Using World Bank Project dataΒΆ

In this tutorial you will use data from the World Bank to inspect project spending. You will analyze money spent by the world bank and try to understand how World Bank project money is distributed among different regions of the world.

The World Bank has a set of APIs that allow anyone to download and use data describing how they invest in projects around the world. The data can be downloaded in a JSON format. If you want to follow along you can download this data using the World Bank Project API at World Bank Project API and then import into your database using mongoimport.

You might need to massage the data a bit since the API returns a single object per API call for 500 projects at a time - but all as sub-documents in one big document. Here is a small snippet of the document:

{"rows":10,"os":"0","page":"1","total":"15476","projects":{
"P129828":{"id":"P129828","projectfinancialtype":"IDA","regionname":"Africa","prodline":" ...
,"P145310":{"id":"P145310","projectfinancialtype":"IDA","regionname":"East ...
,"P144665":{"id":"P144665","projectfinancialtype":"OTHER","regionname":"Middle ...
}}

It would be easier to import and use this data if it had the form of an array of documents similar to:

[
{"id":"P129828","projectfinancialtype":"IDA","regionname":"Africa","prodline":" ...
,{"id":"P145310","projectfinancialtype":"IDA","regionname":"East ...
,{"id":"P144665","projectfinancialtype":"OTHER","regionname":"Middle ...
]

You can strip out the initial:

{"rows":10,"os":"0","page":"1","total":"15476","projects":{,

the last } and add [ and a ]. If you’re using a *nix system (including MacOS) you can then finish preparing the file using a sed command of the form:

sed 's/"P......":{"id":/{"id":/g'

When the data is in the database you can logon to JSON Studio and click the Analytics link at the top right to go to the Aggregation Builder:

_images/world_bank_1.jpg

In the Aggregation Builder select the collection with the data from the Starting Collection pull down:

_images/world_bank_2.jpg

The collection data will be displayed in the collection viewer pane at the bottom left:

_images/world_bank_3.jpg

Looking at this sample collection, you can see that there is a regionname (in this document the value is Africa) and a set of amounts such as lendprojectcost, idacommant, totalamt and grantamt. It is natural to want to aggregate the data by region and see how much money is spent across all projects for that region, what is the distribution of total amount spent etc. However, looking at this data there is a problem; the data for the amounts are strings and if we try to sum these up (e.g. using an aggregation stage) it will yield 0. So, you’ll take a detour through the Spreadsheet Bridge to do some manipulation on the data).

Click the Spreadsheet Bridge icon in the collection viewer pane (marked in the previous image with an arrow). Use the Spreadsheet Bridge to export the data to Excel where you can manipulate and enrich the data and then bring it back into the database.

To export the data to Excel click the export button:

_images/world_bank_4.jpg

This downloads a spreadsheet with your data and opens Excel (assuming you have Excel or another spreadsheet program installed). Each document forms a row in the spreadsheet. Normally the first column will be the id of the document which is used if you want to bring the data back in and update the existing documents.

If you look at the columns ibrdcommamt, idacommamt, idacommamt, grantamt you will indeed see that they are strings. In Excel, mark the relevant cells and select Convert to Number; the data will now be numeric.

_images/world_bank_5.jpg

Before updating the data back in the database let’s add one more column to represent percentage of total spend. Add a column right after grantamt and call it total_plus_grant. Then define the first cell (e.g. W2) as the sum of U2 and V2. Then copy this formula to the rest of the rows in the column:

_images/world_bank_6.jpg

At the bottom of this column add another formula that is the sum of all the number in this column:

_images/world_bank_7.jpg

Now add one more column after total_plus_grant and call it percentage_of_total_spent. Define the first cell as a formula taking the total_plus_grant, dividing it by the sum you just created and multiplying it by 100. Make sure to use absolute positioning for the sum. Then copy the formula to the rest of the cells in that column. Make sure to change the format and allow enough decimal spaces or it will look like it’s always zero:

_images/world_bank_8.jpg

Save your spreadsheet and go back to the Spreadsheet Bridge. If you want to update your existing collection just hit the upload button and pick the spreadsheet. In this case, let’s save your data into a new collection. In the Save into field edit the collection name and enter world_bank_enriched. Then click the upload icon (with the up arrow), select the spreadsheet using the Choose button and click Upload:

_images/world_bank_9.jpg

The Spreadsheet Bridge will now be showing documents from the new collection. Your data now includes numbers as amounts plus you now have the total and percentage fields per document.

To see this without scrolling through everything, click on the Finder link at the top right of the screen to navigate to the Finder. Don’t worry about the warning message - the Spreadsheet Bridge is just telling you that you have mapping data that has not been saved, but the data is safely in the database already. In the Finder select your enriched collection from the top-left pane and look at the collection viewer to the left. Alt-Double-click on “borrower”, “countryname”, “regionname”, “total_plus_grant” and “percentage_of_total_spen”. As you make your selections you will see them being added to the Additional select area in the Point-and-click tab. Click on the Execute icon to run the query:

_images/world_bank_10.jpg

Alternativey, you can configure the Studio to run the queries as you are selecting and de-selecting values. Click the Perefereces link at the top right and find the “Search on any change to facet/condition” preference. Turn it to 1:

_images/world_bank_11.jpg

Now that our data is numeric we can go to the Aggregation Builder and use the aggregation framework to summarize data. Click the Aggregation Builder link at the top right and select the enriched collection from the pull down. Enter a new pipeline name and hit enter. This creates a new pipeline where you can start building stages.

In order to aggregate based on the region create a new Group stage by clicking on the “+” icon in the stages facet, selecting Group, entering a description and clicking the Create button:

_images/world_bank_12.jpg

Now double click on regionname in the JSON document - this adds the grouping specifier in the middle pane:

_images/world_bank_13.jpg

Per region you’ll add a count of how many projects exist in that region, a sum of the total amount spent in that region and what percentage of the total spent is spent in that region.

Since a COUNT is very common, there is a special link for that so simply click on the Add COUNT link. Then, select the Aggregation values radio button in the center pane. This tells the editor that selections from the JSON document will now add aggregate values (as opposed to additional grouping specifiers). Leave the SUM aggregation function selected and find the three fields you want to sum in the left JSON document and select them using a double click each. The by_region edit pane is now ready:

_images/world_bank_14.jpg

Under the covers this creates a group aggregation stage. Click the magnifying glass icon to see what will be run on the database:

_images/world_bank_15.jpg

Run the current pipeline by clicking one of the play buttons (in the center pane, in the grid showing all stages or at the top of the screen). The data is displayed in the collection viewer to the right:

_images/world_bank_16.jpg

Let’s also sort the aggregate data. It can make sense to sort it either by the number of projects or by the total amount spend - let’s sort by count. Create a new pipeline stage of type Sort, give it a name and edit it. In the collection viewer to the right, double click on the count field. This adds a sorting specifier. Select descending using the checkbox:

_images/world_bank_17.jpg

Run the entire pipeline either using the Run Pipeline button in the top toolbar or using the run until button:

_images/world_bank_18.jpg

Save the pipeline if you want to use it in the future. If you wish to see the full pipeline code click the magnifying glass icon in the top toolbar:

_images/world_bank_19.jpg

Next, let’s visualize the data. Click the Visualize icon in the collection viewer (the light-bulb icon). This opens the Visualizer application. At the bottom pane pick Pie as the chart type. Then drag the _id text icon from the top left pane into the X-axis text input box (or simply type in _id). Drag the count into the #series 1 box to show a pie chart of the project distribution across regions:

_images/world_bank_20.jpg

If you want to see both the count and the percentage allocated by region in a column chart select Column from the type drop down and add the second data series. As you can see, while Africa has 50% more projects than East Asia and the Pacific, the total spend there is only 20% more.

_images/world_bank_21.jpg

Table Of Contents

Previous topic

Build a Report in 5 Minutes

Next topic

Using Twitter data

Copyright © 2013-2016 jSonar, Inc
MongoDB is a registered trademark of MongoDB Inc. Excel is a trademark of Microsoft Inc. JSON Studio is a registered trademark of jSonar Inc. All trademarks and service marks are the property of their respective owners.