An Aggregation Pipeline Primer for those Familiar with SQL

This tutorial walks you through a simple example of building an aggregation pipeline using the Aggregation Builder starting with a SQL Example. You will use company data in JSON Format that describes company funding, acquisition etc. The data itself is available using an API from Cruchbase and a typical document looks like the following (some fields are removed for brevity and because it is irrelevant to this tutorial):

{
  "_id": {
    "$oid": "52cdef7c4bab8bd675297d8a"
  },
  "name": "Wetpaint",
  "permalink": "abc2",
  "crunchbase_url": "http://www.crunchbase.com/company/wetpaint",
  "homepage_url": "http://wetpaint-inc.com",
  "blog_url": "http://digitalquarters.net/",
  "blog_feed_url": "http://digitalquarters.net/feed/",
  "twitter_username": "BachelrWetpaint",
  "category_code": "web",
  "number_of_employees": 47,
  "founded_year": 2005,
  "founded_month": 10,
  "founded_day": 17,
  "deadpooled_year": 1.0,
  "tag_list": "wiki, seattle, elowitz, media-industry, media-platform, social-distribution-system",
  "alias_list": "",
  "email_address": "info@wetpaint.com",
  ...
  "overview": "<p>Wetpaint is a technology platform company that uses its proprietary state-of-the-art technology and expertise in social media to build and monetize audiences for digital publishers. Wetpaint’s own online property, Wetpaint Entertainment, an entertainment news site that attracts more than 12 million unique visitors monthly and has over 2 million Facebook fans, is a proof point to the company’s success in building and engaging audiences. Media companies can license Wetpaint’s platform which includes a dynamic playbook tailored to their individual needs and comprehensive training. Founded by Internet pioneer Ben Elowitz, and with offices in New York and Seattle, Wetpaint is backed by Accel Partners, the investors behind Facebook.</p>",
  "image": {
    ...
  },
  "products": [
    ...
  ],
  "relationships": [
    ...
  ],
  "competitions": [
    ...
  ],
  "providerships": [],
  "total_money_raised": "$39.8M",
  "funding_rounds": [
    ...
  ],
  "investments": [],
  "acquisition": {
    "price_amount": 3.0E7,
    "price_currency_code": "USD",
    "term_code": "cash_and_stock",
    "source_url": "http://allthingsd.com/20131216/viggle-tries-to-bulk-up-its-social-tv-business-by-buying-wetpaint/?mod=atdtweet",
    "source_description": " Viggle Tries to Bulk Up Its Social TV Business by Buying Wetpaint",
    "acquired_year": 2013,
    "acquired_month": 12,
    "acquired_day": 16,
    "acquiring_company": {
      "name": "Viggle",
      "permalink": "viggle"
    }
  },
  "acquisitions": [],
  "offices": [
    ...
  ],
  "milestones": [
    ...
  ],
  ...
}

Given this data structure, you will create a list of companies that have been acquiring start-ups and for each such company (like Google or Microsoft) you will compute how many acquisitions the company made, what was the total amount spent by the company in it’s acquisitions, and what is the average number of employees of the acquired company (for example, this might indicate whether the company prefers to acquire early-stage technology companies or likes to acquire companies for an install base).

Before we start building the pipeline, assume that you also have a table in a relational database that incorporates all this data and that it has already been de-normalized. In that case the SQL statement to perform the suggested task might look like:

SELECT
   AcquiringCompany as acquirer,
   Avg(NumberOfEmployees) as avg_emp,
   COUNT(*) as ccount,
   SUM(AcquisitionPrice) as amount_spent
FROM Companies
WHERE AcquiringCompany NOT NULL
GROUP BY acquirer
HAVING ccount > 3
ORDER BY ccount

Before you delve into building the stages, here is an overview of what the resulting pipeline looks like:

_images/sql_agg1.jpg

At a high level, the pipeline stages do the following:

  • Filter and remove all companies that have not been acquired.
  • Group the data by the acquiring company and for each such company compute how many acquisitions occurred, the sum of the amount spent, and the average number of employees of the acquired companies.
  • Sort by the number of acquisitions
  • Filter out any acquirer that has not had more than three acquisitions.

Log into the Studio and navigate to the Aggregation Builder. Select the collection with the Crunchbase data and enter a new pipeline name.

The first stage in the pipeline is a filter stage that filters out all companies that have not been acquired. Add a stage by clicking on the + button and choose the Filter radio button. Give the stage a name and click Create. Choose the EXISTS operator and double click on the left-hand-side cursor on name inside the acquiring_company sub-document, inside the acquisition sub-document.

The next stage in the pipeline is the most important one - the grouping stage. This is the stage that does most of the work. Add a group stage and give it a name. Then double click on the name inside the acquiring_company sub-document, inside the acquisition sub-document. This defines the “group by” element. Click on the COUNT link to add a counte of how many documents will be grouped together - this is the equivalent of the COUNT(*) clause. Then click on the Aggregate Values radio button and double click on the price_amount (for the SUM). Click on AVG operator and double click on the number_of_employees field. Finally, change the field names to make them easier to read. The stage will look like the image shown below:

_images/sql_agg3.jpg

The next two stages are simple - a sort stage that implements the ORDER BY functionality and a filter stage that implements the HAVING functionality. Two sample documents of the result of running the entire pipeline are:

{
"_id": {
   "acquirer": "Google"
},
"count": 71,
"amount spent": 4.6332E9,
"avg
emp": 37.111111111111114
},

{ "_id": {
   "acquirer": "Microsoft"
},
"count": 55,
"amount spent": 1.0095E10,
"avg emp": 95.58333333333333
}

The following image depicts the mapping from the SQL statement to the pipeline stages; while SQL is certainly a compact factor, aggregation pipelines are very intuitive and are often easier to build since they mimic the way people think rather than try to express things in a declarative language like SQL:

_images/sql_agg2.jpg

By clicking on the view icon in the Aggregation Builder’s toolbar you can view the aggregation code and can copy/paste this to other scripts or application code. For the example above the generated pipeline code is:

{
  $match: {
    "acquisition.acquiring_company.name": {
      $exists: true
      }
    }
  },
{
  $group: {
    _id: {
      acquirer: "$acquisition.acquiring_company.name"
      },
      count: {
      $sum: 1
      },
      amountspent: {
      $sum: "$acquisition.price_amount"
      },
      avgemp: {
      $avg: "$number_of_employees"
      }
    }
  },
{
  $sort: {
    "count": -1
    }
  },
{
  $match: {
    "count": {
      $gt: 3
      }
    }
}

Table Of Contents

Previous topic

Using Twitter data

Next topic

Using the Gateway to build composites (Chaining API calls)

Copyright © 2013-2016 jSonar, Inc
MongoDB is a registered trademark of MongoDB Inc. Excel is a trademark of Microsoft Inc. JSON Studio is a registered trademark of jSonar Inc. All trademarks and service marks are the property of their respective owners.