JSON Data Pump (jPump)

The JSON Data Pump (jPump) is an application allowing analysts to pull data from a SonarW or MongoDB database by using an Excel spreadsheet. Analysts use a spreadsheet that defines both the lookup parameters as well as the attributes they want to retrieve from the system. They then submit this spreadsheet using the jPump and get back a report or CSV/spreadsheet with the data they requested.

_images/pump1.jpg

There are two types of pumps - a simple pump and a full pump. A simple pump extracts data from a single collection and does not involve any specialized querying. It is just a convenient interface for analysts who do not want to use the full-blow JSON Studio for simple retrievals.

A full pump allows analysts to pull data based on complex data aggregations and pipelines that may have a multi-stage extraction process and might pull data from multiple collections in order to generate the final output. Full pumps are composed of aggregation pipelines that are built within JSON Studio. These chains of pipelines are built by a power-user familiar with JSON Studio and then used by analysts who only provide the input data using a spreadsheet.

_images/pump2.jpg

As an example, an analyst may want to provide a list of customers by name and look up all products owned by these customers. The first pump stage would rad the spreadsheet and lookup a set of customer attributes (e.g. identifiers) based on the customer name. These attributes are then used to search through the products collection and produce the result spreadsheet.

_images/pump3.jpg

Performance of a pump depends on many things - the speed of the underlying SonarW or MongoDB database, the number of steps in the pump, the size and complexity of the documents and when using MongoDB whether the queries performed are using an index or not. If the database is fast enough and/or the data is indexed for the queries the throughput of a single step of a pump is around 250 documents per second.

Using Simple Pumps

To use a simple pump all you need is to define a spreadsheet with the input data to filter on and the column you want extracted. You do not need to define a query or a pipeline or anything else. The spreadsheet has two parts to it - the input and the output specifications. The image below show an example:

_images/pump5.jpg

The two sections are separated by a single | entered into one of the columns in the header line. This delimits the input from the output. All columns to the left of this delimiter (column C in the image above) specify the input. All columns to the right of this delimiter specify the output.

The input specification allows you to use any field in the collection you are querying. Use dot notation to query based on fields in sub-documents. Specify lines of values as shown. This translates to a query which is an OR of ANDs of the specified values. If you leave a value empty (as shown in row 4 above it means “any value”. Do not leave all columns of a single input line empty as that means “all documents” - the pump will fail with an error message to prevent you from querying the entire database and placing unnecessary load on the server.

The output specification includes which fields you want populated as part of the output.

To invoke the pump navigate to https://<hostname>:<port>/pump.xhtml. To invoke a simple pump enter the data source values allowing you to connect to the database and then specify the collection name you want to query, the output/export type and upload your input spreadsheet using the “Choose File” button. Then submit.

If you chose CSV as the export type you will receive a CSV (which can open as a spreadsheet) with all lines that represent all documents that match your input criteria and for each document you will get the columns you requested. If you left the output specification empty (i.e. the | delimiter was your last column) you will get all data from the documents. You can control whether sub-documents are expanded using dot notation through a the Expand sub-documents checkbox.

In addition to a CSV export you can view the same data as a report, JSON documents or through the Schema Analyzer. The Schema Analyzer takes all the documents resulting from the pump (and after applying the output specification) analyzes all the data to report on counts and frequencies. For each field in the output you will receive how many documents of the result set have this field and what is the relative frequency (in percentage) that this field occurs. Note that for fields in arrays of documents the percentage values can be larger than 100% and the counts larger than the size of the result set. For example, if every document in the result has an array with five sub-documents then the percentage value will be shown as 500%.

Building and Using Full Pumps

A full pump allows you to build an entire process for data retrieval. You can create arbitrarily complex processing, can query from multiple collections and even from collections in different databases (within the same instance) and use both simple queries and aggregation queries to produce the result. In full pumps there are usually two actors involved - the designer of the pump that uses JSON Studio to build and combine the aggregation pipelines used by the pump and the analysts using the pump that do not need to understand the underlying process and just use the pump as a data API.

A full pump is composed of a number of aggregation pipelines. These aggregation pipelines are constructed using JSON Studio and saved by name (e.g. pipe1, pipe2 etc.) These need to be stored in a single Studio database. The Studio DB will serve both for the metadata as well as the interim data that is passed between one step to the next.

Data is passed from the input to the first pipeline in interim collections. These interim collections reside in the Studio DB and are automatically dropped when no longer needed for you. You can prevent these interim collections from being dropped for debugging purposes while you are building the pump by entering a value into the Trace prefix field in the pump.xhtml form.

The mechanism for using data computed by the previous pipeline or the input in a full pump in the bind variables mechanism. Bind variables parameterize a query/pipeline. In a full pump you parameterize each pipeline with bind variables and make sure that the previous stage produces documents with values for this pipeline. The input portion of the spreadsheet in a full pump includes a column for each bind variable used in the first stage of the pipeline. For example, if your pump includes two stages and the first stage had a parameter called custname then your input spreadsheet should have a single column to the left of the | delimiter called custname. If your second pipeline has two parameters called prod_id1 and prod_id2 then the first pipeline must produce documents with a field called prod_id1 and a field called prod_id2. If the first stage only produces prod_id 1 and prod_id2 is something that should be a part of the input data then your spreadsheet should have a second column to the left of the | delimiter called prod_id2 and the first pipeline in your pump must have a projection stage with:

prod_id2: {$literal: "$$prod_id2"}

Once your pipelines are defined and tested individually you need to define the pump. The pump is a JSON document inserted into the lmrm__json_pump collection in the Studio DB. You can insert this document manually or using a spreadsheet upload using the Spreadsheets application in JSON Studio. A pump has a name and an array of pipeline specifications. For example, to create a pump with two steps add the following specification with an array of two documents - each specifying the pipeline name and the database and collection it operates on:

{
  "name": "customer_to_prods",
  "pipelines": [
    {
      "db": "cust",
      "col": "customer",
      "pipeline": "pipe1"
    },
    {
      "db": "prod",
      "col": "products",
      "pipeline": "pipe2"
    }
  ]
}

To activate the full pump omit the collection name and use customer_to_prods in the pump name field of the pump.xhtml submission form.

Table Of Contents

Previous topic

Dashboard Builder

Next topic

jSonarR - Using Gateway data from within R

Copyright © 2013-2016 jSonar, Inc
MongoDB is a registered trademark of MongoDB Inc. Excel is a trademark of Microsoft Inc. JSON Studio is a registered trademark of jSonar Inc. All trademarks and service marks are the property of their respective owners.