Working with differs

JSON Studio provides two tools for diffing collections - the Differ and SonarDiffer.

The Differ is a GUI tool that computes a diff between two documents that you are viewing within the Studio (e.g. in the Finder, the Dicer, the Spreadsheet Bridge, etc). It is most useful for chunks of collections where you want to see a change that you made in the Studio. The Differ compares documents according to a certain sequence as do tools such as windiff, Unix diff and meld.

SonarDiffer is a script that traverses two collections and a definition of what makes up the key to these collections and does a comprehensive comparison between the two collections to identify what documents exist in one of the collections but not the other and for documents that exist in both collections, what the precise delta is. SonarDiffer is precise because it uses the definition of the logical key (provided by you) to know exactly what the deltas are. It is mostly used in QA environments, for validating outcome of ETL processes, and more.

GUI Differ

Use the Differ to compare JSON documents side-by-side or inline to easily identify differences:

_images/differ1.jpg

You can open the Differ from a navigation hyperlink or from any collection viewer. Once open you can paste arbitrary text into any of the text areas and compute the diff.

When you open a Differ for the first time, it is opened in a separate window/tab. The Differ is the only application within JSON Studio that can (and should) remain open in parallel to any other application as a separate tab/window. If you open a JSON text from any of the other applications’ collection viewers (using the left or right arrow) it will use the already-open Differ tab/window. This allows you to easily compare JSON texts even if they come from different applications. Note that the Differ does not do JSON parsing or validation - only a diff function.

SonarDiffer

SonarDiffer is a python script that gets deployed as part of the JSON Studio installation. It is comprised of the script sonardiffer.py and the config file SonarDiffer.conf. You can run the script manually or as a daemon, cron job etc.

This python script has only been tested and certified for Linux. It may however work on any platform that has the appropriate python packages and modules.

The configuration file includes a set of datasource specifications telling SonarDiffer how to connect to the databases, e.g.:

[source]
db_name = source
port = 27017
host =  localhost
username =
password =
auth_method = none
kerberos_service_name = mongodb

[target]
db_name = target
port = 27017
host =  localhost
username =
password =
auth_method = none
kerberos_service_name = mongodb

[out]
db_name = out
port = 27017
host =  localhost
username =
password =
auth_method = none
kerberos_service_name = mongodb

The names “source”, “target” and “out” in the sample above are the logical names for datasources that are then used when invoking the script.

The last two parameters in the command line are important. gdp_diff in the example is the logical name that helps you identify which results (documents in lmrm_differ_results and lmrm_differ_deltas) belong to this run - each document forming the result will have this name. _id in this example is the key field through which the differ will decide which documents are compared. The key field does not have to be a single value - it can be an array that together define uniqueness. Specify the fields in dot notation that, together, uniquely identify a document within the collection.

SonarDiffer assumes that the logical keys you provide ensure uniqueness. Therefore, if a match is found in the target document the differ does not continue matching the document it is currently processing. Therefore, only one match/delta per source document is recorded.

To use a specific config file (rather that the one in the “local directory” use the -c flag.

Diffs are stored in lmrm_differ_results and lmrm_differ_deltas. Lmrm_differ_results will include one or more documents per run specifying which documents are in the source collection but not in the target collection. It will also have information as to how many documents were matched on the key fields and a delta computed, and how many deltas existed (and were then calculated and inserted into lmrm_differ_deltas”. Each such document takes the following form:

{
"_id": { "$oid": "52e15212a92f4a47738bf1b3" },
"number_of_docs_compared": 500,
"in_source_only": [],
"number_of_deltas": 5,
"number_only_in_source": 0,
"uri" : {
     "source" : "mongodb://localhost:27017/source",
     "target" : "mongodb://localhost:27017/target",
     "output" : "mongodb://localhost:27017/out"
 },
"job_info" : {
     "start" : "Fri Jan 24 18:02:05 2014",
     "user_name" : "geoffgeoff-NE570",
     "name" : "aggregation_test",
     "command_input" : [
         "./sonardiffer.py",
         "source",
         "aggregate",
         "target",
         "aggregate",
         "out",
         "aggregation_test",
         "_id"
     ]
 }

When you read results using a find make sure to read all the documents using the queue_id since there may be multiple such documents (e.g. if the number of _ids cannot fit in a single document because in_source_only has too many elements to fit into a single document).

Lmrm_differ_deltas will contain documents that exist in both collections but have some difference. Each document in the delta collection is a superset of the documents in the source and target collections. Where the fields are the same the values are used as in the source/target documents. Where they differ the delta is shown at the lowest possible level, e.g. if the source is:

{
 a: 1,
 b: 2,
 c: {ca:1},
 d: [1, 2, "a"]
}

and the target is:

{
 a: "ee",
 b: 2,
 c: 1,
 d: [1, 2, "adf"],
 e: 3
}

then the delta subdocument will include:

delta: {
 a: {lmrm__source: 1, lmrm__target:"ee"},
 b: 2,
 c: {lmrm__source: {ca:1}, lmrm__target: 1},
 d: [1, 2, {lmrm__source: "a", lmrm__target: "adf"}],
 e: {lmrm__target: 3}
}

Additionally, each document will include a subdocument similar to the following to help you identify which delta documents you should be looking at:

"source": { "$oid": "52b213b38594d8a2be17c78e" },
"target": { "$oid": "52b213b38594d8a2be17c78e" },
"job_name": {
   "start" : "Fri Jan 24 18:02:05 2014",
   "user_name" : "geoffgeoff-NE570",
   "name" : "aggregation_test",
 }

SonarDiffer makes a lot of calls on the database and is also computationally intensive. When run on large collections it will typically consume a full core. You can request that it throttle itself and sleep for half the time in order for it not to compete with other processes running. It will then take approximately double the time to complete. To turn on throttling add the following to the output section of your SonarDiffer.conf file:

throttle=true

When the differ runs it processes one chunk at a time in each collection and by default that chunk size of a 1000 documents. If you want to run with a larger chunk size use the -batch_size flag (e.g. -batch_size 10000). This consumes more memory but will increase throughput considerably since it will have 10x less iterations on the target collection (since each batch will read from the database the entire target collection).

Additional flags:

  • Verbosity: -v or –verbosity provide different verbosity levels. You may be asked by a support engineer to run in this mode in order to troubleshoot and issue.
  • Use -case_insensitive when you want the case of the field and column names in relational databases to be ignored. This is especially important when one side is a relational store since in some databases column names are case insensitive unless they are double-quoted. This is also useful when comparing two mongodb collections if you are not sure which case the fields are although less so because in mongodb field names are definitely case sensitive.
  • Use -sets when you want the diff to ignore order when comparing arrays
  • Use -source_query and -target_query when you want to compare a result of a mongodb query vs an entire collection. The query string should be of the form “{<match>},{<projection>}” as you would use in a find command.
  • Prior to version 1.4.1 you could use the -sql_statement when you want to compare the results of a SQL query vs. a table in an RDBMS. Starting from version 1.4.1 use the SQL Statement (as a quoted string) instead of the table name, e.g. mysqldata source “select * from customers”.
  • Use -key_based_lookup when you want a separate query to be issued on the target collection for every source document. Normally both the source and the target collections are iterated-upon using a cursor. The batch size determines how many times the target collection will be fully iterated. If the fields you are using for uniqueness have an index in the target collection performance may be increased by issuing a fast index-based query per each source document.

SQL Diffs

Starting with version 1.4 of JSON Studio SonarDiffer can compare data in MongoDB collections with data in a relational database. All databases that have a python interface and a Linux driver can be used. Use the command line in order to submit a job, for example:

./sonardiffer.py source gdp target gdp out gdp_diff _id

In this example you are asking SonarDiffer to compare the collection called gdp in the database specified by the source section of the config file to gdp in the database specified by the target section of the config file. The results will be placed in the lmrm_differ_results and lmrm_differ_deltas collections in the database specified by the out section of the config file (more on these collections in a bit). Note that if the source is an RDBMS source then gdp (in the example above) is the table name.

If you want to compare based on a SELECT statement rather than the whote table put a quotes string instead of the table name above and the differ will runt he query on your behalf.

To reference a relational data source, add the connection URL to the SonarDiffer.conf file as folows:

[sql_source]
type = sql
sqlalchemy_db_url = "oracle://test:test@ec2-64-176-47-184.compute-1.amazonaws.com:1521/XE"

Additional flags added in version 1.4 for SQL diffs:

  • Use -sql_transform when you need to compare flat relational data with hierarchical JSON data. Data retrieved from a relational table or a SQL query is always flat (column names and values). If your JSON data has subdocuments you use this parameter to pass in a JSON document that mimics the structure in your database where the values are the column names as returned from the relational store. Here is an example taking customer data that is coming from a relational query but is a sub-document in mongodb:

    {
    "phone":"phone",
    "customerNumber":"customerNumber",
    "product":{
       "Name":"productName",
       "Line":"productLine",
       "Description":"productDescription",
       "Scale":"productScale",
       "Vendor":"productVendor"
    },
    "orderNumber":"orderNumber",
    "orderLineNumber":"orderLineNumber",
    "customer":{
       "Name":"customerName"
    }
    }
    

Diffs using Aggregation

Starting with version 2.1 you can perform diffs where one query is an aggregation pipeline. The source query can be either a MongoDB query or a relational query. The target can be a pipeline query. This is useful when the structures you need to compare with are different. For example, it is useful when the source has flat data structures (either table-based or collection-based) and the target data is a MongoDB collection with arrays. In such a case you can use the aggregation framework’s unwind operation to perform the comparison. Another example is when you want to compare summary data - e.g. the result of a SELECT statement with a group by clause with the equivalent aggregation pipeline in MongoDB.

When you use an aggregation pipeline as a target query you specify the aggregation code within the SonarDiffer.conf file rather than on the command line. Specifically, you add an “aggregation” parameter with the pipeline. You then run the differ with the -aggregation_lookup flag.

Let’s look at am example. Assume that you have the following two collections in the same database and you want to compare the two:

> db.agg1.find()
{ "_id" : ObjectId("
53bdcebc9bddb61735794c22"), "id" : 1, "vendor" : 1 }
{ "_id" : ObjectId("53bdcebc9bddb61735794c23"), "id" : 1, "vendor" : 2 }
{ "_id" : ObjectId("53bdcebc9bddb61735794c24"), "id" : 1, "vendor" : 3 }
{ "_id" : ObjectId("53bdcebc9bddb61735794c25"), "id" : 2, "vendor" : 2 }
{ "_id" : ObjectId("53bdcebc9bddb61735794c26"), "id" : 2, "vendor" : 3, "a" : 1 }
{ "_id" : ObjectId("53bdcebc9bddb61735794c27"), "id" : 3, "vendor" : 2 }

> db.agg2.find()
{ "_id" : ObjectId("53bdce629bddb61735794c1f"), "id" : 1, "vendors" : [  1,  2,  3 ] }
{ "_id" : ObjectId("53bdce629bddb61735794c20"), "id" : 2, "vendors" : [  2,  3 ] }
{ "_id" : ObjectId("53bdce629bddb61735794c21"), "id" : 3, "vendors" : [  3 ] }

In SonarDiffer you would have:

[source]
db_name = source
port = 27017
host = localhost
auth_method =
username =
password =
kerberos_service_name = mongodb
aggregation = [{'$match':{'id':'$$id'}},{'$unwind':'$vendors'},{'$project':{'id':1,'vendor':'$vendors'}},{'$match':{'vendor':'$$vendor'}}]

And the command you would run is:

./sonardiffer.py source agg1 source agg2 out -aggregation_lookup

Because of the -aggregation_lookup flag, the system will look for the aggregation parameter in the conf file and run that as the target query on the target collection. It would run this for every row/document coming back from the source query while replacing the bind variables in the aggregation pipeline with the relevant values in the source document being checked. This is similar to what happens when you use two finds with the -key_based_lookup flag but uses more flexible aggregation pipelines.

In the above example the first row retrieved from the source collection is:

{ "_id" : ObjectId("53bdcebc9bddb61735794c23"), "id" : 1, "vendor" : 2 }

From which the following aggregation would be run:

[{'$match':{'id':'1'}},{'$unwind':'$vendors'},{'$project':{'id':1,'vendor':'$vendors'}},{'$match':{'vendor':2}}]

to produce:

{ "_id" : ObjectId("53bdce629bddb61735794c1f"), "id" : 1, "vendor" : 2 }

Note that it is important that the first step is the pipelines are $match steps based on an indexed fields or performance will suffer. When this is done, performance is better than doing a single $unwind and trying to do batch-based comparisons in memory.

The resulting delta in the above example is:

{
 "_id" : ObjectId("53c467fd72a53f19eb224c1b"),
 "job_info" : {
     "start" : "Mon Jul 14 16:30:05 2014",
     "user_name" : "geoffgeoff-NE570",
     "name" : null
 },
 "source" : ObjectId("53bdcebc9bddb61735794c26"),
 "target" : ObjectId("53bdce629bddb61735794c20"),
 "delta" : {
     "a" : {
         "lmrm_source" : 1
     },
     "vendor" : 3,
     "id" : 2
 }
}

In the second example you will compare a summarizing relational query with an aggregation pipeline. Assume a relational table of the form:

mysql> desc Customers;
+------------------------+-------------+------+-----+---------+-------+
| Field                  | Type        | Null | Key | Default | Extra |
+------------------------+-------------+------+-----+---------+-------+
| customerNumber         | int(11)     | NO   | PRI | NULL    |       |
| customerName           | varchar(50) | NO   |     | NULL    |       |
| contactLastName        | varchar(50) | NO   |     | NULL    |       |
| contactFirstName       | varchar(50) | NO   |     | NULL    |       |
| phone                  | varchar(50) | NO   |     | NULL    |       |
| addressLine1           | varchar(50) | NO   |     | NULL    |       |
| addressLine2           | varchar(50) | YES  |     | NULL    |       |
| city                   | varchar(50) | NO   |     | NULL    |       |
| state                  | varchar(50) | YES  |     | NULL    |       |
| postalCode             | varchar(15) | YES  |     | NULL    |       |
| country                | varchar(50) | NO   |     | NULL    |       |
| salesRepEmployeeNumber | int(11)     | YES  |     | NULL    |       |
| creditLimit            | double      | YES  |     | NULL    |       |
+------------------------+-------------+------+-----+---------+-------+

and the equivalent collection in MongoDB. Issuing the following SELECT query:

mysql> select city, count(*) as c from Customers group by city order by c desc;
+-------------------+---+
| city              | c |
+-------------------+---+
| NYC               | 5 |
| Madrid            | 5 |
| Paris             | 3 |
| Singapore         | 3 |
| Auckland          | 3 |
...

For every row returned by the SQL query, the following aggregation pipeline will be run based on the aggregation specifier to the conf file:

aggregation = [{ "$match": { "city": "$$city" } }, { "$group": { "_id": { "city": "$city" }, "c": { "$sum": 1 } } }]

$$city will be the value extracted from the returned row.

You also need to specify a transform document since each row from the relational query is flat and each document coming back from the aggregation is recursive:

-sql_transform '{"_id": { "city": "city"},"c":"c"}'

Note that you could also add projection phases to the aggregation pipeline to flatten it out.

Finally, be aware that the JSON expressions you specify in the conf file for aggregation and in the transform documents must be valid JSON and cannot be the more relaxed MongoDB expressions. E.g. the following is valid:

aggregation = [{ "$match": { "city": "$$city" } }, { "$group": { "_id": { "city": "$city" }, "c": { "$sum": 1 } } }]

but this is not valid:

aggregation = [{ $match: { city: "$$city" } }, { $group: { _id: { city: "$city" }, c: { $sum: 1 } } }]

The final diff run is (use -encoding_source and -encoding_target if the character sets are different than UTF-8):

./sonardiffer.py
   mysql_source "select city, count(*) as c  from Customers group by city"
   target customers
   out_datasource
   -name try
   -aggregation_lookup
   -encoding_source latin1
   -sql_transform '{"_id": { "city": "city"},"c":"c"}'

Table Of Contents

Previous topic

Sonar Resource Manager

Next topic

Working with Sonar SQL Transfer

Copyright © 2013-2016 jSonar, Inc
MongoDB is a registered trademark of MongoDB Inc. Excel is a trademark of Microsoft Inc. JSON Studio is a registered trademark of jSonar Inc. All trademarks and service marks are the property of their respective owners.