Working with Sonar SQL Transfer

Sonar SQL Transfer is a simple ETL tool. Using this tool you can move data from any relational database into MongoDB based on a table or the output of a SQL statement. You can also transform the data - i.e. move the data returned by the SQL statement to any MongoDB hierarchical structure. The only limitation is that each row returned by the SQL statement will generate a single document in the collection.

Sonar SQL Transfer (sql_tranfer.py) is a python utility and currently only tested and certified on Linux. It has the same prerequisites as the Sonar Differ and much of the code is shared.

The configuration file (SqlTransfer.conf) is the same as that of the differ and includes a set of datasource specifications telling SonarDiffer how to connect to the databases, e.g.:

[source]
db_name = source
port = 27017
host =  localhost
username =
password =
auth_method = none
kerberos_service_name = mongodb

[target]
db_name = target
port = 27017
host =  localhost
username =
password =
auth_method = none
kerberos_service_name = mongodb

[out]
db_name = out
port = 27017
host =  localhost
username =
password =
auth_method = none
kerberos_service_name = mongodb

[sql_source]
type = sql
sqlalchemy_db_url = "oracle://test:test@ec2-64-176-47-184.compute-1.amazonaws.com:1521/XE"

The names “source”, “target” and “out” in the sample above are the logical names for datasources that are then used when invoking the script.

These are the main use cases for Sonar SQL Transfer:

  • Transfer of an existing SQL table and creation of documents for each row in the table.
  • Transfer of data from multiple tables (most often using a join) into a MongoDB collection (usually de-normalizing the data in the transfer). Each resulting document looks flat like the result set of the select statement.
  • Similar to the previous use case but each document can have sub-documents and any desired structure.

The simplest use case is that of transferring data from a table to a collection. Assuming mysql is the name of the data source for your database and out is the name of the datasource for your collection, the following will create a document in the offices_etl collection for every record in the Offices table:

$ ./sql_transfer.py mysql "select * from Employees, Offices where Employees.officeCode=Offices.officeCode order by city" out employees_etl
[########################################] 100%

Assuming you have data in two tables such as Offices and Employees with a foreign key relationship, you can create a document for each joined record. For example, assume two tables with the following sample data:

mysql> select * from Offices;
+------------+---------------+------------------+--------------------------+--------------+------------+-----------+------------+-----------+
| officeCode | city          | phone            | addressLine1             | addressLine2 | state      | country   | postalCode | territory |
+------------+---------------+------------------+--------------------------+--------------+------------+-----------+------------+-----------+
| 1          | San Francisco | +1 650 219 4782  | 100 Market Street        | Suite 300    | CA         | USA       | 94080      | NA        |
| 2          | Boston        | +1 215 837 0825  | 1550 Court Place         | Suite 102    | MA         | USA       | 02107      | NA        |..
mysql> select * from Employees;
+----------------+-----------+-----------+-----------+---------------------------------+------------+-----------+----------------------+
| employeeNumber | lastName  | firstName | extension | email                           | officeCode | reportsTo | jobTitle             |
+----------------+-----------+-----------+-----------+---------------------------------+------------+-----------+----------------------+
|           1002 | Murphy    | Diane     | x5800     | dmurphy@classicmodelcars.com    | 1          |      NULL | President            |
|           1056 | Patterson | Mary      | x4611     | mpatterso@classicmodelcars.com  | 1          |      1002 | VP Sales             |..

The following will create an employees_etl collection:

$ ./sql_transfer.py mysql "select * from Employees, Offices where Employees.officeCode=Offices.officeCode order by city" out employees_etl
[########################################] 100%

Each of the resulting documents will look like:

{
   "_id" : ObjectId("5372f51c144db511cf7366ab"),
   "city" : "Boston",
   "territory" : "NA",
   "firstName" : "Steve",
   "extension" : "x4334",
   "lastName" : "Patterson",
   "reportsTo" : NumberLong(1143),
   "officeCode" : "2",
   "phone" : "+1 215 837 0825",
   "state" : "MA",
   "addressLine1" : "1550 Court Place",
   "employeeNumber" : NumberLong(1216),
   "jobTitle" : "Sales Rep",
   "postalCode" : "02107",
   "addressLine2" : "Suite 102",
   "country" : "USA",
   "email" : "spatterson@classicmodelcars.com"
}

Finally, you can use the -sql_transform flag to structure the resulting documents. For example, the following will create an employee_etl2 collection:

$ ./sql_transfer.py mysql "select * from Employees, Offices where Employees.officeCode=Offices.officeCode order by city" out employees_etl2
   -sql_transform '{"empno": "employeeNumber", "lname": "lastName", "fname": "firstName", "phone": "phone", "ext":"extension", "email":"email",
   "manager":"reportsTo", "office": { "city": "city", "addr1": "addressLine1", "addr2": "addressLine2", "state":"state", "country":"country",
   "zip":"postalCode", "territory":"territory"}}'
[########################################] 100%

Each of the resulting documents will look like:

{
"_id": {
  "$oid": "53730603144db512da651cf2"
  },
  "office": {
    "city": "Boston",
    "zip": "02107",
    "country": "USA",
    "state": "MA",
    "addr2": "Suite 102",
    "addr1": "1550 Court Place",
    "territory": "NA"
  },
  "ext": "x4334",
  "empno": 1216,
  "lname": "Patterson",
  "phone": "+1 215 837 0825",
  "manager": 1143,
  "fname": "Steve",
  "email": "spatterson@classicmodelcars.com"
}

For convenience, the formatted transformation used above is:

{
  "empno": "employeeNumber",
    "lname": "lastName",
    "fname": "firstName",
    "phone": "phone",
    "ext": "extension",
    "email": "email",
    "manager": "reportsTo",
    "office": {
    "city": "city",
      "addr1": "addessLine1",
      "addr2": "addressLine2",
      "state": "state",
      "country": "country",
      "zip": "postalCode",
      "territory": "territory"
    }
  }

When you need to keep a target database in-sync with a source database, use the -upsert flag. You need to specify the fields which form the logical key of the application - i.e. the combination of fields that identify a document uniquely (typically NOT the _id field unless you are creating your _id-s based on application logic).

Use sql_transfer.py –help to see all flags.

ELT or ETL?

SQL Transfer is not a full-blown ETL tool but rather an extract/load tool that uses the underlying databases for some transformation capabilities. On the RDBMS side any SQL query can be used and therefore transformations can occur during the extract. Additionally, SQL Transfer also has a mapping capability for doing lookup-based replacements.

While SQL Transfer itself may not be a full blown ETL tool, together with other tools in JSON Studio you can build any transformation pipelines and implement both ETL and ELT. The most common usage patten is that of using SqlTransfer for the extract/load phase, perhaps with some transformations either using the SQL query or using the look-up replacement option, and then using aggregation pipelines built with the Studio and chained by the Gateway to complete additional transformation steps.

Together with the other tools in JSON Studio, SQL Transfer allows you to have full blown ETL/ELT at a fraction of the cost and with a much smaller deployment footprint and learning curve.

Dictionary Lookups in SQL Transfer

SQL Transfer has a regular expression-based mechanism that can replace/transform string data while it is extracting data from the RDBMS and inserting the data into MongoDB. For example, if your source data has NY and CA as state data and you want to transform them into “New York” and “California” you can do this using transformation rules in SQL Transfer. (For such simple transformation it is often easier to use a mapping table in the RDBMS and use a join command in the extract phase). This mechanism is based on regular expression match/replace so that, for example, you can transform source data that looks like +1.781.234.3456 to (781) 234-3456.

Mapping occurs between the extract and the load phase. Each row is first extracted from the RDBMS using the SQL query, then transformed into the desired document structure (using the sql transform document, or, if it is not specified a flat document format), and then the regex mapping is applied. The resulting document is then inserted/upserted into MongoDB.

Mapping rules are defined in a collection in a MongoDB database named lmrm_etl_regex. You specify where the rules will be taken from using the -regex_rewrite flag which specifies a datasource from the config file. The lmrm_etl_regex is read from that database to populate the transformation rules.

Each rule in lmrm_etl_regex takes the following form:

{
   "_id" : ObjectId("5425b32dc03d9e26f2f7c645"),
   "col" : "employees_etl3",
   "field" : "office.city",
   "from" : "Boston",
   "to" : "Cold City",
   "priority" : 1
}

col specifies the collection name for which the rule will be applied and can be a regular expression to match multiple collections. Field is a dot-notation path to a field on which the replacement will be attempted and can also be a regular expression to match multiple fields. From and to specify the values to match and to replace. When using group match expressions in the from field you can refer to the matched groups in the to field using lmrm_match_<group number>. Priority is used when more than one rule matches a transformed document. Only one rule will be applied per field - the one with the highest priority.

Let’s look at a few examples and assume that a simple sql_transfer ETL yields a collection called employees_etl3 with documents of the form:

{
  "_id": {
    "$oid": "542818b3d419f612b857625d"
    },
    "office": {
    "city": "Boston",
      "zip": "02107",
      "country": "USA",
      "state": "MA",
      "addr2": "Suite 102",
      "addr1": "1550 Court Place",
      "territory": "NA"
    },
    "ext": "x4334",
    "empno": 1216,
    "lname": "Patterson",
    "phone": "+1 215 837 0825",
    "manager": 1143,
    "fname": "Paul",
    "email": "spatterson@classicmodelcars.com"
  }

If you add the following rule:

{
   "_id" : ObjectId("5425b32dc03d9e26f2f7c645"),
   "col" : "employees_etl3",
   "field" : "office.city",
   "from" : "Boston",
   "to" : "Cold City",
   "priority" : 1
}

then the above document will look like:

{
  "_id": {
    "$oid": "542818b3d419f612b857625d"
    },
    "office": {
    "city": "Cold City",
      "zip": "02107",
      "country": "USA",
      "state": "MA",
      "addr2": "Suite 102",
      "addr1": "1550 Court Place",
      "territory": "NA"
    },
    "ext": "x4334",
    "empno": 1216,
    "lname": "Patterson",
    "phone": "+1 215 837 0825",
    "manager": 1143,
    "fname": "Paul",
    "email": "spatterson@classicmodelcars.com"
  }

If you have two rules such as:

{
   "_id" : ObjectId("5425b32dc03d9e26f2f7c645"),
   "col" : "employees_etl3",
   "field" : "office.city",
   "from" : "Boston",
   "to" : "Cold City",
   "priority" : 1
}
{
   "_id" : ObjectId("54281a20bcff37c487347287"),
   "col" : "employees_etl3",
   "field" : "office.territory",
   "from" : "NA",
   "to" : "Unknown"
}

Then the result document will look like:

{
  "_id": {
    "$oid": "542818b3d419f612b857625d"
    },
    "office": {
    "city": "Cold City",
      "zip": "02107",
      "country": "USA",
      "state": "MA",
      "addr2": "Suite 102",
      "addr1": "1550 Court Place",
      "territory": "Unknown"
    },
    "ext": "x4334",
    "empno": 1216,
    "lname": "Patterson",
    "phone": "+1 215 837 0825",
    "manager": 1143,
    "fname": "Paul",
    "email": "spatterson@classicmodelcars.com"
  }

If you have two rules transforming the city field:

{
   "_id" : ObjectId("5425b32dc03d9e26f2f7c645"),
   "col" : "employees_etl3",
   "field" : "office.city",
   "from" : "Boston",
   "to" : "Cold City",
   "priority" : 1
}
{
   "_id" : ObjectId("54281a20bcff37c487347287"),
   "col" : "employees_etl3",
   "field" : "office.territory",
   "from" : "NA",
   "to" : "Unknown"
}
{
   "_id" : ObjectId("54281a8cbcff37c487347288"),
   "col" : "employees_etl3",
   "field" : "office.city",
   "from" : "Boston",
   "to" : "Warm in Summer",
   "priority" : 10
}

Then the result document will look like:

{
  "_id": {
    "$oid": "542818b3d419f612b857625d"
    },
    "office": {
    "city": "Warm in Summer",
      "zip": "02107",
      "country": "USA",
      "state": "MA",
      "addr2": "Suite 102",
      "addr1": "1550 Court Place",
      "territory": "Unknown"
    },
    "ext": "x4334",
    "empno": 1216,
    "lname": "Patterson",
    "phone": "+1 215 837 0825",
    "manager": 1143,
    "fname": "Paul",
    "email": "spatterson@classicmodelcars.com"
  }

Adding a regular expression rule with a higher priority:

{
   "_id" : ObjectId("54281f4ebcff37c48734728b"),
   "col" : "employees_etl3",
   "field" : "office.city",
   "from" : ".*",
   "to" : "Redacted",
   "priority" : 100
}

Yields:

{
   "_id" : ObjectId("542818b3d419f612b857625d"),
   "office" : {
      "city" : "Redacted",
      "zip" : "02107",
      "country" : "USA",
      "state" : "MA",
      "addr2" : "Suite 102",
      "addr1" : "1550 Court Place",
      "territory" : "Unknown"
   },
   "ext" : "x4334",
   "empno" : NumberLong(1216),
   "lname" : "Patterson",
   "phone" : "+1 215 837 0825",
   "manager" : NumberLong(1143),
   "fname" : "Paul",
   "email" : "spatterson@classicmodelcars.com"
}

Or, to redact all fields in the office subdoc:

{
   "_id" : ObjectId("542822d1bcff37c487347297"),
   "col" : "employees_etl3",
   "field" : "office.*",
   "from" : ".*",
   "to" : "Redacted",
   "priority" : 100
}

Yields:

{
   "_id" : ObjectId("542818b3d419f612b857625d"),
   "office" : {
      "city" : "Redacted",
      "zip" : "Redacted",
      "country" : "Redacted",
      "state" : "Redacted",
      "addr2" : "Redacted",
      "addr1" : "Redacted",
      "territory" : "Redacted"
   },
   "ext" : "x4334",
   "empno" : NumberLong(1216),
   "lname" : "Patterson",
   "phone" : "+1 215 837 0825",
   "manager" : NumberLong(1143),
   "fname" : "Paul",
   "email" : "spatterson@classicmodelcars.com"
}

Finally, here is an example using match groups in the regular expression. If we want to tarnsform each American/Canadian phone number in the document from a +1 215 837 0825 format to a 215-837-0827 format we can use the following rule:

{
   "_id" : ObjectId("542aa366efcff99775e43dcd"),
   "col" : "employees_etl3",
   "field" : "phone",
   "from" : "\\+[0-9]* ([0-9]{3}) ([0-9]{3}) ([0-9]{4})",
   "to" : "lmrm_match_1-lmrm_match_2-lmrm_match_3",
   "priority" : 1
}

The “from” regular expression has parentheses that identify groups to match and the “to” field uses those match groups. Each matched group is represented by the variable lmrm_match_<group number>. In this example a row in the RDBMS of the form:

employeeNumber: 1216
      lastName: Patterson
     firstName: Paul
     extension: x4334
         email: spatterson@classicmodelcars.com
    officeCode: 2
     reportsTo: 1143
      jobTitle: Sales Rep
    officeCode: 2
          city: Boston
         phone: +1 215 837 0825
  addressLine1: 1550 Court Place
  addressLine2: Suite 102
         state: MA
       country: USA
    postalCode: 02107
     territory: NA

is transformed into the following document:

{
   "_id" : ObjectId("542818b3d419f612b857625d"),
   "office" : {
      "city" : "Boston",
      "zip" : "02107",
      "country" : "USA",
      "state" : "MA",
      "addr2" : "Suite 102",
      "addr1" : "1550 Court Place",
      "territory" : "NA"
   },
   "ext" : "x4334",
   "empno" : NumberLong(1216),
   "lname" : "Patterson",
   "phone" : "215-837-0825",
   "manager" : NumberLong(1143),
   "fname" : "Paul",
   "email" : "spatterson@classicmodelcars.com"
}

Logging

Both error messages and status messages are inserted into the lmrm_etl_logs collection in the target database. If for example you have an error in your transformation documents, the tool will output a message:

There are syntax errors in the test.lmrm_etl_regex collection.
Please check test.lmrm_etl_regex_logs for more information

and the collection will contain an error document:

{
   "_id" : ObjectId("5374ef74144db531b83fcd4f"),
   "bad_regex" : "**",
   "document" : {
      "from" : "\\+[0-9]* ([0-9]{3}) ([0-9]{3}) ([0-9]{4)",
      "to" : "lmrm_match_1-lmrm_match_2-lmrm_match_3",
      "priority" : 1,
      "field" : "**",
      "_id" : ObjectId("542aa86aefcff99775e43dcf"),
      "col" : "employees_etl3"
   },
   "error" : "The regex contained in field field could not be compiled"
}

Status/progress log entries are logged at the beginning of the run, the end of the run, and during the run every 1000 documents (or whatever batch size you specify).

The start and completion entries log all details about the run, e.g.:

{
   "_id" : ObjectId("53758b59144db536ce294ab5"),
   "start_time" : ISODate("2014-05-15T23:51:53.128Z"),
   "args" : [
      "./sql_transfer.py",
      "mysql_source",
      "select * from Employees, Offices where Employees.officeCode=Offices.officeCode order by city",
      "target",
      "employees_etl3",
      "-sql_transform",
      "{\"empno\": \"employeeNumber\", \"lname\": \"lastName\", \"fname\": \"firstName\", \"phone\": \"phone\", \"ext\":\"extension\", \"email\":\"email\", \"manager\":\"reportsTo\", \"office\": { \"city\": \"city\", \"addr1\": \"addressLine1\", \"addr2\": \"addressLine2\", \"state\":\"state\", \"country\":\"country\", \"zip\":\"postalCode\", \"territory\":\"territory\"}}",
      "-upsert",
      "empno",
      "-regex_rewrite",
      "target"
   ],
   "end_time" : ISODate("2014-05-15T23:51:53.154Z"),
   "rows_completed" : NumberLong(23)
}

The progress messages only log the time and how many batches were completed, e.g.:

{
   "_id" : ObjectId("53758b59144db536ce294ab4"),
   "start_time" : ISODate("2014-05-15T23:51:53.128Z"),
   "batches_completed" : 1,
   "batch_size" : 1000
}

Use the job_id flag to specify a unique ID to help you track and query all log entries for a specific run, e.g.:

$ ./sql_transfer.py mysql_source "select * from Employees, Offices where Employees.officeCode=Offices.officeCode order by city"
   target employees_etl3 -sql_transform '{"empno": "employeeNumber", "lname": "lastName", "fname": "firstName", "phone": "phone",
   "ext":"extension", "email":"email", "manager":"reportsTo", "office": { "city": "city", "addr1": "addressLine1", "addr2": "addressLine2",
   "state":"state", "country":"country", "zip":"postalCode", "territory":"territory"}}' -upsert "empno" -regex_rewrite target -job_id abc

In which case the log records look like:

{
   "_id" : ObjectId("53758c80144db536da82a67e"),
   "start_time" : ISODate("2014-05-15T23:56:48.160Z"),
   "batches_completed" : 1,
   "job_id" : "abc",
   "batch_size" : 1000
}

and you can use the following find to view all log entries for this run:

find({"job_id": "abc"})

Table Of Contents

Previous topic

Working with differs

Next topic

Working with the Run-time Viewer

Copyright © 2013-2016 jSonar, Inc
MongoDB is a registered trademark of MongoDB Inc. Excel is a trademark of Microsoft Inc. JSON Studio is a registered trademark of jSonar Inc. All trademarks and service marks are the property of their respective owners.