Working with the Spreadsheet BridgeΒΆ

Use the Spreadsheet Bridge to use your JSON data within Excel (or any other spreadsheet application that can read/write xls, xlsx and xslxm files). You can extract data into Excel and read back data from Excel. You can “flatten” data and then reconstruct it back using the Spreadsheet Bridge. This is a very powerful capability - a spreadsheet’s data model is tabular/flat and JSON documents are hierarchical. Using mapping metadata that you build using the Spreadsheet Bridge (or infer from the documents) you can bridge this data model. You extract elements from documents and sub-documents into spreadsheet column and you can create new documents or update existing documents from modified spreadsheets. You can even create new sub-documents through new spreadsheet columns.

When you map between JSON data and spreadsheet data, spreadsheet rows map to JSON documents and columns map to fields. For the most part a row maps to a document stored inside your collection but it can map to a partial document that is used to update many documents in the database (more on that later - for not it’s simpler to think of each row as a document stored in the database). Each column maps to a field and you use dot notation to map to an embedded field. A cell in the spreadsheet can contain a primitive data element (like a string or a number) or can even contain a complex JSON expression. That means that you can extract an entire sub-document into a spreadsheet cell and can update/create a new sub-document from data in a spreadsheet cell.

When you first open up the Spreadsheet Bridge you will see two main panels as shown below. On the left-hand-side is a standard collection viewer as you’ve already seen in the Finder and the Aggregation Builder. The collection viewer in the Spreadsheet Bridge has an embedded projection/selection/sorting capability so that you can narrow down the data that you are looking for either by reducing the number of fields or the number of documents before you export the data as a spreadsheet. You can double-click and alt-double-click to add additional conditions to the query or projection. You can ctrl/cmd-double click (ctrl on Windows and command on Macs) to add additional sorting conditions. Note that as of release 1.2 the where area supports subqueries.

_images/celler1.jpg

Note: If you have URLs in your data you can toggle the display to show them as hyperlinks. When you do so, all double-clicks are disabled. Toggle out of hyperlink mode to select. This also changes the word-wrap attributes allowing you to see long texts without using the horizontal scrollbar.

Use the right-hand-side panel to export/import the data (using the toolbar at the top) and define how to perform the mapping that controls the export/import.

You export data to a spreadsheet by clicking on the Export button in the toolbar (the button with the down arrow). To export data from a collection to excel you need to tell the Spreadsheet Bridge which fields to export as columns. If you do not specify anything (i.e. just select a collection and hit the export button), the Spreadsheet Bridge creates a default mapping for you that maps every field in the first document currently shown in the collection viewer (recursively - i.e. every field of a primitive type gets mapped to a column in the spreadsheet). Each column has a header that is the field name in dot-notation. As an example, the image above shows the mapping that was created automatically, specifying that the data in field _id goes to column A, the data in field 2008 goes to column B etc. The resulting spreadsheet would look like:

_images/celler2.jpg

The key to working with spreadsheets is to understand how the mapping affects the export, and especially the import. For each column in the spreadsheet that you want to populate specify a dot-notation. Then, each document is traversed in the collection and the data is extracted using this dot notation specification. If the field in the source document has a primitive type then the column will have that as a value (string, number, date etc.) If that field has an array or sub-document then the cell will have a JSON representation of that complex structure (as a string). Note - when you export data out to a spreadsheet and then re-import it all numeric fields will become doubles.

You have four ways to create mappings - bootstrap from top level, bootstrap from primitives, upload from a file or manually. Clear the metadata using the second bottom from the left on the toolbar (with the “x”). The next button to the right (with the empty circle) creates mapping metadata by traversing the first document shown in the collection viewer and creating a column mapping for each top-level field. If the field is of a primitive type then the values are extracted as such to the spreadsheet. If the fields have a sub-document or array then the cell will have the JSON representation. The column headers will have the field names (which will never be in dot-notation in this case because only top-level fields are mapped). The next button to the right (with the full circle) creates mapping metadata by traversing the first document shown in the collection viewer and creating a column mapping for each (recursive) field that has a primitive value. Note that the first document is the one used to bootstrap this mapping in both bootstrapping methods - so choose wisely. Use the next button to the right (thin upward facing arrow) to upload mapping metadata from an existing spreadsheet file. This looks at the header line and uses the header names to create a mapping.

When you bootstrap mapping metadata the mapping table is filled with field names and the preview populated with real data. The preview pane uses the first 5 documents shown in the collection viewer and the mapping table (that can also be updated manually) to show you what the data will look like when you pull it into a spreadsheet. This preview helps you design your mapping before you start the export. The preview is generated automatically with every change to the mapping metadata or you can click the Preview button (left-most button in the toolbar).

To build mapping metadata manually (or edit existing metadata already populated in the mapping table), edit in-place within the table as shown below:

_images/celler3.jpg

You can change the name, you can delete it completely (in which case an empty column will be created) and you can use dot-notation to extract data from sub-documents. You can also re-order the columns of the table (in which case, depending on your browser, you may have to refresh your screen after each such drag - this is a known issue in this release). In addition, you can add a field from the document to the mapping metadata by alt-ctrl-double-clicking on Windows or alt-command-double-click on Mac while on a field in the collection viewer on the left. Every time you alt-ctrl/command-double-click, the field (in dot notation) is added as the last column of the mapping metadata. You can clear the mapping metadata at any point to start building a mapping from scratch. You can also load existing mapping metadata that you may have stored previously in an lmrm collection using the Load mapping pulldown (see Studio database). When you are happy with the mapping you can export the data and create a spreadsheet and/or save the mapping metadata to the database using the Save mapping as field in the toolbar.

When you export the data to the spreadsheet, the fields underneath the toolbar control where the data is placed - e.g. which worksheet, what column to start on etc. You can also control how many documents to export. If you leave this blank the entire collection will be exported (up to the Excel limit which is just over 1 Million rows).

Every export includes the data and the header. The header is important when you import the data back because that tells the Spreadsheet Bridge what each column means. By default the header comprises of the dot-notation of the field names that you exported. If you select “Spanning column headers” then additional headers are created showing the recursive structure of the data. The only really important header row is the one just above the data itself with the dot notation fields; the rest are a convenience to show the structure of the data.

Let’s look at a simple example. The following shows a document with a non-trivial structure and the mapping being used:

_images/celler4.jpg

The normal export produces the following header:

_images/celler5.jpg

The spanning export produces the header shown below. Note that if you plan on using the spreadsheet and the Spreadsheet Bridge to import data back and you are using spanning columns that when you do the import the starting row field must point to the header with the dot notation fields. In this example the starting row for a correct import is 3. When no spanning headers are used above it is the default 1.

_images/celler6.jpg

It is almost always a good idea to export the _id field as the first column - especially if you want to use the Spreadsheet Bridge to update data back to the original collection. Once you have your data in a spreadsheet you can do any manipulation desired. You can add columns, you can change data, you can use formulas and macros, etc. The next step with the Spreadsheet Bridge is typically to import data back from such a modified/enriched spreadsheet. When you upload/import a spreadsheet every row maps to an update or insert statement in the database. In many cases each row maps to a single document since the update statement uses the document’s id or inserts a single document into the collection. But it’s important to understand that a row loaded generates a save operation to the database.

NOTE: In versions up to and including 1.1 the import was limited to ~2M files. In version 1.2 and up there are two import methods - one limited to 2M files that can import both xls and xlsx files as well as a streaming method available only for xlsx files that can import any Excel file up to 100M. Importing spreadsheets larger than ~20M may require you to increase the memory allocation for Tomcat to 4096 (from the default 1024) by changing setenev.{sh|bat}. Spreadsheets of size 100M almost always require such memory allocation. Also note that the streaming method allows you to have completely empty rows in the soreadsheet (and they will be ignored) while the upload method for xls/xlsx require the data to have no empty rows.

The controls that affect the import are highlighted below:

_images/celler7.jpg
  • Use the import button (arrow facing up) to start an import. This opens the Uploader area on the right. NOTE: In release 1.2 and up you have two import buttons (one with a thin arrow and one with a thick arrow). The first allows you to load from both xls and xlsx files but is limited to spreadsheets of size up to 2M. The second allows you to load very large Excel files (up to 1G) but is limited to xslx.
  • Check Update multi? if you want each row in the spreadsheet to potentially update more than one document (if the update conditions match). If you are importing and updating by id this will never happen. But you can use spreadsheets to update based on any combination of fields and therefore, potentially, update multiple documents with one row of the spreadsheet.
  • Use the pull-down to determine if you want to perform an insert (in which case you may get a duplicate key exception if you include keys), an update or an upsert (an update but if the data is not in the database, an insert is performed).
  • Use the “Update based on” button to open a dialog (shown below) where you can specify which fields of the mapping need to be used for the update conditions (this means that you need to have defined the mapping already or uploaded the mapping from your spreadsheet file). When you perform an update or an upsert, every row from the spreadsheet is used to create an update statement. An update statement has a section that defines which documents to match and a section that defines what data to change. When you use the dialog, anything to the right will be used to match which documents to update.
_images/celler8.jpg
  • Use the upload area to choose which file to upload, upload it or cancel the upload. The import has two stages - the upload itself and the import. If the upload completes and the import operation is running, use the cancel long operation button if you want to cancel the operation.
  • Use the Map to worksheet field to specify which sheet in your workbook to upload data from as well as the starting row and column to tell the Spreadsheet Bridge where to start reading data. This needs to be where the header line begins.
  • Use the import method pull-down to select whether the mapping metadata for the import is taken from the column headers in the file or from the mapping defined on-screen.

Here are some examples that illustrate the rich functionality possible using the Spreadsheet Bridge. In all cases the initial collection includes only the following documents:

{
   "_id" : ObjectId("528517b3bb1177ca391c1767"),
   "2007" : 2007.1,
   "2004" : 2004.1,
   "2005" : 2005.1,
   "2002" : 2002.1,
   "Country Code" : "HUN",
   "Country Name" : "Hungary"
}
{
   "_id" : ObjectId("528517b3bb1177ca391c1768"),
   "2007" : 2007.2,
   "2004" : 2004.2,
   "2005" : 2005.2,
   "2002" : 2002.2,
   "Country Code" : "IDN",
   "Country Name" : "Indonesia"
}
{
   "_id" : ObjectId("528517b3bb1177ca391c170b"),
   "2007" : 2007.3,
   "2004" : 2004.3,
   "2005" : 2005.3,
   "2002" : 2002.3,
   "Country Code" : "ARM",
   "Country Name" : "Armenia"
}

All the examples below perform upserts and operate on the original collection (as opposed to uploading to a new collection each time). Each phase starts with the resulting collection of the previous phase and each phase shows the spreadsheet being uploaded and the resulting collection:

_images/celler9.jpg _images/celler10.jpg _images/celler11.jpg

Note - in the previous example no update mapping was needed even though the update did not include an id. If update metadata is not defined and _id is present it is always used for the update. If neither update metadata is defined and an _id is not one of the fields, then the first column is used to define which documents to update.

Finally, here is an example to better understand inserts versus update. If you apply an upsert on the following it will only update but will not insert a new document:

_images/celler12.jpg

The reason is that often spreadsheets have sum rows and you do not want these to be inserted as a new document when you import.

You can create a new document by either specifying an explicit id or by doing an insert (removing the rows that map to en existing document so that you do not get a duplicate key exception).

_images/celler13.jpg

Note: The Spreadsheet Bridge is not a replacement to mongoexport and mpongoimport. While you can use the Spreadsheet Bridge to export/import data (for example, when you do not have access to export.import tools), you should be aware that while export/import yields the same data, using the Spreadsheet Bridge to export/import may yield different documents for at least the following reasons:

  • The Spreadsheet Bridge exports data potentially based on the first document. By default the first document defines the mapping and therefore any field in other documents that does not exist in the first document will not be included. You can define a superset of the mapping to work around this.
  • Numeric conversion may cause the results to have a different type. For example, if you have Long values in the original collection then after the Spreadsheet Bridge export/import the new documents will have an Int. This is due to the fact that Excel internally maintains all numeric data as floats.

Table Of Contents

Previous topic

Working with the Aggregation Builder

Next topic

Working with the Visualizer

Copyright © 2013-2016 jSonar, Inc
MongoDB is a registered trademark of MongoDB Inc. Excel is a trademark of Microsoft Inc. JSON Studio is a registered trademark of jSonar Inc. All trademarks and service marks are the property of their respective owners.