Schema Analyzer/Sampler

The Schema Analyzer/Sampler, sonarsample, is a python program that samples your database and prepares the lmrm__metadata collection that is then used by facets in the Finder and by the Schemer. The lmrm__metadata collection is a Studio collection and should not be edited. Per collection it maintains the structure, the field names and a sampling of values in the database. sonarsample usually runs in a schedule to periodically re-sample the database so that the metadata is up to date.

You can run sonarsample from any host that can connect to the databases and you can use sonarsample in any topology. For example, you can have a single sonarsample that prepares the metadata for many instances or you can have a single sonarsample collocated with each instance. The most common deployment is to have a sonarsample running on each database server sampling the databases on that instance putting all metadata for all databases in the instance into one Studio database on the same instance.

When you install JSON Studio the sonarsample.py program is installed in the sonarFinder directory along with a sample configuration file SonarSample.conf. There are two important section types as shown below:

# Sample setup , for the "test" database, using the test_meta as
# metadata database.


# the sonarsample.py script reads this file. It looks for it as the command line parameter,
# but if it is not given, it will look in the sonarFinder install directory.

# possible auth_methods are:
#  "none" -- when running with no auth; the default
#  "MONGODB-CR" -- when running in auth mode;
#  "GSSAPI" -- for kerberos.

# default port is 27017, so you don't have to specify it.
# ports can be specified as part of the host name, or in the 'port' parameter
# If you don't specify sample_running_time, the default is 2 minutes. It is specified in seconds.
#
# when connecting to replica sets, specify the replica set members as a comma separated list in a square bracket. You can specify the
# port for each replica set member by using host:port format.
#
# max_fields_per_collection  specify the maximum fields, per collections , that samples will be taken from. Default is 1000.
# max_samples_per_field specifies the maximum samples for any given field. Default is 50.
# samples_query@COLLECTION_NAME    Criteria to limit the documents that will be sampled for the collection named "COLLECTION_NAME".
#                                  This must be a valid JSON string with the same selection criteria as a Mongo find command. The
#                                  document must use double quotes, and if it spans multiple lines each line must be indented.
#                                  For example, to only sample a document in "foo" when its field "value" is greater than 15:
#                                  samples_query@foo = {"value" : {"$gt" : 15}}
# samples_project@COLLECTION_NAME  Criteria to limit the fields that will be sampled for the collection named "COLLECTION_NAME".
#                                  Like samples_query, this must be a valid JSON string with an object or a list. For example, to
#                                  only sample the "value" field in "foo", you may use one of the following:
#                                  samples_project@foo = ["value"]
#                                  samples_project@foo = {"value" : true}
# samples_sort@COLLECTION_NAME     Criteria to sort the order in which documents will be sampled. Use this option to make sure
#                                  certain documents are sampled first. This must be a valid JSON string containing an object,
#                                  mapping field names to 1 (for ascending) or -1 (for descending).
# samples_limit@COLLECTION_NAME    The maximum number of documents to read from the database for the collection named "COLLECTION_NAME".


[test]
db_name = test
port = 27017
host =   localhost
username =
password =
auth_method = none
kerberos_service_name = mongodb
sample_running_time = 240
max_samples_per_field = 100
samples_query =
samples_project =
samples_sort = {"$natural": -1}
samples_limit = 10000000

#optionally,  you can specify a database for the sonarFinder metadata.
# If you don't (i.e. all the lines below are commented out or missing),
# the database itself will be used.

[test:meta]
db_name = test_meta
port = 27017
host =  localhost
username =
password =
auth_method = none

Setup a database section for each database you want to sample. In the example above, the lmrm__metadata collection for the test database will be in the test_meta database. If you omit the meta section, the lmrm_collection will be populated in the test database itself. Note that this configuration allows you to place the lmrm__metadata collection on a different instance (not just a different database). If you are running version 1.0.x of JSON Studio, the Studio database must reside on the same instance as your database (see Studio database). Starting from version 1.1 you can place the metadata on a separate instance.

By default metadata is deleted and replaced every run. You can generate samples and type information incrementally. For example, you can run the sampler once for a longer time on a large collection and then once a day or once a week just add to the samples without redoing the entire collection. Use the -k or –keep-existing flag to not delete previous metadata but merge it to the existing metadata. When you build metadata incrementally you can use the samples_query@COLLECTION_NAME mechanism to ensure that you are sampling new documents every run. There is no point in merging sample data when you are always looking at the first documents. You can do this with a query e.g.:

samples_query@nationwide={"created_at": {"$gt": {$date:"2013-10-21T11:53:22.121Z"}}}

You can also do this with a sort specifier where you can use the { $natural: -1 } operator such that you always look at the last few documents in the collection. A common sampling technique is to sample often (maybe once an hour) and for a few seconds only, and use { $natural: -1 } in the sort condition. In summary, you can define a query, a projection, a sort and a limit that is used. You can define this per collection, e.g.:

samples_query@Collection = {...}
samples_project@Collection = {field_a, field_b, field_c}
samples_sort@Collection = {"field_a" : 1, "field_b" : -1}
samples_limit@Collection = 500

and you can have a default specifier which is used for collections that were not explicitly defined:

samples_query = {...}
samples_project = {..}
samples_sort = {"$natural": -1}
samples_limit = 500

The default specifier if you specify nothing is to use {$natural: -1} with a limit of 100K documents.

To configure a section for a replica set merely list all the members in the replica set for both the database and the Studio database section, e.g.:

[replication_example]
db_name = test
port = 27017
host = mongo0.example.com, mongo1.example.com, mongo2.example.com
sample_running_time = 30
max_samples_per_field = 15

You do not have to specify the port number in the array - it will use the port parameter. You can override the default port as above. You can also specify a port number for each member of the array.

When connecting to a replica set you can (and probably should) prefer to use a secondary using the –prefer-secondary flag.

The sampler can throttle itself so as to take less resources and send fewer requests to the database. The side-effect is that sampling takes longer. The default is to sample at a rate of 2000 documents per second. If you want to shorten the sampling time and have the resources available run the sampler with the –no-throttle flag in which case the sampler will consume a full processor core if it can. If you want the sampler to take up less resources you can use the –throttle x/y flag in which case the sampler will sample x documents every seconds and will sleep while not doing anything. For example, to sample 1000 documents every 5 seconds use –throttle 1000/5. This is generally equivalent to –throttle 200/1 but may differ in the spikes/sleeps depending on the batch size used.

Because documents change, it is a good idea to periodically re-run the sampler and recompute or merge to the metadata. Use whatever scheduling system you are already using by adding a callout to sonarsample. The SonarSample.conf file should remain in the same directory as sonarsample.py making it easy for the integration with the scheduling system. For example, to call sonarsample from cron:

  • Login or su to the user that you would like to be the user that connects to mongo and runs the metadata build.

  • Run: crontab -e. An editor will open, with some comments on how to add new job to run periodically. There is one line in the file. Please read a general introduction to cron files, such as Cron Scheduling, before continuing.

  • Add a line that describes how often to run the metadata rebuild. For example, if you want to run every hour, add a line such as:

    0 * * * * /usr/bin/python <installdir>/sonarsample.py 2>&1 > /<installdir>/logs/sampler.log
    
  • Replace “/dev/null” with a filename if you wish to inspect the output.

  • Save the file.

The example shown makes the sample maker run once an hour; please read the documentation referenced above to find how to configure for other time intervals. The command that will run once an hour (assuming that the Studio is installed in /home/joe/sonarFinder) is:

/usr/bin/python /home/joe/sonarFinder/sonarsample.py 2>&1 > /home/joe/sonarFinder/logs/sampler.log

The log file of the sampler will be in the logs subdirectory of the sonarFinder dir, named sampler.log.

To use a specific config file (rather that the one in the “local directory” use the -c flag.

Metadata Collection and Advanced Architectures

The metadata collection lmrm__metadata in the Studio database maintains samples and statistics for the implied schema per collection. You can keep this data inside your own database or in a separate database. Moreover, you can keep metadata for many databases and even many instances in a single Studio database. Therefore, each document in this collection specifies an instance (using IP and port), a database name and a collection name. Then, if you have metadata from multiple instances that may have the same database name and the same collection (but different data), there will be no collisions.

If you are using replica sets then the replica set name is used instead of the instance identifier (IP and port). This way, if the metadata was computed when connected to the original primary which later goes down, the metadata will still be used when connected to the new primary (the old secondary). NOTE - if you have multiple different replica sets that have the same name, do NOT share the same Studio database as there will be collisions. Using the same replica set name is in general not recommended – but if you choose to do so you should use different Studio databases.

If you are using sharded clusters, metadata will be maintained separately for each mongos instance (in fact, the metadata key will use IP and port used by the mongos). If you have multiple mongos servers prefer to always connect to the same one from JSON Studio, or, generate the metadata for each mongos (even though it is redundant). It is important to configure sonarsample itself to connect to mongos and not to each of the mongod since the lmrm__metadata collection is not sharded and samples will never be used by the Studio connected to mongos.

Table Of Contents

Previous topic

Working with the Schema Analyzer

Next topic

Visualization Geo-Spatial Data with the Mapper

Copyright © 2013-2016 jSonar, Inc
MongoDB is a registered trademark of MongoDB Inc. Excel is a trademark of Microsoft Inc. JSON Studio is a registered trademark of jSonar Inc. All trademarks and service marks are the property of their respective owners.