Using Twitter dataΒΆ

Twitter offers a platform for developers by which you can call APIs and get tweet data in JSON format. Depending on the amount of tweets you want, whether this is pull-based on push-based etc. this may be free or may require payment. For more information see Twitter Dev

Many companies these days sample tweets about themselves to do sentiment analysis. In this tutorial you’ll be using data that was extracted from twitter using their APIs and stored into the database. The tutorial uses data where the search strings had to include the name of one of a sampling of insurance companies. Per insurance company, a collection was created and the tweets stored in that collection. The python program used was very rudimentary but feel free to use it for your own experimentation:

import requests
import json
import pymongo
import time

from pymongo import MongoClient
client = MongoClient()
db = client["tweets"]

USER = '<your user name>'
PW = '<your password>'
collection_names = ["21st century", "nationwide", "esurance", "USAA", "GEICO", "All State", "State Farm", "Amica", "Liberty Mutual", "MAPFRE", "Progressive"]
keywords = ["21st century insurance", "nationwide insurance", "esurance", "usaa", "geico", "all state insurance", "State Farm", "amica", "liberty Mutual", "mapfre", "progressive insurance"]

def check_and_insert(str):
    o = json.loads(str.lower())
    t = o["text"]
    contained = [i for i,j in enumerate(keywords) if j in t]
    for i in contained:
        collection = db[collection_names[i]]
        collection.insert(o)
        print "Inserting ", t, " into ", collection_names[i]


if __name__ == '__main__':
    r = requests.post('https://stream.twitter.com/1.1/statuses/filter.json',
    data={'track': keywords, 'language': 'en'}, auth=(USER,PW), stream=True)

    for line in r.iter_lines():
        if line:
            check_and_insert(line)

An example of a tweet in JSON format:

{
     "_id" : ObjectId("517b0bfeef86987ca8908a45"),
     "contributors" : null,
     "text" : "Liberty Mutual Legends of Golf Scores - Golf - ESPN http://t.co/Y3NAV8gQVu",
     "geo" : null,
     "retweeted" : false,
     "in_reply_to_screen_name" : null,
     "possibly_sensitive" : false,
     "truncated" : false,
     "lang" : "en",
     "entities" : {
             "symbols" : [ ],
             "urls" : [
                     {
                             "expanded_url" : "http://bit.ly/14mzbe6",
                             "indices" : [
                                     52,
                                     74
                             ],
                             "display_url" : "bit.ly/14mzbe6",
                             "url" : "http://t.co/Y3NAV8gQVu"
                     }
             ],
             "hashtags" : [ ],
             "user_mentions" : [ ]
     },
     "in_reply_to_status_id_str" : null,
     "id" : NumberLong("327912332787318785"),
     "source" : "<a href=\"http://twitterfeed.com\" rel=\"nofollow\">twitterfeed</a>",
     "in_reply_to_user_id_str" : null,
     "favorited" : false,
     "in_reply_to_status_id" : null,
     "retweet_count" : 0,
     "created_at" : "Fri Apr 26 22:29:27 +0000 2013",
     "in_reply_to_user_id" : null,
     "favorite_count" : 0,
     "id_str" : "327912332787318785",
     "place" : null,
     "user" : {
             "location" : "On Twitter",
             "default_profile" : false,
             "profile_background_tile" : false,
             "statuses_count" : 8334,
             "lang" : "en",
             "profile_link_color" : "30A4B8",
             "profile_banner_url" : "https://si0.twimg.com/profile_banners/434650093/1365819648",
             "id" : 434650093,
             "following" : null,
             "protected" : false,
             "favourites_count" : 1,
             "profile_text_color" : "4B821A",
             "description" : "This information will rock your golf world! Get the latest golf topic updates. I follow back. (I don't read DM's)",
             "verified" : false,
             "contributors_enabled" : false,
             "profile_sidebar_border_color" : "C0DEED",
             "name" : "Troy Corupe",
             "profile_background_color" : "5290C7",
             "created_at" : "Mon Dec 12 04:13:48 +0000 2011",
             "default_profile_image" : false,
             "followers_count" : 1380,
             "profile_image_url_https" : "https://si0.twimg.com/profile_images/1688345090/Gold_In_Field_normal.jpg",
             "geo_enabled" : false,
             "profile_background_image_url" : "http://a0.twimg.com/profile_background_images/425359241/golf.jpg",
             "profile_background_image_url_https" : "https://si0.twimg.com/profile_background_images/425359241/golf.jpg",
             "follow_request_sent" : null,
             "entities" : {
                     "description" : {
                             "urls" : [ ]
                     },
                     "url" : {
                             "urls" : [
                                     {
                                             "expanded_url" : null,
                                             "indices" : [
                                                     0,
                                                     29
                                             ],
                                             "url" : "http://www.FunGolflessons.com"
                                     }
                             ]
                     }
             },
             "url" : "http://www.FunGolflessons.com",
             "utc_offset" : -18000,
             "time_zone" : "Eastern Time (US & Canada)",
             "notifications" : null,
             "profile_use_background_image" : true,
             "friends_count" : 1739,
             "profile_sidebar_fill_color" : "DDEEF6",
             "screen_name" : "HoleInOneFun",
             "id_str" : "434650093",
             "profile_image_url" : "http://a0.twimg.com/profile_images/1688345090/Gold_In_Field_normal.jpg",
             "listed_count" : 4,
             "is_translator" : false
     },
     "coordinates" : null,
     "metadata" : {
             "result_type" : "recent",
             "iso_language_code" : "en"
     }
}

Once the data is in the database, logon to JSON Studio and you will be in the Finder. Select one on of the collections in the top left pane and the left collection viewer displays the tweets - in this case there are 31,844 tweets in the Liberty Mutual collection:

_images/twitter_1.jpg

Now you want to search the documents for specific information. In the World Bank tutorial you saw point-and-click searches. Let’s use faceted search this time. Select the Facet search tab at the top. There are two search boxes where you can type - one that defined an equivalent to a SELECT clause (i.e. what to display in the output) and an equivalent to a WHERE clause (i.e. under what conditions should a document be part of the result set). If you enter nothing into these fields and just click the execute search (play) button at the top then an empty query is issued meaning that all documents will be part of the result set and for each document the entire document is part of the result set. The lower pane will show that you are cursoring through all documents (1 of 31844 documents) and the entire document is displayed:

_images/twitter_2.jpg

This is likely not what you want, so let’s start by reducing the information returned per document. If this is the first time you are using this document then you may not have the metadata that the faceted-search uses - but the Studio will tell you this and allow you to click the button and compute the metadata on-the-fly.

Go to the top search field (the SELECT) and start typing which fields you want to include in each result set. The search field auto-completes your typing since it knows what fields are available. For example, you will surely want the text so if you type just “te” the studio proposed text and you can hit enter to use it:

_images/twitter_3.jpg

The selection shows text:1 since that tells the database to include that field in the result. Keep entering some fields that you want to include in the result set. Note that you can specify fields in sub-documents using dot notation and that the facets autocomplete it for you as you go:

_images/twitter_4.jpg

After specifying which fields you are interested in, the result set will still include all documents but each document will only have the fields you chose. You can use the cursor mechanism to show more than one collection at a time, and to scroll through the result set. You can also save the result set into a new collection, open it as a CSV file and more:

_images/twitter_5.jpg

Now let’s filter the result set (i.e. add an equivalent to an SQL WHERE clause). In the WHERE search widget start typing in retweet_count. The studio will auto complete it for you so press enter when it is the only selection. The studio then displays a set of options that it has seen in the sample - in this case 0 or 1 or 2:

_images/twitter_6.jpg

Select 2 and click enter; the result set shows that there are 692 documents that have 2 retweets.

_images/twitter_7.jpg

When the studio showed you the possible options (0 or 1 or 2) this is based on sampling. The full list may include more items than what the menu is showing - either because the sampling was partial or because there are too many options and the menu only shows a subset. You can at any time put in the value by yourself - e.g. to query for all documents that have a retweet_count of 4 simply ignore the menu and type in 4. This time the result set shows that there are only 6 documents:

_images/twitter_8.jpg

Let’s take a small detour to see another way to select values for a query, but before we go there, clear the WHERE facet. Back in the collection viewer on the left click the “View collection schema” icon:

_images/twitter_9.jpg

This takes you to the Schema Analyzer application which uses the sampling metadata in order to show you what the collection schema looks like (i.e. how the documents themselves are structured). For each field (including recursive sub-documents) each field is shown with the distribution of the types existing in that field. Many fields will be 100% of a certain type but some may have many different types. In the Liberty Mutual collection as an example, in_reply_to_screen name has a string type in only 3% of the sampled documents and the result of the documents have a null value:

_images/twitter_10.jpg

Scroll down and look at the retweet_count - it shows that 100% of the values are integers. Now click on the magnifying glass. This displays the values that exist from the samples in the collection; not surprisingly, these are the same values you had in the facet menu because they both look at the same thing:

_images/twitter_11.jpg

Now click on the box in the grid itself (in this case the blue box showing 100%):

_images/twitter_13.jpg

This issues a query that looks for all values in the collection for that field that have an integer type. The result is shown in the bottom right-hand pane). This result is not based on sampling - it searches the collection itself but only looks at the first documents up to a limit that is set in the preferences:

_images/twitter_12.jpg

Note that the values are marked as a hyperlink. You can select these and this will add it as a search back on the Finder. An an example, click on the value 5. Upon each click a small message tells you that a condition has been added. Now go back to the Finder by clicking the link at the top right. As you can see the additional conditions have been added to the “Additional where” section and the result set:

_images/twitter_14.jpg

If you click on the Query tab at the top you will see the actual find command that is issued to the database:

_images/twitter_15.jpg

You can modify the text in this text area and re-run the query. When you click on the play button whatever is currently marked/highlighted will be sent to the database. While you can edit the text shown above, be aware that any edit you make will be wiped out if you add a condition or use a facet search. Therefore, if you want to edit the query manually it is better to copy paste the text to above the **** line and only then edit it - anything above the **** line is not touched:

_images/twitter_16.jpg

Table Of Contents

Previous topic

Using World Bank Project data

Next topic

An Aggregation Pipeline Primer for those Familiar with SQL

Copyright © 2013-2016 jSonar, Inc
MongoDB is a registered trademark of MongoDB Inc. Excel is a trademark of Microsoft Inc. JSON Studio is a registered trademark of jSonar Inc. All trademarks and service marks are the property of their respective owners.