Building a Recommendation Engine using R and JSON Studio using Data Stored in MongoDB

Recommendation engines are everywhere these days. Whether it is Netflix telling you what you would probably like to watch, Amazon suggesting what you might want to buy, or Facebook and LinkedIn telling you who you might know. These are all recommendation engines and these are probably the best mainstream examples of advanced/predictive analytics that our world is currently filled with.

In this tutorial we build a simple recommendation engine using the R statistical computing environment, the JSON Studio Gateway and the jSonarR R package using data stored in MongoDB.

The data we use comes from Steam (http://store.steampowered.com) and this tutorial is about gaming - i.e. what games people like to play. From Wikipedia:

Steam is an internet-based digital distribution, digital rights management,
multiplayer, and social networking platform developed by Valve Corporation.

Steam provides the user with installation and automatic updating of games on
multiple computers, and community features such as friends lists and groups,
cloud saving, and in-game voice and chat functionality.

The software provides a freely available application programming interface (API) called Steamworks,
which developers can use to integrate many of Steam's functions, including
networking and matchmaking, in-game achievements, micro-transactions, and
support for user-created content through Steam Workshop, into their products.

As of January 2014, over 3,000 games are available through Steam, which has 75
million active users.[7][8] Steam has had as many as 8 million concurrent users
as of June 2014.

As part of the Steamworks API you can get data that describes which user plays which games (and how much) in a JSON format. So the first step is to get this JSON into the database. Specifically, each document has the following form:

{
  "_id": {
    "$oid": "53b21af672a53f1ac56db3cd"
  },
  "achievements": [
    {
      "unlock_percentage": 72.08170318603516,
      "is_unlocked": true,
      "name": "Royal Ride",
      "id": "ACH_GOLD_NO_OVERFILL"
    },
    ...
    {
      "unlock_percentage": 12.61406803131104,
      "is_unlocked": false,
      "name": "Candy Stripe",
      "id": "ACH_CANDYCANE"
    }
  ],
  "game": "Audiosurf",
  "game_id": 12900,
  "owner": "76561198000664965",
  "playtime_forever": 196
}

Each document describes usage of a game by a user. It includes the owner (the user), the game id (and name) and the playtime - i.e. how long the user has spent playing that game. It also includes an array of “achievements” which can include levels, bonuses etc - but we don’t use this array in this tutorial.

The recommendation engine we build is simplistic since it’s only meant to demonstrate the “glue” needed to build one (and in fact, Steam has an excellent built-in recommendation engine as part of the service anyway). For example, we can also get friend and group data as JSON from Steam and provide recommendations which are also based on people’s relationships. To keep things simple we just build recommendations based on playing times and only using the above documents. The general idea is that if you play a lot of game A B and C and many other people who play a lot of game A B and C also spend a lot of time playing D then D would be a good recommendation for you.

Most of the work happens in R since R has built in functions for almost any statistical analysis you can think of. The data will be coming from the database using an aggregation pipeline that “crunches” the data and puts it in a form almost ready to be consumed by the R script. The data itself is pulled from the JSON Studio Gateway using the jSonarR package available on CRAN.

There are many ways to build a recommendation engine using R and there are many references on the Internet on how to build such engines and what algorithms may be used. For a very quick and simple read see http://www.cs.carleton.edu/cs_comps/0607/recommend/recommender/itembased.html - we use a similar method here (except that we use user-based similarity whereas the description in this link is item-based similarity). Specifically, we use cosine-based similarity - one of the possible measures to define similarity. The matrix we construct is one where each row describes a user and each column a game. A cell in the matrix represents a “normalized” playtime of that game by that user. Each user-vector in this metric therefore represents a user’s preference. By using cosine similarity we can know which vectors are like which other vectors and thus compute recommendations using these other “close” vectors.

The normalization takes into account the average playtime for a game. Some games are such that you spend a lot of time playing them and some you spend less time playing them. So just using the raw playtime may not be enough to indicate whether you like that game or not. For example, assume that the average playtime for game A across all users is 10 and that you have a total playtime of 100 for that game - you probably really like this game. If on the other hand the average playtime for that game is 1000 then you probably don’t care that much for that game. Normalizing allows us to use playtimes as an indicator of “like”.

The first thing we need to do therefore is transform the data into something we can use for this matrix. This is a perfect use-case for an aggregation pipeline so we build it using the JSON Studio. The end result of the pipeline is a set of documents each one looking like:

{
  "_id": {
    "owner": "76561198025055004"
  },
  "games": [
    {
      "game": {
        "game_id": 400
      },
      "weighted_playtime": 0.0863075817522529
    },
    {
      "game": {
        "game_id": 440
      },
      "weighted_playtime": 9.314247629824883E-4
    }
  ]
}

Each such documents represents a user’s ownership and weighted playtime of their games.

Building the pipeline using the studio is simple - here are screen shots of the whole pipeline and of each stage in the pipeline:

_images/recom1.jpg

Stage1:

_images/recom2.jpg

Stage 2:

_images/recom3.jpg

Stage 3:

_images/recom4.jpg

Stage 4:

_images/recom5.jpg

Stage 5:

_images/recom6.jpg

The entire pipeline code (in case you want to run it from the shell) is:

{
    "$match": {
      "playtime_forever": {
        "$ne": 0
        }
      }
    },
    {
    "$group": {
      "_id": {
        "game_id": "$game_id"
        },
        "owners": {
        "$push": {
          "owner": "$owner",
            "playtime": "$playtime_forever"
          }
        },
        "average_playtime": {
        "$avg": "$playtime_forever"
        }
      }
    },
    {
    "$unwind": "$owners"
    },
    {
    "$project": {
      "game": "$_id",
        "_id": 0,
        "owner": "$owners.owner",
        "weighted_playtime": {
        "$divide": [
          "$owners.playtime",
            "$average_playtime"
          ]
        }
      }
    },
    {
    "$group": {
      "_id": {
        "owner": "$owner"
        },
        "games": {
        "$push": {
          "game": "$game",
            "weighted_playtime": "$weighted_playtime"
          }
        }
      }
    }

There is one more “mini-pipeline” we use - to compute the distinct game IDs. It is not absolutely necessary but it does simplify the R code a lot. The matrix has a very large number of rows - one per user - so it will be in the millions. Therefore, we build the matrix as we iterate over the results of the aggregation pipeline. But the matrix does not have too many columns because there aren’t that many games (that number is in the thousands). But the results coming out of the aggregation pipeline are sparse - i.e. not every user plays every game. If we know how many games there are in total then we can preallocate every user vector and fill in the values rather than remember ID mappings while iterating and doing two passes. That second aggregation call is indeed very simple and just looks like:

{
  $group: {
    _id: "$game_id"
    }
}

Now we can turn to the R script. First - we need to get the jSonarR package so we can call the aggregation pipeline through the Gateway and get the data in R:

install.packages("jSonarR")
library('jSonarR')

We also use two other libraries including the JSON parsing library:

install.packages('rjson')
library(rjson)
install.packages('lsa')
library(lsa)

Then make a connection to the Gateway (in this case running on localhost, the database is also on localhost and the database name is steam):

con <- jSonarR::new.SonarConnection('https://localhost:8443', 'localhost',
               'steam', port=47017, username="qa1", pwd="qa1")

When using jSonarR you usually make calls to the Gateway that directly create an R data frame using sonarFind or SonarAgg. These convenience functions shield you from a lot of complexity and directly give you a data frame from which you can apply any R function. In this tutorial however we use a lower-level function that makes the Gateway call and gets the JSON data directly and then we do some additional manipulation within R to get the matrix. The reason is that it is not possible to convert the game id data to fields using an aggregation pipeline.

First we call the two aggregation pipelines and get the results into a list, and compute the list sizes (for example, we need the column size to allocate the user vectors):

game_list = jSonarR::sonarJSON(con, 'distinct_game_id', 'steam1', type='agg')
columns = length(game_list)
lines = jSonarR::sonarJSON(con, 'steam_data', 'steam1', type='agg')
rows = length(lines)

We then compute the game vector and allocate the matrix:

game_vector = integer(columns)
for (i in 1:columns){
  current_game = game_list[[i]]
  game_vector[i] = current_game['_id']
}
m = matrix(nrow=rows,ncol=columns)
rownames(m) = 1:rows
colnames(m) = game_vector

Now populate the entire matrix - this can take a very long time depending how much data you’ve extracted from the aggregation pipeline, how much memory you have, etc.:

for (i in 1:rows) {
  current_row = lines[[i]]

  rownames(m)[i] = current_row[['_id']]
  total_games = length(current_row[['games']])
  for (j in 1:total_games){
    m[toString(current_row[['_id']]),toString(
               current_row[['games']][[j]][['game']])] =
               current_row[['games']][[j]][['weighted_playtime']]
  }
}
m[is.na(m)]=0

The last row makes sure that if someone does not own/play a game that the entry in the matrix is 0.

Everything until now was preparation - now comes the fun stuff. We use a cosine measure for closeness and get the recommendations; once more, these function calls can take a long time so be patient:

# Calculates the cosine measure between the columns of a matrix;
# We transpose so that we get the measure between users
# This is from the Latent Semantic Analysis (lsa) package
rec = cosine(t(m))

rec[is.nan(rec)] = 0

# This returns an NxN matrix - i.e. it compares each user with every other user
# Cell X,Y for example contains the cosine similarity between user X and user Y

# So now let's get the recommendations for the first user as an example

# Use the similarity values for the first user
# (i.e. how similar that user is to every other user)
# Multiply it with the full game matrix - so similar users
# will have a larger effect on recommendations
first_row_rec = rec[1,] %*% m

ones = matrix(nrow=1,ncol=dim(m)[1])

ones[is.na(ones)] = 1
total_rec = ones %*% m

weighted_first_row_rec = first_row_rec / total_rec
weighted_first_row_rec[is.nan(weighted_first_row_rec)] = 0

# So these are the top 10 recommended games for our first user
top_10_recs = colnames(weighted_first_row_rec)[
         tail(order(weighted_first_row_rec),10)]

An example output showing the top 10 recommendations for our first lucky user:

> top_10_recs
 [1] "63660" "32000" "440" "244590" "19990" "264970" "285160"
 "38530" "284000" "12230"

And if we run a simple report using JSON Studio we can see the names of the recommendations:

_images/recom7.jpg

Editor’s note 1: Most of the recommendations make sense to me - I’m not sure though that “Putt-Putt and Pep’s Balloon-o-Rama” would really be appealing to someone who also plays “Grand Theft Auto III”. But there are plenty more refinements possible - as mentioned, this tutorial is a simplistic attempt at a recommendation engine meant for illustrative purposes.

Editor’s note 2: One of the folks on our team who is an uber-gamer, a mathematician and a gamer who also wrote much of the code for this tutorial, reviewed the write-up and commented: “The Putt-putt recommendation is not as weird as it seems. I recall a lot of people playing that game, ironically, so that would explain it popping up in that list”.

Go figure :-)

Table Of Contents

Previous topic

Internet of Things (IoT) Use Case

Next topic

Frequently asked questions (FAQ)

Copyright © 2013-2016 jSonar, Inc
MongoDB is a registered trademark of MongoDB Inc. Excel is a trademark of Microsoft Inc. JSON Studio is a registered trademark of jSonar Inc. All trademarks and service marks are the property of their respective owners.