My goal is to design a system for recommending board games that are likely to be enjoyed by a user based on their expressed interest in a small set of games. I will utilize game rating and game feature data collected from boardgamegeek.com (BGG) in order to build a series of models for relationships between user preferences, item similarity, and board game features.
While I hope to build a model that works well for users of all ages, experience levels, and personal preferences, BGG overwhelming serves a community of dedicated gamers, with strong representation and preferences for contemporary board games, tabletop games, war games, and abstract strategy games. However, because accessibility to less mainstream games is now widespread through Amazon, ebay, and other online retailers, as well as greater representation in big box stores and book stores, my hope is that users only familiar with mainstream games (Risk, Monopoly, Chess) may be exposed to appropriate games that properly cater to what they enjoy in a gaming experience.
The primary evaluation of my model will be user feedback from individuals interacting with the final product. I plan to use online forums to elicit evaluations of the quality of the product, as well as traveling to local game stores and asking patrons to try out my system.
All data were collected from boardgamegeek.com (BGG), an online database and forum for board game fans. While the site is heavily moderated, most content is generated by members, who are awarded an in-site currency 'Geek Gold' for their contributions. The site encourages data mining, and provides an XML interface for accessing game data as well as information about the games in a user's collection.
Despite being a mainstay of the board gaming community (with many of the world's most celebrated game designers as active members), the site lags in its adoption of modern web design practices, which presents a pretty unfriendly user experience and presented many technical challenges to my data acquisition process. While I will try not to belabor the point, this experience taught me the important lesson that one should never trust that current rules have always been in place and that stated rules are actually universally applied.
While the API allows quick access to information, boardgames must be looked up by their gameid (a unique* integer of 1-6 digits), and users must be queried by their username (an alphanumeric with possible spaces and limited punctuation*).
I chose to scrape the top 10k games as ordered by number of ratings. Each game included in this group had at least 100 ratings. Game names and urls were scraped from 100 pages using the builtin requests module and scrapy. Gameids were extracted from the urls using regex. These data were saved as a single csv file for quick local access. This took around 5 minutes to execute.
I used the generated list of gameids to batch query 100 games at a time from the XML API. I again used requests to execute my queries, and used xml.etree.ElementTree to access elements from the returned XML document. A number of features were stored into two separate csv files for later upload into separate tables in PostgreSQL. The first (which became my 'boardgames' table) indexed games by their gameid, and included the name, url, year published, minimum play time, maximum play time, recommended age, minimum number of playeres, maximum number of players, average rating, an adjusted 'bayes rating' (as calculated by the site to avoid ranking new games with few highly positive reviews above more established games), the total number of ratings, the rating standard deviation, and a complexity score (a user generated metric that seeks to quantify how complicated a game is). The second (which became my 'features' table) captured a series of tags that were attached to boardgames. These included the subdomain for the game (a broad classifier with 8 groups), the category (a more specific classifier with 84 groups), game mechanics (comprises of 51 items such as dice rolling, card drafting, modular board, etc.), and game family (1747 more esoteric classifications including specific themes). The site assigned each of these features a unique numeric id, henceforth 'fid', which I used to index games alongside the gameid for quick access. This scraping process took around 30 minutes to execute.
While the list for board games ordered by number of ratings was accessible via HTML, individual user ratings for each game were presented with javascript with 50 ratings per page. My initial plan was to scrape around 200k pages from the 10k games to capture the approximately 11.5M ratings reported for this set of games. Because the javascript elements could sometimes take as long as 5 seconds to load, I anticipated this would take over 250 hours, and so decided to distribute my queries across 10 AWS instances with environments managed by Docker. I used Selenium combined with PhantomJS to set up a headless browser environment that allowed the rendering of page elements without graphical output. I set up a loop to check that elements to be scraped had finished loading on each page, then used Scrapy to capture the source code which was batch extracted by xpath. Data was stored locally on each AWS machine as a csv of user name, gameid, and rating.
Unfortunately, technical difficulties led to numerous setbacks in this approach. Reading through online forums revealed numerous examples of execution errors when using PhantomJS at scale, and while I modified my code numerous times to try to account for these issues, I found that after scraping between around 200 and 900 pages on an instance, my code would time out or proceed so slowly as to not warrant further execution. It is possible that the error was server side, and while I had built ample wait times between requests into my code (and the site proclaims to support 2 requests per IP per second), I may have been throttled by the site for perceived abuse. After around 36 hours of reinitiating and troubleshooting this method, I had only scraped around 800k ratings, somewhat sporadically spread across the games (each instance had been targeted to collect around 1.2 million total ratings starting at a different point in the board game list). Luckily, I had managed to capture 136157 unique usernames, with which I now turned to the XML API to access all ratings by a user. But first, it was time to set up PostgreSQL.
I decided to set up PostgreSQL on AWS with Docker to allow for a persistent database that could be quickly replicated should more space be needed or errors occur. I defined a persistent volume on my AWS instance so that databases would be maintained if I needed to rebuild my Docker container. I used SFTP to transfer to AWS the csv files I had captured thus far so that they could be piped into the PostgreSQL environment through a custom Dockerfile. This allowed me to use a server-side COPY command to quickly generate my tables.
While some modifications and additions were made to tables later during data cleaning and alignment for fast matrix calculations using numpy, the basic framework for my tables is as follows:
boardgames:
gid INT PRIMARY KEY,
gname TEXT,
glink TEXT,
yearpublished INT,
minplaytime INT,
maxplaytime INT,
age INT,
minplayers INT,
maxplayers INT,
avgrating DOUBLE PRECISION,
bayesrating DOUBLE PRECISION,
complexity DOUBLE PRECISION,
numratings INT,
ratingstd DOUBLE PRECISION
users:
uid SERIAL PRIMARY KEY,
uname TEXT NOT NULL UNIQUE
ratings:
uid INT,
gid INT,
rating DOUBLE PRECISION NOT NULL,
PRIMARY KEY(uid, gid)
feature_lookup:
fid INT PRIMARY KEY,
fname TEXT,
fclass TEXT
features:
gid INT,
fid INT,
PRIMARY KEY(gid, fid)
In addition, a temporary table was created to insert all captured user names (spread across many csv files) and remove duplicates before generating the unique numeric uid for fast lookup, indexing, and iteration. It was during this process that I discovered that at some early stage of the site, commas had been legal characters in user names, and a rather active user name "Young, rich, and sexy" needed to manually removed from many of the files prior to upload (which I accomplished using vim with shortcuts for find and delete line).
Database management for the remainder of the project was handled server side using psql, while data was transfered to and from python using psycopg2.
The BGG XML API generates user collection files offline to prevent server overload. Once files are generated, they are accessible for a week, unless a user modifies their collection. The API accepts one request per IP address every 5 seconds. In order to ensure that my data was collected in a timely fashion, I set up 8 AWS instances to manage this work. 4 instances were dedicated to requesting a new user collection every 5 seconds, which queued the generation of the XML document. 4 instances were dedicated to getting the generated XML documents, capturing the gameids and ratings using etree, and bulk inserting these data into the 'ratings' table. Through my process I found that XML files were not necessarily generated in to the order they were added to the queue, which led to some slowdown in my data acquisition, but in the end users were collected at a rate of between around 100 and 3000 per hour, and all users had been captured after about 72 hours of execution time. A very small number of users were lost due to server-side corruption during XML file generation, but 136109 users were successfully queried.
During acquisition of the user ratings, I was forced to drop the unique contraint on the 'gid' column in the ratings table due to what I perceived as a data generation error that was providing some users with multiple ratings for a single gameid. Upon further investigation, I found that in some instances, the site had not assigned unique ids to minor expansions (namely promotional cards), but that users had provided ratings for these items in addition to the game associated with the same gameid, which amounted to 37373 ratings on 18186 user-gameid pairs (around .34% of the total data). While the way that the data was captured does not easily allow me to identify which rating corresponds to the main game, examining the means of the the duplicate ratings grouped by user and gameid reveals that 17074 have means >= 5.5 (the true median on the 1-10 point rating scale). In order to not lose too many data points from users that favorably rated games and their expansions (some users were represented multiple times within the list), I chose to drop the duplicate entries while leaving the mean ratings for those user-game pairs that demonstrated a preference. This results in 10937628 unique user-game ratings without dropping any users. 2023 of these users had only 1 rating in the dataset, and thus were dropped as no comparisons could be made.
During EDA on these 134086 users, the following additional data were dropped:
16 ratings had scores of less than 1. 45 users gave no ratings above 5.5 (but had combined for 1243 total ratings). 41 users had an average rating of <=3.5 (but had combined for 7777 total ratings).
Which led to 10926569 total ratings remaining for 134000 users.
During subsequent modeling, the overrepresentation of board game expansions and reimplementations is noticeable. While a filter was applied to reduce the impact of the final recommendations, an expansive manual investigation of the data will be necessary to programmatically remove all bias.
Number of ratings per game:
<100: 152
>=100 & <250: 4301
>=250 & <500: 2069
>=500 & <1000: 1444
>=1000 & <5000: 1603
>=5000 & <10000: 245
>=10000 & <50000: 181
>=50000: 5
Number of ratings per user:
<5: 5215
>=5 & <10: 10192
>=10 & <20: 20474
>=20 & <50: 40221
>=50 & <100: 27778
>=100 & <250: 21378
>=250 & <500: 6458
>=500 & <1000: 2024
>=1000 & <2500: 335
>=2500: 11
EDA of game features showed some irregularities and missing data:
- Complexity ratings of 0 for 53 games (most appear to be expansions/promos). 357 games have complexity ratings of 2. Because complexity ratings are generated by the votes from the site community and should be in the range of 1-5, these 0 values are reassigned 2 to enable the use of complexity in future modeling.
- Suggested player age was not provided for 539 games
- Max play time was not provided for 418 games
- 41 games had a published year <= 0 (may be near accurate stats of very old games)
- 98 games have no listed max number of players
Some observed trends (all correlations are Pearson):
-
Average overall game rating positively correlates with number of ratings at .1577
-
User average rating correlates negatively with number of ratings given at -.2994
-
Max rating given by a user correlates positively with number of ratings at .1802
-
Min rating given by a user correlates negatively with number of ratings at -.4080
-
Age and complexity correlate positively at .3000
-
Age and average rating correlate positively at .1876
-
Complexity and average rating correlate at .5387
Users who rate more games tend to use a wider range of the rating scale, whereas users with few ratings may only rate games they like and rate them all positively. More popular games tend to have higher ratings overall. Users of the site tend to rate complex games positively.
Because of the size of the data, I chose to develop models using a toy set of the data. Initially, I chose a very small set of the top 500 rated games and 1000 users (making sure to include user-rating pairs for each game). However, the limited scope of this set proved insufficient for the models being used, and so I scaled up to include all games with more than 500 ratings (3476) and a collection of 5000 users with between 50 and 100 ratings. This set comprises 349395 ratings (3.2% of all data). All components of the recommendation system were tuned on this subset before scaling up. These computations easily ran locally on a MacBook Air with 4GB.
For the final system, I decide to select all users with >= 10 ratings, yielding 10838895 ratings. I converted this data into a numpy matrix with 0 representing all missing values. This matrix was 9.5GB in size, and all operations on the full matrix required a t2.2xl instance on AWS. By precomputing relationships, I was ultimately able to reduce the total working memory requirements to around 2.5GB. Because of the size of the data, all operations were conducted in numpy.
Note: for this discussion, features refers specifically to tags generated by users appended to games.
The total count for all features scraped for the games was 1890. After selecting for only features present in 10 or more games, this space was reduced to 453. After one-hot encoding these features for each game, TruncatedSVD was performed with n_components=80 to create a series of vectors that could quickly quantify games in explainable ways with explainable feature loadings based on informative site-provided tags. I precomputed cosine similarity between all games and stored this as a numpy matrix for quick lookup. This computation took around 15 seconds. This 10k X 10k matrix takes 800MB of working memory.
I chose to focus on item-item relationships for my recommender as it provides an efficient means to access relationships between games with a limited amount of input from a new user to the system. All processes were planned to require minimal online computation for fast retrieval of recommendations and optimized working memory usage.
For the current implementation, I chose not to remove user, item, or overall bias prior to calculating my correlation matrix. With additional time, I plan to study how working with these biases, as well as possibly only focusing on positive reviews, may impact the quality of recommendations.
I chose Pearson's correlation as the similarity metric between ratings. I implemented it in a way that correlations were only computed between users who had rated both games being compared, and then a scaling penalty was applied to this correlation to reduce the influence of rarely-paired games. This was the most substantial computation of the project, and required over 2 hours (I will update this figure when I rerun with different parameters, but the computation time caught me off guard and was left to run overnight). The resultant matrix is 10k X 10k matrix and takes 800MB of working memory.
My final recommendation function allows users to express interest in any number of board games and receive the top recommended games rank ordered by their predicted level enjoyment. As mentioned previously, there is a great amount of noise left in the dataset due to saturation of the dataset by game expansions. I computer another 10k x 10k matrix (taking another 800MB of memory) to filter through some of these 8000 interactions, but unfortunately the data was often entered in a way that did not assign parent-child relationships between items, and left many clear expansions without links to their parent game. The scale of saturation of this data did not become apparent until looking at the full dataset, and I did not have the projected 20+ hours to complete this task in time.
However, I was still able to build out a model that I feel is successful at making accurate, personalized recommendations with minimal processing time. My model approaches the data through two avenues in parallel: the top 100 most similar games for each rated item are thrown into a two bags of items, one based on item-item correlation, and the other based on feature consine similarity. These bags are then passed to the other similarity matrix, where similarity scores between rated items and bagged items are summed and multiplied by the average rating for each game. These item residuals are then scaled, summed, and rank ordered (currently I'm using 70/30 in favor of the items pulled by features before looking at item-item relationship).
I hope to fine tune my predictions metrics once I am able to better reduce the presence of expansions, but as they are presently overreprented in both item-item correlation and feature similarity, I feel extremely limited in roads forward without significant investment of manual labor.
However, I am aware that my prediction metric is rather blunt, and that there are likely more nuanced ways to integrate the multiple relationships that I've modeled. Specifically, I am interested in exploring user-item recommendations through use of decomposition of user ratings, and factoring in complexity difference to further tune my model. I plan to continue exploring these models as I try to productionalize my product.
Because of the speed of prediction and the limited amount of operating memory, it is realistic for me to work towards an online implementation of this system. I hope to spend much of the next week implementing a chatbot that will provide users with thumbnail images, descriptions, and links to the recommended games.