Sunday, July 23, 2017

How not to make millions predicting cryptocurrency prices with social media sentiment scores


Sometimes hobbies can teach you basic survival skills.  In a world where machines are likely to replace many of today’s jobs, I figured that rolling up my sleeves and jumping into deep learning during my recent summer vacation would be both fun and useful.  In any case, I find it helpful to understand the work that I will eventually need to manage in my career.  As a passion and hobby blog, here are my musings into machine learning/deep learning.


I started with a business problem that I wanted to solve.  I’ve always been curious about cryptocurrencies and hypothesized that the market could be more open than the traditional stock market: a prime opportunity for data to help predict prices.  Moreover, investment firms have long hired the best mathematicians and computer scientists to model trade data.  With only a few days to dedicate to this project, I knew I wasn’t going to come up with anything too groundbreaking.  As an avid Redditor, I often wonder:  What if the information contained in reddit comments could serve as features to predict cryptocurrency prices like bitcoin, litecoin, ethereum, etc.?


First off, I delved into the Machine Learning for Trading course by Georgia Tech delivered through Udacity.   I really love MOOCs and wish they existed back when I was in HS and university.  This MOOC was very informative and the programming language used, Python, is my second favourite (after Groovy).  The concepts from stock market trading decisions from probabilistic machine learning approaches translate well to cryptocurrency trading decisions.  I also decided to use pythonanywhere.com as my cloud IDE.  It’s a cheap, clean and efficient dev environment and with great technical support.


Next, I needed data and lots of it!  This was definitely the most challenging step.  Getting cryptocurrency time series data that is highly granular and for free, took quite a bit of googling.  I eventually settled on writing a little script to access the API from cryptocompare.com.  Using python Pandas, I was able to save dataframe data into cvs files.  Now for the social media information, I used the popular PRAW Python Reddit API Wrapper to pull the top 25 comments from the top 25 posts of the top 25 subreddits.  I saved the upvote score into a table and used TextBlob to get the sentiment score of the comment itself.  This would be the basis of my main feature to help predict the currency prices.  On a side note, I also looked at using the Watson API from IBM’s Bluemix.  It provided a lot more information on the emotions found in the reddit comments. The API call limitations meant that there would be a cost associated with obtaining the sentiment for each of the 1000s of Reddit comments. Needless to say, my decision to use TextBlob was a no-brainer.


With the reddit comments data and associated sentiment scores, I created a time series table (dataframe) that lined up the comment time to the price time.  A little googling and stackoverflow later, I had my data ready for analysis.


Now the fun part could begin.  Using Google’s Open Source TensorFlow library, I created a neural net model that took the inputs from some of the cryptocurrency prices and sentiment scores with the output of the price of a single cryptocurrency.  Luckily, the fine folks that created Keras made it very easy to create neural net models from data.  Keras truly lowers the barrier of entry into deep learning.  Combined with Dr. Jason Brownlee’s tutorials, anyone with basic programming experience can start experimenting with deep learning.


Unfortunately, I didn’t make millions (of course) with my predictive model.  My hypothesis that the “internet’s” sentiment of positivity vs. negativity, derived from reddit comments, was a predictor of bitcoin, litecoin or ethereum prices was not supported by the models generated by my data. I did learn firm hand that data are really king and predictive algorithms are a dime a dozen. Eventually, I remembered to leave my home office to enjoy the warm summer weather!  Maybe next year, I’ll try and get higher quality data with a more granular analysis of emotions from social media combined with longer time series.  Or, you never know, a machine may be writing my next blog article...