Trading on the Wall Street floor was disrupted with online brokerages. More recently, retail investors forced the hands of traditional brokers into offering free trades with the popularity of the Robinhood platform. Robo-advisors are attempting to disrupt the financial advisor role by using algorithms to base investments on and allow people to passively invest.
Imagine using the wisdom of the Main Street crowd applied to investing strategy to get ahead of Wall Street competitors on learning what is trending… All while being able to save money on hiring a team of expensive analysts. We will focus on 2 subreddits: r/investing and r/wallstreetbets, using Natural Language Processing to classify text within the corpuses in order to form an investing strategy.
I chose to use the VADER Sentiment Intensity Analyzer from nltk to check the overall tones of the each subreddit:
The chart above in conjunction with looking at the top 5 most positive and most negative posts from each subreddit, overall both have a positive tone. However, the wallstreetbets subreddit has high negative scores in addition to the high positive scores, which is an indicator of the higher risk involved with strategies used on r/wallstreetbets.
I then honed in on the r/investing subreddit:
Overall the r/investing subreddit is fond of passive invseting strategies using index funds, and especially Vanguard. Overall this subreddit is more geared toward beginners looking for less risky investments while still being involved in the stock market.
I then honed in on the r/wallstreetbets subreddit:
Overall the r/wallstreetbets subreddit is fond of active options trading and placing huge bets in order to maximize return. Overall this subreddit is more geared toward people that know just enough to be "dangerous."
I established a baseline of 50% accuracy, which was a given beacuse of how the scraped data was combined.
I decided from the start that the metric that I would like to optimize is Recall. I chose recall because if someone has the a lower risk tolerance but they are placed in a r/wallstreetbets based investments they will not be happy with the volatility.
I used TFIDF with stop words set to english in order to vectorize the text from each document. I added on the sentiment scores onto the new vectorized dataframe and fit on Logistic Regression, Multinomial Naive Bayes, and Kernel Support Vector Machine Classifier models.
I chose to go with the Logistic Regression as my final model due to it's interpretability and relatively similar recall scores across the board.
If a document moves up a unit of negative sentiment, they are 1.85 times more likely to be in r/wallstreetbets. Similarly, if a document moves up a score of positive sentiment, they are 2.65 times more likely to be in r/wallstreetbets. This is because r/wallstreetbets on a whole is based on a much more volitile strategy than r/investing. As someone moves up in Neutrality, they are only 0.5 times as lilely to be in r/wallstreetbets.
I used logreg.predict_proba to extract probabilities from the model that the post belonged to the 0 Class (r/investing) or the 1 class (r/wallstreetbets).
Customers would be placed into investment strategies based on the probability that their risk assesment statement would classify as r/investing and r/wallstreetbets, using these percentages as weighting.
If you would like to learn more about my process and see the full project and code please visit the following Github link.