News sites and other media outlets often show lists of popular or trending topics. While commercial sites surely have many sources of such information and run battery of statistical analyses, I started to wonder how far you could get with just the free Twitter sample steam and some basic frequency data. As you can see below, even simple statistics gathered over a small amount of data gives surprisingly good results.
I started by collecting a small number of tweets over the course of two days around Thanksgiving, in the hopes that I might be able to detect holiday-specific topics. I collected 2000 random tweets per hour for the 48 hours spanning Wednesday, November 21st, and Thanksgiving Thursday. I took the tweets from the Twitter Sample Stream, only keeping those tagged as being English. As I found earlier, often tweets tagged as English are not actually in English. Using the language filtering technique I developed, I dropped the bottom 50% of the tweets, leaving a relatively clean set of 1000 x 48 English tweets to work with.
The most obvious thing to start with was hash tag frequencies. Looking at the list of most common hash tags on Thanksgiving day, it was clear what holiday we were in. The top of the list was, in fact, #thanksgiving, followed by other expected topics like #blackfriday. However, the list was peppered with tags that are perennially common, like #teamfollowback and #oomf, and do not represent anything specific for the day.
Hash Tag Frequencies On Thanksgiving Day
To help combat the #teamfollowback problem, I tried comparing hash tag frequencies to the previous day. I ordered the list by the greatest one-day jump in frequency rank. This helped remove the perennially common tags and did bring in a little more breadth, like the Patriots football game. However, because of data sparsity the list was now polluted by relatively uncommon terms that happened to see a small spike in activity.
Hash Tag Frequency Change From Previous Day
Since I was using such a small number of tweets, there was not enough volume of hash tags to get good results on previous-day comparisons. Using raw word frequencies avoids that problem and makes the solution more general. The obvious issue there was that the top frequency words were all exceedingly common and had no holiday-specific information.
Word Frequencies on Thanksgiving Day
Ordering based on comparing raw word frequencies to the previous day got more traction than the same operation on hash tags and showed the best results so far. The topics were not only relevant but broad, including turkey, Black Friday, shopping, Cowboys football, etc.
Word Frequency Change From Previous Day
So, with some very simple techniques and a relatively small amount of data, I was able to get a good sense of aggregate activity on Twitter. There are many other ways to improve the system from here. One glaring problem is the whitespace tokenization of the words, which considers “thanksgiving!” and “thanksgiving” to be two different words. Detection of multi-word topics like “Black Friday” is also something that has not been addressed at all. Furthermore, there are many cases of inconsistent spellings and word formulations like #iamthankfulfor vs #imthankfulfor that could potentially be normalized into a common form.