Detecting Twitter Trending Topics

News sites and other media outlets often show lists of popular or trending topics. While commercial sites surely have many sources of such information and run battery of statistical analyses, I started to wonder how far you could get with just the free Twitter sample steam and some basic frequency data. As you can see below, even simple statistics gathered over a small amount of data gives surprisingly good results.

I started by collecting a small number of tweets over the course of two days around Thanksgiving, in the hopes that I might be able to detect holiday-specific topics. I collected 2000 random tweets per hour for the 48 hours spanning Wednesday, November 21st, and Thanksgiving Thursday. I took the tweets from the Twitter Sample Stream, only keeping those tagged as being English. As I found earlier, often tweets tagged as English are not actually in English. Using the language filtering technique I developed, I dropped the bottom 50% of the tweets, leaving a relatively clean set of 1000 x 48 English tweets to work with.

The most obvious thing to start with was hash tag frequencies. Looking at the list of most common hash tags on Thanksgiving day, it was clear what holiday we were in. The top of the list was, in fact, #thanksgiving, followed by other expected topics like #blackfriday. However, the list was peppered with tags that are perennially common, like #teamfollowback and #oomf, and do not represent anything specific for the day.

Hash Tag Frequencies On Thanksgiving Day

RankFrequencyTag
177#thanksgiving
268#imthankfulfor
366#happythanksgiving
465#mentionsomeoneyourethankfulfor
550#thankful
649#blackfriday
741#teamfollowback
741#peopleschoice
837#oomf
933#iamthankfulfor

To help combat the #teamfollowback problem, I tried comparing hash tag frequencies to the previous day. I ordered the list by the greatest one-day jump in frequency rank. This helped remove the perennially common tags and did bring in a little more breadth, like the Patriots football game. However, because of data sparsity the list was now polluted by relatively uncommon terms that happened to see a small spike in activity.

Hash Tag Frequency Change From Previous Day

Rank ChangeRankFrequencyTag
25465#mentionsomeoneyourethankfulfor
211321#beausarmy
19933#iamthankfulfor
19268#imthankfulfor
171717#bad25
17366#happythanksgiving
161321#whatareyouthankfulfor
16550#thankful
151420#mentionsomeoneyourthankfulfor
122212#patriots

Since I was using such a small number of tweets, there was not enough volume of hash tags to get good results on previous-day comparisons. Using raw word frequencies  avoids that problem and makes the solution more general. The obvious issue there was that the top frequency words were all exceedingly common and had no holiday-specific information.

Word Frequencies on Thanksgiving Day

RankFrequencyWord
17726i
27514the
36799to
45166you
54899a
64031and
73581my
83467for
92972in
102798is

Ordering based on comparing raw word frequencies to the previous day got more traction than the same operation on hash tags and showed the best results so far. The topics were not only relevant but broad, including turkey, Black Friday, shopping, Cowboys football, etc.

Word Frequency Change From Previous Day

Rank ChangeRankFrequencyWord
18243873thankful
129172183thanksgiving!
110121283friday
103105361family
101185167shopping
96138243food
8397385black
8125283cowboys
70168187turkey
66291241thanksgiving

So, with some very simple techniques and a relatively small amount of data, I was able to get a good sense of aggregate activity on Twitter. There are many other ways to improve the system from here. One glaring problem is the whitespace tokenization of the words, which considers “thanksgiving!” and “thanksgiving” to be two different words. Detection of multi-word topics like “Black Friday” is also something that has not been addressed at all. Furthermore, there are many cases of inconsistent spellings and word formulations like #iamthankfulfor vs #imthankfulfor that could potentially be normalized into a common form.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>