Understanding the structure of the data in Twitter streams for sentiment analysis applications
Nov 11, 2016 • 5 min read
In the previous post we outlined the basic scientific method used and formalized the problem statement we are solving, which is, “Based on of the tweets of English-speaking population of the United States related to selected new movie releases, can we identify patterns in the public’s sentiments towards these movies in real-time and track the progression of these sentiments over time?” In this post we address the first step in the process, focused on the understanding of the data.
Our goal in the earliest stage of the project is to understand as much as we can about the data: what data sources are available; how much of the data is being produced; how is it captured and transmitted, with what latencies and on what channels; how long it stays available; how secure is it; how accurate it is, and so on. In our case, we need the following types of data:
- Information about trending movies from which to create a short list of movies we’ll be analyzing. We will use the IMDB database to select several movies of various genres recently released in US, along with their user ratings.
- Actual tweet stream. We will use an official Twitter streaming API to get a tweet stream filtered by a set of keywords.
- Dictionaries of positive and negative words. Dictionaries serve two purposes: they are basis for the dictionary based classification and for dimensionality reduction (removing all irrelevant info, like words such as “an,” “the,” and others that don’t impact a sentiment) in an approach employing machine learning. In our demo application, we prefer reusing existing dictionaries over creation of our own for the sake of cost and efficiency. On a long-term commercial project we might have decided to invest in the creation of a custom dictionary. There is a number of dictionaries available under open licenses as good starting point, e.g. dictionary of root-words that based on a dictionary created by Finn Arup Nielsen, Jeffrey Breen unigram dictionary and MPQA Subjectivity Lexicon, so we used those.
- Training and test datasets. The quality of our models will largely depend on the size of our training datasets and the quality of our testing datasets. For our purposes, we are limited to sources that are open and freely available to the research community. Experientially, we know that the best datasets for our type of analysis consists of tweets labeled manually by people as carrying a “positive” or “negative” sentiment. We will use two datasets: IMDB Large Movie Review Dataset (a dataset topical to our subject) and 5K manually labeled tweets from Niek Sanders.
Once the data scientists build the basic understanding of the data, they may begin formulating the hypotheses on the insights that might be minable from the data and on approach they may use to gain these insights.
The first task is to get the stream of tweets related to some specific movie. We will employ the filtering capability of the Twitter streaming API. Every tweet containing words similar to a movie name is considered as the movie-related. E.g. for the movie “Lights Out” both texts will match. Examples of the data we’ll be dealing with quickly reveal quality issues we’ll have to deal with:
- Tweet 1: “i hope light’s out is worth my time” (relevant)
- Tweet 2: “WOW! DNC Turned Lights Out on Bernie Hecklers in Audience to Control Optics.” (irrelevant)
A data received with every tweet looks like this, after being captured in a json format:
{
"created_at" : "Fri Jul 22 21:34:48 +0000 2016",
"id" : 756603425161261057,
"id_str" : "756603425161261057",
"text" : "i hope light's out is worth my time. ud83dude34",
"source" : "u003ca href="http://twitter.com/download/iphone"
rel="nofollow"u003eTwitter for iPhoneu003c/au003e",
"truncated" : false,"in_reply_to_status_id" : null,
"in_reply_to_status_id_str" : null,
"in_reply_to_user_id" : null,
"in_reply_to_user_id_str" : null,
"in_reply_to_screen_name" : null,
"user" : {
"Id" : 2989278494,
"Id_str" : "2989278494",
"name" : "- qu03c5eenu0455u043day u2728",
"screen_name" : "yungbarbiex0",
"location" : "Queens, NY",
"url" : "http://Instagram.com/xobabyshay",
"description" : "u03b9'u043c u0442u043dau0442 u0432u03b9u0442cu043d you03c5 u043doeu0455 u043dau0442e. $ | u2651ufe0f |",
"Protected" : false,
"verified" : false,
"followers_count" : 1226,
"friends_count" : 551,
"listed_count" : 7,
"favourites_count" : 26466,
"statuses_count" : 40548,
"created_at" : "Mon Jan 19 04:01:54 +0000 2015",
"utc_offset" : null,
"time_zone" : null,
"geo_enabled" : true,
"lang" : "en",
"contributors_enabled" : false,
"is_translator" : false,
"profile_background_color" : "C0DEED",
"profile_background_image_url" : "http://abs.twimg.com/images/themes/theme1/bg.png",
"profile_background_image_url_https" : "https://abs.twimg.com/images/themes/theme1/bg.png",
"profile_background_tile" : true,"profile_link_color" : "DD2E44",
"profile_sidebar_border_color" : "000000",
"profile_sidebar_fill_color" : "000000",
"profile_text_color" : "000000",
"profile_use_background_image" : true,
"profile_image_url" : "http://pbs.twimg.com/profile_images/755988257163317248/SRdOYbJA_normal.jpg",
"profile_image_url_https" : "https://pbs.twimg.com/profile_images/755988257163317248/SRdOYbJA_normal.jpg",
"profile_banner_url" : "https://pbs.twimg.com/profile_banners/2989278494/1468891291",
"default_profile" : false,
"default_profile_image" : false,
"following" : null,
"follow_request_sent" : null,
"notifications" : null
},
"geo" : null,
"coordinates" : null,
"place" : null,
"contributors" : null,
"is_quote_status" : false,
"retweet_count" : 0,
"favorite_count" : 0,
"entities" : {
"hashtags" : [],"urls" : [],
"user_mentions" : [],
"Symbols" : []
},
"favorited" : false,
"retweeted" : false,
"filter_level" : "low",
"lang" : "en",
"timestamp_ms" : "1469223288227"
}
The tweet screenshot is
What potentially valuable data we can see here?
What are potential issues and challenges with the data?
The most important data, naturally, is the field “text” containing the tweet itself. Also there is a location information in fields “location”, “coordinates”, “place”. The information reflecting the tweeter’s social power like “followers_count” also could be interesting. Let’s have a quick overview of followers distribution based on a data sample of about 220K tweets collected during July 22-27, 2016 for several movies.
Quantiles of the “followers number”
An ANOVA (ANalysis Of VAriance) test says the “followers” number is statistically different for different movies.
Analysis of Variance Table
Response: followers_count
Df Sum Sq Mean Sq F value Pr(>F)
movie 4 1.2078e+12 3.0195e+11 10.464 1.789e-08 ***
Residuals 219302 6.3280e+15 2.8855e+10
—
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
The next important aspect of data understanding is amount of data. Working with Twitter Streaming API we get several dozens tweets per second for a stream filtered for one movie name.
Once our data science team looked at enough data samples, they could summarize the initial findings:
- Tweet text may contain emoticons and something else encoded in unicode. In this tweet example, the text (“i hope light’s out is worth my time. ud83dude34”) ends with a unicode-encoded “sleeping face” icon. It will influence the process of data cleansing and feature extraction.
- Location data is empty for more than 90% of tweets. That makes this information almost useless for analysis.
- The ”followers” number range is extremely wide, starting with zero on the low side and reaching hundreds of thousands on the high side. That could be an interesting area to look for any patterns in sentiment distribution if we track the number of followers per movie over time.
- We should be prepared to process 100 tweets per second per movie to be on the safe side. Depending on the number of movies we want to monitor, we will need to size the processing infrastructure appropriately.
That all gives us insight into what kind of data we have and stimulates our thinking on hypotheses and directions for further data exploration. It also gives us the necessary ground to proceed with selection of the right dictionary, which is the subject of the next blog post.
References
- IMDB Large Movie Review Dataset
- Learning Word Vectors for Sentiment Analysis, Stanford University 2011
- 5K manually labeled tweets from Niek Sanders
- A new ANEW: Evaluation of a word list for sentiment analysis in microblogs – by Finn Arup Nielsen
- Dictionary of root-words with sentiment scores
- Jeffrey Breen’s positive and negative unigram dictionary
- MPQA Subjectivity Lexicon
- ANOVA