Collecting, cleaning and visualising Twitter data
Twitter can be a really useful resource for researchers to try and capture the public mood around events[1]. However, extracting and sorting data from twitter can be complex, expensive or both. Below are four guides to collecting, cleaning and visualising Twitter data in a free and (relatively) easy way.
Collecting Twitter data with TAGs and exporting to Gephi
TAGS[2] is a Google sheet that allows you to collect tweets from hashtags and/or users directly from Twitter. You can collect tweets sent from around the last seven days, and can leave the script running to collect tweets as they are posted. You can collect an unlimited number of ‘future’ tweets. This tutorial shows you how to setup TAGS, how to clean your data with a basic Python script and then import it into Gephi ready for visualisation.
Hydrating Datasets for Gephi
To comply with Twitter’s privacy rules full datasets cannot be published on the web, but they do allow Tweet identifier datasets to be shared[3]. This tutorial will show you how to take these identifier datasets and return them to their original JSON or CSV form for analysis. It will then show you how to clean your data with a basic Python script and then import it into Gephi for visualisation.
Collecting Tweets with Gephi
You can also collect data from twitter directly through Gephi, however this method only collects Tweets in real time, and none from the past. You can leave the script running to collect tweets as they are posted watch your graph grow. You can collect an unlimited number of future tweets. This tutorial shows you how to setup Gephi to collect Tweets.
Analyzing Twitter Networks with Gephi
Once you have collected your Twitter data and imported it into Gephi you will want to do some analysis to help you understand what the data means. This tutorial will show you around the Gephi interface, and point you towards some more advanced analysis techniques.
Other sources of free data for analysis
There are many pre-collected Twitter datasets available across the internet. These can be useful in research, but you should check their methods of collection to ensure the validity of data. At the time of writing the below datasets were available for free use:
http://twitterpoliticians.org
https://www.docnow.io/catalog/
https://tweetsets.library.gwu.edu/
https://data.world/datasets/twitter
https://snap.stanford.edu/data/
https://www.trackmyhashtag.com/blog/twitter-datasets-free/
http://followthehashtag.com/datasets/
https://github.com/shaypal5/awesome-twitter-data
https://dataverse.harvard.edu/
- Remember, while useful for getting an idea around different topics, Twitter data is not going to be representative of the whole population. For example, U.S. adult Twitter users are younger and more likely to be Democrats than the general public. Most users rarely tweet, but the most prolific 10% create 80% of tweets from adult U.S. users. And this will be different in different countries, and with different topics. Read more about these issues in this PEW Research report: https://www.pewresearch.org/internet/2019/04/24/sizing-up-twitter-users/
- Developed by Martin Hawksey.
- When using any of the data collection methods, or pre-made datasets above Twitter’s terms for providing downloaded content to third parties, as well as research ethics must be followed.