Collecting, cleaning and visualising Twitter data

Twitter can be a really useful resource for researchers to try and capture the public mood around events[1]. However, extracting and sorting data from twitter can be complex, expensive or both. Below are four guides to collecting, cleaning and visualising Twitter data in a free and (relatively) easy way.

Collecting Twitter data with TAGs and exporting to Gephi
TAGS[2] is a Google sheet that allows you to collect tweets from hashtags and/or users directly from Twitter. You can collect tweets sent from around the last seven days, and can leave the script running to collect tweets as they are posted. You can collect an unlimited number of ‘future’ tweets. This tutorial shows you how to setup TAGS, how to clean your data with a basic Python script and then import it into Gephi ready for visualisation.

Hydrating Datasets for Gephi
To comply with Twitter’s privacy rules full datasets cannot be published on the web, but they do allow Tweet identifier datasets to be shared[3]. This tutorial will show you how to take these identifier datasets and return them to their original JSON or CSV form for analysis. It will then show you how to clean your data with a basic Python script and then import it into Gephi for visualisation.

Collecting Tweets with Gephi
You can also collect data from twitter directly through Gephi, however this method only collects Tweets in real time, and none from the past. You can leave the script running to collect tweets as they are posted watch your graph grow. You can collect an unlimited number of future tweets. This tutorial shows you how to setup Gephi to collect Tweets.

Analyzing Twitter Networks with Gephi
Once you have collected your Twitter data and imported it into Gephi you will want to do some analysis to help you understand what the data means. This tutorial will show you around the Gephi interface, and point you towards some more advanced analysis techniques.


Other sources of free data for analysis

There are many pre-collected Twitter datasets available across the internet. These can be useful in research, but you should check their methods of collection to ensure the validity of data. At the time of writing the below datasets were available for free use:

Twitter Politicians
This website allows you to explore and download a database of parliamentarians on Twitter in 26 countries. The database is designed to move beyond the one-off nature of most Twitter-based research and in the direction of systematic and rigorous comparative and transnational analysis.
http://twitterpoliticians.org
DocNow
One of the largest collections of free datasets available on the internet. These datasets can be used following the second of the tutorials above
https://www.docnow.io/catalog/
George Washington University’s TweetSets
A small collection of very large datasets collected by researchers at George Washington University. This database is used in the hydration tutorial above.
https://tweetsets.library.gwu.edu/
Data World
A collection of 84 datasets ranging from Trump, Elon Musk, and Corporate messaging. Most of the datasets are reasonably old, but still valid for use. They can be reydrated using the method above.
https://data.world/datasets/twitter
SNAP
The Stanford Large Network Dataset Collection (SNAP) contains a range of data about different platforms, including Twitter. The Twitter datasets can be reydrated using the method above.
https://snap.stanford.edu/data/
Track my Hashtag
A set of 33 free datasets draw from other sites across the internet. Most data downloads in a format that is usable out the box – you will need to check the methodology for each different dataset. Most can be reydrated using the method above.
https://www.trackmyhashtag.com/blog/twitter-datasets-free/
Follow the Hashtag
A smaller collection of datasets, although each is sizable. There are also geotagged datasets available here in case you are also running a mapping exercise with your data. Most can be reydrated using the method above.
http://followthehashtag.com/datasets/
GitHub
There are a number of users of GitHub who make their Twitter scrapes available for free download and use. As always you should check their methodology and the validity of data. A good set of data is provided by Shaypal5:
https://github.com/shaypal5/awesome-twitter-data
Harvard Dataverse
A huge collection of datasets, not all of them twitter data. Can be difficult to find what you are looking for, but the data comes with clear methodologies and descriptions of any issues during collection phase. The Twitter datasets can be reydrated using the method above.
https://dataverse.harvard.edu/

Footnotes

  1. Remember, while useful for getting an idea around different topics, Twitter data is not going to be representative of the whole population. For example, U.S. adult Twitter users are younger and more likely to be Democrats than the general public. Most users rarely tweet, but the most prolific 10% create 80% of tweets from adult U.S. users. And this will be different in different countries, and with different topics. Read more about these issues in this PEW Research report: https://www.pewresearch.org/internet/2019/04/24/sizing-up-twitter-users/
  2. Developed by Martin Hawksey.
  3. When using any of the data collection methods, or pre-made datasets above Twitter’s terms for providing downloaded content to third parties, as well as research ethics must be followed.

 

About
Doug Specht is a Reader in Cultural Geography and Communication, a Chartered Geographer (CGeog. FRGS), and Assistant Head of School in the School of Media and Communication at the University of Westminster.

His research examines how knowledge is constructed and codified through digital and cartographic artefacts, focusing on development issues in Latin America and Sub-Saharan Africa, where he has carried out extensive fieldwork. He also writes and researches on pedagogy, and is author of the Media and Communications Student Study Guide.

He speaks and writes on topics of data ethics, development, education and mapping practices at conferences and invited lectures around the world. He is a member of the editorial board at Westminster Papers in Communication and Culture, and the journal Anthropocenes – Human, Inhuman, Posthuman. He is also Chair of the Environmental Network for Central America.