Subject Guides: Social Media Research: Twitter

X (formerly Twitter) TOS, API, and rules for research

Alongside the TOS, X also has extensive guidance for non-commercial use of the API and specific guidance on what not to do with it.

Things that are restricted or forbidden include:

Deriving or storing health, sexuality, union membership, or other personal data
- Some of this information can be worked with in aggregate, but not on an individual level
Matching X accounts to real-world people
Sharing large numbers of "hydrated" posts (more on this later)

Many tools for the collection of X datasets will require you to sign up for a developer account and get an API key to access the API services. You can do this on the X Developer site.

The normal API is also known as the "streaming" API, as it provides access to new posts as they're made but not old posts. It generally will be pulling posts that are less than a week old, and many of the collection tools are intended to be set up and run while the conversation is ongoing to collect posts as they happen over a period of time. There is also the question of how comprehensive the X search system is, as it is a black-box algorithm expected to prioritize popular posts over comprehensive recall.

There is a specific elevated API access level for academic research that advanced social media researchers may want to look into as it enables more Tweet retrieval and retrieval of older archived Tweets. You can find out more about X features for academic research on the corresponding page.

Finding existing Twitter or X datasets

Twitter has been extensively studied for over a decade, and there are several existing historical datasets of Tweets for use in research. These can be found in repositories across the web.

It's important to note that Twitter TOS restricts the sharing of full Tweets in datasets. Most of these datasets will be Tweet IDs, which will need to have the associated text and metadata retrieved. This process is called "hydration" and there are a few apps that will do it for you, like the DocNow hydrator.

Existing Twitter datasets:

DocNoc Tweet Catalog
- Contains 100+ (at time of writing) datasets on subjects from across political science, environmental science, international studies, and more.
The American Presidency Project and the Trump Twitter Archive archive full-text tweets from Donald Trump's now-suspended account
Twitter's Transparency Archive, which collects deleted Tweets believed to have been deliberate misinformation created by foreign state actors.
- Twitter is working on creating additional no-code/low-code Tweet archives on topics of interest to academic researchers, but as of 05/2022 has only released this one.

Voxgov
Voxgov provides real-time access to social media feeds, press releases, publications, and documents from the federal government, It has built-in features for working with government social media, including creating .csv datasets from search results, creating graphs for how frequently your search terms were tweeted by government accounts, most commonly associated terms or people, and more! *AU community only*

Tools for Twitter research

Netlytic
Netlytic is a browser-based social media research tool that has text mining and network visualization features. Works with Twitter, YouTube, RSS feeds, and Reddit. Free accounts are sufficient for most student purposes. Netlytic has a YouTube channel with demonstrations for a variety of types of project.
Chorus
Fully-contained collection-to-visualization Twitter research app. Requires downloads and a fair bit of configuration. It was created specific for doing Twitter analysis in social science research and is a pretty comprehensive search and analysis tool.
Mozdeh
Mozdeh is a social media quantitative analysis FOSS software that can also collect tweets, like Netlytic or Chorus. It works with the same things as Netlyltic: Tweets, YouTube comments, Reddit comments, and manually imported data. Unlike Netlytic, it is a desktop app. It also has a YouTube channel where you can find guides to collecting and analyzing data.
TAGS (Twitter Archiving Google Sheets)
A fairly straightforward way to get Tweets into Google Sheets to be loaded into whatever qualitative or quantitative analysis software you desire.
Tweet Archiver
A cousin to TAGS, it allows you to create queries and capture corresponding Tweets as they happen.
DocNow App
Twitter dataset collection tool created by Documenting the Now. There is a browser-based version for trying it out. Will additionally need the DocNow Hydrator installed. You can find out more at docnow.io
NCapture
NCapture is a Chrome extension that allows users to manually create text datasets from Facebook, Twitter, and YouTube, as well as capture YouTube videos for analysis in NVivo. *Only able to be used with NVivo - not an option otherwise*.

Please see the "Collected Tools" page of this guide for more information about NVivo.

twarc
twarc is a command-line utility and Python package for the collection and rehydration of Twitter datasets. It was created by Documenting the Now (DocNow), who are also responsible for the DocNow App, Hydrator, and other tools. You can find out more at docnow.io
Twitter's searchtweets Python package
A Twitter-provided Python library for working with the Twitter API in command line or scripts.
Reaper
Reaper, built on the socialreaper Python library, is a desktop app with no coding required. While it calls what it does "scraping", it makes use of site APIs and the user will need to register for an API key for any site they want to use Reaper on. This includes Facebook, Twitter, Reddit, YouTube, Tumblr, and Pinterest. It outputs all data as .csv tabular files.
4CAT
4CAT is a relatively advanced tool for the collection and analysis of social media data - it's best run on a UNIX server and has dependencies that it does not automatically install itself - but with the upside that it has modules built to work with important but niche platforms like 4chan, 8kun, Parler, and more, as well as Twitter and Reddit.

Example publications in Twitter research

Climate Change Sentiment on Twitter: An Unsolicited Public Opinion Poll
ABSTRACT: The consequences of anthropogenic climate change are extensively debated through scientific papers, newspaper articles, and blogs. Newspaper articles may lack accuracy, while the severity of findings in scientific papers may be too opaque for the public to understand. Social media, however, is a forum where individuals of diverse backgrounds can share their thoughts and opinions.

more... less...

As consumption shifts from old media to new, Twitter has become a valuable resource for analyzing current events and headline news. In this research, we analyze tweets containing the word “climate” collected between September 2008 and July 2014. Through use of a previously developed sentiment measurement tool called the Hedonometer, we determine how collective sentiment varies in response to climate change news, events, and natural disasters. We find that natural disasters, climate bills, and oil-drilling can contribute to a decrease in happiness while climate rallies, a book release, and a green ideas contest can contribute to an increase in happiness. Words uncovered by our analysis suggest that responses to climate change news are predominately from climate change activists rather than climate change deniers, indicating that Twitter is a valuable resource for the spread of climate change awareness.
Twitter as research data : Tools, costs, skill sets, and lessons learned
Scholars increasingly use Twitter data to study the life sciences and politics. However, Twitter data collection tools often pose challenges for scholars who are unfamiliar with their operation. Equally important, although many tools indicate that they offer representative samples of the full Twitter archive, little is known about whether the samples are indeed representative of the targeted population of tweets.

more... less...

This article evaluates such tools in terms of costs, training, and data quality as a means to introduce Twitter data as a research tool. Further, using an analysis of COVID-19 and moral foundations theory as an example, we compared the distributions of moral discussions from two commonly used tools for accessing Twitter data (Twitter’s standard APIs and third-party access) to the ground truth, the Twitter full archive. Our results highlight the importance of assessing the comparability of data sources to improve confidence in findings based on Twitter data. We also review the major new features of Twitter’s API version 2.