Skip to Main Content

Digital Research

This guide provides an overview of tips, support, and resources available to complete digital research projects at American University.

Text Mining and Social Media Research Basics

Social media research uses user-generated social media content and interactions as the subject of research. The tutorials below focus on the methods and important considerations of social media research, as well as a tool that can do this kind of research.

Text mining uses software to process and analyze large sets of unstructured texts to identify patterns and connections. These resources outline the basics of what text mining is, common approaches, and resources that you can use to conduct this type of analysis.

Note: Not all online resources allow text mining, and projects of this type may have legal or ethical considerations to take into account. In addition, not all library-licensed materials allow use of Artificial Intelligence (AI) tools for text analysis.

Text Mining and Social Media Research Tools

Google Ngram Viewer is a beginner, open-source text searching tool that lets you visualize and graph occurrences of words in texts located in Google Books. It's easy to use and can be a great place to start for refining your research questions and getting a brief preview of the possibilities for large-scale text analysis.

Voyant Tools is a web-based reading and analysis tool for digital texts. Voyant will takes your texts, create a corpus, and can calculate word frequencies, correlations between sets of words, commonly repeated phrases, topic clusters, and other analyses of interest to researchers. You can type in multiple URLs, paste in full text, or upload your own files for analysis.

HathiTrust Research Center Analytics is a tool that can perform large-scale text analysis on materials in the Hathi Trust Digital Library, which is is home to millions of digitized books and publications that you can use to gather a set of texts for your text mining research. The resources below provide information about how to use the HathiTrust Research Center, which you have access to through AU. Note: HathiTrust Research Center will no longer be funded as of the end of 2026, but the team hopes to continue to offer services to the research community.

AntConc is an open-source concordancer software program designed by Laurence Anthony. It can take corpora, analyze clusters of words and/or phrases, and highlight structures and contexts. It works with .txt files and can be downloaded to Windows, Mac, and Linux.

NVivo is a research tool for coding data from a variety of sources, including text-based documents, interviews, surveys, maps, audio/video files, and social media data, and then automating a variety of qualitative and quantitative research analyses for those sources. NVivo can be accessed in the Anderson Computing Complex on campus.

is an open-source software program for statistical analysis that has the ability to do some text-mining analysis as well. The resource below showcases how to use R for this purpose.

MALLET is a well-respected resource for topic modeling, or exploring relationships of words and topics within large corpora. A Python Wrapper is available if you prefer to work in Python.

Sample Collections for Text Mining

These resources may be text and data mined for academic scholarship or educational purposes. These resources have different license agreements and text and data mining rights to be aware of before beginning your data collection. Many also prohibit the use of automated mining with Artificial Intelligence (AI) tool.

  • BioMed Central provides a full-text corpus specifically developed for text mining research.
  • Chronicling America is a database of historical newspapers from 1789 to 1922 that are in the public domain.
  • CORE is the world's largest collection of open access research papers. It is accessible through APIs. 
  • Digital Public Library of America (DPLA) makes its content available via APIs and bulk downloads.
  • Early English Books Online provides output in XML files for analysis.
  • English-Corpora.org was developed by Mark Davies, Linguistics Professor at BYU.
  • Folger Shakespeare Library provides digital copies of Shakespeare's plays, poems, and sonnets, offered in multiple formats.
  • HathiTrust is a partnership of academic and research institutions and offers a collection of millions of digitized titles.
  • Internet Archive provides instructions for those interested in bulk download and API access.
  • National Center for Biotechnology Information (NCBI) has developed its own text mining tools to analyze scholarly publications in the biomedical field.
  • New York Times Developer Portal allows access to APIs for mining New York Times publication data.
  • The Oxford Text Archive is a collection of digitized literary and linguistic texts.
  • Project Gutenberg is a large collection of free e-books available for download. Most are in the public domain.
  • PubMed from the National Library of Medicine offers over 37 million citations to biomedical and life sciences literature, links to full-text resources, and public code repositories with APIs.
  • ScienceDirect provides APIs for no charge, for academic use, subject to Elsevier's policies and limits on usage.
  • Web of Science can be mined using APIs from Clarivate Analytics