This content is drawn from a report authored by the AU Library's Artificial Intelligence Experimentation Working Group. You can read the groups full report covering experiments with the use of AI in library work and recommendations to library leadership in the American University Research Archive.
The Business Tools subgroup sought to test AI tools to find efficiencies in tasks that are common to all library employees, including:
The members of this subgroup divided into three teams, and each investigated the application of AI tools to one of the above tasks.
Meetings: The team charged with testing tools for those tasks associated with meetings looked at Zoom’s AI Companion. While they considered other tools, such as otter.ai, Zoom is already available campus-wide and the university made the AI companion available to users over the summer, so it seemed wise to investigate the functionality.
This team assessed three scenarios to test the AI tool’s capabilities in in-meeting transcribing, note-taking and summarizing:
This team found that while all the functionalities needed a final human editor to impart context and fix name errors, Zoom’s AI Companion is a helpful tool that captures what was said well and is easy to use. This team thinks it would be most helpful when one is both facilitating and note-taking during a meeting.
Writing: This team tested Chat GPT, Gemini, Quillbot and Grammarly to see which would be helpful in creating and editing written content in various formats. For paraphrasing, QuillBot and Grammarly were helpful and brought some creativity and clarity to the original text. However, they did not work perfectly every time, and there was a significant difference between the free and premium versions. For summarizing complex information, Chat GPT performed best and provided a detailed response in testing. Overall, AI tools can be very helpful for writing assistance but require a human to check over the output, suggesting that AI should be viewed as an assistant, not the lead writer.
Presentations: The presentation team found that Synthesia was helpful for turning text into video. They viewed this tool as being particularly useful for quick videos that require frequent updating—but saw little value in using the video avatar.
This subgroup investigated four different AI projects to describe, enhance, or format archival descriptions and metadata.
Finding Aids: This project aimed to use AI to convert text-based finding aids into EAD XML to prepare them for upload into “ArchivesSpace”, an archives content management system. The hope was to streamline this process, as manually encoding roughly 200 finding aids in EAD XML for preparation for upload would be very time-consuming. ChatGPT-4o was tested using a sample of three finding aids of varying complexity, with experimenters providing context and examples.
They found that while ChatGPT was able to generate EAD XML for simple finding aids, the output didn’t always validate, lacked consistency in formatting, and was sometimes shortened rather than preserving the proper encoding. Also, the process took 2-4 hours per finding aid and the generator would often time out and have to be prompted to finish by a human. ChatGPT also had challenges with finding aids that contained extensive hierarchical description.
Overall, using AI for this task introduced extra steps in the process of encoding and uploading the finding aids into the new system as the tool had to be prompted numerous times, provided with context and examples, and its output reviewed and corrected before moving on to the uploading step.
LCSH Genre Headings: This project explored using AI to conduct subject and genre analysis and apply controlled vocabularies to finding aids. Experimenters compared ChatGPT-4o and Claude 3 Opus with short learning prompts with context and examples.
They asked the AI tools to generate at least three valid Library of Congress Subject Headings based on the subject matter of the finding aid and create 3-7 genre headings based on the types of materials outlined.
Results showed that both tools were inconsistent. Both tools invented Name Authority Records for people and organizations when creating LCSH. Claude struggled with both tasks given, inventing LCSH and conflating vocabularies when asked to create genre headings. ChatGPT performed better with genre headings.
Despite these issues, using AI tools for this task did provide a possibility for improving the overall workflow. AI did speed up the cataloguing process and freed up librarians’ time for more complex tasks. It offered a set of headings that catalogers could use as a first draft or starting point and then revise before using.
Russian Texts: Using Claude 3 Opus, this subgroup tested AI’s ability to translate and generate a MARC record for Russian language history books in AU’s collection. Claude was provided with images of the title page and back cover of the book, alongside some context and an example of a Russian language MARC record. Prompts were iteratively refined. Claude did a passable job at creating a Russian MARC record that was corrected by the cataloger, and no errors were found in the Russian-English translation.
Image Description: The goal of this task was to produce short, accurate image descriptions to ease the workload of metadata teams. Fifteen images were introduced to both ChatGPT-4o and Claude 3.5 Sonnet from Japanese Prints and Manuscripts, Peace Corps Community Archive, and University Photographs collections.
Both tools produced inconsistent descriptions, often provided inaccurate descriptions of perspective, color, language, meaning, and inferred context. The tools also demonstrated inherent biases with sensitive images. Accuracy improved when detailed collection-specific context was provided and constraints on location, and when time period or sentence length were introduced. Chat GPT provided more objective and concise descriptions than Claude. This process demonstrated that AI tools can create a good first draft, but that human review of its outputs remains important.
The main goal of this subgroup was using AI to identify materials owned by the AU Library to connect with AU faculty. This subgroup experimented on two separate projects in service of this goal:
Collection Promotion: The group investigated whether AI could create a list of materials and then refine it into specific and appropriate research areas to match with faculty. They tested ChatGPT 3.5 and Copilot. Using three AU faculty members as a test case, the task given to the AI tools was to:
The results of this experimentation were mixed. Neither tool was able to identify materials held by AU. ChatGPT struggled with limiting itself to materials published in a limited timeframe, although Copilot managed similar date restrictions. For some research areas, both tools did an acceptable job of identifying appropriate Library of Congress classes, but specificity was lacking, and they sometimes invented examples or duplicated results.
Overall, using the AI tools for this task involved a lot of manual intervention to verify the accuracy of the results. Experimenters had to break down the tools’ task for them in many cases. However, AI was a decent assistant and could help narrow down titles to a manageable count for later human review.
Workflow creation: The second task attempted was to investigate whether AI could create a workflow identifying journals where AU faculty publish to protect these publications from cuts. Tools tested were ChatGPT 3.5 and 4o, Microsoft Power Automate with Copilot, and Microsoft Visual Studio Code with Cody. Experimenters created a sample dataset of 25 randomly selected faculty members with work stored in AU’s digital repository. They then prompted ChatGPT to write code that works with Google Scholar’s API to retrieve publications from these faculty members within the last five years. Finally, they worked with Copilot for Power Automate to complete the workflow.
Findings showed that the R and Python code generated by ChatGPT 4o were accurate and only needed small updates from the user. Although moving data to ChatGPT was a manual process, using this tool as a co-coder was strong enough to eliminate most of the need to use Cody in Visual Studio Code.
Overall, using AI for this purpose required significant manual guidance and user knowledge. For example, creating the workflow involved multiple back-and-forth “chat” sessions with ChatGPT to refine prompts for each new exception, as it was not capable of recognizing exceptions without assistance. In addition, Copilot for Power Automate was not capable of generating lengthy sequences of complex actions, so experimenters had to guide it step-by-step through the desired workflow. Despite this collaborative approach to ensure accuracy, the AI tools were not able to create a finished product to accomplish the desired task.
This subgroup investigated the potential of AI to improve daily operations in units handling support requests via a ticketing system. Subgroup members identified the AV Airtable database (a SAAS database tool) as an ideal use case, making it the focus of an experimental deployment of ChatGPT’s Data Analyst tool. The Ticketing and Support subgroup focused on AI's ability not only to streamline data analysis, but also to identify which tasks are most suitable for AI assistance in the future.
Experimenters used an iterative process to discover new answers as they learned to deploy the Data Analyst tool. They started by exporting and anonymizing Airtable database with random user IDs to preserve worker and client privacy. The group identified barriers to analysis in the database, such as nulls, repeats, errors, and formatting issues. Experimenters tested various enterprise AI solutions (Google Gemini, Microsoft CoPilot, ChatGPT) to clean this data, ultimately settling on ChatGPT’s Data Analyst. They then generated several priority questions (using insights from the AV staff) and sample prompts to perform data analysis with the AI tool.
Overall, Data Analyst proved helpful in data cleanup, generating logical operators, merging sheets, and transforming unstructured data into insights. However, the tool struggled with Excel documents containing multiple tabs or images, and occasionally different team members received different results with the same prompts. Manual checks remained necessary, which required familiarity with the data.
Although each subgroup was applying AI tools to a different set of library tasks and workflows, common themes emerged during experimentation.
The first theme was that although each subgroup discovered ways to use Generative AI tools to achieve their project goals, experimenters were unsure how applicable their prompts and results may be to a broader set of situations. In other words, each subgroup found ways to compel the AI tool to do a small set of specific tasks that might not be generalizable. More experimentation will be needed to develop approaches that can be used more broadly across the library.
A second common theme that emerged across subgroups was related to issues with consistency. In some cases, and with some tools, subgroup members found that they could use the same prompts in the same order with different results. This occurred both with individual users at different times and across users within groups.
Lastly, experimenters often found it challenging to get AI to accurately complete tasks in alignment with library needs, even after heavy periods of experimentation. For many of the tasks, AI only finished the job halfway. The process of working with AI was time-consuming and required frequent adjustments and human oversight. This required a lot of resources, including staffing and time.