Loyola University Maryland

Emerging Scholars

April Crompton, Michelle Ballard, Bu Hyoung Lee, Ph.D.

Data Triage on Transcribed Tapes from Nixon Watergate Trials

View the poster >>

Abstract:

The Watergate scandal was a major political scandal in the 1970's that led President Nixon's resignation. The source of the data for this project are a portion of the approximately 60 hours of tape subpoenaed by the Watergate Special Prosecution Force (WSPF). They include fully transcribed conversations in the White House that were deemed evidentiary to the Watergate trial. President Nixon chose to record his meetings because he was concerned that his meetings were not always reported accurately, and that his private discussions should not be misconstrued publicly to the benefit of others.

This project supports the area of data triage by improving searchability and exploration methods of large text datasets, as well as economizing and exploiting large datasets for analysis. It will explore methods to enable analysts to identify information of interest more quickly within large datasets of transcribed voice collection. For example, keyword searches on transcribed data result in a great deal of noise, and on a dataset this large, they are useless. We will explore methods to enable analysts to find relevant data by tagging/posturing the text dataset for rapid retrieval.

The exploration leverages topic modeling, an unsupervised machine learning algorithm, which will be used to segment the corpus by detecting word patterns or themes within the corpus of data. Topic modeling was chosen because it is a data-driven approach to identify the hidden or latent topics that can be found within the data. Large segments of the corpus will be modeled to determine these topics, which are sets of words associated with defined segments. Topic modeling supports data triage by categorizing the data into bins based on these topics, which allows analysts to explore specific bins of data relevant to their interests.

This project is of interest because it aims to build the tools to address existing data triage challenges related to national security and technology, leading to better solutions for real-world problems.