“Using and Developing the Weighted Cosine Similarity Score” Paper presented at the Midwest Political Science Association Annual Meeting. Chicago, IL, April 2016. with Brice Acree, Eric Hansen, and Josh Jansa.

We highlight flaws in extant measures of text similarity used in automated text analysis and introduce a new technique, weighted cosine similarity, to address these flaws.

For the working paper, click here.

Classifying Dynamic Frames and Topics Through Supervised Text Mining

While there has been an explosion in the availability of text accompanied by new machine learning methods for its analysis, no text mining methods or applications within political science have been shown capable of avoiding and resistant to the problem of concept drift. I test the ability of LASSO and ridge regression to classify frames and policy topics over time. To do so, I use one-minute speeches given on the floor of the US House given over 25 years hand coded for policy topic. Additionally, I use newspaper articles on immigration and same-sex marriage from the Media Frames Corpus published over 25 years that have been hand annotated for frame use. Using these annotated articles, I establish that concept drift – changes in language associated with a specific concept – is present. Then, I compare two techniques to apply labels: first, a baseline model that uses a random sample from the entire time frame to produce a model that applies labels to the remaining documents; and second, an algorithm trained on documents from most recent 5 years and then updates to apply labels to the remaining documents. To evaluate the models, I compare model predictions to coder labels and the performance of the baseline and updating models. From this analysis, I establish that: 1) concept drift presents itself as a problem when applying topic codes and classifying frames; 2) the iterative approach introduced here identifies vocabulary items identifying key concepts over time; and 3) an updating model using a smaller window and smaller number of observations performs equally as well as the baseline model. This significantly decreases the cost of building an annotated corpus, facilitates over time analysis, and allows the researcher to trace shifts in the content of the concept of interest.

For a poster summarizing the working paper, click here.