TextThresher Software

doi:None

Title	Authors

Home

[www.textthresher.org][1] [1]: http://www.textthresher.org TextThresher is a mass collaboration software allowing researchers to direct hundreds of volunteers – working through the internet – to label tens of thousands of text documents according to all the concepts vital to researchers’ theories and questions. With TextThresher, projects that would have required a decade of effort, and the close training of wave after wave of research assistants, can be completed in about a year and a half online. ---------- **How Will People Use TextThresher?** TextThresher is specifically designed for large and complex content analysis jobs that cannot be completed with existing automated algorithms. It is the ideal tool whenever automated approaches to textual data fail to recognize concepts vital to social scientists’ intricate theories, fail to tease out ambiguous or contextualized meanings, or fail to effectively parse relationships among, or sequences of, social entities. If you are interested in performing a shallow sentiment analysis of Tweets, or developing an exploratory topic model of some corpus, you won’t need TextThresher. If you have a few dozen interviews to analyze, TextThresher is probably overkill. But if you want to extract hierarchically organized, openly validated, research-grade records of related social entities and concepts across thousands of longer documents, TextThresher is for you. Especially in this first beta version, it is ideally suited for the analysis of news events, historical trends, or the evolution of legal theories. Here’s how it works: The crowd content analysis assembly line TextThresher enables is organized around two major steps. First, annotators identify (across the researcher’s documents) text units (words, phrases, sentences) that correspond with the (relatively small number of) nodes at the highest level of the researcher’s hierarchically-organized conceptual/semantic scheme. These high-level nodes describe a researcher’s units of analysis, the social units (be they individuals, events, organizations, etc.) described by all the variables and attributes at the lower-level nodes of the conceptual/semantic scheme. In contrast to old-style content analysis, an annotator using TextThresher does not even attempt the conceptually overwhelming task of applying dozens of different labels to a full document. They just label text units corresponding with the (usually) 3-6 highest level concepts important to a researcher. This is comparatively easy work. In the second step, TextThresher displays those smaller text units, corresponding with just one case of one unit of analysis, to citizen scientists/ crowd workers, and guides them through a series of leading questions about the text unit. Since TextThresher already knows the text unit is about a certain type of unit of analysis, it only asks questions relevant to that unit of analysis, questions prompting users’ search for variables/attributes interesting to the researcher. By answering this relatively short list of questions and highlighting the words justifying their answers, citizen scientists label the text exactly as highly-trained research assistants would. But their work goes much faster and they are more accurate, because (1) they are only reading relatively short text units, (2) they are only concerned to find a relatively short list of variables (that are guaranteed to be relevant for the text unit they are analyzing); and (3) the work is organized as a ‘reading comprehension’ task familiar to everyone who has graduated middle school. TextThresher uses a number of transparent approaches to validate annotators’ labels, including gold standard pre-testing, Bayesian voting weighted by annotator reputation scores, and active learning algorithms. All the labels are exportable as annotation objects consistent with W3C annotation standards, and maintain their full provenance. So, in addition to scaling up content analysis for all the ‘big text data’ out there, TextThresher also brings the old method into the light of ‘open science.’ ---------- **How Can I Get My Hands on TextThresher?** Today, we are announcing that TextThresher lives. It moves data through all of its interfaces as it should. The interfaces are fully functional. (See Demo below.) And TextThresher can be deployed on Scifabric (PYBOSSA), our partner citizen (volunteer) science platform. In the weeks and months to come, we will be testing TextThresher’s user experience, refining our label validation algorithms, and using TextThresher to collect data for the GoodlyLabs’ DecidingForce project, which analyzes thousands of news articles to uncover patterns in the interactions between police and protesters. Once we feel confident that TextThresher is working smoothly (probably around October 2017), we will invite researchers to apply to become beta users of the software. (If you already know you are excited to use TextThresher, feel free to shoot Nick an email and he will keep you updated about upcoming opportunities.) We hope to release TextThresher 1.0 to the general public in early 2018. ---------- **Demo (The Rough Cut)** https://youtu.be/KyWjIANHrN8 (We will update this link as we iterate!) ---------- **Our Thanks** TextThresher would not exist without the support and hard work of many people. We wish to first thank our institutional sponsors. The Hypothes.is “Open Annotation” Fund, the Alfred P. Sloan Foundation, and the Berkeley Institute for Data Science (BIDS) all provided seed funding that allowed us to hire creative and skilled developers. BIDS, too, provided workspace for meetings and support for Nick Adams. The D-Lab and the Digital Humanities @ Berkeley also provided essential resources when the project was in its very early stages. TextThresher’s viability also owes to the encouragement of the annotation and citizen science communities. Dan Whaley, Benjamin Young, Nick Stenning, and Jake Hartnell of Hypothes.is are especially to blame for motivating and guiding our early efforts. Daniel Lombraña of Scifabric, Chris Lintott of Zooniverse, and Jason Radford of Volunteer Science also bolstered our hopes that the citizen science community would appreciate and use our tools. And of course, TextThresher, would not exist without the collective efforts, lost sleep, and careful programming of our talented and dedicated development team. From our earliest prototype till today, we have been fueled by the voluntary and semi-voluntary efforts of students and freelance developers across the Berkeley campus and Bay Area. As the person who got it all started at a point when I could just barely script my way out of a paper bag, I (Nick) especially wish to thank Daniel Haas, Fady Shoukry, and Tyler Burton for their early efforts architecting TextThresher’s backend and frontend (and for believing in the vision). Steven Elleman is largely responsible for our rather sophisticated (if we do say so!) highlighter tool. Jasmine Deng has built the reading comprehension interface that makes TextThresher so easy to use compared to QDAS packages. Flora Xue, with the mentorship of the busy and brilliant Stefan van der Walt, has refactored our data model through multiple improving iterations. And we can all count on TextThresher to become increasingly efficient thanks to the human-computer interactions enabled by Manisha Sharma’s hand-rolled ‘NLP hints’ module. All of this work has been helped along, too, by a number of volunteers like Allen Cao, Youdong Zhang, Aaron Culich, Arjun Mehta, Piyush Patil, and Vivian Fang who have taken on quick but essential tasks across the TextThresher codebase. Finally, I (Nick) have to express my deep gratitude for Norman Gilmore, our development team lead. Norman has not only played an essential role in architecting and writing code throughout TextThresher, he has also served as a patient and caring mentor to all of our developers, helping our team establish and maintain agile scrum practices, proper git etiquette, and a happy, grooving work rhythm. Thanks, Norman! And thanks to all our friends, family, and colleagues who have been rooting for us. We did it! Our work is done! ;) (Haha!) [www.textthresher.org][1] [1]: http://www.textthresher.org

OSF does not support the use of Internet Explorer. For optimal performance, please switch to another browser.

This website relies on cookies to help provide a better user experience. By clicking Accept or continuing to use the site, you agree. For more information, see our Privacy Policy and information on cookie use.

Start managing your projects on the OSF today.

Free and easy to use, the Open Science Framework supports the entire research lifecycle: planning, execution, reporting, archiving, and discovery.

Create an Account Learn More Hide this message

Main content

Home

Menu

Start managing your projects on the OSF today.

Main content

Links to this project

Home

Menu

Add new wiki page

Page permissions have changed

Wiki page deleted

Connected to the collaborative wiki

Connecting to the collaborative wiki

Collaborative wiki is unavailable

Browser unsupported

Start managing your projects on the OSF today.