Main content
Twitter Predicts Heart Disease /
More Evidence that Twitter Language Predicts Heart Disease: A Response and Replication
- johannes C. Eichstaedt
- H. Andrew Schwartz
- Salvatore Giorgi
- Margaret L. Kern
- Gregory Park
- Maarten Sap
- Darwin R. Labarthe
- Emily E. Larson
- Martin Seligman
- Lyle H Ungar
Date created: | Last Updated:
: DOI | ARK
Creating DOI. Please wait...
Category: Project
Description: A recent preprint by Brown and Coyne titled, "No Evidence That Twitter Language Reliably Predicts Heart Disease: A Reanalysis of Eichstaedt et al." asserts to re-analyze our 2015 article published in Psychological Science, “Twitter Language Predicts Heart Disease Mortality”, disputing its primary findings. While we welcome scrutiny of the study, Brown and Coyne’s paper does not in fact report on a reanalysis, but rather presents a new analysis relating Twitter language to suicide instead of heart disease mortality. In our original article, we showed that Twitter language, fed into standard machine learning algorithms, was able to predict (i.e., estimate cross-sectionally) the out-of-sample heart disease rates of U.S. counties. Further, in a separate analysis, we found that the dictionaries and topics (i.e., sets of related words) which best predicted county atherosclerotic heart disease mortality rates included language related to education and income (e.g., “management,” “ideas,” “conference”) as well as negative social relationships (“hate”, “alone,” “jealous”), disengagement (“tired, “bored,” “sleepy”), negative emotions (“sorry,” “mad,” “sad”) as well as positive emotions (“great,” “happy,” “cool”) and psychological engagement (“learn,” “interesting,” “awake”). Beyond conducting a new analysis (correlating Twitter language with suicide rates), Brown and Coyne also detail a number of methodological limitations of group-level and social media-based studies. We discussed most of these limitations in our original article, but welcome this opportunity to emphasize some of the key aspects and qualifiers of our findings, considering each of their critiques and how they relate to our findings. Of particular note, even though we discuss our findings in the context of what is known about the etiology of heart disease at the individual level, we reiterate here a point made in our original paper: that individual-level causal inferences cannot be made from the cross-sectional and group-level analyses we presented. Our findings are intended to provide a new epidemiological tool to take advantage of large amounts of public data, and to complement, not replace, definitive health data collected through other means. We offer preliminary comments on the suicide language correlations: Previous studies have suggested that county-level suicides are relatively strongly associated with living in rural areas (Hirsch et al., 2006; Searles et al., 2014) and with county elevation (Kim et al., 2011; Brenner et al., 2011). When we control for these two confounds, we find the dictionary associations reported by Brown and Coyne are no longer significant. We conclude that their analysis is largely unrelated to our study and does not invalidate the findings of our original paper. In addition, we offer a replication of our original findings across more years, with a larger Twitter data set. We find that (a) Twitter language still predicts county atherosclerotic heart disease mortality with the same accuracy, and (b) the specific dictionary correlations we reported are largely unchanged on the new data set. To facilitate the reproduction by other researchers of our original work, we also re-release the data and code with which to reproduce our original findings, making it more user-friendly. We will do the same for this replication upon publication.