**Paper:**
https://arxiv.org/abs/1911.03855
**Data:**
The data consists of three files:
- *county_outcomes*: dense table with county drinking and socio-demographic variables
- *topics*: topic frequencies for each county in sparse format
- *unigrams*: word frequencies for each county in sparse format
Features are based on our best model:
- Raking on both income and education
- Smoothing k = 10
- Minimum Adaptive Binning size = 50
**Data Files:**
Data is available in both CSV and MySQL formats:
- CSV
- county_outcomes.csv
- topics.csv.zip
- unigrams.csv.zip
- MySQL
- correcting_selection_bias_icwsm2022.sql.zip
**Analysis:**
All analysis was run using the [DLATK Python package][1]. This package uses MySQL so data is made available in a single, convenient SQL dump.
**Code:**
We have released code to allow you to run our methods on your data. See the Github repo [here][2].
**Citation:**
If you use this data in your work please cite:
@article{giorgi2022correcting,
title={Correcting Sociodemographic Selection Biases for Population Prediction from Social Media},
author={Salvatore Giorgi and Veronica Lynn and Keshav Gupta and Farhan Ahmed and Sandra Matz and Lyle Ungar and H. Andrew Schwartz},
year={2022},
journal={Proceedings of the International AAAI Conference on Web and Social Media},
}
[1]: http://dlatk.wwbp.org
[2]: https://github.com/wwbp/robust-poststratification