Main content

Home

Menu

Loading wiki pages...

View
Wiki Version:
**Paper:** https://arxiv.org/abs/1911.03855 **Data:** The data consists of three files: - *county_outcomes*: dense table with county drinking and socio-demographic variables - *topics*: topic frequencies for each county in sparse format - *unigrams*: word frequencies for each county in sparse format Features are based on our best model: - Raking on both income and education - Smoothing k = 10 - Minimum Adaptive Binning size = 50 **Data Files:** Data is available in both CSV and MySQL formats: - CSV - county_outcomes.csv - topics.csv.zip - unigrams.csv.zip - MySQL - correcting_selection_bias_icwsm2022.sql.zip **Analysis:** All analysis was run using the [DLATK Python package][1]. This package uses MySQL so data is made available in a single, convenient SQL dump. **Code:** We have released code to allow you to run our methods on your data. See the Github repo [here][2]. **Citation:** If you use this data in your work please cite: @article{giorgi2022correcting, title={Correcting Sociodemographic Selection Biases for Population Prediction from Social Media}, author={Salvatore Giorgi and Veronica Lynn and Keshav Gupta and Farhan Ahmed and Sandra Matz and Lyle Ungar and H. Andrew Schwartz}, year={2022}, journal={Proceedings of the International AAAI Conference on Web and Social Media}, } [1]: http://dlatk.wwbp.org [2]: https://github.com/wwbp/robust-poststratification
OSF does not support the use of Internet Explorer. For optimal performance, please switch to another browser.
Accept
This website relies on cookies to help provide a better user experience. By clicking Accept or continuing to use the site, you agree. For more information, see our Privacy Policy and information on cookie use.
Accept
×

Start managing your projects on the OSF today.

Free and easy to use, the Open Science Framework supports the entire research lifecycle: planning, execution, reporting, archiving, and discovery.