Main content

Home

Menu

Loading wiki pages...

View
Wiki Version:
***Replication Data for ''Van Hee, Jacobs, Emmery, et al. 2018. Automatic Detection of Cyberbullying in Social Media Text''.*** Accompanying data and metadata for the [preprint][1] and submission of Van Hee, Jacobs, Emmery, et al. 2018. Automatic Detection of Cyberbullying in Social Media Text. ----- **Download experiment data.** Here we provide download links to the featurized dataset vector files for both languages, as well as some metadata files for indexing feature types and corpus document identifiers. **English data** * *EN_feature_vectors.svm.gz*: Feature vector file (3.6 GB, 8.4 GB decompressed) in SVMLight format, gzip compressed. Can be used directly in most ML libraries (e.g., [http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_svmlight_file.html scikit-learn loading function] and LibSVM); * *EN_devset_holdout_indices.json*: JSON holdout indices (780 KB) to split off the same heldin and holdout instance sets as in the paper experiments. Indexes the SVMLight file by row; * *EN_feature_map_dict.pkl*: Feature type mapping dictionary (33 MB) for indexing the SVMLight-file to their feature types (e.g., word 3-grams: column 0-14230). This file is a Python 2.7.12 serialized object using the standard pickle module; * *EN_svm_postid_list.pkl*: Document identifier mapping dictionary (4.8 MB) for mapping the row indices of the SVMLight file to document ids of the AMiCA Cyberbullying corpus. This file is a Python 2.7.12 serialized object using the standard pickle module; **Dutch data** * NL_feature_vectors.svm.gz: Feature vector file (2.5 GB, 5.8 GB decompressed)''']] in [http://svmlight.joachims.org/ SVMLight] format, gzip compressed. Can be used directly in most ML libraries (e.g., [http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_svmlight_file.html scikit-learn loading function] and LibSVM); * NL_devset_holdout_indices.json: JSON holdout indices (526 KB) to split off the same heldin and holdout instance sets as in the paper experiments. Indexes the SVMLight file by row; * NL_feature_map_dict.pkl: Feature type mapping dictionary (31 MB) for indexing the SVMLight-file to their feature types (e.g., word 3-grams: column 0-14230). This file is a Python 2.7.12 serialized object using the standard pickle module; * NL_svm_postid_list.pkl: Document identifier mapping dictionary (3.3 MB) for mapping the row indices of the SVMLight file to document ids of the AMiCA Cyberbullying corpus. This file is a Python 2.7.12 serialized object using the standard pickle module; **Topic model seed terms and BootCat scrape replication data** * *en_nl_topic_model_replication_data.tar.gz*: tarball containing seed terms per cyberbullying subtype used in the corpus bootstrapping tool Bootcat for topic model background corpus collection. Also contains the URLs and the original data found by BootCat. **Overview of all tested system results.** Here we provide an Excel spreadsheet of all tested system results. We tested 31 feature combinations in total for every language. Due to space restrictions we could not represent all 62 system results in the paper. * cyberbullying_detection_2018_all_results.xlsx: Excel spreadsheet of all results (43 KB); **Contact.** Please contact us for any and all questions regarding this research or the data provided. Contact information can be found at the following links: * [Cynthia Van Hee][2] * [Gilles Jacobs][3] ----- [1]: https://arxiv.org/abs/1801.05617 [2]: https://orcid.org/0000-0001-7365-6703 [3]: https://orcid.org/0000-0001-8846-3015
OSF does not support the use of Internet Explorer. For optimal performance, please switch to another browser.
Accept
This website relies on cookies to help provide a better user experience. By clicking Accept or continuing to use the site, you agree. For more information, see our Privacy Policy and information on cookie use.
Accept
×

Start managing your projects on the OSF today.

Free and easy to use, the Open Science Framework supports the entire research lifecycle: planning, execution, reporting, archiving, and discovery.