Main content

Home

Menu

Loading wiki pages...

View
Wiki Version:
This document is intended to explain the nature of the dataset to appropriately use the data in future projects. The datasets built are aimed for single class text classification projects. We created three different datasets from Twitter starting from data gathered by LINKS foundation in the period 2020--2021. The tweets gathered concern catastrophes and hazards such as wildfires, hurricanes, terrorism attacks and so on. Since in this wide time span different major catastrophes happened (e.g., COVID-19), along with the rise of concerns and worrying for global warming and climate crisis-induced hazards, the distribution of examples for each class was highly unbalanced, i.e., the vast majority of tweets gathered talked about COVID-19. This included not only casualty counts, reports and news, but also opinions about lockdown measures, vaccines and different covid-related issues. The data was originally gathered by retrieving tweets if they contained certain keywords: each label has a set of corresponding keywords that describe different shades of a hazard. This approach was used to gather huge numbers of relevant examples and create a dataset that could contain every different way to cite the given disasters. We can consider this dataset labeled in a distant supervised fashions, that is, automatically assigning the labels based on the keywords used for retrieving the items. The three dataset built are: 1. The "Gold Twitter dataset": this dataset contains **1000** examples that were correctly classified during the initial retriving phase. 2. The "Keyword-out-of-context dataset": this dataset contains **100** examples that were labeled with a certain class just because it appeared in the tweet but it does not actually talk about an hazard (e.g. "This team is on **fire**" or "This company is **flooding** the market"). 3. The "Multiple-keywords dataset": as the name says, each of the **100** tweet in this dataset contains at least two hazards, even if it was labeled with a single class.
OSF does not support the use of Internet Explorer. For optimal performance, please switch to another browser.
Accept
This website relies on cookies to help provide a better user experience. By clicking Accept or continuing to use the site, you agree. For more information, see our Privacy Policy and information on cookie use.
Accept
×

Start managing your projects on the OSF today.

Free and easy to use, the Open Science Framework supports the entire research lifecycle: planning, execution, reporting, archiving, and discovery.