This dataset contains Parquet files obtained by processing geotagged tweets posted between 2015 and 2021. The spatial level of aggregation (`cell_id`) is the MSOAs of England and Wales. Only MSOAs where at least 15 users who wrote more than 100 words were found to be resident are included. File names give some threshold parameters, more details about those can be found in the paper. Here's a short description of the content of each file:
- `cells_mistakes`: frequency of mistakes averaged over residents of MSOAs (`cell_id`), by `cat_id` and `rule_id`, as defined by `LanguageTool`.
- `inter_cell_od_{city}...`: proportion of stays made from residents of given cells (`res_cell_id`) to other cells (`cell_id`).