This project presents a means of collecting Reddit data, identifying users with messages of interest (here, those relating to no longer having a job), and selecting neutral pairs for those users.
Sampling methods are further described in the associated paper: [aclanthology.org/2023.wassa-1.41][1]
## Included Data ##
The included data files contain message histories from 14,872 Reddit users, separated into a target sample ([reddit_quitfire_raw.csv.xz][2]) and a paired comparison sample ([reddit_comparison_raw.csv.xz][3])
Each sample has a subset, aggregated raw file ([reddit_quitfire_raw_1yr_weekly.csv.xz][4] and [reddit_comparison_raw_1yr_weekly.csv.xz][5]) and a version with texts scored by SALLEE and LIWC ([reddit_quitfire_scored_1yr_weekly_subset.csv.xz][6] and [reddit_comparison_scored_1yr_weekly_subset.csv.xz][7]).
The scored files are combined to form the final file for analysis: [reddit_combined_scored_1yr_weekly_subset.csv.xz][8]
See the [codebook][9] for descriptions of variables contained in these files.
## Full Data ##
The full sample is too large to provide (e.g., the set used to make the included data files consists of nearly 3 million files, totaling over 160 GB in size), but a new, similar sample can be collected from Reddit using the included scripts.
Running the collection script ([0_collection.R][10]) will first create a `subreddits` directory in which recent messages from select subreddits will be downloaded. These are used to collect user names. The script will then create `user_comments_raw` and `user_posts_raw` directories in which it will download comments and submissions made by each user. This makes up the full sample.
## Target Identification ##
The target identification script ([1_target_identification.R][11]) first searches through the raw user files, and creates the phase 1 search results file: [search_p1.rds][12]
Messages identified in this search phase are then processed with a dependency parsed, and the results are stored in a created `parsed_p1` directory.
Once all messages are parsed, an initial refined set of matched sentences are presented for review and later use: [target_p1.csv][13]
A second phase of refinement applies dependency-based rules, and searches for time references. These results are stored for review and later use: [target_p2.csv][14]
A subset of the users identified in the second phase of refinement are selected in the target data script ([2_target_data.R][15]), which then creates the full and aggregated raw target files.
## Comparison Selection ##
The comparison selection script ([3_comparison_selection.R][16]) starts by constructing a large user by subreddit matrix containing message counts, as well as dates of first and most recent activity and combined message lengths: [user_subreddit_matrix.rds][17]
This script also uses the target search files produced in the target identification phases to identify target users, and exclude users as potential matches.
For each target user, similarity with all remaining possible comparison users is measured from the user subreddit matrix, and a comparison user is assigned. This results in the user pairs file: [user_pairs.rds][18]
Once comparison users are assigned, the comparison data script ([4_comparison_data.R][19]) uses the user pairs file to create the raw comparison files. This script is similar to the target data script, but to assign target messages it uses the raw target data.
## Text Processing ##
Once the raw target and comparison data files are created, these are processed by the process script ([5_process.R][20]), which produces the separate and combined `_scored_` files. This uses the Receptiviti API to process texts.
## Additional Exclusions ##
The additional exclusions script ([6_additional_exclusions.R][21]) identifies some target users who should probably be excluded. This produces the [exclude_users.csv][22] file, which can be used to remove those users from data files.
## Original Post Collection ##
Some target messages are comments within threads.
These full threads were not collected originally,
but they may provide context needed to understand these target messages.
The original post collection script ([7_collect_reply_ops.R][23]) collects the OPs of comment quit messages, which are included as the `reply_to` and `reply_to_id` fields in the [reddit_quitcommentops.csv.xz][24] file, along with scores of the combined reply and message texts.
[1]: https://aclanthology.org/2023.wassa-1.41/
[2]: https://osf.io/yvxnw
[3]: https://osf.io/nqr52
[4]: https://osf.io/wga82
[5]: https://osf.io/jtbq9
[6]: https://osf.io/tbdy2
[7]: https://osf.io/nqpht
[8]: https://osf.io/9uatj
[9]: https://osf.io/xjth3
[10]: https://osf.io/b2hv6
[11]: https://osf.io/p2rt7
[12]: https://osf.io/f6qsg
[13]: https://osf.io/n3b8y
[14]: https://osf.io/xahrc
[15]: https://osf.io/ngqv8
[16]: https://osf.io/ke9pb
[17]: https://osf.io/kfg5t
[18]: https://osf.io/b32ny
[19]: https://osf.io/w7q8v
[20]: https://osf.io/n6sxq
[21]: https://osf.io/f2uv7
[22]: https://osf.io/jgy8k
[23]: https://osf.io/4dgsq
[24]: https://osf.io/74qa8