Main content

Home

Menu

Loading wiki pages...

View
Wiki Version:
This project presents a means of collecting Reddit data, identifying users with messages of interest (here, those relating to no longer having a job), and selecting neutral pairs for those users. ## Included Data ## The included data files contain message histories from 14,872 Reddit users, separated into a target sample ([reddit_quitfire_raw.csv.xz][1]) and a paired comparison sample ([reddit_comparison_raw.csv.xz][2]) Each sample has a subset, aggregated raw file ([reddit_quitfire_raw_1yr_weekly.csv.xz][3] and [reddit_comparison_raw_1yr_weekly.csv.xz][4]) and a version with texts scored by SALLEE and LIWC ([reddit_quitfire_scored_1yr_weekly_subset.csv.xz][5] and [reddit_comparison_scored_1yr_weekly_subset.csv.xz][6]). The scored files are combined to form the final file for analysis: [reddit_combined_scored_1yr_weekly_subset.csv.xz][7] See the [codebook][8] for descriptions of variables contained in these files. ## Full Data ## The full sample is too large to provide (e.g., the set used to make the included data files consists of nearly 3 million files, totaling over 160 GB in size), but a new, similar sample can be collected from Reddit using the included scripts. Running the collection script ([0_collection.R][9]) will first create a `subreddits` directory in which recent messages from select subreddits will be downloaded. These are used to collect user names. The script will then create `user_comments_raw` and `user_posts_raw` directories in which it will download comments and submissions made by each user. This makes up the full sample. ## Target Identification ## The target identification script ([1_target_identification.R][10]) first searches through the raw user files, and creates the phase 1 search results file: [search_p1.rds][11] Messages identified in this search phase are then processed with a dependency parsed, and the results are stored in a created `parsed_p1` directory. Once all messages are parsed, an initial refined set of matched sentences are presented for review and later use: [target_p1.csv][12] A second phase of refinement applies dependency-based rules, and searches for time references. These results are stored for review and later use: [target_p2.csv][13] A subset of the users identified in the second phase of refinement are selected in the target data script ([2_target_data.R][14]), which then creates the full and aggregated raw target files. ## Comparison Selection ## The comparison selection script ([3_comparison_selection.R][15]) starts by constructing a large user by subreddit matrix containing message counts, as well as dates of first and most recent activity and combined message lengths: [user_subreddit_matrix.rds][16] This script also uses the target search files produced in the target identification phases to identify target users, and exclude users as potential matches. For each target user, similarity with all remaining possible comparison users is measured from the user subreddit matrix, and a comparison user is assigned. This results in the user pairs file: [user_pairs.rds][17] Once comparison users are assigned, the comparison data script ([4_comparison_data.R][18]) uses the user pairs file to create the raw comparison files. This script is similar to the target data script, but to assign target messages it uses the raw target data. ## Text Processing ## Once the raw target and comparison data files are created, these are processed by the process script ([5_process.R][19]), which produces the separate and combined `_scored_` files. This uses the Receptiviti API to process texts. ## Additional Exclusions ## The additional exclusions script ([6_additional_exclusions.R][20]) identifies some target users who should probably be excluded. This produces the [exclude_users.txt][21] file, which can be used to remove those users from data files. [1]: https://osf.io/yvxnw [2]: https://osf.io/nqr52 [3]: https://osf.io/wga82 [4]: https://osf.io/jtbq9 [5]: https://osf.io/tbdy2 [6]: https://osf.io/nqpht [7]: https://osf.io/9uatj [8]: https://osf.io/xjth3 [9]: https://osf.io/b2hv6 [10]: https://osf.io/p2rt7 [11]: https://osf.io/f6qsg [12]: https://osf.io/n3b8y [13]: https://osf.io/xahrc [14]: https://osf.io/ngqv8 [15]: https://osf.io/ke9pb [16]: https://osf.io/kfg5t [17]: https://osf.io/b32ny [18]: https://osf.io/w7q8v [19]: https://osf.io/n6sxq [20]: https://osf.io/f2uv7 [21]: https://osf.io/3e4gx
OSF does not support the use of Internet Explorer. For optimal performance, please switch to another browser.
Accept
This website relies on cookies to help provide a better user experience. By clicking Accept or continuing to use the site, you agree. For more information, see our Privacy Policy and information on cookie use.
Accept
×

Start managing your projects on the OSF today.

Free and easy to use, the Open Science Framework supports the entire research lifecycle: planning, execution, reporting, archiving, and discovery.