# Guide to Preliminary Analyses
The Preliminary Analyses consist of a set of analyses that duplicate the workflow we plan for the main analyses outlined in the registered reports, but were performed using convenience samples and predicting sentiment of tweets (rather than the self-reported mental health variables we plan to study in the registered report).
The folder titled "Preliminary Analyses" should have everything you need to duplicate the analyses described in the registered report. It consists of four folders. The **data** folder contains the data files used in the preliminary analyses. The **lexicons** folder contains the sentiment and emotion lexicons used to score tweets (*note: Only sentiment is looked at in the preliminary analyses*). The **scripts** folder contains the R code for the analyses, processing the tweets (to score sentiment and emotion), and the a knitted pdf from the script use to gather one of the samples (the others are not in a shareable state as they currently contain my Twitter authentication information).
The scripts all use relative paths, so you should be able to download the whole folder and start running the scripts (as long as you preserve the folder's structure as it is currently). More information about the contents of each is provided below. There is a consisten naming format that will help you navigate it, which is the following:
## Background on data
Data were downloaded from Twitter's API using a combination of [twitteR (Gentry, 2015)] and [rtweet (Kearney, 2018)]. There are ultimately three samples:
1. **Costello seed**
* A sample of users from my own two-step friend network (i.e., my friends' friends).
2. **Obama & Trump Seed**
* A sample of users that follow Obama or Trump (randomly sampled followers of each).
* Note, the data folder contains a filtered (filtered to have 25+ tweets) and non-filtered versions.
3. **'New' 10 Seed**
* Followers sampled from 10 different popular accounts:
* ‘@joelembiid’, ‘@katyperry’, ‘@jimmyfallon’, ‘@billgates’, ‘@oprah’, ‘@kevinheart4real’, ‘@wizkhalifa’, ‘@adele’, ‘@nba’, and ‘@nfl’
A knitted pdf from collecting the third sample is in the scripts folder [here]
I attempted to download two things from the API for each of these samples:
1. Each user's friends list
2. Each user's timeline of tweets
Users' tweets were scored for sentiment and emotion based on sentiment and emotion lexicons developed for Twitter by [Mohammad & Kiritchenko (2015)]. The lexicons can be seen in the lexicon folder, and the script for cleaning/processing and scoring the tweets are in the scripts folder (see below)
The data folder contains several datasets. Names should be relatively self-explanatory, but the logic of the filenames is as follows:
* They all start with their sample name. So any file that starts with **costello_seed** is from that sample.
* Files with **timelines_df** in the name contain the raw, unprocessed tweet/timeline data.
* Files with **processed tweets** in the name contain scored tweets (for each user available in that sample).
* filenames with **friends** or **Users_friends** in the name should have user-friend edgelist. Couple things to note here:
* The Costello & Obama/Trump seed sample have a single friends/user-friends dataframe; this also contains friend information (including their name/handle).
* The new 10seed sample was collected using slightly different functions, so it has 2 files. One has just friend id's (numeric id's) and usernames, and the other (that ends with 'friend_info') has the friend information (including thei name/handle)
This folder contains two lexicons, both from [Mohammad & Kiritchenko (2015)], downloaded from the [first author's website]
1. [Hashtag Sentiment Lexicon]
* titled: "HS-unigrams.txt"
2. [Hashtag Emotion Lexicon]
This folder just contains the two plots (of training performance) that are saved out by the predictive modeling scripts. Note that if you download this set of folders and run either predictive modeling script (unmodified), you will overwrite the plots. However, if you don't modify the scripts, they should be identical to the originals.
This folder contains scripts (and knitted pdf & html documents) for the preliminary analyses. Those scripts and knitted documents correspond to the following.
## Downloading API DAta
* Only available (currently) for new 10 seed sample, and only available as [PDF]
## Processing & Scoring Tweets
* Scoring Tweets for **Costello Seed Sample** [RMD] and [PDF]
* Scoring Tweets for **Obama/Trump Seed Sample** [RMD] and [PDF]
* Scoring Tweets for **New 10 Seed Sample** [RMD] and [html]
## Predicting Sentiment from Friends
* Predictive Modeling for **combined Costello & Obama/Trump seed**
* Referred to as *initial samples* in manuscript
* [RMD] & [HTML]
* Predictive Modeling for **New 10 seed Sample**
* Referred to as *replication sample* in manuscript
* [RMD] & [HTML]