This document outlines the steps necessary to reproduce the results from our paper.
Our code is organized into two broad sections:
1) A set of Figure_X directories that provide the code necessary to reproduce a particular figure from the paper.
* The code in these directories generally operate on processed data files, not directly on raw data
2) A set of directories named after the different processing steps necessary to produce the data used by the Figure_X code.
Each Figure_X code directory has a corresponding data directory that contains intermediary data necessary to plot the figure.
If you want to regenerate the figures from raw data please reference the following guide:
#####################################################################
#####################################################################
Generating from raw data:
This section covers how to generate, from raw data, the data files used by this script.
NOTE: To get access to the raw unanonymized data please use the link in the Raw_Data directory in OSF.
Install required packages
1) pip install -r requirements.txt
2) apt-get install python3-graph-tool
To generate the intermediary files:
1) Generate retweet network:
Note: these scripts live in the Retweet_network folder in the OSF repository.
a) Run generate_retweet_networks.py
i) NOTE you need to modify the constants in the script to point to the raw twitter data on your machine. There are also variables on how many threads to use that
you may want to tune to your machine. 16 or 32 workers should be fine.
ii) You will likely need to point the WORKING_DIR variable to a directory with > 50GB
of disk space. The final step of merging retweet and tweet edges requires the workers have adequet disk space.
iii) Previously generated versions of these files exist in the Retweet_Network folder of the OSF data repository.
Output: This code will generate the url classified edgelists for the 2020 retweet network, for use in CI calculations. These files are called: <bias>_retweet_edges.csv, there is one for each bias.
NOTE: We apply anonymization after this step, produced retweet edges and tweet ids are all anonymized.
b) Run create_2016_retweet_networks.py
i) Set the tweet_db_file1, tweet_db_file2, and urls_db_file variables to point to the correct sqlite databsae files
ii) Set the save_dir to a path for the 2016 networks. NOTE: this should be exactly the same path as the 2020 networks, since the files share a name. Ideally have a 2016 and 2020 directory for each set of retweet networks.
Output: This script will generate the url classified edgelist for the 2016 retweet networks. The files are named <bias>_retweet_edges.csv
2) Compute Collective Influence:
Note: these scripts live in the Collective_Influence folder in the OSF repository.
a) Run setup.py
b) Run generate_graphs.py
i) Be sure to set the base_path variable in this script to point at the retweet network edges files generated in step 1.
c) Run compute_CI_retweet_networks.py
i) this needs to be run once for each graph generated in the previous step. also expects ../data/ci_output/<graph|output>/2020 directories.
Output: This will produce the <bias>_<year>_ci.gt files used to compute the top_influencer_<bias>.csv files used to make figure 4.
3) Generate User Mappings:
Note: the script needed here is the Similarity Matrix folder
a) Run assemble_user_maps.py
i) You will need to set target_dir in this script to point at the raw 2020 users.csv files.
Output: After running this script you will have generated the user_map_2016.pkl and user_map_2020.pkl files.
4) Find the Top Influencers:
Note: This script lives in the Retweet_Network directory of the provided code.
a) Run elites_network_analysis.py
i) note: adds user handle to output, instead make a map of the new id we create to user handle / anonymized user handle and store that file.
Output: top_influencers_<bias>.csv files
Note: This script lives in Simlarity_Matrix directory
b) get_top_100_unweighted_influencers.py
i) Run for year = 2016 and year = 2020 (in script)
ii) set target_dir to point towards your <bias>_<year>_ci.gt graphs, generated by Collective_Influence compute_CI_retweet_networks.py
Output: top_100_<bias>
Note: This script lives in the Retweet_Network directory of the provided code.
c) analyze_retweet_networks.py
i) Run for year = 2016 and year = 2020 (in script)
ii) point USER_MAPS_<year> variables to the generated user_map_2016.pkl and user_map_2020.pkl files
iii) point the network_dir variable to the output of the compute_CI_retweet_networks.py script.
Output: influencer_rankings_2016.pkl and influencer_rankings_2020.pkl
Note: This script lives in the Retweet_Network directory of the provided code.
d) draw_combined_retweet_graphs.py
i) Run for year = 2016 and year = 2020 (in script)
ii) point save_dir to where the outputs from analyze_retweet_networks.py live and point network_dir to where the outputs of compute_CI_retweet_networks.py live.
Output: two retweet_graph_top_combined_topnum_N.json files, one for 2016 and the other for 2020
5) Generate Affiliation Mapping
Note: this script live in the User_Analysis folder in the OSF repository.
a) get_infl_affiliations.py
i) point INFLUENCER_DIR to the correct directories, as done in the previous step
ii) point ANSWER_DIR to the survey answers in the OSF data repository
iii) point MAPS_DIR to the location of allowed_users_anon_id_to_handle.json
Output: infl_affiliation_map_no_handles.json, infl_affiliation_map_no_handles.pkl
6) Generate Similarity Matrix:
Note: these scripts live in the Similarity_Matrix folder in the OSF repository.
a) get_similarity_matrix_2016.py
i) point influencer_dir to the directory containing the top_100 influencer pickle files generated by get_top_100_unweighted_influencers.py.
ii) point raw_retweets_2016 to the 2016 election data sqlite database.
iii) modify the save_dir
b) get_similarity_matrix_2020.py
i) point influencer_dir to the directory containing the top_100 influencer pickle files generated get_top_100_unweighted_influencers.py.
ii) point raw_retweets_2020 to the raw retweet csv files for 2020.
iii) modify the save_dir
Output: sim_network_large_2016.pkl and sim_network_large_2020.pkl, used by Figure 5
The above steps will generate all intermediary files necessary to plot the figures from the paper. Next, for each figure we will provide a quick overview on what data the figure needs and how to run the scripts to plot it.
#####################################################################
#####################################################################
Generating Figure 3:
Note: Figure 3 related scripts are located in the Figure_3 directory
1) Run analyze_answers.py
a) point the user_data variable to the directory containing link_map.pkl (check Generate User Mapings below)
b) point the influencer_data variable to the output of get_top_100_unweighted_influencers.py
c) point the infl_classification_surveys variable to the survey_0.xlsx - survey_8.xlsx provided in the Influencers_Classification data directory in osf.
Generating Figure 4:
1) Optionally create a python virtual environment to install the necessary python packages: python -m venv <PATH/TO/VIRTUAL/ENVIRONMENT>
a) Start your virtual environment: source <PATH/TO/VIRTUAL/ENVIRONMENT>/bin/activate
2) Install required python packages: pip install -r requirements.txt
a) Also install jupyter if you don't have it. *Note older version of jupyter may require you to manually enable ipywidgets, see: https://ipywidgets.readthedocs.io/en/8.1.3/user_install.html
3) Start up a Jupyter notebook: jupyter notebook
4) Run all cells of the generate_figure_4.ipynb
5) NOTE:
a) One discrepency from the figure presented inthe published paper. For the Extreme Right Bias / Fake News subfigure, the fourth user down, other_2016, erroneously had a circle icon (indicating the user was linked to the media) instead of a triagle (indicating the user was other). This was caused by the manual steps necessary to produce the figure initially.
The original figure first generated 3 sankey diagrams. Then, these were each merged into one svg file and icons, rankd bounding boxes, and legends were added by hand. This resulted in the wrong icon being used. The code presented here generates the figure programatically, therefore lacks the error.
Generating Figure 5:
1) plot_figure_5_2016.py
a) point SIM_NETWORK_PATH to the similarity network created in get_similarity_matrix_2016.py
b) point RETWEET_GRAPH_JSON_PATH to the json file created by the draw_combined_retweet_graphs.py method.
c) modify the SAVE_DIR to where you want the output
2) plot_figure_5_2020.py
a) point SIM_NETWORK_PATH to the similarity network created in get_similarity_matrix_2020.py
b) point RETWEET_GRAPH_JSON_PATH to the json file created by the draw_combined_retweet_graphs.py method.
c) modify the SAVE_DIR to where you want the output
3) merge_finish_figure_5.py
a) point this script to the generated pdfs of the previous step
4) NOTES:
a) This figure contains one discrepency from the one in the published paper. In the 2020 Right Leaning top 5 influencers table, the 3rd rank user Polit_both^1,1 has a purple (12) to the left in the figure presented in the paper, while the figure generated here does not. The figure generated by this script is correct, as it automatically makes the top influencer tables, instead of doing so manually.
The figure generated in the paper originally had a manual step where the two similarity network pdfs were merged into one figure and the top influencer tables added by hand. When adding the tables there was an error where the (12) was added to right leaning influencer 3. This likely occurred because this same user had (12) in the extreme bias right category.