Main content

Home

Menu

Loading wiki pages...

View
Wiki Version:
# Pre-processing ## Trace data The descriptive analyses primarily rely on our visit-level database. We publish these data stripped of any potentially identifying information (e.g. URLs) as `/data/Parquet - visits table`. This data set contains the following variables: - `id`: unique identifier of a visit (integer) - `u_id`: same as id, but missing for duplicated visits, i.e., an URL that has already been visited by that person that day (integer) - `person_id`: unique identifier of the participant (integer) - `country`: the country of the participant (string with values US, NL, or PL) - `created_utc`: the time of the visit in UTC (timestamp) - `wave`: the wave of the visit (integer with values 1, 2, or 3) - `news_domain`: whether the visit was to a news domain (integer of 0 or 1) - `political_title`: whether the visit title was classified as political (integer of 0 or 1) - `visit_time_ms`: The duration of the visit in milliseconds (integer). Duration is defined as the difference to the chronologically next visit, capped at a difference of 5 minutes. - `survey_domain`: whether the visit was to a survey domain (integer of 0 or 1) - `has_title`: whether the visit has a title (integer of 0 or 1) - `news_home`: whether the visit was to the home page of a news domain (0 or 1) - `local_date`: the local date (string DD-MM-YYYY) - `sm_news`: whether a news visit was to the social media site of a news organization (integer of 0 or 1) This visits-level dataset is the basis, first, for the descriptive analyses (see Section \@ref(sec:descriptive) below). Second, we aggregate the data on the person-wave level with the query `1.1-traces-aggregates.sql`, which is exported as which is used for the predictive analyses (see Section "Predictive analyses"). The second trace data set is in `/data/Parquet - people table`, which is only used for an analysis in the Supplementary Materials. Our SQL queries are written in Presto, and were run using the AWS Athena (version 2) (a paid query service). If you want to reproduce the query results, the easiest option is to do it in AWS Athena, in which case you woud upload all of the files in `/data/Parquet - visits table/` into an AWS S3 bucket, and in Athena create a table from "S3 bucket data" and do the same for `/data/Parquet - people table` if needed. This would require no code adaption (unless there are version changes in Athena). If you prefer not to pay the AWS fee, you could install a Postgres database locally, load the data and run queries. This will be considerably slower, and may require code modifications because of the slightly different SQL dialect (from Presto SQL to PostgreSQL). ## Survey data `03_survey-recoding.R` recodes the survey data for all three countries and all three waves. It depends on three helper files called `02_survey-recoding-functions_NL.R`, `02_survey-recoding-functions_PL.R` and `02_survey-recoding-functions_US.R` which contain functions used througout the research project to rename variables, correct errors and join the waves. Running this file saves `/data/processed/data_survey.RData`, which contains the survey data for all three countries in wide format (one row = one respondent). Survey and trace data are joined with the file `04-survey-traces-joining.R`. It takes in `/data/processed/data_survey.RData` and `/data/Parquet - visits table`, which is the output from the query described in `1.1-traces.aggregates.sql`.
OSF does not support the use of Internet Explorer. For optimal performance, please switch to another browser.
Accept
This website relies on cookies to help provide a better user experience. By clicking Accept or continuing to use the site, you agree. For more information, see our Privacy Policy and information on cookie use.
Accept
×

Start managing your projects on the OSF today.

Free and easy to use, the Open Science Framework supports the entire research lifecycle: planning, execution, reporting, archiving, and discovery.