# Exploring the Data and Analysis Scripts #
@[osf](jbaq6)
### Raw Datasets ###
Our analysis was conducted on secondary data using five datasets containing logs of foreground app use:
| Dataset | Sample Size | Reference | Link |
| ----------- | ----------- | -----------| ---------|
| LiveLab | 34 | Shepard, Rahmati, Tossell, Zhong, & Kortum (2011) | http://yecl.org/livelab/traces.html |
| Securacy | 165 | Jones, Ferreira, Hosio, Goncalves, & Kostakos (2015); Ferreira et al. (2015)| The dataset is managed by the University of Oulu. See Ferreira et al. (2015). |
| Consumer Mobile Experience | 532 | OFCOM (2019) | https://www.ofcom.org.uk/research-and-data/telecoms-research/mobile-smartphones/consumer-mobile-experience |
| Mobile Phone Usage | 342 | Pielot et al. (2017) | https://crawdad.org/telefonica/mobilephoneuse/20190429/ |
| Health & Smartphone Use| 46 | Shaw et al. (2020) | https://osf.io/a4p78/ |
To comply with data sharing agreements, we are unable to share the majority of raw (unprocessed) data. The **Health and Smartphone Use** dataset, which was collected by the first author is available: https://osf.io/6wk4p/
Other raw data sets can be requested from relevant data controllers or accessed via the links above.
For more details on these datasets and how they were acquired see our supplementary materials:
https://osf.io/6x3fs/
### Processed Datasets ###
Processed datasets are freely available. These **allow for the replication** of all reported confirmatory and exploratory analyses.
Processing included:
- Converting all raw data to the same structure (view an example here: https://osf.io/vyz5c/)
- Merging the 5 raw datasets together
- Extracting data from the 21 most used apps
- Removing blank days of data
- Removing participants who had less than 7 days of data
- Removing identifiers (e.g. the smartphone ID)
- Converting data into behavioural profiles
Further processing details can be found in the manuscript (https://doi.org/10.1177/09567976211040491) and supplementary materials (https://osf.io/6x3fs/).
**Note:** We have not shared the data or analysis scripts used to derive split half distributions (see later sections). Analysis scripts created average usage scores from raw data (which we cannot share), before generating behavioural profiles.
### Information about Processed Datasets ###
After data cleaning our processed datasets contained 780 participants. Each participant had at least 7 days of smartphone data. This totaled 28,692 days of data across all participants.
Two variables (and corresponding datasets) were created to measure the daily engagement with each app. ***Pickups*** refer to the number of times a user accessed each of the 21 apps per day; ***Durations*** reports how long (in seconds) each user spent on the same 21 apps by day.
The 21 apps included are: Calculator, Calendar, Camera, Clock, Contacts, Facebook, Gallery, Gmail, Google Play Store, Google Search, Instagram, Internet, Maps, Messaging, Messenger, Phone (native phone call app), Photos, Settings, Twitter, WhatsApp, and YouTube.
### Accessing Processed Datasets ###
The datasets are labelled **PickupsBehavProf.csv**, and **TimeBehavProf.csv**. Both can be downloaded from the open science framework project pages (https://osf.io/xvd6s/files/)
### What are *"daily behavioural profiles of app use"*? ###
For each day of data, the raw number of pickups and durations (in seconds) for each app were converted into z scores. Therefore, a behaviour profile consists of 21 standardised scores for any given day. We have deliberately not provided means or standard deviations regarding the average engagement for each of the 21 apps per day to ensure that data cannot be reverse engineered back into its original form. This aligns with data sharing/privacy agreements.
### Analysis ###
The relevant files for each analysis are contained in the following .zip folders on the OSF project page:
- Retrieving Coefficients
- Distributions Pickups Analysis
- Distributions Time (Durations) Analysis
- Split Halves Analysis
- Identifying Individuals Analysis
- Exploratory Factor Analysis
### Retrieving Coefficients ###
In our manuscript we discuss how to compare two behavioural profiles in terms of their similarity by conducting ipsative correlations. These are correlations conducted on ranked data. So firstly, for both behavioural profiles, we ranked each app from the most to the least used app per day. Then a Pearson correlation was conducted to assess if the two behavioural profiles contained similar application use patterns (e.g. always used Facebook the most, and the Calculator app the least etc). The higher the correlation coefficient, the greater the similarity.
In the **Getting Coefficients** folder, the datasets described above are called **PickupsBehavProf.csv**, and **TimeBehavProf.csv**
Two scripts are in this folder, **Durations Coefficients.R** and **Pickups Coefficients.R** reflect the two engagement variables of interest. These scripts contain a couple of functions which randomly take two behavioural profiles and conduct an ipsative correlation coefficient. In both scripts, the first function compares two behavioural profiles from different people (between subject comparison; *BetweenSMW1994*). A second function compares two behavioural profiles from the same person (within subject comparison; *WithinSMW1994*). At the beginning of these scripts, one can specify how many ipsative correlations the functions should generate. However, the max number will depend on the computational power available. Each script outputs two distributions: a within-subject distribution of coefficients and a between-subject distribution of coefficients. Examples for each variable (pickups and time) are in the **Getting Coefficients** folder.
We have also created [a shiny app ][1]which visualises this part of the analysis. It runs identical code with the addition of automated data visualisations.
### Distributions Analysis ###
This analysis involved comparing the within-subject distribution of coefficients to the between-subject distribution of coefficients. We predicted in our [preregistration][2] that the coefficients would be significantly higher in within subject comparisons than between:
*Hypothesis 1: Within-person smartphone use across days will be statistically and significantly higher than between-person smartphone use across days.*
Data and scripts for this analysis are split across two folders. The folder **Distributions Pickups Analysis** contains the distributions for the pickups data. **Distributions Time (Durations) Analysis** contains the distributions for the durations data.
When opening either of these folders, there are multiple distributions. In our analysis we created 90 distributions (45 for within, 45 for between). Each distribution contained 10 million coefficients (defined by our computational limit).
For each of the 45 distributions within-between pair, the **Test of Difference.R** script calculated several descriptive and inferential statistics including:
***Descriptives***
- The mean for the within subject distribution
- The mean for the between subject distribution
- The standard deviation for the within subject distribution
- The standard deviation for the between subject distribution
- The median for the within subject distribution
- The median for the between subject distribution
- The minimum value for the within subject distribution
- The the minimum value for the between subject distribution
- The maximum value for the within subject distribution
- The maximum value for the between subject distribution
***t tests***
- t value
- t degrees of freedom
- p value
- The 95% confidence interval around the mean for the within subject distribution
- 95% confidence interval around the mean for the between subject distribution
- Cohens D effect size, with 95% confidence interval
***Wilcoxon rank-sum***
- W
- p value
- Vargha and Delaney effect size
In the folder **Distributions Pickups Analysis** the output of this script is called **Day Pair Statistics Pickups.csv**. In the folder **Distributions Time(Durations) Analysis** the output of this script is called **Day Pair Statistics Time.csv**.
In above two folders there is also a script called **Graphs.R** which merges the 45 within distributions together and also the 45 between distributions together. These two 'super distributions' containing 45 million data points each are then plotted into a histogram. The histogram mirrors the style of Shoda, Mischel, and Wright (1993) as a tribute to their work.
@[osf](w3xp7)
### Split Halves Analysis ###
We re-analyzed data using a ‘split half’ comparison. This involved creating an average behavioral profile for the first and second half of a user’s data and comparing these directly for a within-user comparison, or comparing one half to another user’s half for a between-user comparison. This split half approach removes the unbalanced influence that users with more behavioral profiles have in the day pair comparisons, since all users have only two data points.
The folder **Split Halves Analysis** contains data for both pickups and durations variables. Specifically, four ipsative correlation coefficient distributions are reported, each with an N of 780:
1. The between subject distribution for duration data
2. The within subject distribution for duration data
3. The between subject distribution for pickups data
4. The within subject distribution for pickups data
The **Split Halves Script - Durations.R** compares 1 & 2, and exports the same descriptive and inferential statistics into a file called **Split Halves Statistics Duration.csv**. The **Split Halves Script - Pickups .R** compares 3 & 4, and exports the same descriptive and inferential statistics into a file called **Split Halves Statistics Pickups.csv**.
### Identifying Individuals Analysis ###
Using the **PickupsBehavProf.csv**, and **TimeBehavProf.csv** data, we built models which could predict which user was associated with each behavior profile. See the **Identifying Individuals Analysis** folder. In particular this analysis was conducted using the **Identifying Users Script.R**
This analysis included:
- Converting each participant into a class to be classified
- Splitting the data into train and test subsets
- Training a pickups random forest model
- Training a durations random forest model
- Calculating the accuracy of both models on test data (see **PickupsConfusionMatrixTest.csv** and **DurationsConfusionMatrixTest.csv**)
- Calculating other performance metrics for each class (user) including Sensitivity, Specificity, Precision, Recall, F1 etc. (see **PickupsPerformanceMeasuresPerPerson.csv** and **DurationsPerformanceMeasuresPerPerson.csv**)
- Calculating probabilities for each user belonging to every behavioural profile (780 probabilities)
- Calculating the % of occurrences where the correct person is in the top 10 most probable users
(see **Top10PickupsConfusionMatrixTest.csv** and **Top10DurationsConfusionMatrixTest.csv**)
### Exploratory Factor Analysis ###
Using the **PickupsBehavProf.csv**, and **TimeBehavProf.csv** data, Pearson correlations were calculated on the daily behavioral profile scores for each app to check for similarity in variance of use. The resultant matrices can be accessed in the folder **Exploratory Factor Analysis**. See files named **CorMatrixPickups.csv** and **CorMatrixTime.csv**.
We submitted the correlation matrices to an Exploratory Factor Analysis (EFA) to examine if the variance in daily usage could be explained by latent factors. This analysis included:
- Parallel analysis using a minimum residual factor analysis method
- EFA using the ‘Promax’ oblique rotation method
This analysis was conducted using the script **Factor Analysis.R**. Further details can be found within supplementary materials: https://osf.io/6x3fs/
### References ###
Ferreira, D., Kostakos, V., Beresford, A.R., Lindqvist, J., & Dey, A.K. (2015). Securacy: An Empirical Investigation of Android Applications’ Network Usage Privacy and Security. In WiSec ’15: Proceedings of the 8th ACM Conference on Security & Privacy in Wireless and Mobile Networks, 11, 1-11. New York, NY, United States. doi: 10.1145/2766498.2766506
Jones, S. L., Ferreira, D., Hosio, S., Goncalves, J., & Kostakos, V. (2015). Revisitation analysis of smartphone app use. Proceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous Computing - UbiComp ’15, 1197–1208. doi:10.1145/2750858.2807542
OFCOM (2019). The consumer mobile experience. Retrieved from https://www.ofcom.org.uk/research-and-data/telecoms-research/mobile-smartphones/consumer-mobile-experience
Pielot, M., Cardoso, B., Katevas, K., Serrà, J., Matic, A., & Oliver, N. (2017). Beyond interruptibility: Predicting opportune moments to engage mobile phone users. In Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (Vol. 1, p. 91). doi:10.1145/3130956
Shaw, H., Ellis, D. A., Geyer, K., Davidson, B. I., Ziegler, F. V., & Smith, A. (2020). Quantifying Smartphone “Use”: Choice of Measurement Impacts Relationships Between “Usage” and Health. Technology, Mind, and Behavior, 1(2). doi: 10.1037/tmb0000022
Shepard, C., Rahmati, A., Tossell, C., Zhong, L., & Kortum, P. (2011). Livelab: Measuring wireless networks and smartphone users in the field. ACM SIGMETRICS Performance Evaluation Review, 38, 15-20. doi:10.1145/1925019.1925023
[3 fromf containing ]: https://osf.io/u6hsc
[1]: https://behaviouralanalytics.shinyapps.io/AppUseProfiles/
[2]: https://osf.io/u6hsc/ "preregistration"