# SMART Mental Health Prediction Tournament
**Stratified Medicine Approaches foR Treatment Selection (SMART)**
In the SMART Mental Health Prediction Tournament, 13 teams from around the world will compete to see who can build the best predictive model for anxiety and depression treatment response. Each team will be provided with the same large, anonymized mental health treatment outcome dataset from the UK’s national health system. A separate, held-out test sample will be used to determine the winner. This will provide a level playing field for evaluating each model’s efficacy, but it will also allow participants and, ultimately, the field, to understand the reasons for the advantages and disadvantages of each strategy, under different conditions. Head-to-head comparisons of the best approaches for selecting mental health treatments will, we believe, yield knowledge that can be used to maximize the efficiency of mental health care delivery in the future.
The first aim of the tournament will be to learn more about different methodological approaches to building predictive models for treatment selection in mental health. We also hope to contribute to the conversation about barriers to implementation of precision medicine approaches in real-world clinical settings, informed in part by work we will undertake with key stakeholders (service-users and clinicians) as part of this project.
The second aim is to produce an algorithm that could be used to inform treatment decisions in the UK’s IAPT (Improving Access to Psychological Therapies) services, with the goal of enhancing efficiency, efficacy, and service.
Growing interest in personalized medicine has resulted in a proliferation of research aimed at understanding individual differences in response to treatment. In 2016, researchers from around the world came together for the Treatment Selection Idea Lab (TSIL) to discuss precision medicine approaches in mental health. Many exciting methods for modeling individual differences in treatment response were presented, all of which had been applied to different datasets. As no head-to-head comparisons had been performed, those who had attended were left without a full understanding of the relative strengths and weakness of the approaches in different contexts.
We were inspired by a talk given at TSIL2016 by Barb Mellers about her and Phil Tetlock’s [Good Judgment] project, a tournament of superforecasters in which Tetlock and Mellers compared different individuals’ and groups’ approaches to prediction to better understand the process of forecasting.
**The UK Mental Healthcare System**
The UK mental healthcare system is organized based on the [stepped-care] model, which presents treatment options hierarchically. Step-1 involves contact with a general medical practitioner and may involve pharmacotherapy or “watchful waiting”. IAPT (Improving Access to Psychological Therapies) comprises low intensity (Step-2 – e.g., guided self-help, computerized CBT) and high intensity (Step-3 – e.g., 1-on-1 treatment with a clinician who specializes in a specific intervention) psychological therapies. The majority of patients treated in this IAPT sample (~80%) began in low intensity (LI) treatment. Patients can be stepped up or down at any point of their contact with IAPT services, typically those who don’t respond to LI are “stepped up” to high intensity (HI) treatment (~15% of this sample), while an additional ~20% (in this sample) are treated at HI from the outset. LI treatments are generally briefer in terms of the number of sessions and the length of treatment sessions compared to HI treatments. Further, LI treatments are delivered by staff with post-graduate diplomas in LI therapies whereas HI treatments are delivered by a mixture of professionals. HI treatments are therefore considerably more expensive for IAPT services to deliver than LI therapies, in terms of staff costs, overheads and other resources.
**Goals of the Tournament**
The aim of this tournament is to test statistical approaches that could be used to improve the process by which patients are allocated to LI or HI treatment in IAPT, such that a given treatment selection model might improve patient outcomes and the efficiency of resource use within IAPT services.
- December, 2017 – Teams receive training data
- March, 2018 – Teams turn in Phase I predictive models
- April, 2018 – Models evaluated in test sample.
- June, 2018 – Tournament results presented and winner announced at TSIL2018
A dataset of a large (~N=6,000) cohort of patients with depression and/or anxiety problems treated in an IAPT service in Leeds, England. This dataset will have two sets of potential predictors: the set of variables that are routinely collected at all IAPT services across the UK, and an enriched set of variables that have been assessed specifically in the Leeds IAPT services. A second separate sample from Cumbria IAPT (N~1000, exact size TBD) will be used as a second held-out test sample for both standard and enriched models. A third separate IAPT sample from London (N~7000, exact size TBD) will be also be used as a test sample for the standard variable models, and to test the generalizability of the models. The data from all three of these test samples have never been analyzed. A final separate validation sample (~N=30,000) will be used to test the generalizability of the winning models (standard variables only). This final sample has been used in previous work and is known to some of the tournament participants and is thus not a true hold-out.
_Variables Routinely Collected at IAPT Sites_: Diagnosis, Age, Gender, Ethnicity, Disability Status, Employment Status, Comorbid long-term physical condition, GAD-7 (Spitzer et al. 2006), PHQ-9 (Kroenke et al. 2001), Work and Social Adjustment Scale (Mundt et al. 2002), IAPT Phobias Scales
_Enriched Variables_: Chronicity, Number of prior treatment episodes, Family history of mental health problems, Outcome Expectancy (Lutz et al. 2007), Index of Multiple Deprivation (socioeconomic status)(McLennan et al. 2011)
The primary dataset includes N=6000 cases of **non-randomized** data. 1000 of these cases constituted the sample analyzed by Delgadillo, Moreea and Lutz (2016). These cases, plus 3000 randomly selected cases of the remaining 5000, will serve as the common training dataset with which the teams will develop their algorithms and allocation strategies. The remaining 2000 cases will be held out as a test sample.
After considering many different outcome metrics as well as current IAPT core metrics (please see the “Outcome metric explanation” document for more details), we developed an adapted binary outcome metric based on the three metrics currently used in IAPT (recovery, reliable change, reliable recovery). The SMART outcome variable is defined for three possible cases in the tables below. Important definitions include:
1) Recovery is defined by being below caseness (PHQ-9 <= 9; GAD-7 <= 7) post-treatment.
2) “reliable change” is defined as improving at least 6 points for the PHQ-9 and at least 5 points for the GAD-7.
3) If a patient is classified as a “case” (PHQ-9 >= 10; GAD-7 >= 8), then a change score that equals or exceeds 50% would meet criteria for positive change on that measure.
4) “reliable and clinically significant deterioration” is defined as moving from being below caseness pre-treatment to above caseness post-treatment with an increase of 6 or more on the PHQ-9 and as an increase of 5 or more on the GAD-7.
![Outcome rule for patients who start above caseness on both PHQ9 and GAD7](http://osf.io/dx6hj/download =500x500)
![Outcome rule for patients who start above caseness only on PHQ9](http://osf.io/m9td8/download =500x500)
![Outcome rule for patients who start above caseness only on GAD7](http://osf.io/y43w6/download =500x500)
As secondary outcomes, we will also analyze the three IAPT binary metrics (recovery, reliable change, reliable recovery).
**Data Analysis Plan**
The primary models on which teams will be evaluated will only use the standard set of variables; however, teams will be allowed to submit a secondary set of models that can use the "enriched variables". If those models indicate meaningful added predictive value from those variables, a cost-benefit analysis will be performed.
Teams will be provided with four datasets – two that only include the standard variables (one without imputed data and one with imputed data) and two that include the enhanced variables (again, with and without imputed data).
Teams must submit one set of 3 algorithms (described below) that rely only on the standard variables, and a second set of 3 algorithms that can be informed by the enriched variables. This is because only the standard variables are currently collected across all IAPT services in the UK, and thus a model that required the enriched variables would not be useful for IAPT unless and until they changed assessment practices across the entire system.
Teams will be allowed to use any available methodological approach to build their predictive model, although **the first model must produce a prognostic prediction of outcome in LI. The second model must generate a prognostic prediction of outcome for those assigned directly to HI**. Prognostic predictions do not capture the expected “differential” response between LI and HI.
**The third model that teams will submit will produce a prediction of the differential benefit of HI over LI**. This prediction could come from two separate prognostic models (one in HI and one in LI), the predictions of which are subtracted, or from a single model that directly generates a differential prediction. This differential prediction must be amenable to the evaluation scheme described below (which requires a difference between *probability* of good outcome in HI vs LI, and not, for example, predicted differential benefit in terms of a continuous outcome like PHQ-9).
If possible, all models should be built and submitted in the R software environment; other programs can be used.
Once each team has submitted their candidate predictive model using ‘routinely collected IAPT variables’ (see Table 1), the predictive accuracy and performance of each team’s model will be tested in the test and validation datasets: (1) the held out test subset of cases in the Leeds IAPT dataset (primary); (2) the second test sample from Cumbria IAPT; (3) the third test set from London IAPT; (4) the wider validation multi-centre dataset including data from 4 other IAPT services from the Northern IAPT Practice Research Network.
Two different types of approaches will be used to evaluate the SMART tournament models. The first type of approach will focus on the accuracy of the predictions generated by each of the prognostic models. The second evaluation will focus on the accuracy of the differential predictions. These evaluations will determine the model(s) that win the SMART tournament.
One goal of the tournament is to come to a consensus as a group about which approach (e.g., propensity score analysis) we should use to account for the non-randomized nature of the data. One approach that has been proposed is a double-robust weighting scheme (available as a SAS routine), which could help account for the effects of any observed predictors on treatment allocation, and would aim to allow us to evaluate the models as if the test and validation samples had been randomized to LI and HI. The primary evaluations will not rely on these approaches, but secondary analyses accounting for the non-randomized nature of the data will be presented.
*Accuracy Evaluation:* The LI and HI prognosis models will be evaluated using brier scores (primary outcome), the deviance statistic (secondary), and ROC curves (with AUC, as tertiary outcome). The differential models will be evaluated in the following way: within the test and validation samples, patients will be arrayed for each model based on the predicted differential benefit (from smallest predicted benefit of HI over LI to largest predicted benefit). Then, a sliding window will be used to calculate the observed differential benefit of HI over LI at each point along that spectrum. The accuracy of these models will be calculated by comparing the predicted differential benefit at each point to the observed differential benefit. The quality of each model will be estimated by treating the observed percentage advantage of HI as the dependent variable in a simple regression, with model-predicted advantage as the predictor (see example figure below). The model that yields the largest t-statistic, associated with the slope estimate, will win.
![Hypothetical Evaluation of Differential Model](https://osf.io/v78qr/download =500x500)
As with the prognostic models, secondary analyses accounting for the non-randomized nature of the data will be presented.
*Tertiary analyses:* Exploratory analyses will evaluate continuous symptom measure outcomes (PHQ9 and GAD7 at post-treatment and as change scores), dropout rates, treatment dosage / duration, and final treatment pathway outcomes (which would differ for those who were stepped up from LI to HI).
*Generalizability analyses:* As the first hold-out (Leeds) was randomly selected from the same dataset as the training sample, we will be able to compare each model's performance in the Leeds test sample with its performance in the other test samples (London and Cumbria) to see whether a model built in a sample from one IAPT service can generalize to samples drawn from services in other regions that will differ in important ways (see Clark, 2018 Annual Rev Clin Psych and Clark et al., 2018 Lancet). We will also be able to use the fourth sample (PRN validation sample), which contains both data from Leeds (as well as other from IAPT services of nearby regions) to see how each model's performance varies over time in the same region.
*Planned lines of post-hoc investigation:* We are interested in comparisons of predictions made by relatively simple versus more complex modeling approaches, so we plan to explore the differences in predictive power for approaches of varying levels of complexity, with the aim of addressing issues related to implementation (e.g., user familiarity, communication barriers, interpretability of models and user trust in their predictions, etc.).
We are also interested in looking at the consistency of predictions made for different individuals across each teams models. For example, will some test subjects have low variability (e.g., high consistency) in their predictions across all teams' models, and will others have high heterogeneity in their predictions. What features might be associated with this variability (e.g., are outcomes for high baseline severity patients more consistently predicted, or patients with certain diagnoses)? And are the predictions general more accurate for subjects with consistent predictions across all teams' models?
Another question we will explore is whether averaging the predictions from all teams, or a subset of the highest performing teams, would generate superior predictions in new test samples compared to using only the highest performing model's predictions.
**Dissemination of Results**
All models and model evaluations will be made publicly available through peer reviewed publication, as well as on the OSF project page and the SMART tournament website. Associated R-code for most teams will also be made publicy and freely available for download (teams had the option of requesting that their code not be shared beyond the tournament participants). Although we do not have permission to share the data, we believe the r-code and associated discussion of the different modeling approaches will provide a value resource to the research community. Each team will respond to a questionnaire (see project document titled "SMART tournament evaluation explanation.docx") that will set the foundation for discussion and dissemination of the different approaches and tournament results.
Team 1) [Rob DeRubeis], Jack Keefe, Colin Xu, and Thomas Kim (University of Pennsylvania)
Team 2) [Adam Kapelner], Alina Levine (Queens College in New York City) and [Joshua Wiley] (Monash University)
Team 3) [Adam Chekroud], Chief Scientist, [Abhishek Chandra], Chief Technology Officer, [Ralitza Gueorguieva], Chair of Biostatistics, [John Krystal], Chair of Psychiatry, [Kevin Anderson], PhD Candidate, [Thomas O'Connell], PhD Candidate, [Stefan Uddenberg], PhD Candidate, [Yoonho Chung], Postdoctoral Associate, [Gianfilippo Coppola], Asisstant Professor, [Michael Lopez-Brau], PhD Candidate (Yale University, Spring Health)
Team 4) [Wolfgang Lutz], Julian Rubel, Anne-Katharina Deisenhofer, Brian Schwartz, Björn Bennemann, (Universität Trier), [Aaron Fisher] (University of California, Berkeley)
Team 5) [Steve Pilling], [Rob Saunders] and [Joshua Buckman] (University College London)
Team 6) [Jaime Delgadillo] and [Michael Barkham] (University of Sheffield)
Team 7) [Ronald Kessler], [Ekaterina Sadikova] (Harvard Medical School), and [Alex Luedtke] (Fred Hutchinson Cancer Research Center).
Team 8) [Jasper Smits], [Jason Shumake], [Christopher Beevers], [Derek Pisner], [Santiago Papini] (University of Texas at Austin)
Team 9) [Andrea Niles] (University of California San Francisco)
Team 10) [Cynthia Fu] (University of East London), [Christos Davatzikos] and [Yong Fan] (University of Pennsylvania)
Team 11) [Clarissa Bauer-Staeb], [Katherine Button], [Catherine Barnaby], [Julian Faraway] (University of Bath)
Team 12) [Marjolein Fokkema] (Leiden University), [Miranda Wolpert], Elisa Napoleone and Julian Edbrooke-Childs (Anna Freud Center).
Team 13) [David Benrimoh], [Robert Fratila], [Matthew Krause] ([aifred health], McGill University)
We are grateful that David Clark, Steve Pilling, and Michael Barkham have agreed to act as advisors to the SMART tournament. Their insights into IAPT will help maximize the potential of the project by ensuring that the structure and goals of the tournament are aligned with the real-world context and needs of IAPT services, clinicians, and service-users.