Annual Review of Statistics and Its Application Calibrating the Scientific Ecosystem Through Meta-Research

While some scientists study insects, molecules, brains, or clouds, other scientists study science itself. Meta-research, or research-on-research, is a burgeoning discipline that investigates efficiency, quality, and bias in the scientific ecosystem, topics that have become especially relevant amid widespread concerns about the credibility of the scientific literature. Meta-research may help calibrate the scientific ecosystem toward higher standards by providing empirical evidence that informs the iterative generation and refinement of reform and ( d ) evaluating solutions. In each of these areas, we review key meta-research endeavors and discuss several examples of prior and ongoing work. The scientific ecosystem is perpetually evolving; the discipline of meta-research presents an opportunity to use empirical evidence to guide its development and maximize its potential.


INTRODUCTION
Meta-research (research-on-research) is a burgeoning discipline that leverages theoretical, observational, and experimental approaches to investigate quality, bias, and efficiency as research unfolds in a complex and evolving scientific ecosystem , Ioannidis 2018a. Meta-research is becoming especially relevant amid warnings of a transdisciplinary "credibility crisis" and concerns of suboptimal and wasteful applications of the scientific method (Altman 1994, Baker 2016, Chalmers & Glasziou 2009, Ioannidis 2005, Leamer 1983, Pashler & Wagenmakers 2012, Open Sci. Collab. 2015. These concerns have inspired reform initiatives intended to achieve higher standards of efficiency, quality, and credibility in science (Ioannidis 2014, Miguel et al. 2014, Munafò et al. 2017, Nelson et al. 2018, Poldrack et al. 2017).

Defining Meta-Research
Meta-research has been defined as "the study of research itself: its methods, reporting, reproducibility, evaluation, and incentives" (Ioannidis 2018a, p. 1). Given this definition, its boundaries are broad, and its thematic areas interact with several other disciplines. Scientific fields are fuzzy categories with flexible, porous, overlapping borders (Börner et al. 2012). Neighboring disciplines to meta-research include, but are not limited to, philosophy of science, history of science, sociology, research synthesis (e.g., meta-analysis), data science, journalology, bibliometrics, ethics, behavioral economics, and evidence-based medicine. All of these fields may to some extent share the goals of describing, evaluating, and improving the scientific ecosystem.

Historical Roots
Meta-research has deep roots in the beginnings of the scientific method, when intellectuals such as Francis Bacon argued for greater experimentation, openness, and collaboration (Sargent 1999). These early efforts to scrutinize and modify the scientific ecosystem were still largely based on philosophical considerations rather than systematic empirical research. Over the past century, concerns about research quality have repeatedly flared across scientific disciplines, including psychology (Elms 1975), economics (Leamer 1983), and biomedicine (Altman 1994). In parallel, systematic investigations of topics, such as publication bias (Sterling 1959), experimenter bias (Rosenthal 1966), and statistical power (Cohen 1962), have reflected a growing shift toward empiricism: an acknowledgment that mostly theoretical arguments about research practices, methods, and bias should eventually be confronted with empirical data (Faust & Meehl 2002). Initiatives such as the Cochrane Collaboration in the domain of evidence-based medicine have achieved some success at addressing suboptimal research quality; however, overall, reform efforts have often failed to gain traction. Nevertheless, the recent credibility crisis has sparked a transdisciplinary discussion (Chalmers & Glasziou 2009, Miguel et al. 2014, Nosek et al. 2015, Poldrack et al. 2017, prompted a cascade of reform initiatives (Ioannidis 2014, Munafò et al. 2017, and catalyzed the emergence of the meta-research discipline. evaluation of studies suffering from these problems). A central challenge is the sheer complexity of the scientific ecosystem. Multiple stakeholders, infrastructures, and processes push and pull in different directions, interact, and evolve. Any individual problem, even if it can be reasonably well described, forms part of a complex causal network of interleaved factors. Delineating symptoms versus causes is typically not straightforward. For example, one assessment mapped 235 different biases in the biomedical literature (Chavalarias & Ioannidis 2010), and biases may manifest differently or have different prevalence and consequences across distinct scientific fields (Fanelli et al. 2017, Goodman 2019. Other discipline-specific characteristics, such as the ratio of true (non-null) to absent (null) relationships among those relationships under scrutiny, and design considerations, such as statistical power and flexibility in design and analysis decisions, may translate to very different chances of getting correct or wrong answers (Ioannidis 2005). In this section, we discuss some salient problems that have been proposed, debated, and studied.

Incentives and Norms
Many problems may arise in the scientific ecosystem due to a fundamental misalignment of scientific ideals and incentive structures. While there is no doctrine defining a set of scientific ideals, the sociologist Robert Merton suggested that scientists share a set of informal cultural norms (Merton 1973): universalism (researchers should evaluate claims based on the evidence rather than irrelevant personal characteristics such as ethnicity, nationality, gender, or professional affiliation); communalism (scientific methods and results belong to the entire scientific community); disinterestedness (science should be free from personal, monetary, and other biases); and organized skepticism (researchers should engage in impersonal critical scrutiny). The extent to which these norms accurately describe scientists' behaviors and beliefs has been challenged (Mulkay 1976), although some survey evidence suggests that many working scientists subscribe to them (Anderson et al. 2010).
Several authors have asserted that key stakeholders in the scientific ecosystem are acting counter to these norms by exerting a preference for scientific findings with certain aesthetic qualities at the expense of authenticity (Giner-Sorolla 2012, Nosek et al. 2012). Specifically, stakeholders such as universities, journals, reviewers, and funders may prefer newsworthy, positive, or clean findings over incremental, negative, or messy findings. Because they exert considerable influence over how research is performed and evaluated, these stakeholders could be creating selection pressures that affect the quality and veracity of research.
Several modeling and simulation efforts have explored the consequences of incentive structures by mimicking the workings of the reward system in science (Bakker et al. 2012, Grimes et al. 2018, Higginson & Munafò 2016, Smaldino & McElreath 2016. For example, Grimes et al. (2018) showed how the emphasis on positive results in top-tier journals can undermine the trustworthiness of scientific findings. Conversely, trustworthiness improved when journals were agnostic as to whether a result was positive or not. The authors' model also suggested that a decrease of allocated funding amplifies competition between scientists, giving rise to an environment in which false-positive results are actively rewarded, thereby further decreasing trustworthiness. Smaldino & McElreath (2016) designed a dynamic model of competing research labs to demonstrate the "natural selection of bad science." Successful labs pursue novel, positive results by selecting suboptimal methods that can deliver such results in large quantities. Ultimately these labs receive higher payoffs (e.g., citations, prestige, funding), allowing them to reproduce at a higher rate and replicate through many offspring labs, perpetuating suboptimal methods in the scientific ecosystem. Higginson & Munafò (2016) extended this framework to research designs and estimated that a publish-or-perish culture contributes to designs with inadequate statistical power and high false-positive rates.

Lack of Transparency
The importance of transparency has a long historical precedent-the world's oldest scientific institution, the Royal Society, has the motto nullius in verba, "take nobody's word for it." Transparency is also regarded by some as an ethical imperative. For example, the Declaration of Helsinki on ethical principles for clinical research states that researchers have an ethical obligation to publish and disseminate complete and accurate reports of their research (World Med. Assoc. 2013). A lack of transparency in the conduct, reporting, and dissemination of research may undermine trust in science (Vazire 2017), waste resources (Chalmers & Glasziou 2009), and disrupt self-correction mechanisms (Ioannidis 2012).
One major concern-publication bias-refers to the phenomenon whereby scientific findings are selectively published based on various aesthetic characteristics of the results, such as being positive or newsworthy (Dwan et al. 2013). Publication bias can emerge through multiple mechanisms. For example, it could be that journals (or reviewers) express a preference for certain types of scientific findings and selectively publish findings that best match those preferences. Alternatively or additionally, researchers may not submit findings with certain characteristics for publication (the file drawer effect; Rosenthal 1979). Publication bias can also emerge from multiple sources, including selective reporting of entire studies or experiments, individual experimental conditions, specific measured outcomes, and outcomes arising from particular analyses (Phillips 2004). The consequence is that publications in academic journals fail to capture all of the findings generated by the scientific enterprise, thus providing a skewed impression of the evidentiary landscape.
Even when research findings are reported, they can be undermined by a lack of transparency about how they were generated. Much activity in the scientific ecosystem centers on research articles as a principal commodity (Young et al. 2008); however, a research article is an incomplete snapshot that cannot fully capture the rich network of research resources (e.g., protocols, materials, raw data, analysis scripts) that provide a more direct account of the research process and output (O. . Being able to access research resources can facilitate independent verification and evidence synthesis, and it may promote new discoveries. For example, access to raw data could enable reanalyses that probe the reproducibility and robustness of the original findings , facilitate more sophisticated forms of meta-analysis (Tierney et al. 2015), or generate new insights through the application of novel techniques or merging with other data sets (Voytek 2016). Occasionally, overriding ethical or legal concerns may limit transparency (Meyer 2018). In such cases, explicitly declaring such negative constraints on sharing should be a minimum expectation (Morey et al. 2016).
Without transparency it can be unclear what the original research hypothesis was, what the raw data looked like, how many studies or analyses were attempted, and how many unattractive results were disregarded. This information is indispensable for properly appraising research.

Statistical Schools of Thought and Statistical Misuse
Most scientific claims are reinforced by a scaffold of statistical analyses that support inductive inferences from samples of data. There are multiple approaches to statistical inference, including Bayesian and likelihood-based, but frequentist inference is the most prevalent (Chavalarias et al. 2016). Deep philosophical rifts exist between these schools of thought (Mayo 2018). Frequentist inference is often used in the form of null-hypothesis significance testing (NHST), a hybrid of two different statistical schools of thought (Gigerenzer 2004, Goodman 1993) that is highly prone to misinterpretation and misuse (Szucs & Ioannidis 2017b, Wagenmakers 2007. Regardless of the statistical paradigm employed, all statistical analyses have the potential to be misused. There are many researcher's degrees of freedom in data analysis and interpretation-just 10 binary analytic choices result in 2 10 = 1,024 unique analysis specifications (Gelman & Loken 2014). This flexibility can lead to large vibration of effects whereby many different results can be obtained from the same data and research question; results from the same study may often point in opposite directions (the so-called Janus phenomenon) (Ioannidis 2008, Patel et al. 2015. This situation can easily be exploited, whether intentionally or not, to extract more desirable results (see Section 2.1) from any given data set.
Making analytic decisions in a data-dependent manner without using appropriate corrections generates more false positives (Simmons et al. 2011). A series of data-dependent activities, collectively known as p-hacking, include stopping data collection, dropping outliers, selecting covariates, or inappropriately rounding p-values based on whether those actions shift the results toward statistical significance. 1 Analytic flexibility varies between domains and partly depends on the degree of standardization of data processing and analytic approaches. In the field of neuroimaging, for example, there is enormous flexibility in the data processing pipeline (Carp 2012), a situation that enabled one research team to detect apparent signatures of brain activity in a dead Atlantic salmon (Bennett et al. 2009).
A related phenomenon that can complicate p-hacking is opaque HARKing (hypothesizing after results are known): presenting a finding as if it were hypothesized all along, thus adding false confidence in its validity (Kerr 1998, O'Boyle et al. 2013. HARKing adds further degrees of freedom to the analysis process, enabling p-hacked findings to be explained convincingly by tailormade hypotheses.

Reproducibility
Use of the term "reproducibility" and related terms such as "replicability" and "repeatability" can vary across fields (Barba 2018). Goodman et al. (2016) proposed one framework that delineates methods reproducibility (obtaining similar results given the same data and analytical tools; often called analytic reproducibility or computational reproducibility), results reproducibility (obtaining similar results given the same analytical and experimental tools but new data; often called replication), and inferential reproducibility (drawing qualitatively similar inferences from an independent methods or results reproduction of a study).
Reproducibility is a core tenet of the scientific method: If one researcher performs a study and makes a claim, a second researcher should be able to repeat the original methods and obtain similar results. Repeating original analyses with the raw data should enable recovering the originally reported findings . However, in the context of stochastic phenomena, we should expect the findings of replication studies to differ to some extent from original studies (Stanley & Spence 2014). It is also unclear how (or if ) one should compare the results of two studies (an original and a replication) and conclude whether one was successful at replicating the other (Goodman et al. 2016, Nosek & Errington 2017. Some replication attempts have sparked heated debates that often result in further nonreproducibility of inferences: Despite examining the same results, researchers can disagree about what they mean (Ioannidis 2017).

INVESTIGATING PROBLEMS
At this stage, researchers conduct more in-depth empirical investigations to examine the prevalence and severity of problems. Investigations may involve meta-epidemiological assessments of potential bias, the impact of study characteristics on observed effects, the distribution of research evidence in different settings, or quantification of heterogeneity (Murad & Wang 2017). Most studies adopt retrospective observational designs and typically involve manual examination of published research articles. A serious challenge is that low transparency may undermine efforts to systematically study other problems. Consequently, some meta-epidemiological studies rely on surrogates that assess whether the pattern of published results is compatible with the theoretical impact of some particular bias. Because a number of benign factors may also contribute to such patterns, such surrogates are approximate/imperfect indicators of bias.

Incentive Structures
In evaluations for hiring, promotion, and tenure, many institutions use simple metrics, such as the journal impact factor, which are known to be problematic (Moher et al. 2018). Survey evidence suggests that many scientists view number of publications, journal ranking, and authorship order as being directly associated with performance assessment and career promotions (van Dalen & Henkens 2012, Walker et al. 2010. In some countries, including China, South Korea, and Turkey, cash rewards are offered for publishing in top-tier journals (Franzoni et al. 2011). Another strong incentive is to claim novelty, which many biomedical researchers do even when this is demonstrably false. In a study of 1,101 randomized controlled trials (RCTs) with 5 or more preceding trials combined in a meta-analysis, 46% cited only 0 or 1 prior trial on the subject, a percentage that increased when there were more prior trials to cite (Robinson & Goodman 2011). Such incentives may hinder research quality as competitive environments with greater publication pressure are more likely to report statistically significant results, a pattern that could be indicative of biased reporting (Fanelli 2010). The need to align incentive structures with good scientific practice is now widely recognized. For example, in one study, about 80% of surveyed scientists thought that incentivizing better research practices would improve reproducibility (Baker 2016).

Publication Bias and Selective Reporting
Empirical investigation of publication bias is challenging because unpublished studies and results are difficult to unearth. One approach is to seek out signals of publication bias in the published literature. For example, Fanelli (2011) manually examined a sample of 4,656 articles across scientific disciplines and observed an overwhelming frequency of positive (i.e., statistically significant) findings (see also Sterling 1959). Similarly, text-mining extraction of p-values from MEDLINE abstracts and full-text articles in PubMed Central showed that 96% of abstracts and full-text articles claimed significant results with p-values <0.05 (Chavalarias et al. 2016). This is simply too good to be true; it is implausible that scientists are routinely testing hypotheses that so frequently turn out to be accurate, especially when statistical power is typically very low (Ioannidis & Trikalinos 2007).
More direct evidence for publication bias has arisen from comparing public records, such as dissertations or study registries, with the published literature. For example, in the domain of management research, O'Boyle and colleagues (2013) found that the ratio of supported to unsupported hypotheses was more than twice as high in a corpus of published articles when compared to corresponding student dissertations. Similarly, Franco and colleagues (2014) capitalized on an institutional rule that required questionnaires and data underlying a series of psychology studies to be made publicly available. By comparing these materials to published research articles, they observed that about 40% of articles did not report all experimental conditions, about 70% of articles did not report all outcome variables, and reported effect sizes were about twice as large as unreported effect sizes (also see Franco et al. 2016).

www.annualreviews.org • Meta-Research
In medicine, researchers have used study protocols available in ethics board documentation or study registries to identify selective reporting of studies and outcomes in the published literature (Chan et al. 2004, Dechartres et al. 2017, Dickersin et al. 1992, Dwan et al. 2011, Easterbrook et al. 1991, Goldacre et al. 2019, Ross et al. 2012. For example, a systematic review of studies comparing those sources of information with their corresponding trial reports highlighted recurrent discrepancies (Dwan et al. 2014).

Transparency of Research Resources
Assessment of articles published across multiple scientific domains suggests minimal availability of critical research resources such as raw data, protocols, materials, and analysis scripts (Alsheikh-Ali et al. 2011, Hardwicke et al. 2019b). Wallach et al. (2018) assessed a random sample of 149 articles published in the biomedical domain between 2015 and 2017 and found that 19 articles had data availability statements, 31 articles had materials availability statements, one article shared a full protocol, and no articles shared analysis scripts. Data sharing may have improved recently in some domains, but there is still much room for improvement (compare Iqbal et al. 2016. Attempts to request research resources directly from researchers, particularly raw data, are often unsuccessful , Rowhani-Farid & Barnett 2016, Vanpaemel et al. 2015, Vines et al. 2014, Wicherts et al. 2006, even for some of the most influential studies (Hardwicke & Ioannidis 2018b). For example, Hardwicke & Ioannidis contacted the authors of 111 highly cited studies published in psychology and psychiatry between 2006 and 2016 and asked if they would be willing to share the associated raw data. Only 15 data sets (14%) were made available in a completely unrestricted form, and ultimately data from 76 studies (68%) were not made available in any form. Naudet et al. (2018) were more successful and obtained 17 data sets from a sample of 37 (46%) RCTs published in the BMJ and PLOS Medicine for the purpose of reanalysis.
More extensive assessment, continuous monitoring, and evaluation of research resource transparency are limited by the time-consuming nature of this type of research, which typically requires manual data extraction and coding. It is possible that computational tools can be developed to automatically extract similar information; however, the performance of such tools will need careful assessment to ensure reasonable sensitivity and specificity.

Suboptimal Research Design
Suboptimal research design may produce misleading results, wasting already scarce resources . Numerous biases in research have been shown to affect the scientific literature (Fanelli et al. 2017), making it difficult to detect small effect sizes. Deficiencies differ from one scientific field to another, but some patterns are highly prevalent across fields. For example, lack of sufficient power to detect a range of plausible effect sizes is common across many different disciplines (Button et al. 2013, Cohen 1962, Moher et al. 1994, Sedlmeier & Gigerenzer 1989, Smaldino & McElreath 2016, Szucs & Ioannidis 2017a. Furthermore, small studies tend to generate more heterogeneous results (IntHout et al. 2015). For example, a metaepidemiological study assessing 85,002 forest plots from the Cochrane Database of Systematic Reviews showed that most large treatment effects originated from small studies, and when aggregated with other studies, pooled treatment effects tended to be smaller (Pereira et al. 2012).
Other design considerations have also been associated with effect magnitude. For example, exaggerated treatment estimates are observed (on average) in observational studies compared to RCTs (Hemkens et al. 2016), surrogate outcomes compared to patient relevant outcomes (Ciani et al. 2013), single-center compared to multicenter clinical trials (Dechartres et al. 2011), and studies with inadequate allocation concealment, random-sequence generation, and blinding (Page et al. 2016).

Statistical Misuse
Intentional or unintentional misuse of statistics may occur during the selection, implementation, reporting, and interpretation of statistical analyses ( Table 1). For example, a recent metaepidemiological survey found that most RCTs with subgroup claims did not adjust for multiple testing, did not use an appropriate test of interaction, and were rarely validated (Wallach et al. 2017). Similarly, a study of 157 neuroscience articles found that 79 made interaction claims but did not appropriately test for one (Nieuwenhuis et al. 2011); instead, they inferred interaction when the outcome in one group was statistically significant and the outcome in the other was not, a known statistical fallacy (Gelman & Stern 2006). The causes of statistical misuse are multifaceted. The misapplication of statistical tools may reflect a response to a flawed incentive structure (Section 2.1) or may be due to a lack of understanding and/or poor training, leading to repeated use of mindless statistical rituals (Gigerenzer 2004).

Reproducibility
Replication studies remain uncommon in many fields (Hardwicke et al. 2019b, Iqbal et al. 2016, Makel et al. 2012, Sterling 1959. In recent years however, a spate of highprofile replication attempts in psychology (Open Sci. Collab. 2015, Yong 2012) and industry-based preclinical research (Begley & Ellis 2012, Prinz et al. 2011 have generated serious concern about a transdisciplinary reproducibility crisis (Baker 2016). A series of multilaboratory efforts adopting high transparency standards and (typically) relatively large sample sizes have been deployed to investigate replicability across several fields. The Reproducibility Project: Psychology (RPP), for example, set out to replicate 100 studies published in high-profile journals. The project reported several indices of replication success; for example, although 97 of the original studies had statistically significant results (p < 0.05), only 36 of the replications did so. When the outcome of a replication study appears to contradict previous evidence, it is important to consider several factors, such as the track record of the theory under scrutiny and the fidelity of the replication attempt. Some research has explored whether prediction markets can be used to estimate replication success (see the sidebar titled Estimating Replication Success Using Prediction Markets and Surveys).
One fairly consistent pattern that has emerged from the RPP and subsequent large-scale replication studies in the social sciences is that effect sizes observed in the replication studies have been on average approximately half as large as those reported in the original studies (Camerer et al. 2016(Camerer et al. , 2018, consistent with the idea that most published effects are inflated by selective reporting and other biases (Ioannidis 2005(Ioannidis , 2008. A Bayesian analysis of the RPP project also highlighted that the evidential value of many of the original studies (and some of the replication studies) was too weak to support robust inferences (Etz & Vandekerckhove 2016).
Several studies have also investigated whether published findings can be recovered by repeating the original analysis on the raw data (i.e., analytic reproducibility; e.g., Hardwicke et al. 2018, Stodden et al. 2018. For example, Hardwicke et al. (2018) encountered at least one nonreproducible value in 24 out of 35 published psychology articles, often due to ambiguous, incomplete, or incorrect specification of the original analyses or mismanagement of data files. Some issues were resolved after the original authors provided (previously unreported) Circular analysis (e.g., attempting to correlate brain activity measures with personality measures after selecting from the former only data that have surpassed a threshold; also known as double dipping) Fiedler 2011, Vul et al. 2009 Implementation issues Illustrative references Failure to account for multiplicity (e.g., multiple comparisons, optional stopping) Armitage et al. 1969, Cramer et al. 2016, John et al. 2012, Strasak et al. 2007, Wallach et al. 2017 Exploiting flexibility in analysis decisions in order to obtain more favorable outcomes Incomplete reporting of outcomes (e.g., not reporting effect sizes, interval estimates, or standard deviations) Counsell & Harlow 2017, Cumming et al. 2007 Distorted presentation of nonsignificant results (e.g., spins in conclusions) Boutron et al. 2010 Selective reporting (e.g., only reporting experiments, outcomes, or analyses that achieved statistical significance) Chan et al. 2004;Dwan et al. 2011Dwan et al. , 2013Dwan et al. , 2014Franco et al. 2014Franco et al. , 2016Goldacre et al. 2019;John et al. 2012;Easterbrook et al. 1991 Presenting post hoc hypotheses as if they were specified a priori (opaque HARKing) John et al. 2012, Kerr 1998

Interpretation issues Illustrative references
Incorrectly assuming that a nonsignificant outcome means that there is no effect Fidler et al. 2006, Hoekstra et al. 2006, Schatz et al. 2005, Sedlmeier & Gigerenzer 1989 Assuming that the difference between significant and not significant is itself significant, or relatedly, erroneous analysis of interactions Gelman & Stern 2006, Nieuwenhuis et al. 2011, Wallach et al. 2017 Based on table 1 in Hardwicke et al. (2019a). Abbreviation: HARKing, hypothesizing after results are known.

ESTIMATING REPLICATION SUCCESS USING PREDICTION MARKETS AND SURVEYS
Prediction markets elicit group beliefs about replication success by having participants place monetary bets (Camerer et al. 2016(Camerer et al. , 2018Dreber et al. 2015;Forsell et al. 2019). Dreber et al. (2015) found that prediction markets (71%) outperformed premarket surveys (58%) when participants were asked to predict replication success for 44 studies from the RPP. However, subsequent studies evaluating social science studies observed that market beliefs were not more accurate than surveys (Camerer et al. 2016(Camerer et al. , 2018. A recent study (Forsell et al. 2019) of the Many Labs 2 replication project (R.A.  reported that performance depended on the type of replication outcomes being predicted. Prediction markets more accurately predicted replication significance (i.e., a statistically significant effect in the same direction as the original study), whereas surveys more accurately predicted replication effect sizes. The relatively low cost and rapid results of prediction markets and surveys make them attractive tools for predicting replication outcomes. However, in existing assessments, forecasters had prior information about the studies under scrutiny. Thus, a future challenge for market predictions will be their performance with new, unfamiliar studies.
information that enabled reproducibility. Importantly, there was no clear evidence that the conclusions of the original studies had been undermined. Generally, studies of analytic reproducibility have highlighted that basic human error is common, suggesting that greater attention should be paid to quality control systems and use of software tools that enable writing reproducible scientific papers (O. , Marwick et al. 2017).

DEVELOPING SOLUTIONS
During this stage researchers and other stakeholders such as universities, funders, and journals, attempt to develop and implement solutions to problems delineated in previous stages in order to improve the efficiency, quality, and credibility of scientific research (Ioannidis 2014, Munafò et al. 2017. Arguably, many reform initiatives are facilitated by the availability of suitable software and technological infrastructure, such as data analysis tools that emphasize reproducibility and repositories for registering study protocols and sharing critical research resources such as raw data (Spellman 2015). Proper (re)training of the scientific workforce may also be crucial for the success of many such initiatives.

Journal, Funder, Society, and University Policies
Some journals, funders, academic societies, universities, and other institutions have begun to introduce policy changes trying to address problems. For example, Nosek et al. (2015) developed the Transparency and Openness Promotion (TOP) guidelines, a set of tiered policy recommendations encompassing data sharing, materials sharing, analysis code transparency, data citation standards, design and analysis reporting, preregistration, and replication. At the time of writing, the TOP website (https://cos.io/our-services/top-guidelines/) reports that over 5,000 organizations (including journals, publishers, and funders) are signatories; however, this only involves "expressing their support of the principles of openness, transparency, and reproducibility, expressing interest in the guidelines," and a commitment to "conducting a review within a year of the standards and levels of adoption." The website also notes that "We know of over 1,100 journals or organizations that have implemented one or more TOP-compliant policy as of June 2019."

Preregistration and Registered Reports
Preregistration involves creating a time-stamped, read-only copy of a study protocol (e.g., hypotheses, methods, and analysis plan) and archiving it in a registry before study commencement. The intention is to mitigate or enable detection of questionable research practices, such as phacking and HARKing, by making clear what was planned and what was not (Nosek et al. 2018). For example, selective reporting can potentially be identified by comparison of the protocol and the report (Goldacre et al. 2019). Additionally, preregistration may help researchers avoid capitalizing on chance by exploiting (perhaps unintentionally) degrees of freedom in the analysis process. Crucially, the intention of preregistration is not to reduce opportunities for exploratory (data-dependent) analyses, but to make clear the exploratory nature of such work (Kimmelman et al. 2014. Although the concept of preregistration is emerging in the basic sciences and preclinical domains (Nosek et al. 2018), it has a longer precedent in the context of clinical trials registration (Dickersin & Rennie 2012) and has sparked opposition and debate in some domains (e.g., compare Dal-Ré et al. 2014, Lash & Vandenbroucke 2012. Implementation of preregistration in practice depends on the research domain, the existence of legal mandates, and the registry that a researcher chooses (or is required) to use ( Table 2). Some registries are tailored to the needs of specific fields and offer or require completion of specific registration templates. Templates facilitate standardization but may not be optimized for some designs. Registries can also influence the level of transparency conferred by preregistration. Some registries automatically make registrations public (e.g., ClinicalTrials.gov), and some offer an optional and time-limited embargo period during which the registration is hidden before the information eventually becomes public (e.g., the Open Science Framework). Others allow registrations to be kept hidden indefinitely (e.g., AsPredicted), which may help to allay concerns about ideas being scooped. However, hidden registrations cannot be effectively monitored by the scientific community. Monitoring can help address issues such as creating multiple similar registrations, registering but not publishing, and registering but still using questionable research practices ). The registered reports article format involves embedding protocol registration directly within the publication pipeline (Chambers 2013). Study protocols are peer reviewed and may be offered in-principle acceptance for publication before the study has even been conducted. By focusing on the quality of study design rather than the aesthetic appeal of the study findings, registered reports may improve study quality and mitigate publication bias. This type of publishing model appears to have been employed by the European Journal of Parapsychology from 1976 for almost two decades (Wiseman et al. 2019) as well as The Lancet from 1997 for at least a decade (Hardwicke & Ioannidis 2018a). The current registered reports format (http://cos.io/rr) was introduced at Cortex in 2013 (Chambers 2013), and adoption has spread across journals and disciplines (Hardwicke & Ioannidis 2018a).

Reporting Guidelines
Reporting guidelines are intended to draw attention to key design and analysis decisions and ensure they are adequately reported in research reports (Altman & Simera 2016). In 2008 the EQUATOR (Enhancing the Quality and Transparency of Health Research) network was launched to provide resources, education, and training to facilitate good research reporting (https://www. equator-network.org/). Currently the network provides 411 reporting guidelines covering a variety of study types ( Table 3), but it is predominantly focused on health research.

Peer Review
Peer review is often considered to be an essential quality control gateway that should prevent low-quality research from entering the scientific literature. There are many studies on peer review, many of which are presented at the International Conference on Peer Review that runs every four years. However, assessment of interventions to improve peer review has been relatively sparse (Bruce et al. 2016). Several variations or amendments to peer review procedures have been proposed to make it more effective and mitigate potential negative consequences, such as the introduction of bias by peer reviewers themselves. For example, there has been much debate and some empirical scrutiny of whether the identities of peer reviewers and/or the content of their reviews should be made publicly available (Ross-Hellauer & Görögh 2019) and whether peer review procedures should be double-blind ( Justice et al. 1998, McGillivray & De Ranieri 2018. The registered reports publication model represents a radical departure from traditional peer review procedures, as it involves results-blind peer review. It has also been proposed that peer reviewers might decline to review a particular manuscript if critical research resources such as raw data, materials, and analysis scripts are not made publicly available or the manuscript does not contain a statement explaining why they cannot be made available (Morey et al. 2016). Finally, there has been discussion about whether specialized statistical review (a process that appears to have had some degree of success in biomedical domains) might also help other fields where statistical review appears to be relatively rare, such as psychology (Hardwicke et al. 2019a).

Collaboration
Pooling expertise, financial resources, and other resources in collaborative work may improve statistical power, reproducibility, access to unique populations, and results generalizability (Ioannidis 2014, Munafò et al. 2017). Collaborative efforts have completely transformed many scientific fields, such as genetics (Seminara et al. 2007), and have become the norm in many physical and space sciences. This model has recently started to gain traction in disciplines where such large-scale collaboration was previously rare, such as psychology (Open Sci. Collab. 2015, R.A. . The new Psychological Science Accelerator operates as a large network of labs with different committees responsible for coordinating various tasks, such as study selection and data management (Moshontz et al. 2018). Similarly, the Observational Health Data Sciences and Success depends on extent of adoption or enforcement by stakeholders.
Abandon p-values entirely Not easy because insufficient information might be available to compute other statistics; many articles do not report effect sizes and/or confidence intervals Previous efforts have not gained traction. May be more successful in some fields (e.g., assessment of diagnostic performance or choosing predictors for prognostic models in which p-values would make little sense)

Focus on effect sizes and their uncertainty
Often this information is not reported at all, but it has become more common in recent literature.
Relevant to the vast majority of the clinical literature; should be heavily endorsed as more directly linked to decision making and may be easier to promote than more sophisticated solutions (Re)train the scientific workforce Takes time and major commitment to achieve sufficient statistical literacy Potentially a more effective solution in the long term but may require major recasting of training priorities in curricula Address biases that lead to inflated results

Requires major training; biases are often impossible to detect in published reports
Preemptively dealing with biases is ideal but needs concerted commitment of multiple stakeholders to promote and incentivize better research practices.
Informatics initiative aims to bring collaborators from multiple sites together in order to merge health data from multiple sites into one standardized database, reproduce analyses across sites, and explore the optimization of design and analytic choices in observational research .
Overall, an increase in large-scale collaborations seems welcome; however, it does raise practical and conceptual issues. For example, accurately and effectively crediting hundreds of authors in a team effort could prove difficult in the current system, which values certain authorship positions more than others. Large-scale collaborations may also dilute beneficial competition and disagreement by focusing on finding a lowest common denominator consensus at the expense of more radical approaches. One interesting hybrid approach known as adversarial collaboration involves two groups of researchers who have a theoretical disagreement collaborating on a project in an effort to maximize the informational value of study design and minimize the influence of their respective biases (Matzke et al. 2015). Some empirical work is already addressing the relative merits of small versus large team science (Wu et al. 2019).

Statistical Reform
In response to widespread statistical misuse (see Section 3.5), there have been many proposals for statistical reform (see Table 4). The American Statistician recently published a collection of 43 articles containing several such proposals (Wasserstein et al. 2019). As NHST is the dominant statistical paradigm in most disciplines (Section 2.3), many solutions are focused on either trying to improve the use of NHST or completely replace it. For example, Benjamin et al. (2017) proposed that reducing the traditional threshold for declaring statistical significance from 0.05 to 0.005 would help researchers more appropriately calibrate their inferences to the strength of the evidence. It has also been suggested that, instead of using a theoretical null distribution, researchers use a data-driven null constructed using negative controls (i.e., associations that we know should be null) to produce calibrated p-values . Others have proposed completely abandoning significance testing (McShane et al. 2019). However, an empirical assessment of articles published during a journal ban of significance testing suggested a tendency to overstate conclusions (Fricker et al. 2019), which is a concern in the absence of any statistical inference (Ioannidis 2019).
Others suggest a switch to alternative paradigms, such as Bayesian statistics, which has several advantages (Wagenmakers 2007). However, it is noteworthy that different inferential approaches often arrive at similar conclusions, at least in simple scenarios (van Dongen et al. 2019). Furthermore, the success of any given approach is largely dependent on the capabilities of the researchers conducting the statistical analyses. Statistical reform is therefore heavily dependent on resolving other issues, such as poor statistical education (Altman 1994, Goodman 2019, misaligned incentives (Nosek et al. 2012), and a lack of transparency, which complicates independent evaluation. Innovative approaches may be worth evaluating, such as requiring authors to indicate their degree of belief in their findings (Goodman 2018).

Evidence Synthesis
The Cochrane Collaboration was founded in 1993 and quickly became a champion of evidence synthesis. Several more recent initiatives have been launched in order to encourage a more transparent and collaborative process for evidence synthesis such as the Stanford MetaLab (http:// metalab.stanford.edu/) in developmental psychology and PROSPERO (https://www.crd.york. ac.uk/prospero/) for preregistration of systematic reviews. Systematic reviews and meta-analyses are growing exponentially, with an increase of 2,728% and 2,635%, respectively, between 1991 and 2014, but many are redundant, misleading, or conflicted (Ioannidis 2016).

EVALUATING SOLUTIONS
The effectiveness of reform initiatives will depend not only on their theoretical sophistication, but on how well they are implemented. Evaluation and ongoing monitoring of reform initiatives are crucial to detect unintended negative consequences and verify that anticipated benefits are being realized in practice (Ioannidis 2015). The success of an intervention depends on how well it is calibrated to the needs, motivations, and capability of scientists, and these factors may vary substantially across research communities. Furthermore, any single reform initiative will typically only address a subset of a complex range of overlapping and interacting processes that are influenced by multiple stakeholders within the scientific ecosystem. Even with best intentions and flawless implementation, reform initiatives may fail because the system evolves to resist them.
Conducting informative evaluation studies can be challenging. Many initiatives are introduced without considering evaluation, which means that most evaluation studies are necessarily limited to retrospective observational designs. A given reform initiative may involve both proximal and distal goals that will emerge across the short, medium, and long terms. Distal goals may take longer to materialize and be more difficult to isolate and operationalize, but they are critical indices of a reform initiative's success or failure. Proximal outcomes can be measured sooner, providing feedback that can be used to address weaknesses and optimize reform initiatives. Here, we describe some specific examples of evaluating solutions relevant to journal policy, reporting guidelines, preregistration, and registered reports.

Journal Policy
Generally, policy effectiveness appears to depend on how stringent policy requirements are, how the policy is worded and interpreted, and how robustly the policy is enforced. For example, data sharing policies that recommend authors to share upon request tend to be markedly less effective relative to more stringent policies that require authors to make data publicly available in an online repository prior to publication , Nuijten et al. 2017, Rowhani-Farid & Barnett 2016. For instance, Nuijten et al. (2017) observed a dramatic increase in data availability from 8.6% to 87.4% of articles after a new policy at Judgement and Decision Making asked authors to publicly share data prior to article publication. By contrast, data sharing at a comparator journal with no data sharing policy, the Journal of Behavioral Decision Making, remained negligible across the same period (prepolicy, 0%; postpolicy, 1.7%).
Although some data sharing policies might achieve their proximal goal of increasing data availability, they are not necessarily achieving their more distal goal of facilitating data reuse or results verification. Using an interrupted time series analysis, Hardwicke et al. (2018) observed a substantial increase in data available statements after a mandatory data sharing policy was introduced at the journal Cognition (from 25% of prepolicy articles to 78% of postpolicy articles). However, among data sets that were reportedly available, the proportion that were actually available, complete, and understandable was only 22% prepolicy and 62% postpolicy (also see Section 3.6). In response to this study, the Cognition editorial team outlined policy changes they intend to implement (Tsakiris et al. 2018)-an example of how meta-research can create a feedback loop between the solution evaluation and solution development stages (Figure 1).

Reporting Guidelines
In line with the recommendations of the International Committee of Medical Journal Editors (ICMJE), medical journals are increasingly adopting reporting guidelines. However, an examination of the online instructions to authors of 168 high-impact-factor journals revealed heterogeneous recommendations on the use of Consolidated Standards of Reporting Trials (CONSORT) guidelines (Shamseer et al. 2016). Sixty-three percent endorsed the use of CONSORT, of which 42% made it a prerequisite for submission. An early pre/post study assessing the impact of CONSORT showed improvements in completeness and transparency of published reports (Moher et al. 2001). For example, unclear description of allocation concealment significantly decreased (mean change −22%). Although the overall quality of reporting improved with uptake of guidelines, it remains suboptimal, and deficiencies persist (Turner et al. 2012).

Preregistration and Registered Reports
In 2005, the ICMJE implemented their policy requiring trial registration for publication, and the number of registrations on ClinicalTrials.gov increased by 73% in 6 months (Zarin et al. 2005). As of April 2019, 301,795 studies have been registered on ClinicalTrials.gov. However, many journals still publish unregistered and retrospectively registered trials (Gopal et al. 2018, Loder et al. 2018, Trinquart et al. 2018). An overview of studies evaluating the registration status of published reports in medical journals found that half of the trials published were not registered and only 20% were prospectively registered (Trinquart et al. 2018).

www.annualreviews.org • Meta-Research
Registration of trials also intends to prevent selective reporting, yet changes between the registered primary outcomes and published outcomes are common (Goldacre et al. 2019, Gopal et al. 2018, Jones et al. 2015, Mathieu et al. 2009, Scott et al. 2015, with up to 31% of articles showing discrepancies in reported versus registered outcomes. Other investigators have reported that mandatory registration of primary outcomes was associated with a substantial decline in the number of trials reporting statistically significant findings, perhaps due to registration effectively mitigating selective reporting (Kaplan & Irvin 2015).
Journal adoption of registered reports has many potential benefits, but a number of implementation issues were found during an exploratory investigation of the format (Hardwicke & Ioannidis 2018a). For example, at the time of the investigation, most registered reports had not been formally registered, and there was no reliable way of tracking their existence and status in the publication pipeline. Furthermore, most in-principle accepted protocols were not publicly available. This investigation is another example of how meta-research can create a virtuous feedback loop to inform and refine solution development (Figure 1). Many of the implementation issues identified were quickly addressed (at least in part) through the creation of a central registry for registered reports (http://cos.io/rr) and efforts to coordinate and update journal policy to ensure that protocols are registered and made publicly available (Chambers & Mellor 2018). Further evaluation and monitoring will be necessary to ascertain how effective these changes have been and if additional implementation issues should be addressed.

SUMMARY POINTS
1. Meta-research, or research-on-research, is a burgeoning discipline that investigates efficiency, quality, and bias in the scientific ecosystem. 3. Using theoretical arguments, modeling, simulations, or early empirical data, metaresearchers have identified potential problems that might cause inefficiency, hamper research quality, and undermine the veracity of the published literature.
4. Empirical investigations have examined the prevalence and severity of problems including publication bias and selective reporting, transparency of critical research resources, suboptimal research design, incentive structures, statistical misuse, and reproducibility.
5. Scientific stakeholders such as universities, funders, journals, and researchers have introduced a number of reform initiatives related to transparency of research resources, preregistration, registered reports, reporting guidelines, peer review, collaboration, statistical reform, and evidence synthesis. The goal is to improve the efficiency, quality, and credibility of scientific research.
6. Evaluation and ongoing monitoring of reform initiatives are crucial to check for unintended negative consequences and verify that anticipated benefits are being realized in practice.
7. Meta-research can help to calibrate the scientific ecosystem toward higher standards by providing a stratum of empirical evidence that informs the iterative generation and refinement of reform initiatives.

FUTURE ISSUES
1. Meta-researchers are operating in the same scientific ecosystem as other researchers and are therefore subject to the same selection pressures that can infuse bias into the research process. It is important that meta-research is held to the same high standards expected of other research.
2. Often researchers promoting reform initiatives are also those with the interest, motivation, and means to evaluate them. These researchers may have the best of intentions, and important studies might not even be performed without their efforts, but such nonindependent evaluation does create a risk of bias. Independent evaluations should be prioritized when feasible, and high transparency standards are imperative.
3. As an emerging cross-disciplinary field, meta-research lacks the traditional infrastructures that support more established disciplines. Consequently, many aspects of meta-research are ad hoc and poorly supported, including training, student recruitment, funding, and publication outlets. Career trajectories for meta-researchers are unclear as university departments specializing in meta-research are rare.
4. Meta-research is frequently complicated by a lack of transparency and poor standardization. Important information is often buried in articles and has to undergo timeconsuming and error-prone manual extraction. Other information is hidden in journal publishing systems or in researchers' files. Improved transparency standards combined with technological developments will enable more efficient, comprehensive, and effective meta-research.
5. Recent advances in text mining, machine learning, and other automated tools may create new opportunities for meta-research on topics that were previously out of reach. Automated tools may also be able to enhance peer review by detecting simple errors or identifying suggestive patterns that can be evaluated further by a human.
6. Different disciplines often share similar problems but have attempted different solutions. More interdisciplinary collaboration and cross-fertilization of ideas may help to ensure that the most effective strategies are widely shared.
7. Widespread concerns about the veracity of the scientific literature can be unsettling, but this is an exciting time to be a scientist. The scientific method is still the best route to finding truths about nature and we can leverage scientific methods to study science itself. At this critical juncture, meta-research has a crucial role to play in guiding scientists' attempts to calibrate the scientific ecosystem toward higher standards of efficiency, quality, and credibility.

DISCLOSURE STATEMENT
The authors are not aware of any affiliations, memberships, funding, or financial holdings that might be perceived as affecting the objectivity of this review.