The RNAseq experiment related to these files was performed in Prof Bart Vanhaesebroeck’s group (UCL). The data are from a paired-end mRNA sequencing experiment using a strand agnostic library.
Sample information (mouse MEFs):
The derivation and culture of the wild-type and PIK3CAWT/H1047R MEFs used in this study have been reported previously (Moniz et al., 2017). Cell pellets were collected on dry ice 48 h after induction of heterozygous PIK3CA-H1047R expression, without prior starvation. Annotation:
F1-F4 (flp/PIK3CA-WT/WT)
O1-O4 (flp/PIK3CA-WT/H1047R)
The raw sequencing reads were mapped to the mouse genome build GRCm38, and gene level counts were performed using Spliced Transcripts Alignment to a Reference (STAR) v2.5.
Subsequent analysis was performed with the limma pipeline developed for microarrays, applying the voom method without sample quality weights. Voom estimates the mean-variance relationship of the log-counts.
Analysis workflow detailed in: https://www.bioconductor.org/help/workflows/RNAseq123/
All files contained in this directory and any subdirectories should be self-explanatory with the help of the accompanying RNotebook. All published results are possible to reproduce using the supplied .RData file
DETAILED ANALYSIS BACKGROUND
Characteristics of the limma pipeline:
1) Linear modelling to analyse complex experiments with multiple treatment factors
2) (can take into account weights to account for variations in precision between different observations) - we won’t use this.
3) Empirical Bayes statistical methods to borrow strength between genes
Based on reference: voom: precision weights unlock linear model analysis tools for RNA-seq read counts (Law et al. Genome Biology 2014)
Voom takes the approach that modelling the mean-variance relationship in the data is more important than specifying the exact probabilistic distribution of the counts (which is an issue with the count-based RNAseq data anyway). Heteroscedastic distribution of counts, with larger variances for larger counts. Why important to model mean-variance relationship: because once you log-transform counts, you do bring the variance down for high counts which otherwise tend to have large SD, but now the log-transformed large counts have smaller std deviations than small log-counts.
Voom relies on statistical methods based on the normal distribution to estimate a mean-variance function for the data at hand. Estimated non-parametrically. Log counts normalised per sequence depth. The mean-variance is fitted to the gene-wise standard deviations of the log-cpm as a function of average log-count. Then modify limma’s empirical Bayes procedure to incorporate a mean-variance trend (if weights were included: mean-variance trend is incorporated into a precision weight for each individual normalised observation). The first method (mean variance trend into the empirical Bayes procedure in limma) = known as limma trend. The second method known as voom = “variance modelling at the observational level”, the one we are using. Limma-trend applies the mean-variance relationship at the gene level whereas voom applies it at the level of individual observations. Limma-trend and voom perform almost equally well when the sequencing depths are the same for each RNA sample (which is the case here). When the sequencing depths are different, voom is the better performer.
No normalisation to transcript length so can’t compare different transcripts to each other, i.e. only estimate relative changes in abundance for a given transcript across different samples, not absolute levels across different transcripts.
log-cpm: dividing each read count by the corresponding library size (in millions) to get counts per million (cpm), then log2.
The counts are augmented by a small positive value (a half of one read) to avoid taking the logarithm of zero. This ensures no missing log-cpm values and reduces the variability at low count values.
log-cpm values are treated as analogous to log-intensity values from a microarray analysis.
Shown mathematically that log-cpm values generally show a smoothly decreasing mean-variance trend with count size (with higher counts, it eventually asymptotes to a value that should be determined by biological variability - see original reference for explanation). The log-cpm transformation therefore roughly de-trends the variance of the RNAseq counts as a function of count size for genes with larger counts. The log-cpm is approximately equal to the CV of the raw RNAseq counts (which has the same relationship = decreasing function of count size for small to moderate counts but for larger counts, it asymptotes to a value that depends on biological variability).
You can’t just input the log-cpm values straight into the well-established microarray analysis pipeline provided by the limma package because it will only behave well for reasonably large counts, but will ignore the mean-variance trend for lower counts (here the variance will be larger). The microarray limma pipeline has therefore been modified to include a mean-variance trend as part of the variance modelling: involves a modification of the empirical Bayes procedure of the limma package so that the gene-wise variances are squeezed towards a global mean-variance curve instead of towards a constant pooled variance = this is limma-trend.
voom: the limma-trend models the variance at the gene level, but in RNAseq the count sizes may vary considerably from sample to sample for the same gene. So even if the cpm is the same, different count sizes might be quite different (and be associated with different degrees of uncertainty). Therefore, voom models the mean-variance trend of the log-cpm values at the individual observation level, instead of applying a gene-level variability estimate to all observations from the same gene. Strategy: estimate non-parametrically the mean-variance trend of the logged read counts and use this mean-variance relationship to predict the variance of each log-cpm value. The predicted variance is then encapsulated as an inverse weight for the log-cpm value —> when these weights are incorporated into a linear modelling procedure, the mean-variance relationship in the log-cpm values is effectively eliminated.
Technical difficulty: you want to predict the variances of individual observations although there is by definition no replication at the observational level from which variances could be estimated. What is done instead: estimate the mean-variance trend at the gene level (like in limma-trend) then interpolate this trend to predict the variances of individual observations. In practice:
1) Fit gene-wise linear models to the normalised log-cpm values, taking into account the experiment design, treatment conditions, replicates etc.
2) Take the residual standard deviation for each gene from the fitted linear model and fit a robust trend to these deviations as a function of the average log-count for each gene (i.e. for a given gene the log count will be different in each sample prior to normalisation).
3) The fitted log-cpm for each observation (taken from the fitted linear model in 1)) is converted into a predicted count (standard algebra). Based on the predicted count size, the standard deviation trend (from 2)) is then interpolated to predict the standard deviation of each individual observation based on its predicted count size (because in 2 we got a model of standard deviation as a function of count size by using the residual standard deviations).
4) The inverse squared predicted standard deviation for each observation then becomes the weight for that observation.
5) The log-cpm values and associated weights are then input into limma’s standard differential expression pipelines.
Key: the use of normal models gives voom access to tractable empirical Bayes distribution theory, facilitating reliable estimation of the Bayesian hyper parameters and exact small sample distributions for the test statistics. Among other things, this allows accurate estimate of the prior degrees of freedom determining the optimal amount of squeezing to be applied to the variances.