Main content



Loading wiki pages...

Wiki Version:
Sample Sources -------------- The original sequences analyzed in this project were obtained from public repositories, and their sample identifiers were retained in the downstream outputs. For more information about the samples, please see and cite the appropriate original studies. Original BALF studies and data sources: - [Chen et al. 2020][1], NCBI BioProject PRJNA601736 - [Wu et al. 2020][2], NCBI BioProject PRJNA603194 - [Zhou et al. 2020][3], NCBI BioProject PRJNA605983 - [Shen et al. 2020][4], NGDC BioProject PRJCA002202 and NCBI BioProject PRJNA605907 - [Xiong et al. 2020][5], NGDC BioProject PRJCA002326 - [Michalovich et al. 2019][6], NCBI BioProject PRJNA434133 - [Ren et al. 2018][7], NCBI BioProject PRJNA390194 - [Huang et al. 2019][8], NCBI BioProject PRJNA484025 Non-BALF studies used only for bioinformatics benchmarking: - [Ibberson et al. 2019][9], NCBI BioProject PRJNA573047 - [Nowicki et al. 2018][10], NCBI BioProject PRJNA387475 Research Team ---------------- The human metatranscriptome results were analyzed by a team of scientists on the [COVID-19 International Research Team (COV-IRT)][11], including the following: - Michael Jochum, Baylor College of Medicine - Mike Lee, NASA Ames Research Center - Viktorija Zaksas, University of Chicago - Elizabeth Vitalis, Inscripta - Cassie Conley, NASA Ames Research Center - Winnie Li, Rice University - Carla Sipahioglu, Rice University - Kristen Curry, Rice University - Bryce Kille, Rice University - Dan Cornforth, Georgia Institute of Technology - Britt Ross, Georgia Institute of Technology - Jason Chin, DNAnexus - Chai Fungtammasan, DNAnexus - Todd Treangen, Rice University - Krista Ternus, Signature Science, LLC Special thanks to the following scientists for providing helpful feedback, contributions, and reviewing the results: - Afshin Beheshti, NASA Ames Research Center - Kjersti Marie Aagaard, Baylor College of Medicine - Marvin Whiteley, Georgia Institute of Technology - Craig Everroad, NASA Ames Research Center - John Fonner, Texas Advanced Computing Center - Leo Elworth, Rice University - Dreycey Albin, Rice University - Yunxi Liu, Rice University - Yongze Yin, Rice University - Tianzhou (Charles) Ma, University of Maryland School of Public Health - Huw Ogilvie, Rice University - Ian Fiddes, Inscripta - Cem Meydan, Weill Cornell Medicine - Sharon Grim, University of Michigan - Anupama Kozhiyalam, San Jose State University - Sonia Villapol, Houston Methodist - Katie Fisch, UCSD Center for Computational Biology & Bioinformatics - Deanne Taylor, Children's Hospital of Philadelphia - Robert Meller, Morehouse School of Medicine - Matthew Frieman, University of Maryland School of Medicine Microbial Analysis ---------------- The commands described below were used to analyze the taxonomic and functional profiles of the microbial community of metatranscriptomes. Because the datasets in this study were derived from different publications and sequencing approaches, all the datasets were processed through the quality control steps to yield single-end reads. Kraken2 was used to eliminate human reads, as well as phiX contamination that appeared in some samples, before further downstream processing. **Quality Assessment of Reads with [FastQC][12] (version 0.11.9)** fastqc {input.fq.gz} -o {output_fastqc} **Quality Filtering of Single End Reads with [Trimmomatic][13] (version 0.39)** trimmomatic SE input.fq.gz output_trim25.fq.gz ILLUMINACLIP:adapters_combined_256_unique.fasta:2:30:15:1:true LEADING:25 TRAILING:25 SLIDINGWINDOW:4:25 MINLEN:25 **Merge Paired-End Reads with [Flash][14] (version 1.2.11)** flash input_name_1.fq.gz input_name_2.fq.gz -M 70 -o input_name 2>&1 | tee input_name_flash.log **Quality Filtering of Merged Reads with Trimmomatic (version 0.39)** trimmomatic SE sample.extendedFrags.fastq sample_extendedFrags_trim25.fq ILLUMINACLIP:adapters_combined_256_unique.fasta:2:30:15:1:true LEADING:25 TRAILING:25 SLIDINGWINDOW:4:25 MINLEN:25 **Quality Filtering of Unmerged Reads with Trimmomatic (version 0.39)** trimmomatic PE sample.notCombined_1.fastq sample.notCombined_2.fastq sample.notCombined_trim25_1.fastq sample.notCombined_trim25_1_se sample.notCombined_trim25_2.fastq sample.notCombined_trim25_2_se ILLUMINACLIP:adapters_combined_256_unique.fasta:2:30:15:1:true LEADING:25 TRAILING:25 SLIDINGWINDOW:4:25 MINLEN:25 The Illumina adapter file used with Trimmomatic is located online [here][15]. **Combine Quality Filtered Outputs from Paired-End Reads (i.e., merged reads + forward and reverse single end unmerged reads + forward reads from paired-end unmerged reads)** cat sample_extendedFrags_trim25.fq sample.notCombined_trim25_1.fastq sample.notCombined_trim25_1_se sample.notCombined_trim25_2_se > sample_combined_trim25.fq **Compress Combined Reads** gzip sample_combined_trim25.fq **Build [Kraken2][16] Database to Remove Human and PhiX Reads with and without Masking** kraken2-build --download-taxonomy --db kraken2_human_and_phiX_db_19_June_2020/ kraken2-build --download-taxonomy --db kraken2_human_and_phiX_db/ Notes about this process are captured here: **Remove Human and PhiX Reads with Kraken2 (version 2.0.8-beta) with Masking** kraken2 --db kraken2_human_and_phiX_db_19_June_2020 --threads 10 sample_combined_1_reads_trim25_1.fq.gz --use-names --confidence 0 --report --unclassified-out sample_filtered_1_reads_trim25_kraken2_unclassified_human_and_phiX_db_confidence0#.fq --classified-out sample_filtered_1_reads_trim25_kraken2_classified_human_and_phiX_db_confidence0#.fq --output sample_filtered_1_reads_trim25_kraken2_human_and_phiX_db_confidence0.out **Remove Low Complexity Sequences with [Fastp][17] (0.20.1)** fastp -i sample_filtered_1_reads_trim25_1.fq.gz -o sample_complexity_filtered_1_reads_trim25_1.fq.gz --low_complexity_filter --complexity_threshold 20 --disable_quality_filtering --disable_adapter_trimming --thread 4 **Remove Human and PhiX Reads with Kraken2 (version 2.0.8-beta) without Masking** kraken2 --db kraken2_human_and_phiX_db --threads 10 sample_complexity_filtered_1_reads_trim25_1.fq.gz --use-names --confidence 0 --report --unclassified-out sample_qual_complexity_filtered_1_reads_trim25_kraken2_unclassified_human_and_phiX_db_confidence0#.fq --classified-out sample_qual_complexity_filtered_1_reads_trim25_kraken2_classified_human_and_phiX_db_confidence0#.fq --output sample_qual_complexity_filtered_1_reads_trim25_kraken2_human_and_phiX_db_confidence0.out **Taxonomic Assignments with Kraken2 (version 2.0.8-beta)** kraken2 --db mason_db --threads 10 sample_filtered_1_reads_trim25_1.fq.gz --use-names --confidence 0 --report --unclassified-out sample_filtered_1_reads_trim25_kraken2_unclassified_mason_db_confidence0#.fq --classified-out sample_filtered_1_reads_trim25_kraken2_classified_mason_db_confidence0#.fq --output sample_filtered_1_reads_trim25_kraken2_mason_db_confidence0.out The database run with Kraken2 for taxonomic classification is the same database used in the [Butler *et al*. 2020 SARS-CoV-2 preprint][18]. The Kraken2 (k=35) database was constructed by the Mason Lab with complete genome or chromosome-level assemblies from RefSeq for archaea, bacteria, protozoa, fungi, human and viruses at the time of the preprint publication. Final Kraken2 results were summarized in this way - **Visualization of Kraken2 Results with [Pavian][19] R package (v1.0.0)** pavian::runApp(port=5000) **Quantify Taxon Abundance with [Bracken][20] (version 2.2) (Analysis Removed)** bracken -d mason_db -i -r {read_length} -l {taxon-level} -t 0 -o {bracken output} **Summarize Bracken Results with [Bioinf Tools (Bit)][21] (version 1.8.03) (Analysis Removed)** bit-combine-bracken-and-add-lineage -i input-files.tsv A tsv input file contains a single column with the Bracken output file names, and an optional second column with new names for use in the compiled output. **Count Sequences in fq.gz File** parallel "echo {} && gunzip -c {} | wc -l | awk '{d=\$1; print d/4;}'" ::: sample.fq.gz **Convert Single End or Combined Paired-End Reads from .fq.gz to .fasta** gunzip -c sample_filtered_1_reads_trim25_1.fq.gz | paste - - - - | cut -f 1,2 | sed 's/^/>/' | tr "\t" "\n" > sample_filtered_1_reads_trim25_1.fa SeqScreen requires fasta files as input, so the filtered *.fq.gz files were converted to fasta. To optimize runs on our servers, input fasta files were parsed into a smaller number of sequences per file for large datasets, and then concatenated together after each component run was completed. **Count Sequences in Fasta** grep -c ">" sample_filtered_1_reads_trim25_1.fa **Subset Fasta into Smaller Files with [pyfasta][22] (v.0.5.2) as Needed** pip install pyfasta pyfasta split -n 2 sample_filtered_1_reads_trim25_1.fa **Assign GO Terms to Reads with [SeqScreen][23] (v1.2) "Fast" Mode (DIAMOND + High Confidence Protein Annotations)** nextflow --fasta {input_fasta} --databases {database_location} --working {output_directory} --threads {number_threads} The SeqScreen v1.2 [DIAMOND][24] protein database was created by parsing UniProt to exclude TrEMBL and UniParc proteins. The functional annotation performance of SeqScreen improved by removing these under-annotated proteins, which typically were not reviewed or had only a 1 or 2 (out of 5) UniProt annotation score. **Summarize All GO Terms by Domain with [CoV-IRT-Micro][25]** cut -f 1,2,3 sample_filtered_trim25_seqscreen_report.tsv > sample_filtered_trim25_seqscreen_parsed.tsv cov-summarize-go-annots-w-domains -g go_basic -i sample_filtered_trim25_seqscreen_parsed.tsv --keep-zeroes -o sample_filtered_trim25_seqscreen_GO_summary_v2.tsv cov-combine-go-summaries-with-domains -i * -o combined_seqscreen_GO_summary_v2.tsv Notes about this process are captured here: We decided not to use slimmed GO terms, but notes about that process are captured here: **Propagate Parental GO Terms with [CoV-IRT-Micro][26]** cov-propagate-go-terms --input combined_seqscreen_GO_summary_v2.tsv --keep-zeroes -o combined_seqscreen_GO_summary_v2_parent_propagated.tsv **Get GO Term Children with [Bit][27]** bit-get-go-term-info GO:0016787 > hydrolase_activity_bit_output.txt **Summarize Read Classifications for Select GO Terms with [SeqScreen Parsing Scripts][28]** python3 scripts/ -g go_terms_list.txt --prefix sample_go_terms_list --force sample_filtered_1_reads_trim25_1_seqscreen_report.tsv **Summarize UniProt Counts from SeqScreen Outputs with [CoV-IRT-Micro][29]** cut -f 18 sample_filtered_1_reads_trim25_1_seqscreen_report.tsv > sample_uniprot_ids.tsv awk '{for(w=1;w<=NF;w++) print $w}' ./sample_uniprot_ids.tsv | sort | uniq -c | sort -nr > sample_uniprot_id_counts.tsv cov-combine-uniprot-ID-counts -i *_uniprot_id_counts.tsv -o uniprot_id_summary.tsv Notes about this process are captured here: The COV-IRT microbial conda package can be found here: Viral Analysis ---------------- The determination and analysis of viral variants was performed according to the methods described in the [Sapoval et al. 2020 preprint][30]. Human Leukocyte Antigen (HLA) Analysis ---------------- The RNA-seq for each individual was mapped to GRCh38 reference genome with no ALT contigs using STAR mapper version 2.6.1a. The mapped BAM files were used for HLA typing with the arcasHLA Docker image from, which was pulled from master branch on 2020-Jun-08. [1]: [2]: [3]: [4]: [5]: [6]: [7]: [8]: [9]: [10]: [11]: [12]: [13]: [14]: [15]: [16]: [17]: [18]: [19]: [20]: [21]: [22]: [23]: [24]: [25]: [26]: [27]: [28]: [29]: [30]: