### __Data__: _de novo_ transcriptome assembly comparison For further information, please contact: (__martin__ dot __hoelzer__ at __uni-jena__ dot __de__). Here, we provide data accompanying our manuscript [Martin Hölzer and Manja Marz. "_De novo_ transcriptome assembly: A comprehensive cross-species comparison of short-read RNA-Seq assemblers." GigaScience, 8 (5), 2019.](https://doi.org/10.1093/gigascience/giz039) published at [GigaScience](https://doi.org/10.1093/gigascience/giz039). We evaluated the performance of [10 different assembly tools](http://www.rna.uni-jena.de/supplements/assembly/index.html#assembler): * Trinity (v2.8.4) * Trans-ABySS (v2.0.1) * SOAPdenovo-Trans (v1.03) * Oases (v0.2.08) * IDBA-Tran (v1.1.1) * Bridger (v2014-12-01) * BinPacker (v1.0) * Shannon (v0.0.2) * SPAdes-sc (v3.13.0) * SPAdes-rna (v3.13.0) on [9 RNA-Seq data sets](http://www.rna.uni-jena.de/supplements/assembly/index.html#data): * ECO: _E. coli_ (bacteria) * CAL: _C. albicans_ (fungi) * ATH: _A. thaliana_ (plant) * MMU: _M. musculus_ (mouse) * HSA: _H. sapiens_ (human) * HSA-EBOV: _H. sapiens_ w/ Ebola infection (human+virus) * -3h: 3h p.o.i. * -7h: 7h p.o.i. * -23h: 23h p.o.i. * HSA-FLUX: _H. sapiens_ simulated data of chromosome 1 For a full description see our [manuscript](https://doi.org/10.1093/gigascience/giz039) or our [electronic supplement](http://www.rna.uni-jena.de/supplements/assembly/index.html). You can also find the full electronic supplement content as `zip` archive here for download. Just open the `index.html` file with any browser to access the supplemental data. ### Available data #### Reads We provide the trimmed read data (`FASTQ`) that was used for assembly. The original raw read data is listed [here](http://www.rna.uni-jena.de/supplements/assembly/index.html#data). We used [prinseq](http://prinseq.sourceforge.net/manual.html) with the following parameters for trimming: ```` prinseq -fastq $FASTQ1 -fastq2 $FASTQ2 -out_good $DIR -out_bad null -log $DIR/foo.log -no_qual_header -min_len 25 -trim_tail_right 8 -trim_qual_right 20 -trim_qual_left 20 -ns_max_n 0 -trim_qual_window 5 ```` Paired-end data is uploaded as two files with `_1` and `_2` suffix. We re-formated some of the `FASTQ` files that were downloaded from the NCBI SRA because some of the assemblers could not work with the SRA-formatted headers: ```` # original: @SRR203276.2 61DFRAAXX100204:1:100:10000:14603/1 # re-formatted: @SRR203276.2:61DFRAAXX100204:1:100:10000:14603 1:N:0:ACACAC ```` The _Homo sapiens_ (HSA) paired-end data set (`hsa_1.fastq` and `hsa_2.fastq`) was split up and archived in four `FASTQ` files due to file size: * hsa_1.1.fastq.zip * hsa_1.2.fastq.zip * hsa_2.1.fastq.zip * hsa_2.2.fastq.zip Just download and extract the archives and then `cat` the files together: ```` cat hsa_1.1.fastq hsa_1.2.fastq > hsa_1.fastq cat hsa_2.1.fastq hsa_2.2.fastq > hsa_2.fastq ```` to obtain the original input files that we used for assembly. #### Assemblies We also provide each assembly for each tool and data set in `FASTA` format, overall 90 assemblies. #### HISAT2 mappings To assess the re-mapping rate of the processed RNA-Seq data to the assemblies that were build out of the same reads, we mapped the reads back to the corresponding assembly using HISAT2. We generated 90 mapping files in `BAM` format that were uploaded here. Because some of the files are larger than 5GB we needed to split them into multiple BAM files. You can simply download the BAM chunks and use `BamTools merge` to generate the final mapping file again: ```` $ bamtools merge -in binpacker.sorted.bam00 -in binpacker.sorted.bam01 -in binpacker.sorted.bam02 -out output_alignments_merged.bam ```` #### Blast alignments To assess the number of (nearly) full-length reconstructed protein-coding transcripts, we blasted (`blastx`) all transcripts of each assembly against the [UniProtKB/Swiss-Prot](https://www.uniprot.org/) database and provide the tabular output files here. ```` blastx -task blastx -query $ASSEMBLY -db uniprot_sprot.fasta -out $OUT/$TOOL.blastx.max1.outfmt6 -evalue 1e-20 -num_threads $THREADS -max_target_seqs 1 -outfmt 6 ```` _Please note_ that we used the `-max_target_seqs 1` option here to save runtime and because one (maybe suboptimal hit) is enough for our evaluation. However, using this parameter can be fatal, because it does not what most users are expecting from its description. Indeed, not the _best_ blast hit is reported by setting `max_target_seqs` to `1`, but just _some_ (the first) hit matching the other criteria (e-value, ...). If you are interested, this (huge) problem was recently discussed in the community and a [short note published at Bioinformatics](https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/bty833/5106166). #### Assembly commands To allow full reproducibility of our comparison, we provide all commands (as `bash.sh` scripts) used to execute the assemblies on each data set. _Please note_ that the bash scripts can not be executed directly: you have to change the variables to match your directory structure and installation requirements. The scripts also include the assembly commands for Mira, that were finally not included in the manuscript (see our paper for details). __Fig. 1__ Workflow overview. See manuscript for further details. ![Real data is truth!](https://github.com/hoelzer-lab/hoelzer-lab.github.io/blob/master/assets/osf/assembly_overview.png?raw=true =900x) If you use our data, please cite: [Martin Hölzer and Manja Marz. "_De novo_ transcriptome assembly: A comprehensive cross-species comparison of short-read RNA-Seq assemblers." GigaScience, 8 (5), 2019.](https://doi.org/10.1093/gigascience/giz039)
OSF does not support the use of Internet Explorer. For optimal performance, please switch to another browser.