Main content

Home

Menu

Loading wiki pages...

View
Wiki Version:
## Data This supplementary data is accompanying our study of non-coding RNAs (ncRNAs) in 16 publicly available bat genome assemblies. If you use this data, please cite: [Mostajo _et al._ "A comprehensive annotation and differential expression analysis of short and long non-coding RNAs in 16 bat genomes." Accepted at NAR Genomics and Bioinformatics, (2019)]() The full supplement can be found here: [rna.uni-jena.de/supplements/bats][1] All scripts and code used to analyze and annotate the genomes can be found here: [github.com/rnajena/bats_ncrna][2] Besides all annotation files, in this repository, we additionally provide big (intermediate) data files for * genomes (original and re-named with internal IDs used in processing the data) | ``fasta`` * mappings | ``sorted.bam`` * merged annotations for each bat species | ``gtf`` * blast results | ``tsv`` * differential gene expression | ``tar.bz2`` __Please note__: for our computational analysis we re-named all contigs for each bat genome assembly following a three-letter abbreviation code, e.g.: MLU_123 for a contig of the _M. lucifugus_ genome assembly. Afterwards, we re-named all annotation files to be compatible with the current genome `fasta` files provided by the NCBI. Here, we uploaded all files with their original (NCBI) contig IDs for direct usage. Out ``gtf`` annotations follow a general format derived from the Ensembl annotation format. We strictly stick to a _{gene, transcript, exon}_ hierarchy. Yes, this might not necessary for most ncRNAs, however, our annotations can be directly merged and used with exiting annotations of protein-coding genes. Each ``gtf`` row is composed of 9 columns: * 1 contig ID * 2 annotation source * 3 type {gene, transcript, exon} * 4 start * 5 stop * 6 . * 7 strand {+,-,.} * 8 . * 9 description with specific attributes For example, written as arrays: ```` ['MLU_1', 'gorap', 'gene', '50', '150', '.', '+', '.', 'gene_id "MLUG00000000001"; gene_name "NARF"; gene_source "gorap"; gene_biotype "ncRNA";'] ['MLU_1', 'gorap', 'transcript', '50', '150', '.', '+', '.', 'gene_id "MLUG00000000001"; gene_name "NARF"; gene_source "gorap"; gene_biotype "ncRNA";' transcript_id "MLUT00000000001";] ['MLU_1', 'gorap', 'exon', '50', '150', '.', '+', '.', 'gene_id "MLUG00000000001"; gene_name "NARF"; gene_source "gorap"; gene_biotype "ncRNA";' transcript_id "MLUT00000000001"; exon_id "MLUE00000000001";] ```` For your gene/transcript/exon IDs we stick to the following pattern: ```` <SPECIES_ABBR><TYPE><NCRNA_CLASS><00000000000> ```` with ```` TYPE = {G, T, E} // gene, transcript, exon ```` and ```` NCRNA_CLASS = { rRNA => R, tRNA => T, miRNA => M, miRNA w/ mirDeep2 (de novo) => D, snoRNA => S, ncRNA/miscRNA/other => N, lncRNA => L, lncRNA-hot-spot => H, mito => O } ```` Example for an rRNA (R) gene (G) annotated in Myotis lucifugus (MLU): MLUGR00000000001 ### Final merged GTFs For each bat species we merged the known NCBI annotations (protein- and non-coding) with our new ncRNA annotations in two steps to provide extended full annotations directly usable for further computational analyses. __First__, for each bat species we merged our new ncRNA annoations using a custom script ([merge_gtf_global_ids.py][3]). After reading in all features and asserting correct file structure, overlaps were resolved in the following manner: (1) exons are considered overlapping if more than 50 % of the shorter one is covered by the larger one. (2) if only one of the overlapping set is of biotype protein-coding, remove all others. (3) for further ties, keep only the exon that is highest on a priority list based on annotation source. (4) for further ties, keep only the longest of the exons. For each exon to be removed the corresponding transcript is deleted, and gene records that lost all transcript are also deleted. __Second__, we converted and filtered the NCBI annotations to a compatible format with a custom script ([format_ncbi.py][4]) and then combined the results with our merged novel ncRNA annotations using the same strategy to resolve overlaps as above, but imposing less strict format rules ([merge_gtf_ncbi.py][5]). ### How to join split files and extract compressed files Some files were split for uploading. To join these files use: ``` cat somefile.tar.bz2.parta* > somefile.tar.bz2 ``` To extract `.tar.bz2` compressed files use: ``` tar -xjf somefile.tar.bz2 ``` [1]: https://www.rna.uni-jena.de/supplements/bats [2]: https://github.com/rnajena/bats_ncrna [3]: https://github.com/rnajena/bats_ncrna/blob/master/merge_gtf_global_ids.py [4]: https://github.com/rnajena/bats_ncrna/blob/master/format_ncbi.py [5]: https://github.com/rnajena/bats_ncrna/blob/master/merge_gtf_ncbi.py
OSF does not support the use of Internet Explorer. For optimal performance, please switch to another browser.
Accept
This website relies on cookies to help provide a better user experience. By clicking Accept or continuing to use the site, you agree. For more information, see our Privacy Policy and information on cookie use.
Accept
×

Start managing your projects on the OSF today.

Free and easy to use, the Open Science Framework supports the entire research lifecycle: planning, execution, reporting, archiving, and discovery.