Loading wiki pages...

Wiki Version:
<h3><strong>SuperCRUNCH Analysis of UCE Data</strong></h3> <hr> <p>Here, I provide all details, files, and commands used for this particular SuperCRUNCH analysis.</p> <h4><strong><em>Background and Starting Material:</em></strong></h4> <p>To demonstrate the ability of SuperCRUNCH to handle large sequence-capture datasets, I assembled a UCE supermatrix using recently published data for the microhylid frog genus <em>Kaloula</em>. I downloaded sequence data from the NCBI nucleotide database using the search terms <em>Kaloula</em> and “ultra conserved element”, which resulted in a 32MB fasta file containing 38,568 records. </p> <p>The following instructions allow replication of my analysis, and the commands for each module are provided. Note that directory and file path locations will be different depending on where these data are downloaded and stored locally. In the module commands, paths to directories are denoted by /PATH/, whereas files will end with typical extensions (.txt, .fasta, etc). Also note that I generally move relevant output files to new folders before performing the next step. </p> <p>The overall workflow for this analysis includes obtaining taxon names directly from the fast file, taxonomic assessment, parsing loci, orthology filtration, sequence selection, creating an accession table, adjusting sequence directions, alignment, relabeling, trimmming, and concatenation. </p> <h4><strong><em>1: Getting Taxon Names:</em></strong></h4> <p>I obtained a taxon list directly from the fasta file using <a href="http://Fasta_Get_Taxa.py" rel="nofollow">Fasta_Get_Taxa.py</a>, which contained ten species and four subspecies labels. This was accomplished using the following commands:</p> <pre class="highlight"><code>python <a href="http://Fasta_Get_Taxa.py" rel="nofollow">Fasta_Get_Taxa.py</a> -i /01_Get_Taxa/Kaloula_UCE.fasta -o /01_Get_Taxa/Output</code></pre> <p>This produced the <em>Species_Names.txt</em> and <em>Subspecies_Names.txt</em> output files. I combined the names into a new file called <em>Kaloula_Names.txt</em>.</p> <h4><strong><em>2: Taxon Assessment:</em></strong></h4> <p>I used the <em>Kaloula_Names.txt</em> to search the starting sequence set (<em>Kaloula_UCE.fasta</em>) to find any unmatched taxon names with the following commands:</p> <pre class="highlight"><code>python <a href="http://Taxa_Assessment.py" rel="nofollow">Taxa_Assessment.py</a> -i /02_Get_Taxa/Kaloula_UCE.fasta -t /02_Taxon_Assessment/Kaloula_Names.txt -o /02_Taxon_Assessment/Output/ </code></pre> <p>Out of the 38,568 records records in the starting data, 9,731 records were written to <em>Unmatched_Taxa.fasta</em> and 28,837 records were written to <em>Matched_Taxa.fasta</em>. </p> <h4><strong><em>3: Parse Loci:</em></strong></h4> <p>I created a general-use UCE locus file from the UCE 5k probe set, which was used to conduct searches. I searched for 5,041 UCE loci using the search terms in <em>UCE_5k_locus_file.txt</em>. I used the following commands to retrieve locus-specific fasta files:</p> <pre class="highlight"><code>python <a href="http://Parse_Loci.py" rel="nofollow">Parse_Loci.py</a> -i /03_Parse_Loci/Matched_Taxa.fasta -l /03_Parse_Loci/UCE_5k_locus_file.txt -t /03_Parse_Loci/Kaloula_Names.txt -o /03_Parse_Loci/Output/ </code></pre> <p>The number of sequences found for each locus is reported in <em>Loci_Record_Counts.log</em>. </p> <h4><strong><em>4: Cluster, Blast and Extract (no reference sequences):</em></strong></h4> <p>I used the automated clustering and orthology filtering for the UCE loci retrieved. The records contained in these particular fasta files were not expected to contain multiple loci and were therefore suitable for this method. I selected the dc-megablast algorithm for blastn, which is best for interspecific sequence searches, and used the -m span method to merge blast coordinates (which merges non-overlapping coordinates if they are within 100 bases of each other). The command for this step is below (note the UCE locus-specific fasta files were moved to the /04_Cluster_Blast/ directory):</p> <pre class="highlight"><code>python <a href="http://Cluster_Blast_Extract.py" rel="nofollow">Cluster_Blast_Extract.py</a> -i /04_Cluster_Blast/ -b dc-megablast -m span</code></pre> <p>The orthology-filtered fasta files for all loci and corresponding log files are in the output directory, specifically in the output folder '/04_Trimmed_Results/'. The log files show the original length, trimmed length, and coordinates used for every accession number included in the filtered fasta files. The resulting orthology-filtered fasta files were used for sequence selection.</p> <h4><strong><em>5: Sequence Filtering and Selection:</em></strong></h4> <p>I used <em><a href="http://Filter_Seqs_and_Species.py" rel="nofollow">Filter_Seqs_and_Species.py</a></em> to select representative sequences by length using the following commands (note the UCE filtered fasta files were moved to the /05_Filter_Select/ directory):</p> <pre class="highlight"><code>python /Users/portik/Documents/GitHub/SuperCRUNCH/supercrunch-scripts/<a href="http://Filter_Seqs_and_Species.py" rel="nofollow">Filter_Seqs_and_Species.py</a> -i /Users/portik/Research/SuperCRUNCH/UCE/04_Filter -f length -l 50 -t /Users/portik/Research/SuperCRUNCH/UCE/02_Parse_Loci/Anura_Taxa.txt python <a href="http://Filter_Seqs_and_Species.py" rel="nofollow">Filter_Seqs_and_Species.py</a> -i /05_Filter_Select/ -f length -l 150 -t -t /03_Parse_Loci/Kaloula_Names.txt</code></pre> <p>This produced fasta files in which every available taxon is represented by a single best sequence (passing the respective filters), ready for pre-alignment steps. These fasta files can also be used to quickly create a table of NCBI accession numbers for all taxa and loci, explained in the following step. </p> <h4><strong><em>6: Make Accession Table:</em></strong></h4> <p>This step produced an accession table containing columns of loci and rows of taxa that is filled with NCBI accession numbers (if there was a sequence for that taxon and locus). Blank cells are filled with a dash. I moved the filtered fasta files produced from the previous step to a new directory called '/06_Species_Filtering_Results/' for this step. </p> <pre class="highlight"><code>python <a href="http://Make_Acc_Table.py" rel="nofollow">Make_Acc_Table.py</a> python -i /06_Species_Filtering_Results/ </code></pre> <h4><strong><em>7: Adjust Direction:</em></strong></h4> <p>To prevent disasters in the alignment stage, it was necessary to ensure all sequences are in the correct orientation. This step was executed for a directory of fasta files, in this case the filtered, single-sequence per taxon fasta files. I used the following command for this step (note that all the relevant fasta files were moved to the /07_Adjust/ directory):</p> <pre class="highlight"><code>python <a href="http://Adjust_Direction.py" rel="nofollow">Adjust_Direction.py</a> -i /07_Adjust/</code></pre> <p>Although this step may seem unnecessary, the <em>Log_Sequences_Adjusted.txt</em> file that is produced showse how many sequences per locus required adjusting. In this case, there weren't any sequences requiring adjustment. </p> <h4><strong><em>8: Align:</em></strong></h4> <p>I performed alignments using two main strategies. I aligned the direction-adjusted fasta files using all mafft and clustal-o). I used the following two commands to use each aligner (note that all the fasta files were in the /08_Align/ directory for this step):</p> <pre class="highlight"><code>python <a href="http://Align.py" rel="nofollow">Align.py</a> -i /08_Align/ -a mafft python <a href="http://Align.py" rel="nofollow">Align.py</a> -i /08_Align/ -a clustalo</code></pre> <p>I ended up selecting the mafft alignments for the next steps.</p> <h4><strong><em>9: Relabel Fasta Records:</em></strong></h4> <p>I relabeled the records in the aligned fasta files to allow for concatenation downstream. The records up to this point contain the accession number and original description lines, but for concatenation need to be relabeled with just the species or subspecies names. I relabeled by species, and used a text file to allow for subspecies names to be included too. Relabeling was accomplished using the following command (note that all the fasta files were in the /09_Relabel/ directory for this step):</p> <pre class="highlight"><code>python <a href="http://Relabel_Fasta.py" rel="nofollow">Relabel_Fasta.py</a> -i /09_Relabel/ -r species -s /03_Parse_Loci/Kaloula_Names.txt</code></pre> <h4><strong><em>10: Trim Alignments:</em></strong></h4> <p>I performed some automated trimming of the aligned fasta files. I ran trimal using the gap threshold of 0.05 (removes columns with &gt;95% missing data). The command for doing this is as follows (note that all the relabeled fasta files were in the /10_Trimming/ directory): </p> <pre class="highlight"><code>python <a href="http://Trim_Alignments.py" rel="nofollow">Trim_Alignments.py</a> -i /10_Trimming/ -f fasta -a gt</code></pre> <h4><strong><em>11: Concatenate Alignments:</em></strong></h4> <p>With the aligned fasta files trimmed and relabeled accordingly, the supermatrix was ready to be assembled. The alignments were concatenated using the following command (note that all the trimmed aligned fasta files were moved to the /11_Concatenation/ directory for this step): </p> <pre class="highlight"><code>python <a href="http://Concatenation.py" rel="nofollow">Concatenation.py</a> -i /11_Concatenation/ -f fasta -s dash -o phylip</code></pre> <p>Note that this produced a concatenated alignment in phylip format and wrote a '-' symbol for missing data sequences. The 1,785 alignment files contained 14 unique taxa and the final concatenated alignment contained 22,751 sequences and was 1,352,656 base pairs long. A report of the number of loci for each taxon is contained in the <em>Taxa_Loci_Count.log</em> file, and the <em>Data_Partitions.txt</em> file contains information about the location of each locus in the concatenated alignment, which can be used for partitioned phylogenetic analyses and other applications. The final phylip alignment was used as input to run a RAxML analysis. </p>
OSF does not support the use of Internet Explorer. For optimal performance, please switch to another browser.
This website relies on cookies to help provide a better user experience. By clicking Accept or continuing to use the site, you agree. For more information, see our Privacy Policy and information on cookie use.

Start managing your projects on the OSF today.

Free and easy to use, the Open Science Framework supports the entire research lifecycle: planning, execution, reporting, archiving, and discovery.