<p><strong><em>TL;DR for impatient people: If you just need a qiime2 GTDB classifier quickly and don't care how I did it, just download the .qza files and follow the two template bash scripts for how to slice the classifier to your primers.</em></strong></p> <p><strong>Goal</strong>: To correlate SSU amplicons with GTDB taxonomy so we know how to pick genomes that are representative of organisms identified using amplicon sequencing surveys.</p> <p><strong>Method</strong>: Took SSU sequences files from GTDB ssu_r86.1_20180911, curated semi-manually to remove obvious incongruencies between GTDB/SILVA132 taxonomy (using RDP classifier) at domain and phylum levels, and created taxonomy/fna artifacts. Can be used in your qiime2 pipeline with minimal modification (slice to your primer sequences and then train your classifier).</p> <p><strong>NB</strong>: Most scripts/steps are included but some of it I did using bash tools like cut/sed and have not included all of these steps.</p> <p><strong>Note for oceanographers</strong>: It seems like some taxa that are abundant in amplicons are either missing or under-represented currently in GTDB. Two examples I've seen so far are <em>Candidatus</em> Actinomarina and SAR11 Group IV. They don't get classified beyond "Bacteria" using the "qiime feature-classifier scikit-learn" with default parameters in qiime2-2018.8. If you fool around with the settings (i.e. set --p-confidence to -1), you can get qiime2 to give a proper phylum-level classification but it seems inaccurate beyond that level (e.g. <em>Candidatus</em> Actinomarina SSU seqs get classified to some other actino from a non-marine environment).</p>
