# Phylign
Phylign (Břinda *et al*, 2023) allows efficient searching and full alignment of
query sequences to huge datasets of bacterial assemblies that have been phylogenetically
compressed using MiniPhy (https://github.com/karel-brinda/MiniPhy).
We have recently adapted Phylign to allow users to align query sequences to AllTheBacteria v0.2
or subsets of this dataset. Detailed information, including how to run Phylign on computing clusters,
can be found in the README on our GitHub page (https://github.com/AllTheBacteria/Phylign/blob/main/README.md).
Briefly, the main steps to set this up are as follows:
1. Install conda via `apt-get` on Linux, `brew` on OS X or using the instructions on the Anaconda
website (https://conda.io/projects/conda/en/latest/user-guide/install/index.html).
2. Clone the Phylign repository by opening a terminal and running
`git clone https://github.com/AllTheBacteria/Phylign && cd Phylign`.
3. Install all of the Phylign dependencies in a conda environment by running
`conda env create -f environment.yaml && conda activate phylign`.
4. Download the MiniPhy-compressed batches of assemblies you want to query and place them in `asms/`.
The compressed assemblies are available at https://ftp.ebi.ac.uk/pub/databases/AllTheBacteria/Releases/0.2/assembly/.
5. Download the compressed COBS indices from that match the batches of assemblies you downloaded and
place them in `cobs/`. The COBS indices are available at
https://ftp.ebi.ac.uk/pub/databases/AllTheBacteria/Releases/0.2/indexes/phylign/.
6. If you only want to query a subset of the assemblies, modify `data/batches_2m.txt` to just
include the batches of assemblies you are interested in.
7. Replace all of the files in `input/` with your query FASTA or FASTQ files.
8. Run `make` to search for the query sequences in the assemblies you downloaded. The results will
be saved to `output/`.