# Description of the dataset
## Structure
| Folder | Content |
|---|---|
| alignments | Tir receptor family MSA |
| anchor | Anchor motif predictions |
| disopred | DISOPRED disorder predictions |
| disopred_agg-clas | DISOPRED aggregation and classes |
| fasta | Sequence collections from UniProt |
| figures | Plots derived from analysis |
| iupred | IUPred 1.0 disorder predictions |
| iupred_agg-clas | IUPred 1.0 aggregation and classes |
| maps | Species and taxa in FASTA collection |
| motif_vs_disorder | Merged data from anchor and aggregated DISOPRED |
## Logic
**The code for data processing can be found in [this repository](https://osf.io/cxkjf/)**
Sequences were fetched from UniProt and sorted in collections under `fasta`.
Three effectors collections were assembles, *E. coli* EHEC, *E. coli* EPEC, and *C. rodentium*.
For each one of them, the corrresponding taxon an specie name was extracted. The resulting dictionaries were saved under `maps`.
The taxon lists were used to fetch available UniProt reference proteomes for each collection. As a reference, the human proteome was also collected. All those sequence collections are also found under `fasta`.
Then, each collection was processed using IUPred 1.0 *short* and *long* modes and DISOPRED 3.1.