Adding More Reference Genomes to vConTACT2 Clusters

The virus clustering programme vConTACT2 is a fantastic tool for applying taxonomy to large sets of viral contigs. In short, it clusters unknown viruses with those in the RefSeq database based on shared protein clusters.

To provide even more context to viral clusters though, you may wish to include more reference genomes than those in RefSeq.

To supplement the RefSeq genomes, I took all of the phage genomes on MillardLab, and removed the RefSeq genomes (to avoid duplication). The remaining genomes were processed through dedupe.sh at 95% minimum ID to remove highly similar sequences. This led to a custom subset of 7,527 genomes.

Genes were called on the 7,527 genomes using Prodigal. From this, .faa and .csv mapping files were produced so the reference genomes could be used to supplement vConTACT2 clustering.

Click HERE for the mapping (.csv) file.
Click HERE for the sequence (.faa) file.

Furthermore, a list of these genomes can be obtained from the mapping file using the following command (potentially useful when visualising the resultant network):

awk -F ',' '{print $2}' database.csv | sort | uniq

Happy clustering!