Each month we produce iTOL annotation files for the INPHARED genomes. These are produced by parsing the Taxonomy data from Genbank files …. only NCBI taxonomy lags behind ICTV taxonomy…..So some of the new families etc havent been updated as of yet.
Below are iTOL annotation files for all phages in the current VMR40 . To allow the latest taxonomy to be rapidly applied to trees in iTOL
See our publication in PHAGE to read about how this dataset is produced and some of our analyses of it. Please consider citing this paper if you are using this database of information on this webpage. You can also generate an up-to-date version of the database, with useful files for vConTACT2, MASH, and IToL using our Perl script available on Github. Updates to the script this month include a new column to the tsv outputs which include anything identified as “host” or “lab_host” within the original Genbank files. However, these values may be inconsistent or downright bizarre (so please use them with caution).
We also recently added annotations using PHROGs (more details available here), and you can download the updated annotations from HERE (please note that we won’t be re-uploading the updated annotations on a monthly basis, as the file is huge. We have updated for 1 Jan 2025. Having the first ~32,000 already annotated will save users a lot of time when using the Perl script themselves).
If you don’t want to run the script yourself, please download all of the files ready-made from below:
For continued testing and demonstration of taxMyPhage we have run all the new genomes added to the INPHARED database in the last 30 days through taxMyPhage (dsDNA phages). Running 250 genomes takes < 2 hrs.
The webversion of taxMyPhage is here or the github . Preprint is here
Analysing these ~200 genomes, demonstrates the large diversity of phages that are routinely being identified. With 50 not falling within currently defined genera
As our group often sequence a lot of bacteriophages, we often want to know if they are new genus and/or species, as quickly as possible. So after several incarnations we have developed taxMyPhage to do this as efficiently as possible. It included contributions several undergrad research projects (Maria Lestido , Moi Thomas and Deven Webster), further brain storming with Thomas and code development, Remi who made it work a lot quicker and made all the conda packages work, and taxonomy guidance and testing from Dann. We now have a standalone tool and Webversion and preprint.
What it will do
Classify dsDNA phages at the genus and or species level only
The webserver will provide taxonomy for a predicted phage and allow a download of a upper right matrix of similarity against other phages classified by ICTV . It will NOT compare against or phages in NCBI, only species that are classified by ICTV.
The standalone tools offers the same, with additional options. The standalone tool has no restriction on the number of input sequences and can be run on 1000s of sequences. If the provided genomes are not complete then inaccurate results maybe obtained, we are implementing the algorithm developed here, that normalises for sequence similarity over total genome length. Additionally the standalone tool will produce similarity matrices for multi-fasta input, allowing calculation of intergenomic similarity. Uses cases for this might be to identify representative sequences from a large dataset.
Interpretation
The output of taxMyPhage will produce be top right matrix (example below) and a tsv that provides the assigned taxonomy.
Example of new genus and new species
The matrix below shows the query has < 70% ANI to any currently classified phage, thus will be a new genus based on ICTV criteria of >70% being within the same genus. As it is < 70% ANI , it is also < 95% ANI , thus is also a new species. Based on the closest genomes identified in searching, these were all members of the Phapecoctavirus. Further phylogenetic analysis would be required to confirm this is the closest Genera of phages – this is just based on similarity cutoffs.
Example of an existing species and conflicts in taxonomy
Below the query >95% ANI to an existing phage so would be the same species (Muminvirus mumin) . However, there are other things to note .The query has greater than 70% >ANI with two distinct genera (others already classified phages do too). Taxonomy is not perfect , taxMyPhage cannot solve these issues but will report them , for the user then to decide what to do.
Based on the command we use for searching for phages in Genbank for INPHARED database , I recently came across some useful further eutilities commands. Writing here for when I forget
To list the most recent genomes in the last 50 days for instance use the -days flag
then with esummary and the extract , utilities to extract specific information. In this case the Accession, creation date, update date , title and TaxID . To produce an easily parsable list in tab format.
More ranting to myself than anything, but others might find it useful
The “VMR” is described as
“The current Virus Metadata Resource (VMR) that provides a list of all exemplar viruses can be downloaded from the link below.”
Most of it is and is really useful … apart from the accessions that don’t point to phages and are for bacterial genomes – without warning or indication :
“Bartogtaviriformidae Bartonegtaviriform Bartonegtaviriform andersoni E Bartonella gene transfer agent BaGTA BX897699 Partial genome dsDNA bacteria “
with accession BX897699 pointing to the genome of a GTA in Bartonella … why GTAs are classified as viruses is slightly confusing (but heyho, argument for another day)
But all these accessions point to bacterial genomes ..
If you were trying to use the VMR to automatically extract all representative viral genomes that infect bacteria … then there is need to check they are actually the bacteriophages (or GTAs!) and not the entirely bacterial genome
With the influx of phage genomic data, there have been several changes to bacteriophage taxonomy. See the paper “A Roadmap for Genome-Based Phage Taxonomy” by Evelien Adriaenssens and Dann Turner who have led these efforts with their work with ICTV. As a result, the classical phage families of Podoviridae, Siphoviridae and Myoviridae are: kaput, have shuffled off their mortal coil, given up the ghost, or any other preferred phrase… generally, they are no more. Some will be upset by this, I am sure. But personally, I am more than happy to see them go. While they were useful for a period of time, the ability to rapidly sequence most phage genomes and put phages “in boxes” based on their genomic content, rather than what they looked like, offers so much resolution to describe differences. For Podoviridae, Siphoviridae and Myoviridae aficionados, the morphotypes of Myovirus, Podovirus and Siphovirus live on.
As a result of updates to phage taxonomy, 100s of new genera have now been created. For our own work, we are interested in rapidly identifying which genus newly isolated phages fall within. Consequently, we have collected genomes of all (dsDNA) phages classified by ICTV and organised them into directories based on genus. Extracted the common marker terL for each phage that can then be used for input into alignments for creating phylogenies, to determine how related a new phage may be. Additionally, we have run VIRIDIC on each genus.
We are hoping to have an automated process available that will rapidly identify if a new phage genome falls within a known phage genus, based on currently ICTV guidelines. As others are probably comparing phage genomes to know taxa, we have made all the data downloadable as a single file called ICTV_genera.tar.gz, which can be downloaded here as it may well be useful to others.
It contains 1653 directories (Genera)
Each directory contains:
*gff files of every ICTV classified phage genome of that genus.
*fsa individual fasta files every ICTV classified phage genome of that genus.
*_terL.ffn – automated extract of terL gene of every ICTV classified phage genome of that genus.
04_VIRIDIC_out folder
*pdf heatmap for that genus
*clusters.csv file from VIRIDIC for that genus
*MA_genCol.csv from VIRIDIC for that genus
If genera only have 1 species, we don’t run VIRIDIC for obvious reasons
If the directory is empty, it is a genus of an RNA phage (still sorting this- see above about dsDNA)
To get the data use :
wget http://warwick.s3.climb.ac.uk/inphared/ICTV_genera.tar.gz
tar -xvf ICTV_genera.tar.gz
If you find this useful, consider citing Cook, et al 2021. INfrastructure for a PHAge REference Database: Identification of Large-Scale Biases in the Current Collection of Cultured Phage Genomes. PHAGE. https://doi.org/10.1089/phage.2021.0007
Ryan has recently been testing the short read eliminator kit from circulomics to enrich for long read for nanopore sequencing of viromes. The input for virome samples was various liquids produced or excreted by cows ….often smelly and sticky, generally not much fun to extract the viral fraction from.
Given the nature of the samples getting large amounts of DNA, let alone HMW DNA from the small sample volumes is not possible. Resorting to MDA amplification to produce enough DNA for the nanopore sequencing. To enrich for longer reads Ryan tried multiple samples without any enrichment with the short read eliminator kit.
The results of which look encouraging
Duplicates of the same library was treated with short read eliminator kit and compared to untreated samples
It shifts the median read length from 1.9 kb to 6.9 kb, which still aren’t huge reads. But given average size of phage genomes and the low input issues, it might make a difference to the final assembly. Initial results of assembly of a small number of samples look encouraging, with ~1000 predicted complete phage genomes as predicted from checkV. This is in line with similar number of genomes we have previously assembled from a single seawater sample, where we found a 650 kb phage genome.
We still have to determine how this SRE kit might exclude some phages that have small genomes. As we have previously observed differences in the population of smaller phage genomes when comparing nanopore and illumina sequencing (Cook et al 2021 Microbiome).
*As i found this recently- here is Ryan talking about some of his previous work on PromethION sequencing of viromes
The recent PHROGs database from Terzian et al is a great resource for phage annotation. Previously we re-formatted this database into HMMs that are suitable for use within Prokka (read about it HERE and download the HMMs for yourself HERE).
Ryan has added this resource to our INPHARED dataset to re-annotate the genomes of all cultured phages that we can identify in Genbank. The updated GenomesDB folder of INPHARED can be downloaded from here (warning it’s a big file tar file), with > 19,000 genomes now annotated in a consistent manner. We have found the PHROGs annotation really useful to find homologues by string searching based on annotations, due to the standardised annotation provided by the PHROGs team.
These annotations are fully automated, thus for those that have spent 100s of hours annotating one phage, these annotations are most likely not “better” annotations. But they are entirely consistent over all the phages we have re-annotated, which for the analysis we are interested in doing is of importance to us. Ryan has more specific details on how to update the database on his github page. The PHROGs team provide a brilliant interactive site to explore all the PHROGs they annotated here.
Removal of incomplete phage genomes
Thanks to Evelien who has identified several 100 incomplete phages in the database, these have been removed and added to the exclusion list. Full details of those excluded on github page, with the ability to add accessions of other phages that you might spot here, which will be excluded in versions going forward.