Phage genomes up to 2018
In filtering for complete phage genomes, it became clear that the phages with the same name will have multiple accession numbers, for multiple reasons. The simplest is the presence of a RefSeq and original accession number. eg Synechococcus phage S-CAM9 RefSeq (NC_031922) and the original (KU686206). So while it has a unique accession it, it is a duplication of an existing genome. So we were counting these phages twice, we can easily remove these from the total. This reduces the total by ~ 2300, leaving ~ 7700 genomes, as seen above.
But this doesn’t mean the all these remaining genomes are unique. Again using the example of Synechococcus phage S-CAM9 there are three accessions numbers KU686206, KU686205, KU686204 that are all derivates of S-CAM9, with each being a different isolate [see the paper]. The filtering we used was pulling out the “phage name”, but not also the “isolate number” [something we need to fix in our script]. Given it is not possible to determine from the Genbank file descriptions how similar phages isolates are- which can vary from 1 SNP to multiple gene deletions. We grouped all phage with identical “phage names” into a single entity – on the basis they are very similar. This data is also plotted as ” uniquely named phage”
Just how similar phages with the same name are is hard to tell. The genomes of phages from experimental evolution experiments in some instances can be different isolates of the same phage but have identical genomes …. Whereas environmental isolates with the same name phage but different isolate numbers are roughly the same phage species [whatever that is, but lets not go there], which may lack several genes.
So what does this mean?
Well, the number of unique phage genomes in GenBank [not 100% identical] is probably somewhere between ~6200 and 7700. A number of isolates of the same genome have been sequenced multiple times.
The number of phages with RefSeq accessions has barely increased in the last 6 months. However, the number of “total genomes” and “Uniquely named phages” is increasing month on month. Therefore, using only phage genomes from RefSeq is likely to vastly under-represent the diversity of phages if constructing a phage database.
Below are files for all phage genomes, and RefSeq only genomes along with premade MASH database
Files can be downloaded below