Bias in phage genomes

What started off sometime in 2019 as search for a number, too put into an introduction of a paper ends up a few years later with hopefully a useful paper. That number was how many complete phage genomes are currently publicly available via public databases are currently available. At the time, NCBI virus had not been released (https://www.ncbi.nlm.nih.gov/labs/virus/vssi/), which contains some of this information. Myself and Nathan Brown wrote a quick script that used the esearch/efetch factilies to extract phage genomes. Then applied several filtering steps to extract “complete” phage genomes with lots of manual filtering. We started providing this data on the website for download. After requests from people of how to cite this list and some reminding from Branko Rihtman, we have finally got to a pre-print. Ryan Cook has tidied up the code a lot and parsed lots of informtion that can be extracted from the genbank files.

In extracting this informaiton we found many things

There is big bias in the hosts that phage are isolated on – most phages are isolated on a small number of host bacteria

Far more lytic phage genomes than temperate – with most temperate phage genomes coming from an even smaller number of hosts

The number of putative antibiotic resistance genes is different for lytic versus temperate phages and host

Jumbo phages are not always rare – again dependenent on the host

Even for hosts where large numbers of phage have been isolated, we are a long way from sampling the number predicted phage species t

All the data can be accessed via github https://github.com/RyanCook94/

And the paper on https://www.biorxiv.org/content/10.1101/2021.05.01.442102v1.article-metrics