INPHARED re-annotated with PHROGs

The recent PHROGs database from Terzian et al is a great resource for phage annotation. Previously we re-formatted this database into HMMs that are suitable for use within Prokka (read about it HERE and download the HMMs for yourself HERE).

Ryan has added this resource to our INPHARED dataset to re-annotate the genomes of all cultured phages that we can identify in Genbank. The updated GenomesDB folder of INPHARED can be downloaded from here (warning it’s a big file tar file), with > 19,000 genomes now annotated in a consistent manner. We have found the PHROGs annotation really useful to find homologues by string searching based on annotations, due to the standardised annotation provided by the PHROGs team.

These annotations are fully automated, thus for those that have spent 100s of hours annotating one phage, these annotations are most likely not “better” annotations. But they are entirely consistent over all the phages we have re-annotated, which for the analysis we are interested in doing is of importance to us. Ryan has more specific details on how to update the database on his github page. The PHROGs team provide a brilliant interactive site to explore all the PHROGs they annotated here.

Removal of incomplete phage genomes

Thanks to Evelien who has identified several 100 incomplete phages in the database, these have been removed and added to the exclusion list. Full details of those excluded on github page, with the ability to add accessions of other phages that you might spot here, which will be excluded in versions going forward.

PHAGE ANNOTATION WITH PHROGS

Recently PHROGs was released by Terzian et al (https://doi.org/10.1093/nargab/lqab067 ). Full details are provided on their webpages and publication. Briefly their curated dataset provides tens of thousands of PHROGs with a standardised annotation attributed to each PHROG. All of this is available through their searchable website and can also be downloaded.

For first pass phage genome annotation this seems like a great resources. We standardly use Prokka for annotation of phage genomes, that allows custom hmm databases to be used for annotation. Unfortunately the HMMs provided directly by the PHROGs team don`t sit neatly into Prokka and allow the annotation linked to the PHROG to appear in the final annotation, because of differences in formats.

However, as they provided all their data in an easily downloadable form. We have taken this and reformatted to produce HMMs with the annotations included so it plays nicely with HMMER3 as part of Prokka . We have produced a single file that can but put in /opt/prokka/db/hmm directory of Prokka. Thanks to Thomas Sicheritz-Pontén for helping with sorting out getting the correct annotation into the 38,000 HMMs …

A single file containing all HMMs that can be directly added to Prokka , can be downloaded here. Warning its 3 Gb when unzipped. Thanks to Terzian et al who did all the hard work on producing the original PHROGs and curated annotation and making it available , we have just reformatted it for our own use and anybody else that might want to use it with prokka..

To get it running within prokka. Locate the installation of prokka

$prokka –listdb

In my case this results in output of /usr/local/bioinf/prokka/db

and [08:43:23] * HMMs: all_VOG HAMAP

telling us there are already some HMMs databases called all_VOG & HAMAP

Within /usr/local/bioinf/prokka/db is the a directory called hmm

Thus, the full path is /usr/local/bioinf/prokka/db/hmm

The downloaded database needs to be copied into /usr/local/bioinf/prokka/db/hmm

Then run $prokka –setupdb

Running the command $prokka –listdb

[08:43:23] * HMMs: all_phrogs all_VOG HAMAP

all_phrogs will now be used by prokka. If you only want to use the PHROGs database, consider using the prokka flag of –hmms and specify /usr/local/bioinf/prokka/db/hmm/all_phrogs

Full details on adding databases are explained on the Prokka github page

Bias in phage genomes

What started off sometime in 2019 as search for a number, too put into an introduction of a paper ends up a few years later with hopefully a useful paper. That number was how many complete phage genomes are currently publicly available via public databases are currently available.  At the time, NCBI virus had not been released (https://www.ncbi.nlm.nih.gov/labs/virus/vssi/), which contains some of this information. Myself and Nathan Brown wrote a quick script that used the esearch/efetch factilies to extract phage genomes. Then applied several filtering steps to extract “complete” phage genomes with lots of manual filtering. We started providing this data on the website for download. After requests from people of how to cite this list and some reminding from Branko Rihtman, we have finally got to a pre-print. Ryan Cook has tidied up the code a lot  and parsed lots of informtion that can be extracted from the genbank files. 

 In extracting this informaiton we found many things 

 There is big bias in the hosts that phage are isolated on – most phages are isolated on a small number of host bacteria 

Far more lytic phage genomes than temperate – with most temperate phage genomes coming  from an even smaller number of hosts 

The number of putative antibiotic resistance genes is different for lytic versus temperate phages and host 

Jumbo phages are not always rare – again dependenent on the host 

Even for hosts where large numbers of phage have been isolated, we are a long way from sampling the number predicted phage species t

All the data can be accessed via github https://github.com/RyanCook94/ 

And the paper on https://www.biorxiv.org/content/10.1101/2021.05.01.442102v1.article-metrics