Quickly listing phages in Genbank

Based on the command we use for searching for phages in Genbank for INPHARED database , I recently came across some useful further eutilities commands. Writing here for when I forget

To list the most recent genomes in the last 50 days for instance use the -days flag

esearch -db nucleotide -query "gbdiv_PHG"[prop] -days 50    | efilter -query "40000:800000 [SLEN]" -sort "Date" |  esummary | xtract -pattern  DocumentSummary -element AccessionVersion -element CreateDate -element UpdateDate -element Title
-element TaxId

then with esummary and the extract , utilities to extract specific information. In this case the Accession, creation date, update date , title and TaxID . To produce an easily parsable list in tab format.

OR420333.1      2023/10/01      2023/10/01      UNVERIFIED: Macromonas phage BK-30P, complete genome
OR473000.1      2023/10/01      2023/10/01      Synechococcus phage S-CREM2, complete genome
OR500351.1      2023/10/01      2023/10/01      Xanthomonas phage Murka, complete genome
OR257569.1      2023/10/01      2023/10/01      Ruegeria phage RpAliso, complete genome
OR338916.1      2023/10/01      2023/10/01      Bacillus phage DZ1, complete genome
OR339795.1      2023/10/01      2023/10/01      Escherichia phage phi456, complete genome
OR296290.1      2023/10/01      2023/10/01      Escherichia phage SDYTW1-F1-2-2_3, complete genome
OR157981.1      2023/10/01      2023/10/01      Staphylococcus phage BUCT_X001, complete genome
OQ326580.1      2023/10/01      2023/10/01      UNVERIFIED_ORG: Phage vB_PaeM_Sem1, complete genome
OR148986.1      2023/10/01      2023/10/01      Caudovirus D_HF5_2C, complete genome

Pointless Accessions in the VMR

More ranting to myself than anything, but others might find it useful

The “VMR” is described as

“The current Virus Metadata Resource (VMR) that provides a list of all exemplar viruses can be downloaded from the link below.”

Most of it is and is really useful … apart from the accessions that don’t point to phages and are for bacterial genomes – without warning or indication :

“Bartogtaviriformidae Bartonegtaviriform Bartonegtaviriform andersoni E Bartonella gene transfer agent BaGTA BX897699 Partial genome dsDNA bacteria “

with accession BX897699 pointing to the genome of a GTA in Bartonella … why GTAs are classified as viruses is slightly confusing (but heyho, argument for another day)

But all these accessions point to bacterial genomes ..

BX897699.1Bartonella henselae strain Houston-1, complete genome: 1931047bp

CP001357.1 Brachyspira hyodysenteriae WA1, complete genome: 3000694bp

CP001312.1 Rhodobacter capsulatus SB 1003, complete genome: 3738958bp
CP000830.1 Dinoroseobacter shibae DFL 12, complete genome: 3789584b

CP000031.2 Ruegeria pomeroyi DSS-3, complete genome: 4109437bp

CP015418.1 Rhodovulum sulfidophilum DSM 1374, complete genome: 4132586bp

If you were trying to use the VMR to automatically extract all representative viral genomes that infect bacteria … then there is need to check they are actually the bacteriophages (or GTAs!) and not the entirely bacterial genome

ICTV Bacteriophage Genera

With the influx of phage genomic data, there have been several changes to bacteriophage taxonomy. See the paper “A Roadmap for Genome-Based Phage Taxonomy” by Evelien Adriaenssens and Dann Turner who have led these efforts with their work with ICTV. As a result, the classical phage families of Podoviridae, Siphoviridae and Myoviridae are: kaput, have shuffled off their mortal coil, given up the ghost, or any other preferred phrase… generally, they are no more. Some will be upset by this, I am sure. But personally, I am more than happy to see them go. While they were useful for a period of time, the ability to rapidly sequence most phage genomes and put phages “in boxes” based on their genomic content, rather than what they looked like, offers so much resolution to describe differences. For Podoviridae, Siphoviridae and Myoviridae aficionados, the morphotypes of Myovirus, Podovirus and Siphovirus live on.

As a result of updates to phage taxonomy, 100s of new genera have now been created. For our own work, we are interested in rapidly identifying which genus newly isolated phages fall within. Consequently, we have collected genomes of all (dsDNA) phages classified by ICTV and organised them into directories based on genus. Extracted the common marker terL for each phage that can then be used for input into alignments for creating phylogenies, to determine how related a new phage may be. Additionally, we have run VIRIDIC on each genus.

We are hoping to have an automated process available that will rapidly identify if a new phage genome falls within a known phage genus, based on currently ICTV guidelines. As others are probably comparing phage genomes to know taxa, we have made all the data downloadable as a single file called ICTV_genera.tar.gz, which can be downloaded here as it may well be useful to others.

It contains 1653 directories (Genera)

Each directory contains:

  • *gff files of every ICTV classified phage genome of that genus.
  • *fsa individual fasta files every ICTV classified phage genome of that genus.
  • *_terL.ffn – automated extract of terL gene of every ICTV classified phage genome of that genus.
  • 04_VIRIDIC_out folder
    • *pdf heatmap for that genus
    • *clusters.csv file from VIRIDIC for that genus
    • *MA_genCol.csv from VIRIDIC for that genus
  • If genera only have 1 species, we don’t run VIRIDIC for obvious reasons
  • If the directory is empty, it is a genus of an RNA phage (still sorting this- see above about dsDNA)

To get the data use :

wget http://warwick.s3.climb.ac.uk/inphared/ICTV_genera.tar.gz 
tar -xvf ICTV_genera.tar.gz 

If you find this useful, consider citing Cook, et al 2021. INfrastructure for a PHAge REference Database: Identification of Large-Scale Biases in the Current Collection of Cultured Phage Genomes. PHAGE. https://doi.org/10.1089/phage.2021.0007

Tinkering with size selection for nanopore sequencing of viromes

Ryan has recently been testing the short read eliminator kit from circulomics to enrich for long read for nanopore sequencing of viromes. The input for virome samples was various liquids produced or excreted by cows ….often smelly and sticky, generally not much fun to extract the viral fraction from.

Given the nature of the samples getting large amounts of DNA, let alone HMW DNA from the small sample volumes is not possible. Resorting to MDA amplification to produce enough DNA for the nanopore sequencing. To enrich for longer reads Ryan tried multiple samples without any enrichment with the short read eliminator kit.

The results of which look encouraging

Duplicates of the same library was treated with short read eliminator kit and compared to untreated samples

It shifts the median read length from 1.9 kb to 6.9 kb, which still aren’t huge reads. But given average size of phage genomes and the low input issues, it might make a difference to the final assembly. Initial results of assembly of a small number of samples look encouraging, with ~1000 predicted complete phage genomes as predicted from checkV. This is in line with similar number of genomes we have previously assembled from a single seawater sample, where we found a 650 kb phage genome.

We still have to determine how this SRE kit might exclude some phages that have small genomes. As we have previously observed differences in the population of smaller phage genomes when comparing nanopore and illumina sequencing (Cook et al 2021 Microbiome).

*As i found this recently- here is Ryan talking about some of his previous work on PromethION sequencing of viromes

INPHARED re-annotated with PHROGs

The recent PHROGs database from Terzian et al is a great resource for phage annotation. Previously we re-formatted this database into HMMs that are suitable for use within Prokka (read about it HERE and download the HMMs for yourself HERE).

Ryan has added this resource to our INPHARED dataset to re-annotate the genomes of all cultured phages that we can identify in Genbank. The updated GenomesDB folder of INPHARED can be downloaded from here (warning it’s a big file tar file), with > 19,000 genomes now annotated in a consistent manner. We have found the PHROGs annotation really useful to find homologues by string searching based on annotations, due to the standardised annotation provided by the PHROGs team.

These annotations are fully automated, thus for those that have spent 100s of hours annotating one phage, these annotations are most likely not “better” annotations. But they are entirely consistent over all the phages we have re-annotated, which for the analysis we are interested in doing is of importance to us. Ryan has more specific details on how to update the database on his github page. The PHROGs team provide a brilliant interactive site to explore all the PHROGs they annotated here.

Removal of incomplete phage genomes

Thanks to Evelien who has identified several 100 incomplete phages in the database, these have been removed and added to the exclusion list. Full details of those excluded on github page, with the ability to add accessions of other phages that you might spot here, which will be excluded in versions going forward.

PHAGE ANNOTATION WITH PHROGS

Recently PHROGs was released by Terzian et al (https://doi.org/10.1093/nargab/lqab067 ). Full details are provided on their webpages and publication. Briefly their curated dataset provides tens of thousands of PHROGs with a standardised annotation attributed to each PHROG. All of this is available through their searchable website and can also be downloaded.

For first pass phage genome annotation this seems like a great resources. We standardly use Prokka for annotation of phage genomes, that allows custom hmm databases to be used for annotation. Unfortunately the HMMs provided directly by the PHROGs team don`t sit neatly into Prokka and allow the annotation linked to the PHROG to appear in the final annotation, because of differences in formats.

However, as they provided all their data in an easily downloadable form. We have taken this and reformatted to produce HMMs with the annotations included so it plays nicely with HMMER3 as part of Prokka . We have produced a single file that can but put in /opt/prokka/db/hmm directory of Prokka. Thanks to Thomas Sicheritz-Pontén for helping with sorting out getting the correct annotation into the 38,000 HMMs …

A single file containing all HMMs that can be directly added to Prokka , can be downloaded here. Warning its 3 Gb when unzipped. Thanks to Terzian et al who did all the hard work on producing the original PHROGs and curated annotation and making it available , we have just reformatted it for our own use and anybody else that might want to use it with prokka..

To get it running within prokka. Locate the installation of prokka

$prokka –listdb

In my case this results in output of /usr/local/bioinf/prokka/db

and [08:43:23] * HMMs: all_VOG HAMAP

telling us there are already some HMMs databases called all_VOG & HAMAP

Within /usr/local/bioinf/prokka/db is the a directory called hmm

Thus, the full path is /usr/local/bioinf/prokka/db/hmm

The downloaded database needs to be copied into /usr/local/bioinf/prokka/db/hmm

Then run $prokka –setupdb

Running the command $prokka –listdb

[08:43:23] * HMMs: all_phrogs all_VOG HAMAP

all_phrogs will now be used by prokka. If you only want to use the PHROGs database, consider using the prokka flag of –hmms and specify /usr/local/bioinf/prokka/db/hmm/all_phrogs

Full details on adding databases are explained on the Prokka github page

Bias in phage genomes

What started off sometime in 2019 as search for a number, too put into an introduction of a paper ends up a few years later with hopefully a useful paper. That number was how many complete phage genomes are currently publicly available via public databases are currently available.  At the time, NCBI virus had not been released (https://www.ncbi.nlm.nih.gov/labs/virus/vssi/), which contains some of this information. Myself and Nathan Brown wrote a quick script that used the esearch/efetch factilies to extract phage genomes. Then applied several filtering steps to extract “complete” phage genomes with lots of manual filtering. We started providing this data on the website for download. After requests from people of how to cite this list and some reminding from Branko Rihtman, we have finally got to a pre-print. Ryan Cook has tidied up the code a lot  and parsed lots of informtion that can be extracted from the genbank files. 

 In extracting this informaiton we found many things 

 There is big bias in the hosts that phage are isolated on – most phages are isolated on a small number of host bacteria 

Far more lytic phage genomes than temperate – with most temperate phage genomes coming  from an even smaller number of hosts 

The number of putative antibiotic resistance genes is different for lytic versus temperate phages and host 

Jumbo phages are not always rare – again dependenent on the host 

Even for hosts where large numbers of phage have been isolated, we are a long way from sampling the number predicted phage species t

All the data can be accessed via github https://github.com/RyanCook94/ 

And the paper on https://www.biorxiv.org/content/10.1101/2021.05.01.442102v1.article-metrics

Adding More Reference Genomes to vConTACT2 Clusters

The virus clustering programme vConTACT2 is a fantastic tool for applying taxonomy to large sets of viral contigs. In short, it clusters unknown viruses with those in the RefSeq database based on shared protein clusters.

To provide even more context to viral clusters though, you may wish to include more reference genomes than those in RefSeq.

To supplement the RefSeq genomes, I took all of the phage genomes on MillardLab, and removed the RefSeq genomes (to avoid duplication). The remaining genomes were processed through dedupe.sh at 95% minimum ID to remove highly similar sequences. This led to a custom subset of 7,527 genomes.

Genes were called on the 7,527 genomes using Prodigal. From this, .faa and .csv mapping files were produced so the reference genomes could be used to supplement vConTACT2 clustering.

Click HERE for the mapping (.csv) file.
Click HERE for the sequence (.faa) file.

Furthermore, a list of these genomes can be obtained from the mapping file using the following command (potentially useful when visualising the resultant network):

awk -F ',' '{print $2}' database.csv | sort | uniq

Happy clustering!

Updating the DIAMOND database file for ViromeQC

The new virome quality control software, ViromeQC, determines viral enrichment of sequenced viromes. In short, fastQ reads are aligned to ribosomal sequences using Bowtie and bacterial signature sequences using DIAMOND. These markers of bacterial contamination are used to estimate viral enrichment.

The pipeline was built using DIAMOND v.0.9.9. At the time of writing, the latest version of DIAMOND is v.0.9.29. Somewhere between these two versions, the format of DIAMOND databases changed. Therefore, if you have the latest version of DIAMOND, the pipeline will not run properly and you may see this error:

Error: Database was built with an older version of Diamond and is incompatible.

The issue is with the database:

viromeqc/index/amphora_bacteria.dmnd

To overcome this, I installed DIAMOND v.0.9.9, extracted the sequences from the database, and produced a new database using DIAMOND v.0.9.29 as follows:

/v.0.9.9/diamond getseq -d amphora_bacteria.dmnd | /v.0.9.29/diamond makedb -d new_db.dmnd

The new version of the database can be downloaded here:

http://s3.climb.ac.uk/ADM_share/crap/amphora_bacteria.dmnd

Replace the old database with the new one and viromeQC should run beautifully.