INPHARED re-annotated with PHROGs

The recent PHROGs database from Terzian et al is a great resource for phage annotation. Previously we re-formatted this database into HMMs that are suitable for use within Prokka (read about it HERE and download the HMMs for yourself HERE).

Ryan has added this resource to our INPHARED dataset to re-annotate the genomes of all cultured phages that we can identify in Genbank. The updated GenomesDB folder of INPHARED can be downloaded from here (warning it’s a big file tar file), with > 19,000 genomes now annotated in a consistent manner. We have found the PHROGs annotation really useful to find homologues by string searching based on annotations, due to the standardised annotation provided by the PHROGs team.

These annotations are fully automated, thus for those that have spent 100s of hours annotating one phage, these annotations are most likely not “better” annotations. But they are entirely consistent over all the phages we have re-annotated, which for the analysis we are interested in doing is of importance to us. Ryan has more specific details on how to update the database on his github page. The PHROGs team provide a brilliant interactive site to explore all the PHROGs they annotated here.

Removal of incomplete phage genomes

Thanks to Evelien who has identified several 100 incomplete phages in the database, these have been removed and added to the exclusion list. Full details of those excluded on github page, with the ability to add accessions of other phages that you might spot here, which will be excluded in versions going forward.

Adding More Reference Genomes to vConTACT2 Clusters

The virus clustering programme vConTACT2 is a fantastic tool for applying taxonomy to large sets of viral contigs. In short, it clusters unknown viruses with those in the RefSeq database based on shared protein clusters.

To provide even more context to viral clusters though, you may wish to include more reference genomes than those in RefSeq.

To supplement the RefSeq genomes, I took all of the phage genomes on MillardLab, and removed the RefSeq genomes (to avoid duplication). The remaining genomes were processed through dedupe.sh at 95% minimum ID to remove highly similar sequences. This led to a custom subset of 7,527 genomes.

Genes were called on the 7,527 genomes using Prodigal . From this, .faa and .csv mapping files were produced so the reference genomes could be used to supplement vConTACT2 clustering.

Click HERE for the mapping (.csv) file.
Click HERE for the sequence (.faa) file.

Furthermore, a list of these genomes can be obtained from the mapping file using the following command (potentially useful when visualising the resultant network):

awk -F ',' '{print $2}' database.csv | sort | uniq

Happy clustering!

Updating the DIAMOND database file for ViromeQC

The new virome quality control software, ViromeQC, determines viral enrichment of sequenced viromes. In short, fastQ reads are aligned to ribosomal sequences using Bowtie and bacterial signature sequences using DIAMOND. These markers of bacterial contamination are used to estimate viral enrichment.

The pipeline was built using DIAMOND v.0.9.9. At the time of writing, the latest version of DIAMOND is v.0.9.29. Somewhere between these two versions, the format of DIAMOND databases changed. Therefore, if you have the latest version of DIAMOND, the pipeline will not run properly and you may see this error:

Error: Database was built with an older version of Diamond and is incompatible.

The issue is with the database:

viromeqc/index/amphora_bacteria.dmnd

To overcome this, I installed DIAMOND v.0.9.9, extracted the sequences from the database, and produced a new database using DIAMOND v.0.9.29 as follows:

/v.0.9.9/diamond getseq -d amphora_bacteria.dmnd | /v.0.9.29/diamond makedb -d new_db.dmnd

The new version of the database can be downloaded here:

http://s3.climb.ac.uk/ADM_share/crap/amphora_bacteria.dmnd

Replace the old database with the new one and viromeQC should run beautifully.