All posts by Andy


Recently PHROGs was released by Terzian et al ( ). Full details are provided on their webpages and publication. Briefly their curated dataset provides tens of thousands of PHROGs with a standardised annotation attributed to each PHROG. All of this is available through their searchable website and can also be downloaded.

For first pass phage genome annotation this seems like a great resources. We standardly use Prokka for annotation of phage genomes, that allows custom hmm databases to be used for annotation. Unfortunately the HMMs provided directly by the PHROGs team don`t sit neatly into Prokka and allow the annotation linked to the PHROG to appear in the final annotation, because of differences in formats.

However, as they provided all their data in an easily downloadable form. We have taken this and reformatted to produce HMMs with the annotations included so it plays nicely with HMMER3 as part of Prokka . We have produced a single file that can but put in /opt/prokka/db/hmm directory of Prokka. Thanks to Thomas Sicheritz-Pontén for helping with sorting out getting the correct annotation into the 38,000 HMMs …

A single file containing all HMMs that can be directly added to Prokka , can be downloaded here. Warning its 3 Gb when unzipped. Thanks to Terzian et al who did all the hard work on producing the original PHROGs and curated annotation and making it available , we have just reformatted it for our own use and anybody else that might want to use it with prokka..

To get it running within prokka. Locate the installation of prokka

$prokka –listdb

In my case this results in output of /usr/local/bioinf/prokka/db

and [08:43:23] * HMMs: all_VOG HAMAP

telling us there are already some HMMs databases called all_VOG & HAMAP

Within /usr/local/bioinf/prokka/db is the a directory called hmm

Thus, the full path is /usr/local/bioinf/prokka/db/hmm

The downloaded database needs to be copied into /usr/local/bioinf/prokka/db/hmm

Then run $prokka –setupdb

Running the command $prokka –listdb

[08:43:23] * HMMs: all_phrogs all_VOG HAMAP

all_phrogs will now be used by prokka. If you only want to use the PHROGs database, consider using the prokka flag of –hmms and specify /usr/local/bioinf/prokka/db/hmm/all_phrogs

Full details on adding databases are explained on the Prokka github page

Bias in phage genomes

What started off sometime in 2019 as search for a number, too put into an introduction of a paper ends up a few years later with hopefully a useful paper. That number was how many complete phage genomes are currently publicly available via public databases are currently available.  At the time, NCBI virus had not been released (, which contains some of this information. Myself and Nathan Brown wrote a quick script that used the esearch/efetch factilies to extract phage genomes. Then applied several filtering steps to extract “complete” phage genomes with lots of manual filtering. We started providing this data on the website for download. After requests from people of how to cite this list and some reminding from Branko Rihtman, we have finally got to a pre-print. Ryan Cook has tidied up the code a lot  and parsed lots of informtion that can be extracted from the genbank files. 

 In extracting this informaiton we found many things 

 There is big bias in the hosts that phage are isolated on – most phages are isolated on a small number of host bacteria 

Far more lytic phage genomes than temperate – with most temperate phage genomes coming  from an even smaller number of hosts 

The number of putative antibiotic resistance genes is different for lytic versus temperate phages and host 

Jumbo phages are not always rare – again dependenent on the host 

Even for hosts where large numbers of phage have been isolated, we are a long way from sampling the number predicted phage species t

All the data can be accessed via github 

And the paper on

All v all comparison of coliphages

Having recently sequenced several coliphages, we have wanted to compare them to all other coliphages. To do this, we have downloaded all complete (or near complete) bacteriophages genomes [see here]. We then filtered these genomes based on their GenBank description to pull out all phages that have Escherichia, E.coli or coliphage in their description.  Having done this we then used an all v all comparison of using MASH, to construct a matrix of similarity. Then visualised this using the heatmaply.

This can be seen below. An interactive webpage of the image  is available here 

Looking closely at the clusters it is clear to see that phage with genus form discrete clusters eg top right of the plot is T4virus (and other genera in the Tevenvirinae subfamily)


We have moved ….

The lab has now moved from Warwick Medical school to the Dept of Infection, Immunity, and Inflammation at the University of Leicester. To be more specific, I have moved with the rest of lab group still at Warwick.

After 17 years at Warwick and knowing who to speak to and where to find things, it has been an interesting experience starting at Leicester.  Not knowing how to get into or where exactly my office/lab is in the building, has provided a new experience. But also great to meet new colleagues, who have helped me find my way.



Welcome to Branko and Slawek

October has brought the start of a new term and the arrival of two new lab members (ok it November before posting)  Branko joins the lab to work as an ESPRC fellow working on AMR, he joins Paul who has further extended his ESPRC fellowship and will be with us for a few more months.

Slawek joins as a CENTA PhD student, who will be looking at the role of the marine VIROME in the maintaining a pool of AMR genes.

*Alex Wilcox has also joined the lab and like Branko and Paul he has gained an ESPRC fellowship

Bacteriophage genome assembly and annotation workshop

We will be running a bacteriophage genome assembly and annotation workshop at WMS on Monday 9th of January.  The course will be run on CLIMB  virtual machines, so please register for an account in advance.


Date: Monday 9th January 2017

Cost: £50

Registration and payment by Monday 5th December* – registration form is open – here

Spaces: 20


Attendees may provide up to four samples of bacteriophage DNA (by 5th Dec 2016) in advance of the workshop, which will be sequenced and the data available for analysis on the 9th of Jan. During the workshop attendees will learn how to quality control their data, assemble bacteriophage genomes, annotate and prepare their genome in a format for submission to EBI.

Registration is on a first come first served basis, so register early. Spaces are limited to 20 people

MRC CLIMB infrastructure will be used for the workshop.

Prior experience:

No prior experience of genome annotation is needed.


Analysis will be run on CLIMB. Users are encouraged to bring their own laptop – a limited number of laptops are available.

A free CLIMB account is also needed. Register here

DNA samples

DNA samples must be received by Monday 5th December for them to be sequenced in time for the course. It is not necessary to send phage samples to attend the course. The workshop is a genome annotation training workshop- sequencing of isolates is a bonus so that attendees can annotate their own genomes.

DNA samples must be sent in a 96 well plate. A minimum of 10 ul of DNA at 10 ng /ul DNA is required. Larger volumes are fine, but must be at concentration of 10 ng / ul.

DNA must be column purified prior to sending, Zymo DNA purification/columns are recommended. Concentration must be determined by use of fluorescent detection system (eg Qubit), not Nanodrop .

Phage do not have to be CsCl purified prior to DNA extraction. The method we use for extraction can be found here , then run through a DNA clean-up column

Please contact Andrew Millard on after registration, prior to sending any samples. DO NOT JUST SEND SAMPLES

  1. Contact prior to sending
  2. Label the plate so that your name can be read on the side of the plate.
  3. Complete the form that will be sent when you contact
  4. Sample names are to be alphanumeric only.

Comparing all (cultured) bacteriophage genomes

Given a recent increase in the number of bacteriophage genome sequenced- Nathan ( @NathanMB3) has updated the all-v-all  comparison with more genomes (~5500 in total).Image at bottom of page

After reading the recent paper  “MASH:fast genome and metagenome distance and estimation using MinHash” and meeting Nathan Brown at the University of Leicester, we discussed using MASH for identification of  phage genomes and comparison thereof.  The authors of the genome biology paper had included viruses in the microbial comparison in Figure 3 . Here we just focused on bacteriophage genomes.

For rapid identification of phage genomes we first constructed a database of phage genomes that were public. This included all phage genomes from the NCBI ( , which were then filtered to remove eukaryotic viruses. In addition phage genomes were collected from the website. A sketch was made for all of these phages and collated, the mash database of this can be downloaded here.

We are using this database to rapidly identify newly sequenced phage isolates. This has worked well with the 100+ novel phages isolated so far and gives very similar results to blastn if there similar phage already in the database (it’s just quicker). We have found it to be good starting point for further comparative genomics.

Using this database we then constructed an all-versus-all comparison of phage genomes. The advantage of MASH is that it allows this to be done in an extremely rapid manner.  MASH outputs a text file with a Jaccard distance for each pair of genomes.  The Jaccard distance is a measure of dissimilarity between genomes (on a scale of 0 to 1, where 0 is nearly identical and 1 is completely different), which we then plotted on a heatmap comparing all phages genomes.  To do that we used the NeatMap package (Rajaram and Oono, BMC Bioinformatics 2011) in R to first arrange the phage genomes along the axes using an nMDS clustering algorithm with the Jaccard distances (Taguchi and Oono, Bioinformatics 2005) and then plot a heatmap.  The ordering of the phage genomes and the hierarchical clustering shown on each axis are based on the nMDS results and are not the same as a phylogenetic/genomic tree.

Regardless, the clustered phage genomes shown by the green squares on the heatmap are – for the most part – related to each other according to existing phage taxonomy.  This confirms that MASH coupled with nMDS clustering of Jaccard distances from MASH gives a good approximation of structure in the global phage population sequenced to date.  Further analysis may reveal new patterns in the global phage population structure and illustrate the bias in phage sampling and sequencing. The dense green boxes in the top right are comprised of mycobacterium phages- which are by far the most numerically abundant in the database.

Below in Figure 1

Figure 1 “The known phage universe”. All versus all comparison of phage genomes.


The script for the clustering and production of heatmap can be found here 

The list of phage on each axis  is here

Below is the updated version with a larger number of genomes than in 2016. Not much has really changed, most phages still have limited similarity to other phages!  just more of them .  Still, many more phage genomes need to be sampled

All phage compared against all phage


Preparing Bacteriophage genomes for submission to EBI

As part of an undergraduate teaching project, we have recently sequenced a number of bacteriophage genomes. At the end of the sequencing analysis is a submission to a database prior to publication. There are several databases to which newly sequenced genomes can be submitted: EBI, NCBI and DDJB . My preferred database is EBI, frankly because the submission process is less painful (relatively) than using NCBI and receiving an accession number of the submitted genome takes less time. Having gone through it several times recently, I thought it would be a good idea to outline the steps required to complete the submission in the least painful and time-consuming way.

In order to complete a submission process, you need several things:

  • Webin account
  • A complete annotated genome (gzipped),
  • Raw Fastq reads (gzipped),
  • TaxonID code
  • Phage name
  • Locus tag
  • Library insert size
  • Covearge of the genome
  • Description of the sequencing technology and the machine used . eg NexteraXT with a MiSeq
  • The relevant metadata for the phage , eg data of isolation, source of isolate etc

If you haven’t re-sequenced a previously sequenced bacteriophage, then a new Taxon ID code is required for your newly sequenced phage. This code has to be requested and needs to be done in advance of submission (you can start the submission, but will not be able to get very far without the Taxon ID). For a new Taxon ID a PROJECT ID is needed. So the first step in submitting a genome is to create a new PROJECT under ENA Webin submission (If you dont have a Webin account you will have to create one first). During this process, you can also reserve or create a unique locustag for the genome.The process of requesting a new Taxon ID only takes few day and needs to be done in advance of submission, as you won’t get far in the submission process without it.  Here is a brief list of requirements for the Taxon ID request::

  • Phage name

  • Host

  • Submitter name and email address used for ENA submission account

  • WebinID

  • Project/study Id

Full details of how to get a new Taxon ID can be found here

Once these have been created then the genome can be annotated and formatted correctly for submission. A range of programs can be used for genome annotation,  a good option for the first pass is PROKKA . Using the “–locustag XXXX” option with PROKKA will automatically create the locus tags with correct locus identifier if XXXX is replaced by the locus tag created/reserved when the project was created.  eg

prokka1.11 —locustag  XXXX —addgenes contigs.fasta –prefix phage1 —outdir phage1

Running this command, will create a file called phage1.gbk that we can use for the basis of submission. The next step is to change the format of this file to EMBL format. For this, I use a script called  to create a file called phage1.embl.This conversion using BioPerl creates  a file with the following few header lines:

ID unknown; SV 1; linear; unassigned DNA; STD; UNC; 348043 BP.
AC unknown;
OS Genus species
OC Unclassified.
CC Annotated using prokka 1.11 from

Unfortunately, these few lines are not correct for submission in the current form and need to be changed to meet current flat file standard. The above lines outlined in red need to be changed to meet the current requirements. Therefore those lines need to be changed to look like this. The ID line should only include linear or circular , depending on the type of phage you have. PR, contains the project ID that is obtained when the study is registered. Full details of what is needed can be found here

ID XXX; XXX; linear; XXX; XXX; XXX; XXX.
AC * _{chr1}
PR Project:PRJEB1234;
RL Submitted (17-JUL-2016) to the INSDC.
CC Annotated using prokka 1.11 from .

I did this using a script called .The line PR should contain the PROJECT ID that was generated earlier. This file can now be checked with the validator program for flatfiles from EBI that can be obtained here.

This will identify any issues with the files prior to submission that will need to be fixed. The validator program output file  VAL_ERROR.txt  contains the details of any errors and give clues to how they may be fixed. Once these reports have no errors the final EMBL file can be gzipped ready for submission. This is easily done with the command gzip.

One of the additional requirements for submission is a md5sum value. More information on what MD5SUM values are and why they are used can be found here. MD5SUM value of the gzipped filw embl1.gz file can be produced using the following command line:

md5sum embl1.gz

MD5SUM values will also be needed for the gzipped fastq files. These values can be obtained in a similar manner to the above example for the embl file.

Submission also requires the insert size of the library and the coverage of the genome. There is a number of ways these parameters can be calculated. It is highly likely that in the process of genome assembly, a BAM file has been generated with reads being mapped back to the assembled phage genome.  We can use a program called Qualimap to analyse the produced BAM files and get the insert size and the coverage.  In order to make this process easier for multiple genomes, I use the following perl scripts and

The final checklist of things that are needed are

  1. Annotated phage genome gzipped and md5sum value

  2. Raw fastq files gzipped and md5sum values

  3. TaxaID value

  4. Insert size of library

  5. Coverage of genome

  6. Assembly algorithm used

  7. Sequencing platform used

Once you have this information it is relatively painless (note relatively) to submit a single genome to EBI. As you have created a PROJECT ID already to get a locus tag and Taxon ID, there is no need to create another project. I normally submit reads first, then my assembly and in the process they are all associated with the same PROJECT ID.

Upon submission of the assembly files, there is a bewildering number of combinations of choices of scaffolds/contigs/chromosomes and annotated or not. Most bacteriophages will assemble into a complete genome without gaps or NNNs. There the choice would be

Does the assembly contain contigs? NO

Does the assembly contain scaffolds? NO

Does the assembly contain chromosomes? YES

Are the chromosomes functionally annotated? YES

These choices are what I have been advised to use by a member of the ENA Team.

This therefore requires a Chromosome list file ( and the MD5SUM of it ). This format of this file is very simple and the detaisl of how it can be produced arehere 

In its simplest form, it is a one line tab delimited file . eg

chr01 I Chromosome

Jumbo Phages

Recently Carrie Smith a talented undergrad student isolated two bacteriophages on E. coli from swine feces that have a very large genome size.

Currently, the top ten largest bacteriophage genomes within the ENA are

Rank Phage Name Size kb Accession
1 Bacillus phage G 497.513 JN638751
2 Aureococcus anophagefferens virus isolate BtV-01 370.92 KJ645900
3 Cronobacter phage vB_CsaM_GAP32 358.663 JN882285
4 Escherichia phage 121Q 348.532 KM507819
5 Escherichia phage PBECO 4 348.113 KC295538
6 Enterobacteria phage vB_KleM-RaK2 345.809 JQ513383
7 Pseudomonas phage 201phi2-1 316.674 EU197055
8 Pseudomonas phage PhiPA3 309.208 HQ630627
9 Pseudomonas phage OBP 284.757 JN627160
10 Pseudomonas phage Lu11 280.538 JQ768459

The phage we isolated has a genome size of 348,043 bp and just misses out on being in the top five largest phage genomes.  Whilst all the above phage genomes are larger than the vast majority of phage genomes, only Bacillus phage G makes it into the top 10 largest viral genomes. Even that is dwarfed by the giant Pandoravirus salinus (~ 2.4 Mb ) ! 

Preliminary analysis of the genome suggests the majority of predicted genes have no similarity to genes in current databases. However, a number of genomes have been identified that are homologues of host genes. These genes include gyrA, gryB and rpoD homologues.

Analysis of the gene encoding for the major capsid protein, suggests it is distantly related to other T4like T4_gp23