Preparing Bacteriophage genomes for submission to EBI

As part of an undergraduate teaching project, we have recently sequenced a number of bacteriophage genomes. At the end of the sequencing analysis is a submission to a database prior to publication. There are several databases to which newly sequenced genomes can be submitted: EBI, NCBI and DDJB . My preferred database is EBI, frankly because the submission process is less painful (relatively) than using NCBI and receiving an accession number of the submitted genome takes less time. Having gone through it several times recently, I thought it would be a good idea to outline the steps required to complete the submission in the least painful and time-consuming way.

In order to complete a submission process, you need several things:

  • Webin account
  • A complete annotated genome (gzipped),
  • Raw Fastq reads (gzipped),
  • TaxonID code
  • Phage name
  • Locus tag
  • Library insert size
  • Covearge of the genome
  • Description of the sequencing technology and the machine used . eg NexteraXT with a MiSeq
  • The relevant metadata for the phage , eg data of isolation, source of isolate etc

If you haven’t re-sequenced a previously sequenced bacteriophage, then a new Taxon ID code is required for your newly sequenced phage. This code has to be requested and needs to be done in advance of submission (you can start the submission, but will not be able to get very far without the Taxon ID). For a new Taxon ID a PROJECT ID is needed. So the first step in submitting a genome is to create a new PROJECT under ENA Webin submission (If you dont have a Webin account you will have to create one first). During this process, you can also reserve or create a unique locustag for the genome.The process of requesting a new Taxon ID only takes few day and needs to be done in advance of submission, as you won’t get far in the submission process without it.  Here is a brief list of requirements for the Taxon ID request::

  • Phage name

  • Host

  • Submitter name and email address used for ENA submission account

  • WebinID

  • Project/study Id

Full details of how to get a new Taxon ID can be found here

Once these have been created then the genome can be annotated and formatted correctly for submission. A range of programs can be used for genome annotation,  a good option for the first pass is PROKKA . Using the “–locustag XXXX” option with PROKKA will automatically create the locus tags with correct locus identifier if XXXX is replaced by the locus tag created/reserved when the project was created.  eg

prokka1.11 —locustag  XXXX —addgenes contigs.fasta –prefix phage1 —outdir phage1

Running this command, will create a file called phage1.gbk that we can use for the basis of submission. The next step is to change the format of this file to EMBL format. For this, I use a script called genbank_2_embl.pl  to create a file called phage1.embl.This conversion using BioPerl creates  a file with the following few header lines:

ID unknown; SV 1; linear; unassigned DNA; STD; UNC; 348043 BP.
XX
AC unknown;
XX
XX
XX
OS Genus species
OC Unclassified.
XX
CC Annotated using prokka 1.11 from http://www.vicbioinformatics.com.
XX

Unfortunately, these few lines are not correct for submission in the current form and need to be changed to meet current flat file standard. The above lines outlined in red need to be changed to meet the current requirements. Therefore those lines need to be changed to look like this. The ID line should only include linear or circular , depending on the type of phage you have. PR, contains the project ID that is obtained when the study is registered. Full details of what is needed can be found here

ID XXX; XXX; linear; XXX; XXX; XXX; XXX.
XX
AC XXX;
AC * _{chr1}
PR Project:PRJEB1234;
XX
RL Submitted (17-JUL-2016) to the INSDC.
XX
XX
XX
XX
CC Annotated using prokka 1.11 from http://www.vicbioinformatics.com .

I did this using a script called emb_reformat.pl .The line PR should contain the PROJECT ID that was generated earlier. This file can now be checked with the validator program for flatfiles from EBI that can be obtained here.

This will identify any issues with the files prior to submission that will need to be fixed. The validator program output file  VAL_ERROR.txt  contains the details of any errors and give clues to how they may be fixed. Once these reports have no errors the final EMBL file can be gzipped ready for submission. This is easily done with the command gzip.

One of the additional requirements for submission is a md5sum value. More information on what MD5SUM values are and why they are used can be found here. MD5SUM value of the gzipped filw embl1.gz file can be produced using the following command line:

md5sum embl1.gz

MD5SUM values will also be needed for the gzipped fastq files. These values can be obtained in a similar manner to the above example for the embl file.

Submission also requires the insert size of the library and the coverage of the genome. There is a number of ways these parameters can be calculated. It is highly likely that in the process of genome assembly, a BAM file has been generated with reads being mapped back to the assembled phage genome.  We can use a program called Qualimap to analyse the produced BAM files and get the insert size and the coverage.  In order to make this process easier for multiple genomes, I use the following perl scripts calc_bam_coverage.pl and calc_bam_insert.pl

The final checklist of things that are needed are

  1. Annotated phage genome gzipped and md5sum value

  2. Raw fastq files gzipped and md5sum values

  3. TaxaID value

  4. Insert size of library

  5. Coverage of genome

  6. Assembly algorithm used

  7. Sequencing platform used

Once you have this information it is relatively painless (note relatively) to submit a single genome to EBI. As you have created a PROJECT ID already to get a locus tag and Taxon ID, there is no need to create another project. I normally submit reads first, then my assembly and in the process they are all associated with the same PROJECT ID.

Upon submission of the assembly files, there is a bewildering number of combinations of choices of scaffolds/contigs/chromosomes and annotated or not. Most bacteriophages will assemble into a complete genome without gaps or NNNs. There the choice would be

Does the assembly contain contigs? NO

Does the assembly contain scaffolds? NO

Does the assembly contain chromosomes? YES

Are the chromosomes functionally annotated? YES

These choices are what I have been advised to use by a member of the ENA Team.

This therefore requires a Chromosome list file ( and the MD5SUM of it ). This format of this file is very simple and the detaisl of how it can be produced arehere 

In its simplest form, it is a one line tab delimited file . eg

chr01 I Chromosome

Jumbo Phages

Recently Carrie Smith a talented undergrad student isolated two bacteriophages on E. coli from swine feces that have a very large genome size.

Currently, the top ten largest bacteriophage genomes within the ENA are

Rank Phage Name Size kb Accession
1 Bacillus phage G 497.513 JN638751
2 Aureococcus anophagefferens virus isolate BtV-01 370.92 KJ645900
3 Cronobacter phage vB_CsaM_GAP32 358.663 JN882285
4 Escherichia phage 121Q 348.532 KM507819
5 Escherichia phage PBECO 4 348.113 KC295538
6 Enterobacteria phage vB_KleM-RaK2 345.809 JQ513383
7 Pseudomonas phage 201phi2-1 316.674 EU197055
8 Pseudomonas phage PhiPA3 309.208 HQ630627
9 Pseudomonas phage OBP 284.757 JN627160
10 Pseudomonas phage Lu11 280.538 JQ768459

The phage we isolated has a genome size of 348,043 bp and just misses out on being in the top five largest phage genomes.  Whilst all the above phage genomes are larger than the vast majority of phage genomes, only Bacillus phage G makes it into the top 10 largest viral genomes. Even that is dwarfed by the giant Pandoravirus salinus (~ 2.4 Mb ) ! 

Preliminary analysis of the genome suggests the majority of predicted genes have no similarity to genes in current databases. However, a number of genomes have been identified that are homologues of host genes. These genes include gyrA, gryB and rpoD homologues.

Analysis of the gene encoding for the major capsid protein, suggests it is distantly related to other T4like T4_gp23

Viruses Inhibit CO2 Fixation in the Most Abundant Phototrophs on Earth

This week, we published a paper in Current Biology, titled “Viruses Inhibit CO2Fixation in the Most Abundant Phototrophs on Earth”. The paper is a product of work done by Rich Puxty in the course of his Phd thesis under the supervision ofDave Scanlan, Dave Evans and myself. During his postgraduate research, Rich was looking into the effect of cyanophage infection on host photosynthesis and carbon fixation, using a model system of marine cyanophages S-PM2 and S-RSM4, with their host cyanobacterium Synechococcus WH7803.

Cyanophages are a group of viruses that specifically infect cyanobacteria –  photoautotrophic bacteria  which  utilise light energy to  synthesize sugars and drive metabolic processes. The two most interesting phyla of marine cyanobacteria are Synechococcus and Prochlorococcus. They are widespread in the global oceans and play an important role in the fixation of CO2 and production of oxygen. It has been known for two decades that cyanophages infect cyanobacteria and for over a decade we have known that cyanophages carry genes that can maintain the photosynthetic apparatus of their host.

Our latest published research has shown that whilst photosynthesis is maintained during infection, CO2 fixation is not, thus effectively decoupling CO2 fixation from photosynthesis. If we use these findings with the current data for the number of infected marine cyanobacterial cells per day, the abundance of cyanobacteria and the data from model system of S-PM2/S-RSM4 and WH7803 ,  we can extrapolate that up to 5.39 petagram (1015 grams) of carbon per year is lost to viral-induced inhibition of marine CO2 fixation. Just to put this number in some perspective, that ~ 10% of total marine primary production. More importantly, this huge impact that viruses have on global carbon fixation has not been taken into account in all of the global models and we are hoping that our research will allow us to get a far more accurate picture about global carbon balances.
Another surprising finding in this work was difference in “fitness ” between cyanophages S-PM2 and S-RSM4. These phages contain a number of auxiliary metabolic genes (AMGs) – genes which are suspected to participate in bacterial metabolic processes of the infected host. Our assumption was that since AMGs are supposed to provide the cyanophage with more metabolites required for successful infection cycle, and thus with selective advantage, the phage which carries more AMGs, would possess greater “fitness”.

By using  infection physiological parameters – burst size (number of phage particles produced per cell) and latent period (time from initial attachment to production of new phage particles), we discovered that surprisingly the phage which contained fewer putative AMGs – S-PM2, had both a larger burst size and a shorter latent period, suggesting selective advantage over AMG-replete S-RSM4.

Puxty, RJ., Millard, AD., Evans, DJ., Scanlan, DJ., Viruses inhibit CO2 fixation in the most abundant phototrophs on Earth. Current Biology. doi: 10.1016/j.cub.2016.04.036. http://dx.doi.org/10.1016/j.cub.2016.04.036

More comments on this work in the general press can be found here

ScienceNews 

Bacteriophage Genome assembly

Recently, we published a paper in PeerJ titled “Assessing Illumina technology for the high-throughput sequencing of bacteriophage genomes “. The work on this problem started several years ago, when I was actively encouraged (forced) to write a blog post on the Warwick Microbiology and Infection Unit blog (http://blogs.warwick.ac.uk/microbialunderground/). At the time, I looked at some basic parameters that influenced the assembly of bacterial genomes and showed how above a certain coverage there was no benefit in increasing sequencing depth alone. 

Given my interested in bacteriophage genomes and some work we were doing on sequencing multiple phage genomes, I started to look at the amount of sequence coverage required to successfully assemble bacteriophage genomes.  Combined with a throwaway comment on a grant review I received saying that  “it won’t work anyway, as most phage genomes cannot be assembled”, I had enough motivation to start looking at this in far more detail, largely out of frustration and the wish to prove the reviewer wrong.

Using in silico modelling of ~2000 complete phage genomes from ENA (some later got removed as the initial assemblies contained too many ambiguous bases), we investigated how increasing the depth of coverage and insert size affected the assembly of phage genomes when using SPAdes as an assembly algorithm. We demonstrated that the majority of bacteriophage genomes could be assembled without errors and went on to test this in vitro, by the sequencing of some novel phage genomes.

The resulting paper was submitted to PeerJ, where I experienced, for the first time, open peer review, as two of the four reviewers signed their reviews. An interesting observation from the review was that the two signed reviews provided the most useful critique by far. The feedback from the reviewers undoubtedly improved the manuscript by requesting that we further test our hypothesis using additional two assemblyalgorithms (Ray & Velvet), using a variety of sequencing depths and insert sizes.

As a result, after completing ~ 75,000 phage assemblies with a range of insert sizes, sequencing depths and assembly algorithms,we demonstrated that:

  • The majority of phage genomes (>98 %) can be completely assembled at 30x coverage
  • Multiple phages can be combined in a single sequencing library
  • the sequenced phage genomes from combined libraries can be completely reconstructed.

For full details, check out the paper:
Assessing Illumina technology for the high-throughput sequencing of bacteriophage genomes.
Rihtman B, Meaden S, Clokie MR, Koskella B, Millard AD.
PeerJ. 2016 Jun 1;4:e2055. doi: 10.7717/peerj.2055. eCollection 2016.