Bacteriophage Genomes

***As from October 2108 monthly updates will each have their own page***

This page will remain as it contains details of how genomes were extracted

Table of  >9000  Complete Bacteriophage genomes extracted from Genbank on 31st  May   2018. The number of genomes has increased by only ~50 on last month. Based on feedback we have excluded a number of genomes from last month that were not complete but passed through our automated filtering system. The large increase this  last month seems in part due to the submission of predicted prophages from bacterial genomes – but without experimental evidence they are functional. 

Number of complete (near complete- see below) genomes .

 

 


We used ” esearch -db nucleotide -query \”gbdiv_PHG\”[prop] |efetch -format gb > phages.gb ” to first pull out all phage nucleotide data.

Then filtered this file for

“Complete” &  “Genome|chromosome” or

“Sequence Length > 10000”  &  “Complete” & “Sequence|genome”

or Accession Number matches – a list of accessions we have manually curated of complete bacteriophage genomes that don`t match the above criteria. We updated these parsing parameters after helpful feedback that pointed out we were missing some phages.   We think this allows the extraction of all complete (or near complete) bacteriophage genomes, whilst excluding fragments of genomes. We have manually looked at the list of descriptions we extract to identify obvious fragments eg  21 kb region of a cyanophage genome. With further rules to exclude these specific fragments.

Thanks to Kelly Williams for highlighting and then helping filter out some further issues.  The table below will contain duplicates if the phage has been sequenced multiple times, in an experimental evolution experiment for instance. It also contains both a REFSEQ and the original Genbank entry file, so the number of unique genomes will be less. Using the phage description as an identifier this is 5446 genomes (from March).

Again thanks to Kelly Williams, who has found more complete phage genomes that are not identified by “gbdiv_PHG” in Genbank or are in the ENV database. The accessions for these genomes are 

FN436269,NC_029032,NC_023585,NC_028693,NC_021793,NC_025471,NC_026606,NC_008355,NC_028651,NC_028998,NC_029103,NC_027984,NC_006938,HQ157199,NC_028990,NC_025444,HQ157198,KT626047,NC_028658,NC_014322,NC_028671,NC_024711,NC_028992,NC_029020,NC_029016,NC_029027,NC_029011 

I will add these to our database next time we update it  

Thanks to feedback from Andrew Kropinski we have removed to some further non-complete genomes.

Flat files are here for those wanting to batch download the entire list

Flat files with REFSEQ entries removed

Flat files with only REFSEQ entries – This is as extracted from this dataset for phages that have a RefSeq accession

[table “22” not found /]