Key Steps in Phage Genome Assembly and Annotation
We routinely sequence bacteriophage genomes. Below is a short list of the steps involved in genome assembly and annotation. The process is described in more detail previous how to guide we have published
- Sequencing and Read Quality Control
Raw sequencing reads are first assessed and cleaned to remove low-quality bases, adapters, and contaminants (trim_galore). Reads are normalised (bbnorm.sh ) - Genome Assembly
Reads are assembled into contigs using short- or long-read assemblers (SPAdes, Flye). Assembly graphs should be inspected carefully, as circular graphs do not necessarily indicate circular genomes but may reflect circular permutation or terminal repeats. - Assembly Validation and Completion
Assemblies are checked for completeness, misassemblies, and unresolved repeats (pilon). Read mapping back to the assembly is used to confirm coverage, identify assembly errors, and infer genome termini where possible. - Genome Orientation and Re-ordering
Completed phage genomes are standardised by re-ordering the sequence to begin at a biologically meaningful position, typically the small terminase subunit or another conserved gene, improving comparability between genomes (dnaapler). - Gene Prediction
Open reading frames (ORFs) are predicted using phage-appropriate gene callers. Automated predictions are reviewed to minimise false positives and ensure biologically realistic gene structures (prodigal – from within Prokka) - Functional Annotation
Predicted proteins are compared against reference databases to assign putative functions. Most phage genes remain hypothetical, but key structural, replication, and lysis genes are prioritised for accurate annotation (Prokka, with PHROGs database). - tRNA and Non-coding Feature Detection
Genomes are screened for tRNAs and other non-coding elements that may contribute to phage fitness or host interaction (Prokka). - Genome Classification and Contextualisation
The annotated genome is compared to existing phage genomes to determine relatedness and taxonomy. - Preparation for Database Submission
Final genomes and annotations are formatted to meet public repository standards (e.g. EMBL ).