Key Steps in Phage Genome Assembly and Annotation

We routinely sequence bacteriophage genomes. Below is a short list of the steps involved in genome assembly and annotation. The process is described in more detail previous how to guide we have published

  1. Sequencing and Read Quality Control
    Raw sequencing reads are first assessed and cleaned to remove low-quality bases, adapters, and contaminants (trim_galore). Reads are normalised (bbnorm.sh )
  2. Genome Assembly
    Reads are assembled into contigs using short- or long-read assemblers (SPAdes, Flye). Assembly graphs should be inspected carefully, as circular graphs do not necessarily indicate circular genomes but may reflect circular permutation or terminal repeats.
  3. Assembly Validation and Completion
    Assemblies are checked for completeness, misassemblies, and unresolved repeats (pilon). Read mapping back to the assembly is used to confirm coverage, identify assembly errors, and infer genome termini where possible.
  4. Genome Orientation and Re-ordering
    Completed phage genomes are standardised by re-ordering the sequence to begin at a biologically meaningful position, typically the small terminase subunit or another conserved gene, improving comparability between genomes (dnaapler).
  5. Gene Prediction
    Open reading frames (ORFs) are predicted using phage-appropriate gene callers. Automated predictions are reviewed to minimise false positives and ensure biologically realistic gene structures (prodigal – from within Prokka)
  6. Functional Annotation
    Predicted proteins are compared against reference databases to assign putative functions. Most phage genes remain hypothetical, but key structural, replication, and lysis genes are prioritised for accurate annotation (Prokka, with PHROGs database).
  7. tRNA and Non-coding Feature Detection
    Genomes are screened for tRNAs and other non-coding elements that may contribute to phage fitness or host interaction (Prokka).
  8. Genome Classification and Contextualisation
    The annotated genome is compared to existing phage genomes to determine relatedness and taxonomy.
  9. Preparation for Database Submission
    Final genomes and annotations are formatted to meet public repository standards (e.g. EMBL ).