Checking Assembly Graphs for Bubbles During Phage Genome Annotation
Assembly graphs provide an important quality-control step when annotating bacteriophage genomes, particularly for identifying “bubbles“—alternative paths in the graph that represent sequence ambiguity. Bubbles commonly arise from sequencing errors, local repeats, strain variation, or unresolved heterogeneity in the input data. In phage genomics, they can also reflect true biological variation, such as microdiversity within a plaque . Thus, the presence of two similar phages within the same plaque.
Inspecting assembly graphs allows us to determine whether a phage genome has been resolved into a single, coherent path or whether competing paths remain.
Visualisation tools such as Bandage enable direct inspection of graph structure, making it possible to identify bubbles associated with low coverage, short repeat regions, or conflicting read support. Combining graph inspection with coverage information and read mapping helps distinguish between biologically meaningful variation and assembly artefacts.
Resolving bubbles—by selecting the most supported path or re-assembling with adjusted parameters—improves genome continuity and annotation accuracy. Routine inspection of assembly graphs is a very useful too in the producing reliable, high-quality phage genomes.
A major cause of having bubbles in assembly graphs, is having to much sequencing data, a common problem with phage genomes. Always sub-sample/normalise your data prior to assembly of a phage genome
Important note on circular structures in assembly graphs
It is important to note that the presence of a circular structure in an assembly graph does not necessarily indicate a biologically circular phage genome. In many bacteriophages, especially tailed dsDNA phages, apparent circles arise from circular permutation, where there is no fixed biological start site. Tailed dsDNA phages do not have circular genomes. But assembly algorithms may represent genomes as a continuous loop, even though the packaged DNA is linear.
In addition to circular permutation, terminal repeat structures can also generate circular assembly graphs. Phages with direct terminal repeats (DTRs) or inverted terminal repeats (ITRs) contain identical sequences at both ends of a linear genome. During assembly, these repeated termini can be collapsed or joined by the assembler, producing a circular path in the graph despite the genome having defined physical ends. This effect is particularly common when repeat lengths exceed read length or when coverage across the termini is high and uniform.
As a result, circular paths in assembly graphs may reflect circular permutation or terminal repeat architecture, rather than true genome circularity. Correct interpretation therefore requires additional evidence, such as read mapping patterns, coverage shifts at genome ends, identification of repeat boundaries, or known packaging strategies. Careful evaluation of these features is essential to avoid incorrect assignment of genome topology during phage genome annotation and database submission.
| Genome architecture | Assembly graph appearance | Read-mapping signature | Interpretation |
|---|---|---|---|
| True circular genome | Single circular path | Reads map continuously across the entire genome; long reads span the circular junction | Genome is biologically circular. Possible ssDNA phage or non-tailed dsDNA phage |
| Circularly permuted linear genome | Circular graph despite linear packaging | Uniform read coverage across the genome; no consistent coverage peak or drop; reads map across any chosen start position | Linear genome with no fixed biological start site |
| Direct terminal repeats (DTRs) | Circular graph or short loop at ends | Increased coverage over terminal regions; reads or read pairs map to both ends of the assembly | Linear genome with repeated termini |
| Inverted terminal repeats (ITRs) | Circular or ambiguous graph structure | Reads map to both genome ends in opposite orientations; coverage enrichment at termini | Linear genome with inverted terminal repeats |
| Assembly artefact / unresolved repeats | Bubbles or multiple alternative paths | Uneven coverage; weak or conflicting read support across paths | Likely technical artefact requiring resolution |