Checking Assembly Graphs for Bubbles During Phage Genome Annotation

Assembly graphs provide an important quality-control step when annotating bacteriophage genomes, particularly for identifying “bubbles“—alternative paths in the graph that represent sequence ambiguity. Bubbles commonly arise from sequencing errors, local repeats, strain variation, or unresolved heterogeneity in the input data. In phage genomics, they can also reflect true biological variation, such as microdiversity within a plaque . Thus, the presence of two similar phages within the same plaque.

Inspecting assembly graphs allows us to determine whether a phage genome has been resolved into a single, coherent path or whether competing paths remain.

Visualisation tools such as Bandage enable direct inspection of graph structure, making it possible to identify bubbles associated with low coverage, short repeat regions, or conflicting read support. Combining graph inspection with coverage information and read mapping helps distinguish between biologically meaningful variation and assembly artefacts.

Resolving bubbles—by selecting the most supported path or re-assembling with adjusted parameters—improves genome continuity and annotation accuracy. Routine inspection of assembly graphs is a very useful too in the producing reliable, high-quality phage genomes.

A major cause of having bubbles in assembly graphs, is having to much sequencing data, a common problem with phage genomes. Always sub-sample/normalise your data prior to assembly of a phage genome

Important note on circular structures in assembly graphs

It is important to note that the presence of a circular structure in an assembly graph does not necessarily indicate a biologically circular phage genome. In many bacteriophages, especially tailed dsDNA phages, apparent circles arise from circular permutation, where there is no fixed biological start site. Tailed dsDNA phages do not have circular genomes. But assembly algorithms may represent genomes as a continuous loop, even though the packaged DNA is linear.

In addition to circular permutation, terminal repeat structures can also generate circular assembly graphs. Phages with direct terminal repeats (DTRs) or inverted terminal repeats (ITRs) contain identical sequences at both ends of a linear genome. During assembly, these repeated termini can be collapsed or joined by the assembler, producing a circular path in the graph despite the genome having defined physical ends. This effect is particularly common when repeat lengths exceed read length or when coverage across the termini is high and uniform.

As a result, circular paths in assembly graphs may reflect circular permutation or terminal repeat architecture, rather than true genome circularity. Correct interpretation therefore requires additional evidence, such as read mapping patterns, coverage shifts at genome ends, identification of repeat boundaries, or known packaging strategies. Careful evaluation of these features is essential to avoid incorrect assignment of genome topology during phage genome annotation and database submission.

Genome architectureAssembly graph appearanceRead-mapping signatureInterpretation
True circular genomeSingle circular path Reads map continuously across the entire genome; long reads span the circular junctionGenome is biologically circular.
Possible ssDNA phage or non-tailed dsDNA phage
Circularly permuted linear genomeCircular graph despite linear packagingUniform read coverage across the genome; no consistent coverage peak or drop; reads map across any chosen start positionLinear genome with no fixed biological start site
Direct terminal repeats (DTRs)Circular graph or short loop at endsIncreased coverage over terminal regions; reads or read pairs map to both ends of the assemblyLinear genome with repeated termini
Inverted terminal repeats (ITRs)Circular or ambiguous graph structureReads map to both genome ends in opposite orientations; coverage enrichment at terminiLinear genome with inverted terminal repeats
Assembly artefact / unresolved repeatsBubbles or multiple alternative pathsUneven coverage; weak or conflicting read support across pathsLikely technical artefact requiring resolution