All posts by Ryan Cook

Adding More Reference Genomes to vConTACT2 Clusters

The virus clustering programme vConTACT2 is a fantastic tool for applying taxonomy to large sets of viral contigs. In short, it clusters unknown viruses with those in the RefSeq database based on shared protein clusters.

To provide even more context to viral clusters though, you may wish to include more reference genomes than those in RefSeq.

To supplement the RefSeq genomes, I took all of the phage genomes on MillardLab, and removed the RefSeq genomes (to avoid duplication). The remaining genomes were processed through dedupe.sh at 95% minimum ID to remove highly similar sequences. This led to a custom subset of 7,527 genomes.

Genes were called on the 7,527 genomes using Prodigal. From this, .faa and .csv mapping files were produced so the reference genomes could be used to supplement vConTACT2 clustering.

Click HERE for the mapping (.csv) file.
Click HERE for the sequence (.faa) file.

Furthermore, a list of these genomes can be obtained from the mapping file using the following command (potentially useful when visualising the resultant network):

awk -F ',' '{print $2}' database.csv | sort | uniq

Happy clustering!

Updating the DIAMOND database file for ViromeQC

The new virome quality control software, ViromeQC, determines viral enrichment of sequenced viromes. In short, fastQ reads are aligned to ribosomal sequences using Bowtie and bacterial signature sequences using DIAMOND. These markers of bacterial contamination are used to estimate viral enrichment.

The pipeline was built using DIAMOND v.0.9.9. At the time of writing, the latest version of DIAMOND is v.0.9.29. Somewhere between these two versions, the format of DIAMOND databases changed. Therefore, if you have the latest version of DIAMOND, the pipeline will not run properly and you may see this error:

Error: Database was built with an older version of Diamond and is incompatible.

The issue is with the database:

viromeqc/index/amphora_bacteria.dmnd

To overcome this, I installed DIAMOND v.0.9.9, extracted the sequences from the database, and produced a new database using DIAMOND v.0.9.29 as follows:

/v.0.9.9/diamond getseq -d amphora_bacteria.dmnd | /v.0.9.29/diamond makedb -d new_db.dmnd

The new version of the database can be downloaded here:

http://s3.climb.ac.uk/ADM_share/crap/amphora_bacteria.dmnd

Replace the old database with the new one and viromeQC should run beautifully.