Skip to contents

This vignette covers the per-cell genome annotation wrappers. All of them are built on the MAP framework — see Map scripts for the underlying mechanism and how to write your own.

QUAST — assembly quality

QUAST computes standard assembly QC metrics from contigs (N50, number and length of contigs, GC content, mis-assemblies, etc.). It is the most common way to compare the quality of different assemblies and is the de-facto standard for reporting assembly statistics.

  • Website: https://github.com/ablab/quast
  • If you use this tool, please cite: Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinformatics. 2013;29(8):1072–1075.

(SLURM-compatible step)

BascetMapCellQUAST(
  bascetRoot,
  inputName  = "contigs",  #or other source of contigs
  outputName = "quast"
)

Then aggregate the results for visualization. This example caches the result to speed up reloading; this is optional

quast_aggr <- BascetCacheComputation(bascetRoot,"cache_quast",MapListAsDataFrame(BascetAggregateMap(
  bascetRoot,
  "quast",
  aggr.quast
)))

Abricate — AMR / virulence screening

Abricate performs mass screening of contigs against several curated databases of antimicrobial resistance and virulence genes (e.g. NCBI, CARD, ResFinder, VFDB, PlasmidFinder). It only reports acquired resistance genes, not point mutations, and is widely used because it is fast and easy to interpret.

The NCBI database is used by default. See ListDatabaseAbricate() for a list of other databases.

(SLURM-compatible step)

BascetMapCellAbricate(
  bascetRoot,
  inputName  = "contigs",  #or other source of contigs
  outputName = "abricate",
  db         = "ncbi"
)

abricate_mat <- BascetAggregateAbricate(
  bascetRoot,
  inputName = "abricate"
)

Bakta — genome annotation

Bakta is a fast, standardized annotation tool for bacterial genomes and plasmids. It identifies coding sequences, ncRNAs, tRNAs, CRISPR arrays, and more, and assigns functional descriptions through alignment-free sequence identification. It is a good choice when you want a complete genome annotation comparable across cells.

  • Website: https://github.com/oschwengers/bakta
  • If you use this tool, please cite: Schwengers O, Jelonek L, Dieckmann MA, Beyvers S, Blom J, Goesmann A. Bakta: rapid and standardized annotation of bacterial genomes via alignment-free sequence identification. Microbial Genomics. 2021;7(11):000685.

First download a database:

DownloadDatabaseBakta(
  dbdir  = "~/bakta",  #create directory before running command
  dbtype = "light"
)

You can then run Bakta on all cells:

(SLURM-compatible step)

BascetMapCellBakta(
  bascetRoot,
  inputName  = "contigs",  #or other source of contigs
  outputName = "bakta",
  db         = "~/bakta"
)

Then aggregate the results for visualization. This example caches the result to speed up reloading; this is optional

bakta_aggr <- BascetCacheComputation(bascetRoot,"cache_bakta",MapListAsDataFrame(BascetAggregateMap(
  bascetRoot,
  "bakta",
  aggr.bakta
)))

Ariba — AMR identification from reads

Ariba detects antimicrobial resistance genes (and other gene panels) directly from sequencing reads, without first assembling. It builds local assemblies around reference genes and reports SNPs and indels relative to the reference, which makes it useful when assembly quality is too low to trust contig-based screening.

  • Website: https://github.com/sanger-pathogens/ariba
  • If you use this tool, please cite: Hunt M, Mather AE, Sánchez-Busó L, Page AJ, Parkhill J, Keane JA, Harris SR. ARIBA: rapid antimicrobial resistance genotyping directly from sequencing reads. Microbial Genomics. 2017;3(10):e000131.

(SLURM-compatible step)

BascetMapCellAriba(
  bascetRoot,
  inputName  = "filtered",
  outputName = "ariba",
  db         = "/path/to/ariba_db/out.prepareref"
)

ariba_mat <- BascetAggregateAriba(
  bascetRoot,
  inputName = "ariba"
)

AMRfinder — NCBI AMRfinderPlus

AMRfinderPlus screens contigs (or proteins) against the NCBI Reference Gene Catalog to identify acquired AMR genes, point mutations conferring resistance, and selected virulence and stress-response genes. It is maintained by NCBI and is the source for many downstream AMR databases.

  • Website: https://github.com/ncbi/amr
  • If you use this tool, please cite: Feldgarden M, Brover V, Gonzalez-Escalona N, Frye JG, Haendiges J, Haft DH, Hoffmann M, Pettengill JB, Prasad AB, Tillman GE, Tyson GH, Klimke W. AMRFinderPlus and the Reference Gene Catalog facilitate examination of the genomic links among antimicrobial resistance, stress response, and virulence. Scientific Reports. 2021;11(1):12728.

First download the database:

DownloadDatabaseAMRfinder("/path/to/amrfinder_db")

Then run on all cells:

(SLURM-compatible step)

BascetMapCellAMRfinder(
  bascetRoot,
  inputName  = "contigs",
  outputName = "AMRfinder",
  db         = "/path/to/amrfinder_db"
)

amr_df <- BascetAggregateAMRfinder(
  bascetRoot,
  inputName = "AMRfinder"
)

GECCO — biosynthetic gene clusters

GECCO predicts biosynthetic gene clusters (BGCs) in assembled contigs using a conditional random field over Pfam domain compositions. It is much faster than antiSMASH and is well suited to scanning thousands of single-cell assemblies for natural-product potential.

  • Website: https://gecco.embl.de/
  • If you use this tool, please cite: Carroll LM, Larralde M, Fleck JS, Ponnudurai R, Milanese A, Cappio Barazzone E, Zeller G. Accurate de novo identification of biosynthetic gene clusters with GECCO. bioRxiv. 2021. doi:10.1101/2021.05.03.442509

(SLURM-compatible step)

BascetMapCellGECCO(
  bascetRoot,
  inputName  = "contigs",
  outputName = "gecco"
)

Aggregate the per-cell cluster tables into a single list of data.frames:

gecco_aggr <- BascetAggregateGECCO(
  bascetRoot,
  inputName = "gecco"
)