Skip to contents

Bascet has two types of MAP functions

Zorn/Bascet is designed to let you run all kinds of software that operates on your cells. This can be either the raw reads, the contigs, or any other data that you produce. Because these operations can be computationally intense, all of this happens through the MAP framework.

Bascet currently hosts two type of MAP functions:

  • MAP functions written in Rust for highest performance. If the original software is not Rust, we perform static analysis-mediated LLM translation to make it compatible
  • MAP functions calling software through shell scripts. This is not as fast but avoids the creation of thousands (or millions) of small files

Calling a MAP function — one example

All MAP wrappers follow the same pattern: they take an input shard, run a per-cell tool on it, and produce an output shard. Here is QUAST as a representative example:

(SLURM-compatible step)

BascetMapCellQUAST(
  bascetRoot,
  inputName  = "contigs",
  outputName = "quast"
)

This is a thin wrapper around BascetMapCell that knows the right script name. The raw form is equivalent:

BascetMapCell(
  bascetRoot,
  withfunction = "_quast",
  inputName    = "contigs",
  outputName   = "quast"
)

For the full QUAST workflow (including aggregation), see genome annotation.

Available MAP wrappers

Each wrapper is documented in detail in the vignette listed in the last column. That vignette covers any required databases, the matching BascetAggregate*() step, and downstream usage.

Wrapper Tool Typical input Typical output Documented in
BascetMapCellSKESA SKESA de novo assembly filtered contigs assembly
BascetMapCellFASTQC FastQC read QC filtered fastqc read quality control
BascetMapCellQUAST QUAST assembly QC contigs quast genome annotation
BascetMapCellAbricate Abricate AMR / virulence contigs abricate genome annotation
BascetMapCellBakta Bakta genome annotation contigs bakta genome annotation
BascetMapCellAriba Ariba AMR from reads filtered ariba genome annotation
BascetMapCellAMRfinder NCBI AMRfinder contigs AMRfinder genome annotation
BascetMapCellGECCO GECCO biosynthetic clusters contigs gecco genome annotation
BascetComputeMinhash k-mer minhash sketches filtered minhash k-mer analysis

Arguments to MAP functions

Some scripts require additional arguments to be sent (such as a link to a database file). This is done by setting the args argument. Below will set two environment variables such that the contents can be picked up the script:

BascetMapCell(
  ...
  args(DB="some/path",OTHERARG="hi")
  ...
)

Aggregating MAP results

Once you have run your map function, you most likely want to load the results into R. We call this procedure “aggregate”. Most wrappers have a matching BascetAggregate*() function — see the per-tool vignette for the exact call.

There is also a catch-all aggregate function that requires a bit of a special way of calling. The example below takes “out.txt”, generated by each tool, and stores the raw file content in a list. This is not pretty but it may help you in debugging and development:

raw_aggr <- MapListAsDataFrame(BascetAggregateMap(
  bascetRoot,
  inputName="..",
  aggr.raw("out.txt")
))

Custom MAP functions - introduction

It is easy to add new functions! Easiest way is to simply copy and modify the code for an existing script. You can start from either * QUAST, which takes contigs as input * SKESA, which takes FASTQ as input

Once you have written your script, you invoke it with a direct path:

BascetMapCell(
  bascetRoot,
  withfunction = "/path/to/your/script.sh",
  inputName = "...",
  outputName = "..."
)

In most cases you want to write your own aggregate function. This function will take the output from your tool, parse it, and put in a sensible R object. Have a look at example and existing aggregate functions for inspiration.

Custom MAP functions - details

If you look at any of our MAP functions, you will find that it is a BASH script that conforms to a certain pattern. It actually is just a script (in any language) that takes certain command line arguments.

--bascet-api

The script returns the API version, also validating that it is a valid script for MAP calls

--expect-files

The script returns a list of what files to extract from the Bascet, for each cell. Here, “*” means to get everything. Asking for less means higher performance

--missing-file-mode

The Bascet what to do if the files are missing. “skip” means to just proceed with the next cell

--compression-mode

How to compress the output files. “default” means to compress. However, if your tool generates compressed files already, it is just a waste of time trying to do it again, in which case the script can return “uncompressed”.

--input-dir XXX

This is the directory where input files are located

--output-dir YYY

Where to store output to. This directory is already created

--num-threads ZZZ

How many threads to use for this particular process. Note that Bascet is already calling multiple MAP scripts in parallel and there is thus typically little benefit in making individual process multithreaded

--recommend-threads

Return how many threads (at least 1) the job should get. This is used if the user runs mapcell but only specifies the total number of threads. Bascet will then try to allocate workers accordingly. Return 1 if your mapcell script does not support multithreading

--preflight-check

This is called once only, to check that the script has the needed software dependencies. In such case, it returns “MAPCELL-CHECK”