Bascet has two types of MAP functions
Zorn/Bascet is designed to let you run all kinds of software that operates on your cells. This can be either the raw reads, the contigs, or any other data that you produce. Because these operations can be computationally intense, all of this happens through the MAP framework.
Bascet currently hosts two type of MAP functions:
- MAP functions written in Rust for highest performance. If the original software is not Rust, we perform static analysis-mediated LLM translation to make it compatible
- MAP functions calling software through shell scripts. This is not as fast but avoids the creation of thousands (or millions) of small files
Calling a MAP function — one example
All MAP wrappers follow the same pattern: they take an input shard, run a per-cell tool on it, and produce an output shard. Here is QUAST as a representative example:
BascetMapCellQUAST(
bascetRoot,
inputName = "contigs",
outputName = "quast"
)This is a thin wrapper around BascetMapCell that knows
the right script name. The raw form is equivalent:
BascetMapCell(
bascetRoot,
withfunction = "_quast",
inputName = "contigs",
outputName = "quast"
)For the full QUAST workflow (including aggregation), see genome annotation.
Available MAP wrappers
Each wrapper is documented in detail in the vignette listed in the
last column. That vignette covers any required databases, the matching
BascetAggregate*() step, and downstream usage.
| Wrapper | Tool | Typical input | Typical output | Documented in |
|---|---|---|---|---|
BascetMapCellSKESA |
SKESA de novo assembly | filtered |
contigs |
assembly |
BascetMapCellFASTQC |
FastQC read QC | filtered |
fastqc |
read quality control |
BascetMapCellQUAST |
QUAST assembly QC | contigs |
quast |
genome annotation |
BascetMapCellAbricate |
Abricate AMR / virulence | contigs |
abricate |
genome annotation |
BascetMapCellBakta |
Bakta genome annotation | contigs |
bakta |
genome annotation |
BascetMapCellAriba |
Ariba AMR from reads | filtered |
ariba |
genome annotation |
BascetMapCellAMRfinder |
NCBI AMRfinder | contigs |
AMRfinder |
genome annotation |
BascetMapCellGECCO |
GECCO biosynthetic clusters | contigs |
gecco |
genome annotation |
BascetComputeMinhash |
k-mer minhash sketches | filtered |
minhash |
k-mer analysis |
Arguments to MAP functions
Some scripts require additional arguments to be sent (such as a link to a database file). This is done by setting the args argument. Below will set two environment variables such that the contents can be picked up the script:
Aggregating MAP results
Once you have run your map function, you most likely want to load the
results into R. We call this procedure “aggregate”. Most wrappers have a
matching BascetAggregate*() function — see the per-tool
vignette for the exact call.
There is also a catch-all aggregate function that requires a bit of a special way of calling. The example below takes “out.txt”, generated by each tool, and stores the raw file content in a list. This is not pretty but it may help you in debugging and development:
raw_aggr <- MapListAsDataFrame(BascetAggregateMap(
bascetRoot,
inputName="..",
aggr.raw("out.txt")
))Custom MAP functions - introduction
It is easy to add new functions! Easiest way is to simply copy and modify the code for an existing script. You can start from either * QUAST, which takes contigs as input * SKESA, which takes FASTQ as input
Once you have written your script, you invoke it with a direct path:
BascetMapCell(
bascetRoot,
withfunction = "/path/to/your/script.sh",
inputName = "...",
outputName = "..."
)In most cases you want to write your own aggregate function. This function will take the output from your tool, parse it, and put in a sensible R object. Have a look at example and existing aggregate functions for inspiration.
Custom MAP functions - details
If you look at any of our MAP functions, you will find that it is a BASH script that conforms to a certain pattern. It actually is just a script (in any language) that takes certain command line arguments.
--bascet-api
The script returns the API version, also validating that it is a valid script for MAP calls
--expect-files
The script returns a list of what files to extract from the Bascet, for each cell. Here, “*” means to get everything. Asking for less means higher performance
--missing-file-mode
The Bascet what to do if the files are missing. “skip” means to just proceed with the next cell
--compression-mode
How to compress the output files. “default” means to compress. However, if your tool generates compressed files already, it is just a waste of time trying to do it again, in which case the script can return “uncompressed”.
--input-dir XXX
This is the directory where input files are located
--output-dir YYY
Where to store output to. This directory is already created
--num-threads ZZZ
How many threads to use for this particular process. Note that Bascet is already calling multiple MAP scripts in parallel and there is thus typically little benefit in making individual process multithreaded
--recommend-threads
Return how many threads (at least 1) the job should get. This is used if the user runs mapcell but only specifies the total number of threads. Bascet will then try to allocate workers accordingly. Return 1 if your mapcell script does not support multithreading
--preflight-check
This is called once only, to check that the script has the needed software dependencies. In such case, it returns “MAPCELL-CHECK”