Count sketch KMER-based workflow • Zorn

First set up Zorn/Bascet according to the install instructions. This tutorial assumes that you have debarcoded the reads.

When to use the Count sketch KMER-based workflow

This workflow takes debarcoded reads from each cell and maps each cell to a high-dimensional vector space. This space does not retain the original KMERs but can be queried for the similarity of cell content, and the presence of a set of KMERs. The workflow shines in that it does not require any prior knowledge, and it makes use of all the sequencing data captured for each cell. It is a good “second workflow” after first using the KRAKEN2-based workflow, which also assigns taxonomic IDs.

ℹ️ Count sketching relies on the following ideas: - Genomes can be represented as a vector of KMER counts. If these KMERs are long enough, they exist nearly-only once in a genome, and seldom in genomes of different origins - Angles between KMER count vectors ignore sequencing depth and thus represent underlying genome prior to sampling-by-sequencing - The distance between high dimensional vectors is preserved also after random linear projection into a lower dimensional space (lower, but still high-dimensional) - The random projection matrix can be represented implicitly, avoiding the need for the original full KMER vectors

Note that these linear projections are different from Min-Hashes, and have further nice properties beyond the description here.

Preprocessing

To first compute the CountSketch, we will use the BascetRunCountsketch-command. The length of KMERs, and number of reduced dimensions, is set at this stage: - 31 bp is a common choice of KMER lengths as it fits well in 32-bit registers of a CPU, while being rather unique. k=31 is used by default. The KMER length should typically be odd, to remove KMERs symmetric under reverse complement (e.g., AATT). - ⚠️ The reduced number of dimensions must be a power of two, such as 2048, 4096, 8192, 16384 etc. This restriction is due to the choice of underlying algorithm. The amount of dimensions controls how pairwise distances between cells are preserved in the sketch, following the Johnson-Lindenstrauss lemma. Higher dimensions yield more accurate distance estimates and generally improve clustering.

The following command computes a countsketch matrix. The input is here the raw reads, which is suitable for metatranscriptomics, and works for metagenomics as well. For higher precision, you can also use the de novo assembly commands to extract unique KMERs as input for sketching (not shown here).

(SLURM-compatible step)

#Gather count sketches into a single matrix file
BascetRunCountsketch(
  bascetRoot,
  kmerSize=31,      #Default
  sketchSize=4096,  #Need to be power of two

)

Postprocessing with Signac/Seurat

You can now load all of this data into R as a Seurat object:

adata <- BascetLoadCountSketchMatrix(
  bascetRoot,
  inputName = "countsketch_mat"
)

The data is rather large; we recommend enabling multithreading for Seurat. Note that in the current version of R/Bioconductor, as of writing, threading might be disabled in RStudio. You then need to run the code in a separate R session. If you see warning messages about threading later on, then this is the only backup we can offer for now!

library(future)
plan("multicore", workers = 10)

Note that the data was loaded directly as a Seurat reduction, rather than as regular counts. This is because the counts have little meaning of their own, and it makes little sense to perform PCA, SVD or similar on the data. Rather, we recommend using all the dimensions, and compute fewer in the first steps if you want to reduce. The following code produces a UMAP, setting the dimension to all available dimensions in the reduction:

adata <- CountSketchUMAP(adata)

The result can be visualized:

DimPlot(adata)

We recommend that you integrate this object with the output of KRAKEN2 to get some clue about what the clusters mean.