Informative KMER-based workflow • Zorn

First set up your Zorn/Bascet workdirectory as before. If you wish to run these steps on a SLURM cluster, see separate vignette and adapt accordingly.

library(Zorn)
bascet_runner.default <- LocalRunner(direct = TRUE, showScript=TRUE)
bascetRoot <- "/home/yours/an_empty_workdirectory"

Preprocessing

To extract informative KMERs, you first need to obtain a list of them. Bascet/Zorn offers one way, in which we first extract minhashes from each cell. These are finger prints that we later can look for in other cells.

(SLURM-compatible step)

### Compute minhashes for each cell
BascetComputeMinhash(
  bascetRoot
)

### Gather minhashes into a single histogram
BascetMakeMinhashHistogram(
  bascetRoot
)

If two cells of the same species are shallowly sequenced, they may not share minhashes, simply because the reads don’t span all of their genomes. We can however look for KMERs that are more frequently shared by counting minhashes across all cells, and generating a histogram.

kmerHist <- BascetReadMinhashHistogram(bascetRoot)

kmerHist$rank <- 1:nrow(kmerHist)
ggplot(kmerHist[kmerHist$cnt>2,], aes(rank, cnt)) +
  geom_point() +
  scale_x_log10() +
  scale_y_log10()

It is an open problem which KMERs to pick. KMERs of low abundance will not help clustering much as the UMAP algorithm will not find many other cells they are correlated with. Likewise, KMERs present in each cell are not informative either. In practice, we don’t see this happening; just picking the most common top 10,000 KMERs or so seems to give useful results. You can also pick KMERs based on a cutoff:

### Pick KMERs
useKMERs <- kmerHist$kmer[kmerHist$cnt>5]  #Pick KMERs present in more than 5 cells

Using the list of KMERs, you can now re-scan all genomes for their presence.

(SLURM-compatible step)

### Build count table by looking up selected KMERs in per-cell KMER databases
BascetQueryFq(
  bascetRoot,
  useKMERs=useKMERs
)

The output format is one binary HDF5 file for each shard, roughly in the Anndata file format. To load these, use the following command that loads the files and concatenates them into a single matrix. It can then be loaded into Seurat:

library(Seurat)
### Read count matrix
cnt <- ReadBascetCountMatrix(
  bascetRoot,
  "kmer_counts"
  #TODO note that the is a maximum count by default
)

Most likely you will have to filter out the low abundant cells. Note that with this workflow, the cutoff is based on the number of KMERs rather than the number of reads.

  ## Load subset of real cells
  keep_cells <- rowSums(cnt)>20000 #this cutoff is based on the number of KMERs, not read count!
  sum(keep_cells)
  adata <- CreateSeuratObject(
    counts = CreateAssayObject(t(cnt[keep_cells,])),  #note that t() is needed
    assay = "infokmer"
  )

You can now proceed to generate a dimensional reduction in the form of a UMAP. See the tutorials for Seurat and Signac for details. We summarize the required command below, where you can apply either an RNA-seq or ATAC-seq dimensional reduction depending on what you think is appropriate.

if(TRUE){
  #ATAC-seq style analysis
  library(Signac)
  adata <- RunTFIDF(adata)
  adata <- FindTopFeatures(adata, min.cutoff = 'q0')
  adata <- RunSVD(adata)
  DepthCor(adata)
  adata <- RunUMAP(object = adata, reduction = 'lsi', dims = 1:30, reduction.name = "infokmers_umap")  
} else {
  #RNA-seq style analysis
  adata <- NormalizeData(adata)
  adata <- FindVariableFeatures(adata, selection.method = "vst", nfeatures = 2000)
  adata <- ScaleData(adata, features = rownames(adata))
  adata <- RunPCA(adata, features = VariableFeatures(object = adata))
  adata <- RunUMAP(adata, dims = 1:20, reduction.name = "infokmers_umap")
}

You can now plot a UMAP with your cells. Note that the cells do not have any labels. One way to obtain labels is by also running the KRAKEN-based workflow; or you can use other annotation tools through the mapcell-system.

DimPlot(object = adata, reduction = "infokmers_umap") + 
  xlab("KMER1") + 
  ylab("KMER2")