Skip to contents

When to perform SNP-analysis

To compare closely related genomes, it is common to analyze SNPs (Single Nucleotide Polymorphisms) - sites in the genome which differ between two or more genomes. This method works fairly well for comparing, e.g., human genomes; for microbes, this analysis is typically restricted to highly similar genomes or subsets of genomes. SNP-analysis can also be use to separate cells from different donors.

Citing

Bascet wraps cellsnp-lite for SNP-calling, so please cite this paper if you use it:

Xianjie Huang, Yuanhua Huang, Cellsnp-lite: an efficient tool for genotyping single cells, Bioinformatics, Volume 37, Issue 23, December 2021, Pages 4569–4571, https://doi.org/10.1093/bioinformatics/btab358

Alignment

SNP-analysis relies on having data aligned to a reference, to which SNPs (differences) are called. First follow the alignment workflow to generate the needed BAM-files.

SNP-calling

You can now check how each cells genome deviate from the reference and obtain SNPs. This is done by calling the integrated CellSNP-lite. The input data must be position sorted reads.

(SLURM-compatible step)

BascetRunCellSNP(
  bascetRoot,
  inputName="myref_aligned",
  numLocalThreads=5,
  runner=SlurmRunner(bascet_runner.default, ncpu="5")
)

Loading the count matrix

The heavy work is now over. You can load the count matrix like this:

# get aligned counts
cnt_myref <- ReadBascetCountMatrix(bascetRoot,"cnt_myref")

# Compute fraction aligned
mapped_reads <- Matrix::rowSums(cnt_myref@X)
cnt_myref@obs$tot_reads <- as.double(cnt_myref@obs$unclassified_reads + mapped_reads)
cnt_myref@obs$frac_mapped <- as.double(mapped_reads / cnt_myref@obs$tot_reads)

Subsetting cells

If you do microbial analysis and only want cells that are similar enough for comparision, you can filter out other cells using the following procedure.

First produce a kneeplot of how much cells deviate SNP-wise:

### Knee plot of alignment
df <- data.frame(
  frac_mapped = sort(cnt_myref$obs$frac_mapped, decreasing = TRUE),
  index = 1:nrow(cnt_myref$obs)
)
ggplot(df, aes(index, frac_mapped)) +
  geom_line() +
  scale_x_log10() +
  theme_bw()

You can then subset as follows:

cnt_myref <- cnt_myref[cnt_myref$obs$frac_mapped>0.8,]

Clustering by SNPs

You can produce UMAPs and clusters based on the SNP matrix. Load the CellSNP-lite counts as follows:

snp_ad <- ReadCellSNPmatrix(
  bascetRoot,
  "cellsnp"
)

Create a Seurat object from the counts:

adata <- CreateSeuratObject(counts = t(snp_ad), project = "adata3k", min.cells = 3, min.features = 0)

#transfer count metadata (optional)
adata$frac_mapped <-  cnt_mutans$obs[colnames(adata),]$frac_mapped
adata$tot_reads <-  cnt_mutans$obs[colnames(adata),]$tot_reads

Do the usual transformations and clustering (another option is to normalize by sequencing depth of cell, not SNP count; or run ATAC-seq-like normalization. Best practices are yet to be established):

### Normalize data, PCA etc
adata <- NormalizeData(adata, normalization.method = "LogNormalize", scale.factor = 10000)
adata <- NormalizeData(adata)
adata <- FindVariableFeatures(adata, selection.method = "vst", nfeatures = 2000)
all.genes <- rownames(adata)
adata <- ScaleData(adata, features = all.genes)

adata <- RunPCA(adata, features = VariableFeatures(object = adata))

adata <- FindNeighbors(adata, dims = 1:10)
adata <- FindClusters(adata, resolution = 0.5)

adata <- RunUMAP(adata, dims = 1:10)

adata$snp_count <- rowSums(adata@assays$RNA@layers$counts)

Do some quick plots to check for clusters:

DimPlot(adata, reduction = "umap")
FeaturePlot(adata, reduction = "umap", features = c("snp_count"))
FeaturePlot(adata, reduction = "umap", features = c("frac_mapped"))
FeaturePlot(adata, reduction = "umap", features = c("tot_reads"))

Find differential SNPs using the Seurat marker tools

adata.markers <- FindAllMarkers(adata, only.pos = TRUE)

adata.markers %>%
  group_by(cluster) %>%
  dplyr::filter(avg_log2FC > 2) %>%
  slice_head(n = 100) %>%
  ungroup() -> top10

as.data.frame(top10)

Plot SNPs to your liking

FeaturePlot(adata, reduction = "umap", features = c("NZ-CP077404.1-7148-T-to-C"))

That’s it!

SNP-based deconvolution of cells from different donors

The output files from SNP-analysis are directly compatible with Vireo. See their tutorials for how to perform this analysis.