Skip to contents

First set up Zorn/Bascet according to the install instructions. This tutorial assumes that you have debarcoded the reads.

When to perform alignment

Most single-cell analysis’ will involve an alignment step:

  • For analysis of model organisms (classic scRNA-seq), alignment is needed to see which genes are expressed
  • For metagenomics, alignment is a good way of removing human reads, or reads from contaminating DNA/RNA included with the enzmymes of buffers (see filter step later)
  • For metagenomics, you still need alignment prior to SNP-analysis (covered in separate section)

Indexing a genome

Before alignment, you need to prepare your reference by indexing it. Bascet wraps several aligners and use their native indices. If you want full access to all options, indices can be built the old-school way by accessing them like this:

#Access BWAMEM2 inside Bascet
./bascet exttool bwa-mem2 index ...

But you can also perform the indexing via R. Here we assume that you have a genome FASTA-file “genome.fa” or “genome.fa.gz”:

(SLURM-compatible step)

#Build a reference for BWAMEM2
BascetIndexGenomeBWAMEM2(
  bascetRoot,
  "/your_reference/genome.fa"
)

If you have long-read data, Minimap2 is the better option

#Build a reference for Minimap2
BascetIndexGenomeMinimap2(
  bascetRoot,
  "/your_reference/genome.fa"
)

You can also use STAR if you want to align RNA-seq data (and thus need a splice-aware aligner)

#Build a reference for STAR
BascetIndexGenomeSTAR(
  bascetRoot,
  fastaFile="/your_reference/genome.fa",
  gtfFile="/your_reference/genome.gtf"  #note that GTF-file is needed here to inform STAR about splice junctions
)

Alignment

You can now proceed with alignment using BWA. By default, this command outputs both unsorted and sorted BAM-files.

  • The sorted BAM-files are required for counting of reads in genomic regions, i.e., WGS species counting and RNA-seq
  • The unsorted BAM-files are required for filtering of host reads

(SLURM-compatible step)

### Perform alignment
BascetAlignToReference(
  bascetRoot,
  aligner="BWAMEM2",
  useReference="/your_reference/genome.fa"
)

Host filtering workflow

A common task in metagenomics is to filter out human reads. After you have aligned the reads, you can run the following command to only retain unmapped reads

BascetFilterAlignment(
  bascetRoot,
  inputName="aligned_cell",
  outputName="aligned_cell_nohost",
  keepMapped=TRUE
)

To get the output BAM file back in Bascet TIRP format, use the following command

(SLURM-compatible step)

#Convert aligned filtered BAM to TIRP
BascetMapTransform(
  bascetRoot,
  inputName="aligned_cell_nohost",
  outputName = "filtered_nohost",
  outFormat="tirp.gz"
)

Option #1: Counting chromosomes (presence/absence of species)

A simplified scenario is when you just want to get the number of reads per chromosome. These counts can later be summarized on the levels of genomes (summing up across chromosomes). We use this to for example compare the alignment to an expected mock community, and thus just want to see how species mix together. The following command generates counts files:

(SLURM-compatible step)

### Count reads per chromosome
BascetCountChrom(
  bascetRoot
)

The output format is one binary HDF5 file for each shard, following the AnnData on-disk layout for the main sparse count matrix. Cells are stored as observations, features as variables, and per-cell unmapped read counts are available in cnt@obs$unclassified_reads. To load these, use the following command that loads the files and concatenates them into a single matrix. It can then be loaded into Seurat:

library(Seurat)

cnt <- ReadBascetCountMatrix(bascetRoot,"chromcount", verbose=FALSE)

adata <- CreateSeuratObject(
  cnt,
  assay = "chrom_cnt"
)

Most likely, some cells will have to be discarded. Initially we recommend that you keep the cutoff low until you are better informed. You can then apply the cutoff as follows:

keep_cells <- adata$nCount_chrom_cnt > 10000 #10k reads
sum(keep_cells)                              #See how many cells pass this cutoff
adata <- adata[,keep_cells]                  #Reduce to sensible number

If you have a list of which chromosomes belong together, then you can furthermore sum them up to get species-level counts. mapSeq2strain should contain two columns: (todo) (TODO filter after instead)

adata[["species_cnt"]] <- ChromToSpeciesCount(adata, mapSeq2strain)  

#Optional: Figure out which species has most reads in which cell
cnt <- GetAssayData(adata, assay = "species_cnt", layer = "counts")
adata$species_aln <- rownames(cnt)[apply(cnt, 2, which.max)]

Knowing the species, you can generate a per-species kneeplot as follows:

DefaultAssay(adata) <- "species_cnt"
KneeplotPerSpecies(adata)

You can now proceed to generate a dimensional reduction in the form of a UMAP. See the tutorials for Seurat and Signac for details. We summarize the required command below, where you can apply either an RNA-seq or ATAC-seq dimensional reduction depending on what you think is appropriate.

ATAC-seq style analysis (recommended also for DNA-seq!):

#ATAC-seq style analysis
library(Signac)
adata <- RunTFIDF(adata)
adata <- FindTopFeatures(adata, min.cutoff = 'q0')
numdim <- min(50, nrow(adata)-1)
adata <- RunSVD(adata, n = numdim)
DepthCor(adata)
adata <- RunUMAP(object = adata, reduction = 'lsi', dims = 1:numdim, reduction.name = "chrom_umap")  

RNA-seq style analysis:

#RNA-seq style analysis
adata <- NormalizeData(adata)
adata <- FindVariableFeatures(adata, selection.method = "vst", nfeatures = nrow(adata))
adata <- ScaleData(adata, features = rownames(adata))
numdim <- min(50, nrow(adata)-2)
adata <- RunPCA(adata, features = VariableFeatures(object = adata), npcs = numdim)
adata <- RunUMAP(adata, dims = 1:numdim, reduction.name = "chrom_umap")

Finally, you can plot your UMAP. If you also extracted the dominant species column in your matrix, you can visualize it on top!

DimPlot(object = adata, label = TRUE, group.by = "species_aln") + 
  xlab("BWA1") + 
  ylab("BWA2")

Option #2: Feature counting (RNA-seq analysis, etc)

For protocols such as RNA-seq, the aim is to get the number of reads per gene, per cell. After alignment and sorting of reads based on genome coordinate (default setting), you can overlay the reads with gene coordinates from an annotation file. The format is detected from the file extension; the following are all supported: .gff, .gff.gz, .gtf, .gtf.gz and .bed. The following command performs this operation:

BascetCountFeature(
  bascetRoot,
  gffFile = "/my/genome.gtf.gz",
  useFeature = "gene",     #Default
  attrGeneId = "gene_id",  #Default
  attrGeneName = "name"    #Default
)

The ID and name of genes are extracted from the GFF attributes column. Not all files have both name and ID; the ID will be used as name if nothing specified. By default, the rows having “gene” as the feature description will

After the features (genes) have been counted, you can load them into Seurat:

library(Seurat)

cnt <- ReadBascetCountMatrix(bascetRoot,"featurecount", verbose=FALSE)

adata <- CreateSeuratObject(cnt)

From here, you can follow an RNA-seq tutorial for Seurat. But for the impatient, this is some minimal example code to get your first UMAP:

adata <- NormalizeData(adata)
adata <- FindVariableFeatures(adata, selection.method = "vst")
adata <- ScaleData(adata)
adata <- RunPCA(adata, features = VariableFeatures(object = adata))

adata <- FindNeighbors(adata, dims = 1:20)
adata <- FindClusters(adata, resolution = 0.5)

adata <- RunUMAP(adata, dims=1:10)

Option #3: Counting with Signac

If you want to analyze where reads mapped, either for RNA-seq counting for genes, or more generally, what the count is within a genomic region, you can use Signac for great flexibility in this task. Signac is designed to analyze single-cell ATAC-seq data, where the reads are stored in a specialized BED-file. You can use the command below to generate this file.

Note one caveat: depending on chemistry, this workflow may not handle UMIs, and just report raw read counts. However, for quick and dirty operations this is typically enough.

(SLURM-compatible step)

### Generate fragments BED file suited for quantifying
### reads/chromosome using Signac later 
BascetBam2Fragments(
  bascetRoot
)

Optional: TIRP to FASTQ

If you want to feed the FASTQ data into another tool (or aligner), you can use the following command to convert TIRP to FASTQ. Each FASTQ read will contain the name of the cell

(SLURM-compatible step)

### Get reads in fastq format
BascetMapTransform(
  bascetRoot,
  inputName="filtered",
  outputName="asfq",
  out_format="R1.fq.gz"
)