Overview of File formats • Zorn

Bascet-TIRP (.tirp.gz)

TIRP (Tabix-indexed read pairs) is a bgzip’ed text file with the following columns, separated by a tab:

Name of cell
1 (start position in tabix format; ignored)
1 (end position in tabix format; ignored)
R1 sequence
R2 sequence
R1 quality score
R2 quality score
UMI (always present, may be empty)

With the only exception of right after debarcoding, TIRPs are always sorted by the name of the cell. They are then indexed using tabix. This means that you can retrieve a list of cells in a file using the tabix tool, but we advise you to use higher level wrappers in Zorn/Bascet to not get locked to this file format.

An optional read count histogram can be stored as a file: xxx.tirp.gz.hist

Bascet-ZIP (.zip)

We use zip files as a means of storing general data. These are the conventions:

File Y for cell XX is stored as XX/Y
If a cell has reads, they are stored as XX/r1.fq and XX/r2.fq
If a cell has contigs, they are stored as XX/contigs.fa
The file XX/_mapcell.log is the output from the mapcell script
Overall, files named XX/YY are reserved as special output from future tools. Thus avoid storing files starting with in their name

See (separate section for how to work with Bascet-ZIP).

Bascet-FASTQ (.R1.fq.gz and .R2.fq.gz)

Some tools require FASTQ as input or output. To keep track of the cell origin of reads, the reads have a special naming convention:

cellID “:” UMI “:” read_number

where * cellID is the name of the cell. As FASTQ only supports some characters, names will be mangled in the future (to be implemented) * UMI is the unique molecular identifier * read_number is just a number, with the same number for R1 and R2. it can be used to track read correspondence if reads are filtered, multimapped etc.

Reads in Bascet-FASTQ should be sorted by (cellID, read_number, read_index), in this order. This makes it easy to read all reads for one cell without having to scan through the entire file.

Example:

@cell1:AAAA:1
ATCGATCGATCG
+
FFFFFFFFFFFF

See (separate section for how to work with Bascet-FASTQ).

Bascet-BAM (.bam)

To keep track of the cellular origin, reads in BAM files typically follow the same naming convention as in Bascet-FASTQ. Thus, running any aligner on a Bascet-FASTQ should result in valid Bascet-BAM.

To support other tools, the reads may also be annotated using tags. If read names are not named according to the FASTQ scheme, tools must instead scan for tags:

CB:Z:…. cell_ID
UB:Z:…. UMI

This is similar to CellRanger annotation (https://www.10xgenomics.com/support/software/cell-ranger/7.2/analysis/outputs/cr-outputs-bam), except Bascet does not have tags for yet-to-be-corrected cell IDs and UMIs.

Bascet-HDF5 (.h5)

Bascet generates count matrices from several tools:

KRAKEN count matrices
Informative KMER count matrices
Feature and chromosome count matrices

Our format is very similar to the Anndata count matrix format. We may however need to further adjust our writers for conformity. We provide our own compatible readers.