2. Preparing Files¶

The pipeline is controlled by editing configuration and manifest files. Defaults are found in the nextflow.config file.

SINCLAIR Process Overview ^{Overview of Single Cell RNASeq Gene Expression Process}

2.1 Preparing input data¶

SINCLAIR has been designed to start at one of two entry points: raw unaligned reads in fastq.gz format, or pre-aligned counts in .h5 format. The file format should follow the naming convention used by 10X CellRanger; file names can be modified as needed to meet these criteria. The directory structure as described below will also contain a brief description of the file naming format.

2.1.1 Starting from fastq.gz files¶

The fastq.gz files names need to resemble the format as if they were generated by 10X CellRanger with mkfastq, bcl-convert or bcl2fastq. If the files have already been generated via CellRanger, then proceed to the next step. Otherwise, the sample files and directories should be in the following format:

This would be the file structure of a dataset with 2 samples that were each run in 2 lanes

/path/to/sample1/sample1_S1_L0001_R1_001.fastq.gz /path/to/sample1/sample1_S1_L0001_R2_001.fastq.gz /path/to/sample1/sample1_S1_L0001_I1_001.fastq.gz /path/to/sample1/sample1_S1_L0002_R1_001.fastq.gz /path/to/sample1/sample1_S1_L0002_R2_001.fastq.gz /path/to/sample1/sample1_S1_L0002_I1_001.fastq.gz

/path/to/sample2/sample2_S1_L0001_R1_001.fastq.gz /path/to/sample2/sample2_S1_L0001_R2_001.fastq.gz /path/to/sample2/sample2_S1_L0001_I1_001.fastq.gz /path/to/sample2/sample2_S1_L0002_R1_001.fastq.gz /path/to/sample2/sample2_S1_L0002_R2_001.fastq.gz /path/to/sample2/sample2_S1_L0002_I1_001.fastq.gz

Not all cases will include all these file outputs. The bare minimum required to run the CellRanger alignment are the forward and reverse reads from a single lane run. Using the above example:

/path/to/sample1/sample1_S1_L0001_R1_001.fastq.gz /path/to/sample1/sample1_S1_L0001_R2_001.fastq.gz

/path/to/sample2/sample2_S1_L0001_R1_001.fastq.gz /path/to/sample2/sample2_S1_L0001_R2_001.fastq.gz

2.2.2. Starting from aligned .h5 files¶

When starting from .h5 files that are generated from CellRanger alignment, the directory structure is simpler:

/path/to/sample1/outputs/filtered_feature_bc_matrix.h5 /path/to/sample2/outputs/filtered_feature_bc_matrix.h5

As before, it is strongly recommended to follow the naming conventions generated by CellRanger, as SINCLAIR looks primarily for the filtered_feature_bc_matrix.h5 file, which has already had empty droplets algorithmically filtered out.

Addendum: Starting from matrix files

Older versions of CellRanger, as well as some other workflows (e.g. DropSeq, Smart-Seq, PipSeq, etc.) tend to produce a file structure that resembles the following for each sample:

/path/to/sample1/outs/filtered_feature_matrix/matrix.mtx.gz
/path/to/sample1/outs/filtered_feature_matrix/features.tsv.gz
/path/to/sample1/outs/filtered_feature_matrix/barcodes.tsv.gz

In this case, the files will need to be converted into a .h5 file and organized as described previously. This can be done with a script such as this:

library(Seurat)
library(DropletUtils)

args = commandArgs(trailingOnly=T)

sampleName = args[1]
mtx_file = args[2]
features_file=args[3]
barcodes_file=args[4]

counts = Seurat::ReadMtx(mtx = mtx_file, cells = barcodes_file, features=features_file)
outfile = paste0(sampleName,".h5")
DropletUtils::write10xCounts(x= counts,path=outfile)

2.2 Preparing Manifests¶

There are two manifests, which are required. These files describe information on the samples and desired contrasts. These files are:

/assets/input_manifest.csv OR /assets/input_manifest_cellranger.csv
assets/contrast_manifest.csv

2.2.1 Input Manifest¶

This manifest will include information to sample level information. It includes the following column headers:

masterID: This is the biological sample ID; duplicates are allowed in this column
uniqueID: This is a unique sample level ID; duplicates are not allowed in this column
groupID: This is the groupID which should match to the contrast_manifest; duplicates are allowed in this column
dataType: This is the datatype for the input sample; currently only permitted to use gex
input_dir: This is the input directory for the data files of the sample type (e.g. /path/to/sample1/fastq or /path/to/sample1/outs)

An example sampleManifest file is shown below:

masterID	uniqueID	groupID	dataType	input_dir
WB_Lysis_1	sample1	group1	gex	/data/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/test_dir/WB_Lysis_Granulocytes_3p_Introns_8kCells_fastqs/sample1
WB_Lysis_1	sample2	group1	gex	/data/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/test_dir/WB_Lysis_Granulocytes_3p_Introns_8kCells_fastqs/sample2
WB_Lysis_2	sample3	group2	gex	/data/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/test_dir/WB_Lysis_Granulocytes_3p_Introns_8kCells_fastqs/sample3
WB_Lysis_2	sample4	group2	gex	/data/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/test_dir/WB_Lysis_Granulocytes_3p_Introns_8kCells_fastqs/sample4
WB_Lysis_3	sample5	group3	gex	/data/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/test_dir/WB_Lysis_Granulocytes_3p_Introns_8kCells_fastqs/sample5

2.2.2 Contrast Manifest¶

This manifest will include sample information to performed differential comparisons. A few requirements:

groups listed must match groups within the input_manifest groupID column
headers should be included for the max number of contrasts. In the example below, the second contrast contains 3 groups, and so the header includes contrast1-contrast3
multiple groups can be added by increasing the header and adding additional contrasts, as needed

An example contrast file:

contrast1	contrast2	contrast3
group1	group2
group1	group2	group3