SINCLAIR Quickstart¶
Overview¶
SINCLAIR is an end-to-end NextFlow pipeline for processing single cell RNASeq (scRNASeq) data from either raw fastq.gz files or count files in .h5
format, as produced by the 10X CellRanger pipeline, and primarily uses Seurat v5 as its backbone for downstream processing.
In short, SINCLAIR performs the following functions:
- Alignment from FASTQ (optional)
- Initial quality control and cell filtering per sample
- Sample combination
- Batch correction
- Preliminary cell type annotation
- Preliminary clustering
The final outputs are a set of Seurat .rds
files that contain all provided samples with and without batch correction, with the latter evaluated with several algorithms.
Installation and Initialization¶
Via CCBR Pipeliner on Biowulf (NIH HPC)¶
If working on Biowulf, start an interactive session with a minimum of 16 CPUs, 8 hours wall-time, and a local scratch allocation (temporary RAM) of 128 GB:
sinteractive --mem=64g --cpus-per-task=16 --time=8:00:00 --gres=lscratch:128
As of CCBR Pipeliner release 8, instantiate Pipeliner as a module:
module load ccbrpipeliner
Navigate to your working directory and initialize SINCLAIR:
sinclair init
Setting up input files¶
All input files should follow nomenclature as if generated via CellRanger (https://www.10xgenomics.com/support/jp/software/cell-ranger/8.0/tutorials/inputs/cr-specifying-fastqs). When starting from fastq files, each sample should have its own directory containing at least R1
and R2
(i.e. forward and reverse reads). Additional files that may be included include I1
index files and reads from multiple lanes. Example minimum data structure for two samples:
`/path/to/sample1/sample1_S1_L0001_R1_001.fastq.gz`
`/path/to/sample1/sample1_S1_L0001_R2_001.fastq.gz`
`/path/to/sample2/sample2_S1_L0001_R1_001.fastq.gz`
`/path/to/sample2/sample2_S1_L0001_R2_001.fastq.gz`
When starting from .h5
files that are generated from CellRanger alignment, the directory structure is simpler:
/path/to/sample1/outputs/filtered_feature_bc_matrix.h5
/path/to/sample2/outputs/filtered_feature_bc_matrix.h5
The .h5
matrix files should be indicated as filtered_feature_bc_matrix.h5
, with the sample name indicated in the directory path.
Setting Up Manifest Files¶
Manifest files are comma-separated variable (.csv) files in the assets
folder of the SINCLAIR working directory. These contain the filepaths for the input sample files and the contrasts to be included as group identities for downstream differential expression.
Two options exist for sample inputs: input_manifest.csv
and input_manifest_cellranger.csv
. Usage depends on the entry point, i.e. whether the samples have already been aligned using CellRanger.
input_manifest.csv¶
This file is used when starting from fastqs and requires alignment via CellRanger. The .csv
file in assets
can be used as a template and modified as follows: | masterID | uniqueID | groupID | dataType | input_dir | | -------- | -------- | ------- | -------- | --------- | | parentID_1
| uniqueSampleID_1
| groupID_1
| gex
| path/to/fastqs/1
| | parentID_2
| uniqueSampleID_2
| groupID_2
| gex
| path/to/fastqs/2
|
The masterID
column can be used to indicate if samples are replicates of the same sample and are not required to be unique. The groupID
indicates the contrast group for each sample. The input_dir
must point to a series of fastq files as generated by the 10X Chromium pipeline, or fastq files that follow the same naming convention.
input_manifest_cellranger.csv¶
This file is used if the alignment and read counting has already been performed, such as through 10X CellRanger software or a similar tool. The input type is expected to be in .h5
format. Tools will be made available to convert mtx
triplet files into .h5
masterID | uniqueID | groupID | dataType | input_dir |
---|---|---|---|---|
parentID_1 | uniqueSampleID_1 | groupID_1 | gex | path/to/h5Counts/1 |
parentID_2 | uniqueSampleID_2 | groupID_2 | gex | path/to/h5Counts/2 |
The primary difference here is that the input_dir points to the directory containing .h5
file rather than uncounted .fastq.gz
files.
contrast_manifest.csv¶
This file contains the comparisons to be generated, and will indicate samples that should be included in different combinations. For each contrast indicated, only the samples within the specified groups will be processed and combined.
contrast1 | contrast2 | contrast3 |
---|---|---|
group1 | group2 | |
group1 | group2 | group3 |
As many contrasts as there exists groups can be included, with a minimum of 2 groups, as specified both in the input_manifest.csv
/input_manifest_cellranger.csv
and the contrast_manifest.csv
files. If running SINCLAIR on a single sample, the above can be formatted as:
contrast1 |
---|
group1 |
Starting the Run¶
These instructions will start a basic run of SINCLAIR. For more detailed instructions, please refer to 3. Running the Pipeline.
. When running SINCLAIR from CCBRPipeliner, the following commands are used. When running from a GitHub installation, sinclair
should be replaced with bin/sinclair
.
To start a local instance with CellRanger alignment (which is also the default setting):
sinclair run --mode local --species=<genome> --run_cellranger true
To start a slurm run:
sinclair run --mode slurm --species <genome> --run_cellranger true
By default, the genome is hg19
; other options include mm10
and hg38
. In order to run SINCLAIR without CellRanger alignment, the parameter --run_cellranger false
needs to be set and SINCLAIR will now look at the input_manifest_cellranger.csv
manifest.
Expected Outputs¶
During execution, the SINCLAIR workflow stores all temporary outputs in the work
directory. This directory also supports workflow recovery: if the run fails, intermediate files in work allow the pipeline to resume from the point of failure when the user re-runs the pipeline.
Final results are saved in the results
directory unless a different output directory was specified in the parameters. The results
directory will contain 4 subdirectories:
batch_correct
contains the combined Seurat.rds
files for each of the contrasts, with a separate file for each batch correction method, as well as a summary.html
file.cellranger_counts
contains the.h5
counts files for each sample produced by the CellRanger software.samplesheets
contains the parsed sample sheets based on the manifest files, as interpreted by NextFlow and SINCLAIR.seurat
contains two subdirectories:merge
contains the combined sample Seurat.rds
files for each set of contrasts prior to batch correction (which can otherwise be referred to as the "uncorrected" object).preprocess
contains the individual sample.rds
files.
When proceeding to downstream secondary analysis, such as differential expression, please utilize the batch_correction_integration.html
files to determine which batch correction method, or even lack thereof, best fits the data. The appropriate file can then be analyzed in R through the Seurat workflow.
For multi-sample analsyses, the output directory will have the following structure:
results
├── batch_correct
│ ├── group1-group2_batch_correction_cca.rds
│ ├── group1-group2_batch_correction_harmony.rds
│ ├── group1-group2_batch_correction_integration.html
│ ├── group1-group2_batch_correction_liger.rds
│ └── group1-group2_batch_correction_rpca.rds
├── cellranger_counts
│ ├── sample1
│ │ └── outs
│ │ └── filtered_feature_bc_matrix.h5
│ ├── ...
│ └── sampleN
│ └── outs
│ └── filtered_feature_bc_matrix.h5
├── samplesheets
│ ├── project_contrast_samplesheet.csv
│ ├── project_gex_samplesheet.csv
│ └── project_groups_samplesheet.csv
└── seurat
├── merge
│ ├── group1-group2_seurat_merged.pdf
│ └── group1-group2_seurat_merged.rds
└── preprocess
├── sample1_seurat_preprocess.pdf
├── sample1_seurat_preprocess.rds
├── ...
├── sampleN_seurat_preprocess.pdf
└── sampleN_seurat_preprocess.rds
For single-sample analsyses, the output directory will be similar in structure to the above, but missing batch_correct results:
results
├── cellranger_counts
│ ├── sample1
│ │ └── outs
│ │ └── filtered_feature_bc_matrix.h5
├── samplesheets
│ ├── project_contrast_samplesheet.csv
│ ├── project_gex_samplesheet.csv
│ └── project_groups_samplesheet.csv
└── seurat
├── merge
│ ├── group1-group2_seurat_merged.pdf
└── preprocess
├── sample1_seurat_preprocess.pdf