SINCLAIR Quickstart¶
Overview¶
SINCLAIR is an end-to-end NextFlow pipeline for processing single cell RNASeq (scRNASeq) data from either raw fastq.gz files or count files in .h5 format, as produced by the 10X CellRanger pipeline, and primarily uses Seurat v5 as its backbone for downstream processing.
In short, SINCLAIR performs the following functions:
- Alignment from FASTQ (optional)
- Initial quality control and cell filtering per sample
- Sample combination
- Batch correction
- Preliminary cell type annotation
- Preliminary clustering
The final outputs are a set of Seurat .rds files that contain all provided samples with and without batch correction, with the latter evaluated with several algorithms.
Installation and Initialization¶
Via CCBR Pipeliner on Biowulf (NIH HPC)¶
If working on Biowulf, start an interactive session with a minimum of 16 CPUs, 8 hours wall-time, and a local scratch allocation (temporary RAM) of 128 GB:
sinteractive --mem=64g --cpus-per-task=16 --time=8:00:00 --gres=lscratch:128
As of CCBR Pipeliner release 8, instantiate Pipeliner as a module:
module load ccbrpipeliner
Navigate to your working directory and initialize SINCLAIR:
sinclair init
Setting up input files¶
All input files should follow nomenclature as if generated via CellRanger (https://www.10xgenomics.com/support/jp/software/cell-ranger/8.0/tutorials/inputs/cr-specifying-fastqs). When starting from fastq files, each sample should have its own directory containing at least R1 and R2 (i.e. forward and reverse reads). Additional files that may be included include I1 index files and reads from multiple lanes. Example minimum data structure for two samples:
`/path/to/sample1/sample1_S1_L0001_R1_001.fastq.gz`
`/path/to/sample1/sample1_S1_L0001_R2_001.fastq.gz`
`/path/to/sample2/sample2_S1_L0001_R1_001.fastq.gz`
`/path/to/sample2/sample2_S1_L0001_R2_001.fastq.gz`
When starting from .h5 files that are generated from CellRanger alignment, the directory structure is simpler:
/path/to/sample1/outputs/filtered_feature_bc_matrix.h5 /path/to/sample2/outputs/filtered_feature_bc_matrix.h5
The .h5 matrix files should be indicated as filtered_feature_bc_matrix.h5, with the sample name indicated in the directory path.
Setting Up Manifest Files¶
Manifest files are comma-separated variable (.csv) files in the assets folder of the SINCLAIR working directory. These contain the filepaths for the input sample files and the contrasts to be included as group identities for downstream differential expression.
Two options exist for sample inputs: input_manifest.csv and input_manifest_cellranger.csv. Usage depends on the entry point, i.e. whether the samples have already been aligned using CellRanger.
input_manifest.csv¶
This file is used when starting from fastqs and requires alignment via CellRanger. The .csv file in assets can be used as a template and modified as follows: | masterID | uniqueID | groupID | dataType | input_dir | | -------- | -------- | ------- | -------- | --------- | | parentID_1 | uniqueSampleID_1 | groupID_1 | gex | path/to/fastqs/1 | | parentID_2 | uniqueSampleID_2 | groupID_2 | gex | path/to/fastqs/2 |
The masterIDcolumn can be used to indicate if samples are replicates of the same sample and are not required to be unique. The groupID indicates the contrast group for each sample. The input_dir must point to a series of fastq files as generated by the 10X Chromium pipeline, or fastq files that follow the same naming convention.
input_manifest_cellranger.csv¶
This file is used if the alignment and read counting has already been performed, such as through 10X CellRanger software or a similar tool. The input type is expected to be in .h5 format. Tools will be made available to convert mtx triplet files into .h5
| masterID | uniqueID | groupID | dataType | input_dir |
|---|---|---|---|---|
parentID_1 | uniqueSampleID_1 | groupID_1 | gex | path/to/h5Counts/1 |
parentID_2 | uniqueSampleID_2 | groupID_2 | gex | path/to/h5Counts/2 |
The primary difference here is that the input_dir points to the directory containing .h5 file rather than uncounted .fastq.gz files.
contrast_manifest.csv¶
This file contains the comparisons to be generated, and will indicate samples that should be included in different combinations. For each contrast indicated, only the samples within the specified groups will be processed and combined.
| contrast1 | contrast2 | contrast3 |
|---|---|---|
| group1 | group2 | |
| group1 | group2 | group3 |
As many contrasts as there exists groups can be included, with a minimum of 2 groups, as specified both in the input_manifest.csv/input_manifest_cellranger.csv and the contrast_manifest.csv files. If running SINCLAIR on a single sample, the above can be formatted as:
| contrast1 |
|---|
| group1 |
Starting the Run¶
These instructions will start a basic run of SINCLAIR. For more detailed instructions, please refer to 3. Running the Pipeline.. When running SINCLAIR from CCBRPipeliner, the following commands are used. When running from a GitHub installation, sinclair should be replaced with bin/sinclair.
To start a local instance with CellRanger alignment (which is also the default setting):
sinclair run --mode local --species=<genome> --run_cellranger true
To start a slurm run:
sinclair run --mode slurm --species <genome> --run_cellranger true
By default, the genome is hg19; other options include mm10 and hg38. In order to run SINCLAIR without CellRanger alignment, the parameter --run_cellranger false needs to be set and SINCLAIR will now look at the input_manifest_cellranger.csv manifest.
Expected Outputs¶
During execution, the SINCLAIR workflow stores all temporary outputs in the work directory. This directory also supports workflow recovery: if the run fails, intermediate files in work allow the pipeline to resume from the point of failure when the user re-runs the pipeline.
Final results are saved in the results directory unless a different output directory was specified in the parameters. The results directory will contain 4 subdirectories:
batch_correctcontains the combined Seurat.rdsfiles for each of the contrasts, with a separate file for each batch correction method, as well as a summary.htmlfile.cellranger_countscontains the.h5counts files for each sample produced by the CellRanger software.samplesheetscontains the parsed sample sheets based on the manifest files, as interpreted by NextFlow and SINCLAIR.seuratcontains two subdirectories:mergecontains the combined sample Seurat.rdsfiles for each set of contrasts prior to batch correction (which can otherwise be referred to as the "uncorrected" object).preprocesscontains the individual sample.rdsfiles.
When proceeding to downstream secondary analysis, such as differential expression, please utilize the batch_correction_integration.html files to determine which batch correction method, or even lack thereof, best fits the data. The appropriate file can then be analyzed in R through the Seurat workflow.
For multi-sample analsyses, the output directory will have the following structure:
results
├── batch_correct
│ ├── group1-group2_batch_correction_cca.rds
│ ├── group1-group2_batch_correction_harmony.rds
│ ├── group1-group2_batch_correction_integration.html
│ ├── group1-group2_batch_correction_liger.rds
│ └── group1-group2_batch_correction_rpca.rds
├── cellranger_counts
│ ├── sample1
│ │ └── outs
│ │ └── filtered_feature_bc_matrix.h5
│ ├── ...
│ └── sampleN
│ └── outs
│ └── filtered_feature_bc_matrix.h5
├── samplesheets
│ ├── project_contrast_samplesheet.csv
│ ├── project_gex_samplesheet.csv
│ └── project_groups_samplesheet.csv
└── seurat
├── merge
│ ├── group1-group2_seurat_merged.pdf
│ └── group1-group2_seurat_merged.rds
└── preprocess
├── sample1_seurat_preprocess.pdf
├── sample1_seurat_preprocess.rds
├── ...
├── sampleN_seurat_preprocess.pdf
└── sampleN_seurat_preprocess.rds
For single-sample analsyses, the output directory will be similar in structure to the above, but missing batch_correct results:
results
├── cellranger_counts
│ ├── sample1
│ │ └── outs
│ │ └── filtered_feature_bc_matrix.h5
├── samplesheets
│ ├── project_contrast_samplesheet.csv
│ ├── project_gex_samplesheet.csv
│ └── project_groups_samplesheet.csv
└── seurat
├── merge
│ ├── group1-group2_seurat_merged.pdf
└── preprocess
├── sample1_seurat_preprocess.pdf