SINCLAIR Quickstart

Overview

SINCLAIR is an end-to-end NextFlow pipeline for processing single cell RNASeq (scRNASeq) data from either raw fastq.gz files or count files in .h5 format, as produced by the 10X CellRanger pipeline, and primarily uses Seurat v5 as its backbone for downstream processing.

In short, SINCLAIR performs the following functions:

  • Alignment from FASTQ (optional)
  • Initial quality control and cell filtering per sample
  • Sample combination
  • Batch correction
  • Preliminary cell type annotation
  • Preliminary clustering

The final outputs are a set of Seurat .rds files that contain all provided samples with and without batch correction, with the latter evaluated with several algorithms.

Installation and Initialization

Via CCBR Pipeliner on Biowulf (NIH HPC)

If working on Biowulf, start an interactive session with a minimum of 16 CPUs, 8 hours wall-time, and a local scratch allocation (temporary RAM) of 128 GB:

sinteractive --mem=64g --cpus-per-task=16 --time=8:00:00 --gres=lscratch:128

As of CCBR Pipeliner release 8, instantiate Pipeliner as a module:

module load ccbrpipeliner

Navigate to your working directory and initialize SINCLAIR:

sinclair init

Setting up input files

All input files should follow nomenclature as if generated via CellRanger (https://www.10xgenomics.com/support/jp/software/cell-ranger/8.0/tutorials/inputs/cr-specifying-fastqs). When starting from fastq files, each sample should have its own directory containing at least R1 and R2 (i.e. forward and reverse reads). Additional files that may be included include I1 index files and reads from multiple lanes. Example minimum data structure for two samples:

`/path/to/sample1/sample1_S1_L0001_R1_001.fastq.gz`
`/path/to/sample1/sample1_S1_L0001_R2_001.fastq.gz`

`/path/to/sample2/sample2_S1_L0001_R1_001.fastq.gz`
`/path/to/sample2/sample2_S1_L0001_R2_001.fastq.gz`

When starting from .h5 files that are generated from CellRanger alignment, the directory structure is simpler:

/path/to/sample1/outputs/filtered_feature_bc_matrix.h5 /path/to/sample2/outputs/filtered_feature_bc_matrix.h5

The .h5 matrix files should be indicated as filtered_feature_bc_matrix.h5, with the sample name indicated in the directory path.

Setting Up Manifest Files

Manifest files are comma-separated variable (.csv) files in the assets folder of the SINCLAIR working directory. These contain the filepaths for the input sample files and the contrasts to be included as group identities for downstream differential expression.

Two options exist for sample inputs: input_manifest.csv and input_manifest_cellranger.csv. Usage depends on the entry point, i.e. whether the samples have already been aligned using CellRanger.

input_manifest.csv

This file is used when starting from fastqs and requires alignment via CellRanger. The .csv file in assets can be used as a template and modified as follows: | masterID | uniqueID | groupID | dataType | input_dir | | -------- | -------- | ------- | -------- | --------- | | parentID_1 | uniqueSampleID_1 | groupID_1 | gex | path/to/fastqs/1 | | parentID_2 | uniqueSampleID_2 | groupID_2 | gex | path/to/fastqs/2 |

The masterIDcolumn can be used to indicate if samples are replicates of the same sample and are not required to be unique. The groupID indicates the contrast group for each sample. The input_dir must point to a series of fastq files as generated by the 10X Chromium pipeline, or fastq files that follow the same naming convention.

input_manifest_cellranger.csv

This file is used if the alignment and read counting has already been performed, such as through 10X CellRanger software or a similar tool. The input type is expected to be in .h5 format. Tools will be made available to convert mtx triplet files into .h5

masterID uniqueID groupID dataType input_dir
parentID_1 uniqueSampleID_1 groupID_1 gex path/to/h5Counts/1
parentID_2 uniqueSampleID_2 groupID_2 gex path/to/h5Counts/2

The primary difference here is that the input_dir points to the directory containing .h5 file rather than uncounted .fastq.gz files.

contrast_manifest.csv

This file contains the comparisons to be generated, and will indicate samples that should be included in different combinations. For each contrast indicated, only the samples within the specified groups will be processed and combined.

contrast1 contrast2 contrast3
group1 group2
group1 group2 group3

As many contrasts as there exists groups can be included, with a minimum of 2 groups, as specified both in the input_manifest.csv/input_manifest_cellranger.csv and the contrast_manifest.csv files. If running SINCLAIR on a single sample, the above can be formatted as:

contrast1
group1

Starting the Run

These instructions will start a basic run of SINCLAIR. For more detailed instructions, please refer to 3. Running the Pipeline.. When running SINCLAIR from CCBRPipeliner, the following commands are used. When running from a GitHub installation, sinclair should be replaced with bin/sinclair.

To start a local instance with CellRanger alignment (which is also the default setting):

sinclair run --mode local --species=<genome> --run_cellranger true

To start a slurm run:

sinclair run --mode slurm --species <genome> --run_cellranger true

By default, the genome is hg19; other options include mm10 and hg38. In order to run SINCLAIR without CellRanger alignment, the parameter --run_cellranger false needs to be set and SINCLAIR will now look at the input_manifest_cellranger.csv manifest.

Expected Outputs

During execution, the SINCLAIR workflow stores all temporary outputs in the work directory. This directory also supports workflow recovery: if the run fails, intermediate files in work allow the pipeline to resume from the point of failure when the user re-runs the pipeline.

Final results are saved in the results directory unless a different output directory was specified in the parameters. The results directory will contain 4 subdirectories:

  • batch_correct contains the combined Seurat .rds files for each of the contrasts, with a separate file for each batch correction method, as well as a summary .html file.
  • cellranger_counts contains the .h5 counts files for each sample produced by the CellRanger software.
  • samplesheets contains the parsed sample sheets based on the manifest files, as interpreted by NextFlow and SINCLAIR.
  • seurat contains two subdirectories:
    • merge contains the combined sample Seurat .rds files for each set of contrasts prior to batch correction (which can otherwise be referred to as the "uncorrected" object).
    • preprocess contains the individual sample .rds files.

When proceeding to downstream secondary analysis, such as differential expression, please utilize the batch_correction_integration.html files to determine which batch correction method, or even lack thereof, best fits the data. The appropriate file can then be analyzed in R through the Seurat workflow.

For multi-sample analsyses, the output directory will have the following structure:

results
├── batch_correct
│   ├── group1-group2_batch_correction_cca.rds
│   ├── group1-group2_batch_correction_harmony.rds
│   ├── group1-group2_batch_correction_integration.html
│   ├── group1-group2_batch_correction_liger.rds
│   └── group1-group2_batch_correction_rpca.rds
├── cellranger_counts
│   ├── sample1
│   │   └── outs
│   │       └── filtered_feature_bc_matrix.h5
│   ├── ...
│   └── sampleN
│       └── outs
│           └── filtered_feature_bc_matrix.h5
├── samplesheets
│   ├── project_contrast_samplesheet.csv
│   ├── project_gex_samplesheet.csv
│   └── project_groups_samplesheet.csv
└── seurat
    ├── merge
    │   ├── group1-group2_seurat_merged.pdf
    │   └── group1-group2_seurat_merged.rds
    └── preprocess
        ├── sample1_seurat_preprocess.pdf
        ├── sample1_seurat_preprocess.rds
        ├── ...
        ├── sampleN_seurat_preprocess.pdf
        └── sampleN_seurat_preprocess.rds

For single-sample analsyses, the output directory will be similar in structure to the above, but missing batch_correct results:

results
├── cellranger_counts
│   ├── sample1
│   │   └── outs
│   │       └── filtered_feature_bc_matrix.h5
├── samplesheets
│   ├── project_contrast_samplesheet.csv
│   ├── project_gex_samplesheet.csv
│   └── project_groups_samplesheet.csv
└── seurat
    ├── merge
    │   ├── group1-group2_seurat_merged.pdf
    └── preprocess
        ├── sample1_seurat_preprocess.pdf