2. Preparing Files

The pipeline is controlled by editing configuration and manifest files. Defaults are found in the nextflow.config file.

SINCLAIR Process Overview Overview of Single Cell RNASeq Gene Expression Process

2.1 Preparing input data

SINCLAIR has been designed to start at one of two entry points: raw unaligned reads in fastq.gz format, or pre-aligned counts in .h5 format. The file format should follow the naming convention used by 10X CellRanger; file names can be modified as needed to meet these criteria. The directory structure as described below will also contain a brief description of the file naming format.

2.1.1 Starting from fastq.gz files

The fastq.gz files names need to resemble the format as if they were generated by 10X CellRanger with mkfastq, bcl-convert or bcl2fastq. If the files have already been generated via CellRanger, then proceed to the next step. Otherwise, the sample files and directories should be in the following format:

This would be the file structure of a dataset with 2 samples that were each run in 2 lanes

/path/to/sample1/sample1_S1_L0001_R1_001.fastq.gz /path/to/sample1/sample1_S1_L0001_R2_001.fastq.gz /path/to/sample1/sample1_S1_L0001_I1_001.fastq.gz /path/to/sample1/sample1_S1_L0002_R1_001.fastq.gz /path/to/sample1/sample1_S1_L0002_R2_001.fastq.gz /path/to/sample1/sample1_S1_L0002_I1_001.fastq.gz

/path/to/sample2/sample2_S1_L0001_R1_001.fastq.gz /path/to/sample2/sample2_S1_L0001_R2_001.fastq.gz /path/to/sample2/sample2_S1_L0001_I1_001.fastq.gz /path/to/sample2/sample2_S1_L0002_R1_001.fastq.gz /path/to/sample2/sample2_S1_L0002_R2_001.fastq.gz /path/to/sample2/sample2_S1_L0002_I1_001.fastq.gz

Not all cases will include all these file outputs. The bare minimum required to run the CellRanger alignment are the forward and reverse reads from a single lane run. Using the above example:

/path/to/sample1/sample1_S1_L0001_R1_001.fastq.gz /path/to/sample1/sample1_S1_L0001_R2_001.fastq.gz

/path/to/sample2/sample2_S1_L0001_R1_001.fastq.gz /path/to/sample2/sample2_S1_L0001_R2_001.fastq.gz

2.2.2. Starting from aligned .h5 files

When starting from .h5 files that are generated from CellRanger alignment, the directory structure is simpler:

/path/to/sample1/outputs/filtered_feature_bc_matrix.h5 /path/to/sample2/outputs/filtered_feature_bc_matrix.h5

As before, it is strongly recommended to follow the naming conventions generated by CellRanger, as SINCLAIR looks primarily for the filtered_feature_bc_matrix.h5 file, which has already had empty droplets algorithmically filtered out.

Addendum: Starting from matrix files

Older versions of CellRanger, as well as some other workflows (e.g. DropSeq, Smart-Seq, PipSeq, etc.) tend to produce a file structure that resembles the following for each sample:

/path/to/sample1/outs/filtered_feature_matrix/matrix.mtx.gz
/path/to/sample1/outs/filtered_feature_matrix/features.tsv.gz
/path/to/sample1/outs/filtered_feature_matrix/barcodes.tsv.gz

In this case, the files will need to be converted into a .h5 file and organized as described previously. This can be done with a script such as this:

library(Seurat)
library(DropletUtils)

args = commandArgs(trailingOnly=T)

sampleName = args[1]
mtx_file = args[2]
features_file=args[3]
barcodes_file=args[4]

counts = Seurat::ReadMtx(mtx = mtx_file, cells = barcodes_file, features=features_file)
outfile = paste0(sampleName,".h5")
DropletUtils::write10xCounts(x= counts,path=outfile)

2.2 Preparing Manifests

There are two manifests, which are required. These files describe information on the samples and desired contrasts. These files are:

  • /assets/input_manifest.csv OR /assets/input_manifest_cellranger.csv
  • assets/contrast_manifest.csv

2.2.1 Input Manifest

This manifest will include information to sample level information. It includes the following column headers:

  • masterID: This is the biological sample ID; duplicates are allowed in this column
  • uniqueID: This is a unique sample level ID; duplicates are not allowed in this column
  • groupID: This is the groupID which should match to the contrast_manifest; duplicates are allowed in this column
  • dataType: This is the datatype for the input sample; currently only permitted to use gex
  • input_dir: This is the input directory for the data files of the sample type (e.g. /path/to/sample1/fastq or /path/to/sample1/outs)

An example sampleManifest file is shown below:

masterID uniqueID groupID dataType input_dir
WB_Lysis_1 sample1 group1 gex /data/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/test_dir/WB_Lysis_Granulocytes_3p_Introns_8kCells_fastqs/sample1
WB_Lysis_1 sample2 group1 gex /data/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/test_dir/WB_Lysis_Granulocytes_3p_Introns_8kCells_fastqs/sample2
WB_Lysis_2 sample3 group2 gex /data/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/test_dir/WB_Lysis_Granulocytes_3p_Introns_8kCells_fastqs/sample3
WB_Lysis_2 sample4 group2 gex /data/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/test_dir/WB_Lysis_Granulocytes_3p_Introns_8kCells_fastqs/sample4
WB_Lysis_3 sample5 group3 gex /data/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/test_dir/WB_Lysis_Granulocytes_3p_Introns_8kCells_fastqs/sample5

2.2.2 Contrast Manifest

This manifest will include sample information to performed differential comparisons. A few requirements:

  • groups listed must match groups within the input_manifest groupID column
  • headers should be included for the max number of contrasts. In the example below, the second contrast contains 3 groups, and so the header includes contrast1-contrast3
  • multiple groups can be added by increasing the header and adding additional contrasts, as needed

An example contrast file:

contrast1 contrast2 contrast3
group1 group2
group1 group2 group3