2. Preparing Files¶
The pipeline is controlled by editing configuration and manifest files. Defaults are found in the nextflow.config
file.
Overview of Single Cell RNASeq Gene Expression Process
2.1 Preparing input data¶
SINCLAIR has been designed to start at one of two entry points: raw unaligned reads in fastq.gz
format, or pre-aligned counts in .h5
format. The file format should follow the naming convention used by 10X CellRanger; file names can be modified as needed to meet these criteria. The directory structure as described below will also contain a brief description of the file naming format.
2.1.1 Starting from fastq.gz files¶
The fastq.gz files names need to resemble the format as if they were generated by 10X CellRanger with mkfastq
, bcl-convert
or bcl2fastq
. If the files have already been generated via CellRanger, then proceed to the next step. Otherwise, the sample files and directories should be in the following format:
This would be the file structure of a dataset with 2 samples that were each run in 2 lanes
/path/to/sample1/sample1_S1_L0001_R1_001.fastq.gz
/path/to/sample1/sample1_S1_L0001_R2_001.fastq.gz
/path/to/sample1/sample1_S1_L0001_I1_001.fastq.gz
/path/to/sample1/sample1_S1_L0002_R1_001.fastq.gz
/path/to/sample1/sample1_S1_L0002_R2_001.fastq.gz
/path/to/sample1/sample1_S1_L0002_I1_001.fastq.gz
/path/to/sample2/sample2_S1_L0001_R1_001.fastq.gz
/path/to/sample2/sample2_S1_L0001_R2_001.fastq.gz
/path/to/sample2/sample2_S1_L0001_I1_001.fastq.gz
/path/to/sample2/sample2_S1_L0002_R1_001.fastq.gz
/path/to/sample2/sample2_S1_L0002_R2_001.fastq.gz
/path/to/sample2/sample2_S1_L0002_I1_001.fastq.gz
Not all cases will include all these file outputs. The bare minimum required to run the CellRanger alignment are the forward and reverse reads from a single lane run. Using the above example:
/path/to/sample1/sample1_S1_L0001_R1_001.fastq.gz
/path/to/sample1/sample1_S1_L0001_R2_001.fastq.gz
/path/to/sample2/sample2_S1_L0001_R1_001.fastq.gz
/path/to/sample2/sample2_S1_L0001_R2_001.fastq.gz
2.2.2. Starting from aligned .h5 files¶
When starting from .h5
files that are generated from CellRanger alignment, the directory structure is simpler:
/path/to/sample1/outputs/filtered_feature_bc_matrix.h5
/path/to/sample2/outputs/filtered_feature_bc_matrix.h5
As before, it is strongly recommended to follow the naming conventions generated by CellRanger, as SINCLAIR looks primarily for the filtered_feature_bc_matrix.h5
file, which has already had empty droplets algorithmically filtered out.
Addendum: Starting from matrix files
Older versions of CellRanger, as well as some other workflows (e.g. DropSeq, Smart-Seq, PipSeq, etc.) tend to produce a file structure that resembles the following for each sample:
/path/to/sample1/outs/filtered_feature_matrix/matrix.mtx.gz
/path/to/sample1/outs/filtered_feature_matrix/features.tsv.gz
/path/to/sample1/outs/filtered_feature_matrix/barcodes.tsv.gz
In this case, the files will need to be converted into a .h5
file and organized as described previously. This can be done with a script such as this:
library(Seurat)
library(DropletUtils)
args = commandArgs(trailingOnly=T)
sampleName = args[1]
mtx_file = args[2]
features_file=args[3]
barcodes_file=args[4]
counts = Seurat::ReadMtx(mtx = mtx_file, cells = barcodes_file, features=features_file)
outfile = paste0(sampleName,".h5")
DropletUtils::write10xCounts(x= counts,path=outfile)
2.2 Preparing Manifests¶
There are two manifests, which are required. These files describe information on the samples and desired contrasts. These files are:
/assets/input_manifest.csv
OR/assets/input_manifest_cellranger.csv
assets/contrast_manifest.csv
2.2.1 Input Manifest¶
This manifest will include information to sample level information. It includes the following column headers:
- masterID: This is the biological sample ID; duplicates are allowed in this column
- uniqueID: This is a unique sample level ID; duplicates are not allowed in this column
- groupID: This is the groupID which should match to the
contrast_manifest
; duplicates are allowed in this column - dataType: This is the datatype for the input sample; currently only permitted to use
gex
- input_dir: This is the input directory for the data files of the sample type (e.g.
/path/to/sample1/fastq
or/path/to/sample1/outs
)
An example sampleManifest file is shown below:
masterID | uniqueID | groupID | dataType | input_dir |
---|---|---|---|---|
WB_Lysis_1 | sample1 | group1 | gex | /data/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/test_dir/WB_Lysis_Granulocytes_3p_Introns_8kCells_fastqs/sample1 |
WB_Lysis_1 | sample2 | group1 | gex | /data/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/test_dir/WB_Lysis_Granulocytes_3p_Introns_8kCells_fastqs/sample2 |
WB_Lysis_2 | sample3 | group2 | gex | /data/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/test_dir/WB_Lysis_Granulocytes_3p_Introns_8kCells_fastqs/sample3 |
WB_Lysis_2 | sample4 | group2 | gex | /data/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/test_dir/WB_Lysis_Granulocytes_3p_Introns_8kCells_fastqs/sample4 |
WB_Lysis_3 | sample5 | group3 | gex | /data/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/test_dir/WB_Lysis_Granulocytes_3p_Introns_8kCells_fastqs/sample5 |
2.2.2 Contrast Manifest¶
This manifest will include sample information to performed differential comparisons. A few requirements:
- groups listed must match groups within the
input_manifest
groupID column - headers should be included for the max number of contrasts. In the example below, the second contrast contains 3 groups, and so the header includes contrast1-contrast3
- multiple groups can be added by increasing the header and adding additional contrasts, as needed
An example contrast file:
contrast1 | contrast2 | contrast3 |
---|---|---|
group1 | group2 | |
group1 | group2 | group3 |