2. Preparing Files¶
The pipeline is controlled through editing configuration and manifest files. Defaults are found in the /PIPELINEDIR/conf and /PIPELINEDIR/ directories
Overview of Single Cell RNASeq Gene Expression Process
2.1 Configs¶
The configuration files control parameters and software of the pipeline. These files are listed below:
- nextflow.config
- conf/base.config
- conf/modules.config
- conf/process_params.config
- conf/Rpack.config
2.1.1 NextFlow Config¶
The configuration file dictates the global information to be used during the pipeline.
2.1.2 Base Config¶
The configuration file dictates submission to Biowulf HPC. There are two different ways to control these parameters - first, to control the default settings, and second, to create or edit individual rules. These parameters should be edited with caution, after significant testing.
2.1.3 Modules Config¶
The configuration file dictates process-specific processing parameters, including:
- the version of each software or program that is being used in the pipeline
- output location and file names
- additional arguments to be passed to the process
2.1.4 R Package Config¶
The configuration file dictates which R libraries, and which versions, are loaded into the accompanying R script
2.1.3 Process Parameters¶
The configuration file dictates process-specific user parameters, which varies for each process. Users can choose varied resolution values or QC methods, for example.
2.2 Preparing Manifests¶
There are two manifests, which are required. These files describe information on the samples and desired contrasts. These files are:
- /assets/input_manifest.csv
- /assets/contrast_manifest.csv
2.2.1 Input Manifest¶
This manifest will include information to sample level information. It includes the following column headers:
- masterID: This is the biological sample ID; duplicates are allowed in this column
- uniqueID: This is a unique sample level ID; duplicates are not allowed in this column
- groupID: This is the groupID which should match to the
contrast_manifest
; duplicates are allowed in this column - dataType: This is the datatype for the input sample; options are 'gex' 'atac' 'vdj'
- input_dir: This is the input directory for the data files of the sample type (IE "/path/to/sample1/fastq")
An example sampleManifest file is shown below:
masterID | uniqueID | groupID | dataType | input_dir |
---|---|---|---|---|
WB_Lysis_1 | sample1 | group1 | gex | /data/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/test_dir/ |
WB_Lysis_1 | sample2 | group1 | gex | /data/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/test_dir/ |
WB_Lysis_2 | sample3 | group2 | gex | /data/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/test_dir/ |
WB_Lysis_2 | sample4 | group2 | gex | /data/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/test_dir/ |
WB_Lysis_3 | sample5 | group3 | gex,/data/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/test_dir/ | WB_Lysis_Granulocytes_3p_Introns_8kCells_fastqs/sample5 |
WB_Lysis_1 | sample6 | group1 | atac | /data/CCBR_Pipeliner/Pipelines/TechDev_scRNASeq_Dev2023/test_dir/ |
2.2.2 Contrast Manifest¶
This manifest will include sample information to performed differential comparisons. A few requirements:
- groups listed must match groups within the
input_manifest
groupID column - headers should be included for the max number of contrasts. In the example below, the second contrast contains 3 groups, and so the header includes contrast1-contrast3
- multiple groups can be added by increasing the header and adding additional contrasts, as needed
An example contrast file:
contrast1 | contrast2 | contrast3 |
---|---|---|
group1 | group2 | |
group1 | group2 | group3 |