Skip to contents

Clean Raw Counts

Usage

clean_raw_counts(
  moo,
  sample_names_column = "Sample",
  gene_names_column = "GeneName",
  samples_to_rename = c(""),
  data_type = "Bulk RNAseq",
  cleanup_column_names = TRUE,
  split_gene_name = TRUE,
  aggregate_rows_with_duplicate_gene_names = TRUE,
  gene_name_column_to_use_for_collapsing_duplicates = ""
)

Arguments

moo

multiOmicDataSet object (see create_multiOmicDataSet_from_dataframes())

sample_names_column

The column from your input Sample Metadata table containing the sample names. The names in this column must exactly match the names used as the sample column names of your input Counts Matrix. Only columns of Text type from your input Sample Metadata table will be available to select for this parameter.

gene_names_column

The column from your input Counts Matrix containing the Feature IDs (Usually Gene or Protein ID). This is usually the first column of your input Counts Matrix. Only columns of Text type from your input Counts Matrix will be available to select for this parameter.

samples_to_rename

If you do not have a Plot Labels Column in your sample metadata table, you can use this parameter to rename samples manually for display on the PCA plot. Use "Add item" to add each additional sample for renaming. Use the following format to describe which old name (in your sample metadata table) you want to rename to which new name: old_name: new_name

data_type

Type of data to process. Options: c("Bulk RNAseq", "Proteomics")

cleanup_column_names

Invalid raw counts column names can cause errors in the downstream analysis. If this is TRUE, any invalid column names will be automatically altered to a correct format. These format changes will include adding an "X" as the first character in any column name that began with a numeral and replacing some special characters ("-,:. ") with underscores ("_"). Invalid sample names and any changes made will be detailed in the template log.

split_gene_name

If TRUE, split the gene name column by any of these special characters: ,|_-:

aggregate_rows_with_duplicate_gene_names

If a Feature ID (from the "Cleanup Column Names" parameter above) is found to be duplicated on multiple rows of the raw counts, the Log will report these Feature IDs. Using the default behavior (TRUE), the counts for all rows with a duplicate Feature IDs are aggregated into a single row. Counts are summed across duplicate Feature ID rows within each sample. Additional identifier columns, if present (e.g. Ensembl IDs), will be preserved and multiple matching identifiers in such additional columns will appear as comma-separated values in an aggregated row.

gene_name_column_to_use_for_collapsing_duplicates

Select the column with Feature IDs to use as grouping elements to collapse the counts matrix. The log output will list the columns available to identify duplicate row IDs in order to aggregate information. If the data_type is "Bulk RNAseq", your column selected for Feature ID will be renamed to "Gene". If the data_type is "Proteomics", your column selected for Feature ID will be renamed to "Feature ID". If left blank your "Feature ID" Column will be used to Aggregate Rows. If "Feature ID" column can be split into multiple IDs the non Ensembl ID name will be used to aggregate duplicate IDs. If "Feature ID" column does not contain Ensembl IDs the split Feature IDs will be named 'Feature_id_1' and 'Feature_id_2'. For this case an error will occur and you will have to manually enter the Column ID for this field.

Value

multiOmicDataSet with cleaned counts

Examples

moo <- create_multiOmicDataSet_from_dataframes(
  as.data.frame(nidap_sample_metadata),
  as.data.frame(nidap_raw_counts),
  sample_id_colname = "Sample",
) %>%
  clean_raw_counts(sample_names_column = "Sample", gene_names_column = "GeneName")
#>   .
#>  A1
#>  A2
#>  A3
#>  B1
#>  B2
#>  B3
#>  C1
#>  C2
#>  C3
#> [1] ""
#> [1] "Not able to identify multiple id's in GeneName"
#> [1] ""
#> [1] "Columns that can be used to aggregate gene information"
#> [1] "Gene"
#> [1] ""
#> [1] "Aggregating the counts for the same ID in different chromosome locations."
#> [1] "Column used to Aggregate duplicate IDs: "
#> [1] "Gene"
#> [1] "Number of rows before Collapse: "
#> [1] 43280
#> [1] "no duplicated IDs in Gene"
#> [1] "Bulk RNAseq"

head(moo@counts$clean)
#>            Gene A1 A2 A3 B1 B2 B3 C1 C2 C3
#> 1 RP23-271O17.1  0  0  0  0  0  0  0  0  0
#> 2       Gm26206  0  0  0  0  0  0  0  0  0
#> 3          Xkr4  0  0  0  0  0  0  0  0  0
#> 4 RP23-317L18.1  0  0  0  0  0  0  0  0  0
#> 5 RP23-317L18.4  0  0  0  0  0  0  0  0  0
#> 6 RP23-317L18.3  0  0  0  0  0  0  0  0  0