Skip to contents

This is often the first step in the QC portion of an analysis to filter out features that have very low raw counts across most or all of your samples.

Usage

filter_counts(
  moo,
  count_type = "raw",
  gene_names_column = "gene_id",
  sample_names_column = "sample_id",
  group_column = "Group",
  label_column = "Label",
  columns_to_include = c("Gene", "A1", "A2", "A3", "B1", "B2", "B3", "C1", "C2", "C3"),
  outlier_samples_to_remove = c(),
  minimum_count_value_to_be_considered_nonzero = 8,
  minimum_number_of_samples_with_nonzero_counts_in_total = 7,
  minimum_number_of_samples_with_nonzero_counts_in_a_group = 3,
  use_cpm_counts_to_filter = TRUE,
  use_group_based_filtering = FALSE,
  principal_component_on_x_axis = 1,
  principal_component_on_y_axis = 2,
  legend_position_for_pca = "top",
  point_size_for_pca = 1,
  add_label_to_pca = TRUE,
  label_font_size = 3,
  label_offset_y_ = 2,
  label_offset_x_ = 2,
  samples_to_rename_manually = c(""),
  color_histogram_by_group = FALSE,
  set_min_max_for_x_axis_for_histogram = FALSE,
  minimum_for_x_axis_for_histogram = -1,
  maximum_for_x_axis_for_histogram = 1,
  legend_position_for_histogram = "top",
  legend_font_size_for_histogram = 10,
  number_of_histogram_legend_columns = 6,
  colors_for_plots = c("indigo", "carrot", "lipstick", "turquoise", "lavender", "jade",
    "coral", "azure", "green", "rum", "orange", "olive"),
  number_of_image_rows = 2,
  interactive_plots = FALSE,
  plot_correlation_matrix_heatmap = TRUE,
  make_plots = TRUE
)

Arguments

moo

multiOmicDataSet object (see create_multiOmicDataSet_from_dataframes())

count_type

the type of counts to use – must be a name in the counts slot (moo@counts)

gene_names_column

The column from your input Counts Matrix containing the Feature IDs (Usually Gene or Protein ID). This is usually the first column of your input Counts Matrix. Only columns of Text type from your input Counts Matrix will be available to select for this parameter.

sample_names_column

The column from your input Sample Metadata table containing the sample names. The names in this column must exactly match the names used as the sample column names of your input Counts Matrix. Only columns of Text type from your input Sample Metadata table will be available to select for this parameter.

group_column

The column from your input Sample Metadata table containing the sample group information. This is usually a column showing to which experimental treatments each sample belongs (e.g. WildType, Knockout, Tumor, Normal, Before, After, etc.). Only columns of Text type from your input Sample Metadata will be available to select for this parameter.

label_column

The column from your input Sample Metadata table containing the sample labels as you wish them to appear in the plots produced by this template. This can be the same Sample Names Column. However, you may desire different labels to display on your figure (e.g. shorter labels are sometimes preferred on plots). In that case, select the column with your preferred Labels here. The selected column should contain unique names for each sample.

columns_to_include

Which Columns would you like to include? Usually, you will choose to a feature ID column (e.g. gene or protein ID) and all sample columns. Columns excluded here will be removed in this step and from further analysis downstream of this step.

outlier_samples_to_remove

A list of sample names to remove from the analysis.

minimum_count_value_to_be_considered_nonzero

Minimum count value to be considered non-zero for a sample

minimum_number_of_samples_with_nonzero_counts_in_total

Minimum number of samples (total) with non-zero counts

minimum_number_of_samples_with_nonzero_counts_in_a_group

Only keeps genes that have at least this number of samples with nonzero CPM counts in at least one group

use_cpm_counts_to_filter

If no transformation has been been performed on counts matrix (eg Raw Counts) set to TRUE. If TRUE counts will be transformed to CPM and filtered based on given criteria. If gene counts matrix has been transformed (eg log2, CPM, FPKM or some form of Normalization) set to FALSE. If FALSE no further transformation will be applied and features will be filtered as is. For RNAseq data RAW counts should be transformed to CPM in order to properly filter.

use_group_based_filtering

If TRUE, only keeps features (e.g. genes) that have at least a certain number of samples with nonzero CPM counts in at least one group

principal_component_on_x_axis

The principle component to plot on the x-axis for the PCA plot. Choices include 1, 2, 3, ... (default: 1)

principal_component_on_y_axis

The principle component to plot on the y-axis for the PCA plot. Choices include 1, 2, 3, ... (default: 2)

legend_position_for_pca

legend position for the PCA plot

point_size_for_pca

geom point size for the PCA plot

add_label_to_pca

label points on the PCA plot

label_font_size

label font size for the PCA plot

label_offset_y_

label offset y for the PCA plot

label_offset_x_

label offset x for the PCA plot

samples_to_rename_manually

If you do not have a Plot Labels Column in your sample metadata table, you can use this parameter to rename samples manually for display on the PCA plot. Use "Add item" to add each additional sample for renaming. Use the following format to describe which old name (in your sample metadata table) you want to rename to which new name: old_name: new_name

color_histogram_by_group

Set to FALSE to label histogram by Sample Names, or set to TRUE to label histogram by the column you select in the "Group Column Used to Color Histogram" parameter (below). Default is FALSE.

set_min_max_for_x_axis_for_histogram

whether to set min/max value for histogram x-axis

minimum_for_x_axis_for_histogram

x-axis minimum for histogram plot

maximum_for_x_axis_for_histogram

x-axis maximum for histogram plot

legend_position_for_histogram

legend position for the histogram plot. consider setting to 'none' for a large number of samples.

legend_font_size_for_histogram

legend font size for the histogram plot

number_of_histogram_legend_columns

number of columns for the histogram legend

colors_for_plots

Colors for the PCA and histogram will be picked, in order, from this list. If you have >12 samples or groups, program will choose from a wide range of random colors

number_of_image_rows

number of rows for the plot image. 1 = side-by-side, 2 = stacked

interactive_plots

set to TRUE to make PCA and Histogram plots interactive with plotly, allowing you to hover your mouse over a point or line to view sample information. The similarity heat map will not display if this toggle is set to TRUE. Default is FALSE.

plot_correlation_matrix_heatmap

Data sets with a large number of samples may be too large to create a correlation matrix heat map. If this template takes longer than 5 minutes to run, Toggle switch to FALSE and the correlation matrix will not be be created. Default is TRUE.

make_plots

whether to create plots

Value

multiOmicDataSet with filtered counts

Details

This function takes a multiOmicDataSet containing raw counts and a sample metadata table, and returns the multiOmicDataSet object with filtered counts. It also produces an image consisting of three QC plots.

You can tune the threshold for tuning how low counts for a given gene are before they are deemed "too low" and filtered out of downstream analysis. By default, this parameter is set to 1, meaning any raw count value less than 1 will count as "too low".

The QC plots are provided to help you assess: (1) PCA Plot: the within and between group variance in expression after dimensionality reduction; (2) Count Density Histogram: the dis/similarity of count distributions between samples; and (3) Similarity Heatmap: the overall similarity of samples to one another based on unsupervised clustering.

Examples

moo <- create_multiOmicDataSet_from_dataframes(
  as.data.frame(nidap_sample_metadata),
  as.data.frame(nidap_clean_raw_counts),
  sample_id_colname = "Sample"
) %>%
  calc_cpm(gene_colname = "Gene") %>%
  filter_counts(
    sample_names_column = "Sample",
    gene_names_column = "Gene"
  )
head(moo@counts$filt)
#>            Gene   A1  A2   A3   B1   B2  B3   C1  C2   C3
#> 1 0610007P14Rik 1049 950  934 1068 1140 947 1393 907 1427
#> 2 0610009B22Rik  283 590  615  241  383 608  299 186  696
#> 3 0610010F05Rik  352 678 1377  958  879 616  332   0  186
#> 4 0610011F06Rik  430 565  553  462  558 688  710 826  706
#> 5 0610012G03Rik  480 589  683  324  596 673  909 933  419
#> 6 0610037L13Rik  467 570  593  558  330 423  356 198  568