Getting Started
Hello! Start in the left-hand sidebar by:
1. browsing for a .csv file with the gene counts
2. browsing for a .csv file with the experimental design
3. clicking the Run Analysis button, which appears after the input files are verified as valid for analysis (see tips below)
Note that the DE analysis results and plots may take several moments to process depending on the size of the input gene counts table.
This version of the DA app is designed for use on Posit Cloud and it is missing some features found in the local version of the DA app.
Helpful Tips
Tip 1: The input raw gene counts table is expected to contain numeric integer values.
Tip 2: Gene names in the first column of the input gene counts table are expected to be character values.
Tip 3: Sample names in the first line of the gene counts table must match the sample names contained in the first column of the experimental design table.
Tip 4: Sample names contained in the gene counts and experimental design tables are expected to be character values.
Tip 5: The input gene counts and experimental design tables must end in the .csv file extension.
Tip 6: Lines containing HTSeq stats (no_feature, ambiguous, too_low_aQual, not_aligned, alignment_not_unique) are automatically removed from the input counts table.
Data Formatting
Example gene counts and experimental design tables are displayed below.
Example Gene Counts Tables
Example gene counts table of six samples and five genes: Example gene counts table of twelve samples and three genes:Example Experimental Design Tables
Processing
The DE analysis results and plots may take several moments to process depending on the size of the input gene counts or experimental design tables.
Helpful Tips
Tip 1: The plots and results may take several moments to appear depending on the size of the input gene counts table.
Tip 2: Navigate to the Analysis, Data Normalization, Data Exploration, or Results steps by clicking the tabs above.
Tip 3: Further information about choosing dispersion values and methods for obtaining dispersions may be found in the edgeR manual (e.g., section 2.12).
Tip 4: Examples of designing model expressions for ANOVA-like tests are availble in the edgeR manual (e.g., sections 3.2.6 & 4.4.9).
Tip 5: If the normalizaion plot or other results look strange, make sure that the input table contains raw gene counts that have not been normalized.
DE Analysis
Begin the differential expression (DE) analysis by selecting an analysis type, log2-fold change (LFC) cut off, and false discovery rate (FDR) adjusted p-value cut off.
Select LFC Cut Off:
Select FDR Adjusted p-Value Cut Off:
Select Analysis Type:
Pairwise Comparison
Exact tests are performed to identify differences in the means between two groups of negative-binomially distributed counts. A comparison or contrast is a linear combination of means for groups of samples.
Choose Factor Levels for Comparison:
Enter Dispersion Value:
The dispersion value may be either a character string or a numeric value. The character string is used to indicate that dispersions should be taken from the data. Allowable character values are common, trended, tagwise or auto. If the input is numeric, then it can be a common value for all genes.
Note that the default dispersion value is auto, which uses the most complex dispersions found in the data.
GLM Comparison
The GLM is used to perform an ANOVA-like analysis to identify any significant main effect associated with an explanatory variable. An explanatory variable may be a categorical factor with two or more levels, such as treat and cntrl.
Additionally, genes above the input log2 fold change (LFC) threshold are identified as significantly DE using t-tests relative to a threshold (TREAT) with the glmTreat function of edgeR. If the input LFC cut off is set to 0, then the glmQLFTest function is used instead.
Enter Expression for Comparison:
Enter Dispersion Value:
Tip! Make sure that the factors used in the expression are spelled the same as in the experimental design file (shown below)
Examples of designing model expressions for ANOVA-like tests are availble in the edgeR manual (e.g., sections 3.2.6 & 4.4.9). A detailed description of designing model expressions is also provided in the paper "A guide to creating design matrices for gene expression experiments" doi: 10.12688/f1000research.27893.1 (e.g., studies with multiple factors).
The dispersion value may be either a NULL or numeric scalar. If the input is NULL, then the dispersions will be extracted from the data. The order of precedence is genewise dispersion, trended dispersions, common dispersion. If the input is numeric, then the dispersion value can be a common value for all genes.
Note that the default dispersion value is NULL.
Examples of typical dispersion values and methods for obtaining dispersions may be found in the edgeR manual (e.g., section 2.12). For example, the common BCV (square-root dispersion) values typically are 0.4 for human data, 0.1 for data on genetically identical model organisms or 0.01 for technical replicates. Furthermore, the dispersion may be estimated from the data given a sizeable number of control transcripts that should not be DE.
Click to Analyze:
Design Table:
Data Normalization
Download PlotThe plot of library sizes shows the sequencing library size for each sample before Trimmed Mean of M-values (TMM) normalization. Libraries are the collection of RNA-seq reads associated with each sample.
Number of Genes with Sufficiently Large Counts:
Filtering is performed to remove genes that were identified as not sufficiently expressed under the experimental conditions.
Normalized Gene Counts Table:
Download TableNormalized values were calcuated in counts per million (CPM) using the normalized library sizes. The normalization method used with edgeR was the Trimmed Mean of M-values (TMM). Note that TMM normalization factors do not take into account library sizes.
Data Exploration
Download PlotThe above principal component analysis (PCA) plot shows the distances between samples by the approximate the expression differences. The expression differences were calculated as the the average of the largest absolute LFCs between each pair of samples and the same genes were selected for all comparisons. Note that the points are replaced by the sample name and colored by the associated factor level.
PCAs are commonly used to visualize the signal to noise relationship within a data set. For example, the patterns of variation between and within groups.
Download Plot
The above multidimensional scaling (MDS) plot shows the distances between samples by the approximate the expression differences. The expression differences were calculated as the the average of the largest absolute LFCs between each pair of samples and the top genes were selected separately for each pairwise comparison. Note that the points are replaced by the sample name and colored by the associated factor level.
Download Plot
The biological coefficient of variation (BCV) plot is the square root of the dispersion parameter under the negative binomial model and is equivalent to estimating the dispersions of the negative binomial model.
The negative binomial distribution is used to identify genes with sufficiently large counts to be considered a real signal and measures what it expects to be missing data, or a measure of dispersion. For example, a BCV^2 of 0.4 indicates a 20% difference between samples.
The negative binomial distribution models biological noise rather than sequencing noise (e.g., library size normalization).
DE Analysis Results
Begin the differential expression (DE) analysis on the Analysis tab by selecting input values and clicking the Analyze button.
The inputs may also be adjusted on the Analysis tab and updated by clicking the Analyze button.
Pairwise Comparison
Note that results will not appear if there are invalid input values (e.g., dispersions).
Pairwise Results
Download Plot
The mean-difference (MD) plot shows the log2 fold changes (LFCs) in expression differences versus average log2 CPM values. Red points are significantly up-expressed genes and the blue points are significantly down-expressed, where signifigance was determined by the input FDR cut off. The blue lines indicate the input LFC cut off, which will be used to further filter the set of significantly DE genes.
Number of Significantly DE Genes:
The above table shows the number of significantly DE genes that were up- or down-expressed in the input comparison. Signifigance was determined by the input LFC and FDR cut offs.
DE Analysis Results Table:
Download TableA table of pairwise DE analysis results sorted by increasing FDR adjusted p-values may be downloaded by clicking the above button.
Significant DE Analysis Results Table:
Download TableA table of significant pairwise DE analysis results sorted by increasing FDR adjusted p-values may be downloaded by clicking the above button. Signifigance was determined by the input LFC and FDR cut offs.
DE Genes IDs:
Download TableA list of the DE gene IDs from the pairwise analysis may be downloaded by clicking the above button.
Significantly DE Genes IDs:
Download TableA list of the significantly DE gene IDs from the pairwise analysis may be downloaded by clicking the above button. Signifigance was determined by the input LFC and FDR cut offs.
Results Exploration
Download Plot
The heatmap displays the hierarchical clustering of individual samples by the log2 CPM expression values of significantly DE genes from the pairwise analysis. Signifigance was determined by the input FDR and LFC cut offs.
Note that the heatmap function requires at least 2 significantly DE genes to create the plot.
Download Plot
The above volcano plot displays the association between statistical significance (e.g., p-value) and magnitude of gene expression (fold change). Signifigance and magnitude were determined by the input FDR and LFC cut offs.
GLM Comparison
Note that results will not appear if there are invalid input values (e.g., dispersions).
GLM Results
Download Plot
The mean-difference (MD) plot shows the log2 fold changes (LFCs) in expression differences versus average log2 CPM values. Red points are significantly up-expressed genes and the blue points are significantly down-expressed, where signifigance was determined by the input FDR cut off. The blue lines indicate the input LFC cut off, which will be used to further filter the set of significantly DE genes.
Number of Significantly DE Genes:
The above table shows the number of significantly DE genes that were up- or down-expressed in the input comparison. Signifigance was determined by the input LFC and FDR cut offs.
DE Analysis Results Table:
Download TableA table of GLM DE analysis results sorted by increasing FDR adjusted p-values may be downloaded by clicking the above button.
Significant DE Analysis Results Table:
Download TableA table of significant GLM DE analysis results sorted by increasing FDR adjusted p-values may be downloaded by clicking the above button. Signifigance was determined by the input LFC and FDR cut offs.
DE Gene IDs:
Download TableA list of the DE gene IDs from the ANOVA-like analysis may be downloaded by clicking the above button.
Significantly DE Gene IDs:
Download TableA list of the significantly DE gene IDs from the ANOVA-like analysis may be downloaded by clicking the above button. Signifigance was determined by the input LFC and FDR cut offs.
Model Exploration
Download Plot
Above is a plot of the genewise quasi-likelihood (QL) dispersion against the log2 CPM gene expression levels. Dispersion estimates are obtained after fitting negative binomial models and calculating dispersion estimates.
Results Exploration
Download Plot
The heatmap displays the hierarchical clustering of individual samples by the log2 CPM expression values of significantly DE genes from the GLM analysis. Signifigance was determined by the input FDR and LFC cut offs.
Note that the heatmap requires at least 2 significantly DE genes to create the plot.
Download Plot
The above volcano plot displays the association between statistical significance (e.g., p-value) and magnitude of gene expression (fold change). Signifigance and magnitude were determined by the input FDR and LFC cut offs.
Helpful Information
A tutorial for this application can be found here in the scripts directory of the freeCount GitHub.
The latest version of this application may be downloaded from the freeCount GitHub .
Example gene counts and experimental design tables are also provided on GitHub .
Gene tables may be created from RNA-seq data as described in Bioinformatics Analysis of Omics Data with the Shell & R .
A tutorial of the biostatistical analysis performed in this application is provided in Downstream Bioinformatics Analysis of Omics Data with edgeR .
Cite
Elizabeth Mae Brooks, Sheri A Sanders, and Michael E Pfrender. 2024. FreeCount: A Coding Free Framework for Guided Count Data Visualization and Analysis. In Practice and Experience in Advanced Research Computing 2024: Human Powered Computing (PEARC '24). Association for Computing Machinery, New York, NY, USA, Article 37, 1–4. https://doi.org/10.1145/3626203.3670605