Analysis

Significance Testing

Analysis was completed using the statistics computing language R (version 4.1.0), and edgeR, a package for the differential expression analysis of RNA-sequencing profiles (Robinson, McCarthy, & Smyth, 2010). EdgeR models count data using an over-dispersed Poisson model and uses an empirical Bayes procedure to moderate the degree of overdispersion across genes and assumes data to have a negative binomial distribution.

In order to find what genes are being differentially expressed we need to look at the difference in expression between the Archenteron cells and all other cells in the embryo (GFP Positive vs GFP Negative).

EdgeR was first used to calculate the fold change (a measure of how much higher or lower expression is in the archenteron compared to all other cells) between the GFP positive and negative cell populations for both biological replicates for each gene.

However, differences in gene expression can be due to chance, this is for two reasons: natural biological variation between the replicates means they express genes at different levels or due to an error in the experiment and data collection. Therefore, we cannot just accept the fold change as the single determining factor for if a gene is differentially expressed, we compare it to the P-Value, the likelihood that a datapoint is due to chance; the smaller the P-Value, the more confident we are in the datapoint being accurate.

EdgeR uses likelihood ratio tests and quasi-likelihood F-tests to find the P-Value. For these calculations dispersion was estimated using the Cox-Reid profile-adjusted likelihood (CR) method and the biological coefficient of variation (BCV) was estimated using the generalized linear model (GLM) method provided in the package.

This data was then visualized using R and ggplot2 (Wickham, 2016). Any point with a P-Value less than 0.05 was deemed as statistically significant, if the fold change was greater than two, the gene was classed as upregulated, less than half, classed as downregulated, as shown in the volcano plot (figure 1). In my analysis, 260 statistically significant genes were identified.

Figure 2 shows the significantly expressed genes on a scatter plot of gene expression in the archenteron against in all other cells for both biological replicates. Hypothetically if a gene is not differentially expressed it should lie exactly on the y=x line, upregulated genes should lie above the diagonal and downregulated should lie below, the further above or below, the greater the magnitude of the fold change. These graphs demonstrate that the statistics tests are necessary for identifying significant genes as not every point lies where we would expect based on RNA counts alone.

Figure 1 Volcano plot showing Fold change against the P value for the full transcriptome of the combined replicates
Figure 2 Logarithmic graph of Gene expression in the Archenteron vs All other cells
Design a site like this with WordPress.com
Get started