Biological Question: How does dexamethasone - a glucocorticoid steroid used to treat asthma changes gene expression in human airway smooth muscle (ASM) cells? Which genes does it switch on or off, and how does this tell us about its anti-inflammatory effects?
RNA sequencing (RNA-seq) measures genome-wide gene expression by sequencing mRNA and counting how many reads map to each gene. Compared to microarrays, RNA-seq can detect novel transcripts, works across a huge dynamic range, and gives actual counts rather than relative fluorescence intensities.
The dataset used here comes from Himes et al. (2014) [1], who treated 4 human ASM cell lines with dexamethasone and sequenced both the treated and untreated version of each, 8 samples total. Their main finding was that CRISPLD2 is one of the most strongly glucocorticoid-responsive genes in these cells.
| Tool | Purpose |
|---|---|
| R >= 4.0 + RStudio | Programming environment |
DESeq2 (Bioconductor) |
Differential expression analysis |
airway (Bioconductor) |
Example RNA-seq count dataset |
EnhancedVolcano |
Publication-ready volcano plots |
ComplexHeatmap |
Heatmap visualization |
plotly |
Interactive plots |
shiny |
Interactive web app (app.R) |
DT |
Interactive data tables |
dplyr |
Data manipulation |
ggplot2 |
Static plotting |
tibble |
Tidy data frames |
## class: RangedSummarizedExperiment
## dim: 63677 8
## metadata(1): ''
## assays(1): counts
## rownames(63677): ENSG00000000003 ENSG00000000005 ... ENSG00000273492
## ENSG00000273493
## rowData names(10): gene_id gene_name ... seq_coord_system symbol
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(9): SampleName cell ... Sample BioSample
A SummarizedExperiment object with 63,677 genes across 8
samples - 4 donors x 2 conditions (treated/untreated). Each number in
the count matrix is how many RNA-seq reads mapped to that gene in that
sample. Nothing has been normalized yet.
Let’s look at the sample metadata interactively:
colData(airway)[, c("cell", "dex")] %>%
as.data.frame() %>%
tibble::rownames_to_column("Sample") %>%
DT::datatable(
caption = "Airway Dataset: Sample Metadata",
rownames = FALSE,
options = list(pageLength = 8, dom = "t"),
class = "compact stripe"
)Let’s look at how many reads each sample produced
# Pull raw counts and log-transform for visualization
count_mat <- assay(airway)
log_counts <- log2(count_mat + 1)
# Reshape to long format for plotting
long_df <- as.data.frame(log_counts) %>%
tibble::rownames_to_column("gene") %>%
tidyr::pivot_longer(-gene, names_to = "sample", values_to = "log2count") %>%
dplyr::left_join(
as.data.frame(colData(airway)) %>%
tibble::rownames_to_column("sample") %>%
dplyr::select(sample, dex),
by = "sample"
)
ggplot(long_df, aes(x = sample, y = log2count, fill = dex)) +
geom_boxplot(outlier.size = 0.3, outlier.alpha = 0.3) +
scale_fill_manual(values = c("untrt" = "#a29bfe", "trt" = "#f9ca24"),
labels = c("untrt" = "Untreated", "trt" = "Treated")) +
labs(
title = "Raw Count Distribution per Sample (before normalization)",
x = "Sample",
y = "log2(count + 1)",
fill = "Treatment"
) +
theme_minimal(base_size = 12) +
theme(
axis.text.x = element_text(angle = 35, hjust = 1, size = 9),
plot.title = element_text(face = "bold"),
legend.position = "right"
)The boxes should be roughly the same height and centered around similar values. If one sample’s box is dramatically lower or shifted, it may have been sequenced to a much shallower depth, which DESeq2’s normalization will correct but it’s good to know. Here the samples look reasonably consistent, which is a good sign.
dds <- DESeqDataSet(airway, design = ~ dex)
dds$dex <- relevel(dds$dex, ref = "untrt") # untreated = reference/baseline
dds <- DESeq(dds) # normalize + estimate dispersion + test
res <- results(dds)
vsd <- vst(dds, blind = FALSE) # variance-stabilized data for heatmap##
## out of 33469 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up) : 1881, 5.6%
## LFC < 0 (down) : 1503, 4.5%
## outliers [1] : 51, 0.15%
## low counts [2] : 14833, 44%
## (mean count < 4)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
What DESeq2 does:
res_df <- as.data.frame(res) %>%
tibble::rownames_to_column("gene") %>%
dplyr::filter(!is.na(padj))
n_total <- format(nrow(res_df), big.mark = ",")
n_sig <- format(sum(res_df$padj < 0.05), big.mark = ",")
n_up <- format(sum(res_df$padj < 0.05 & res_df$log2FoldChange > 0), big.mark = ",")
n_down <- format(sum(res_df$padj < 0.05 & res_df$log2FoldChange < 0), big.mark = ",")
htmltools::HTML(paste0('
<div class="stat-row">
<div class="stat-card">
<span class="number">', n_total, '</span>
<span class="label">Genes Tested</span>
</div>
<div class="stat-card">
<span class="number">', n_sig, '</span>
<span class="label">Significant (padj < 0.05)</span>
</div>
<div class="stat-card">
<span class="number" style="color:#c0392b">', n_up, '</span>
<span class="label">Upregulated</span>
</div>
<div class="stat-card">
<span class="number" style="color:#2980b9">', n_down, '</span>
<span class="label">Downregulated</span>
</div>
</div>
'))vol_df <- as.data.frame(res) %>%
tibble::rownames_to_column("gene") %>%
dplyr::filter(!is.na(pvalue), !is.na(log2FoldChange)) %>%
dplyr::mutate(
neg_log10_p = -log10(pvalue),
significant = dplyr::case_when(
padj < 0.05 & log2FoldChange > 1.5 ~ "Upregulated",
padj < 0.05 & log2FoldChange < -1.5 ~ "Downregulated",
TRUE ~ "Not significant"
)
)
color_map <- c(
"Upregulated" = "#c0392b",
"Downregulated" = "#2980b9",
"Not significant" = "#bdc3c7"
)
plot_ly(
vol_df,
x = ~log2FoldChange,
y = ~neg_log10_p,
type = "scatter",
mode = "markers",
color = ~significant,
colors = color_map,
marker = list(size = 4, opacity = 0.7),
text = ~paste0(
"<b>", gene, "</b><br>",
"LFC: ", round(log2FoldChange, 3), "<br>",
"p-value: ", signif(pvalue, 3), "<br>",
"padj: ", signif(padj, 3)
),
hoverinfo = "text"
) %>%
layout(
title = list(text = "Volcano Plot: Dexamethasone vs Untreated", font = list(size = 16)),
xaxis = list(title = "log2 Fold Change", zeroline = TRUE),
yaxis = list(title = "-log10(p-value)"),
shapes = list(
list(type = "line", x0 = -1.5, x1 = -1.5, y0 = 0, y1 = 40,
line = list(dash = "dash", color = "gray", width = 1)),
list(type = "line", x0 = 1.5, x1 = 1.5, y0 = 0, y1 = 40,
line = list(dash = "dash", color = "gray", width = 1)),
list(type = "line", x0 = -8, x1 = 8,
y0 = -log10(0.05), y1 = -log10(0.05),
line = list(dash = "dash", color = "gray", width = 1))
),
legend = list(orientation = "h", y = -0.15)
)Hover over any point to see the gene ID, fold change, and p-value. Red = significantly upregulated by dexamethasone. Blue = significantly downregulated. Dashed lines mark the cutoffs (|LFC| > 1.5, padj < 0.05) is considered a hit. You can adjust these in the Shiny app.
ma_df <- as.data.frame(res) %>%
tibble::rownames_to_column("gene") %>%
dplyr::filter(!is.na(log2FoldChange), !is.na(padj)) %>%
dplyr::mutate(color = ifelse(padj < 0.05, "#2980b9", "#bdc3c7"))
plot_ly(
ma_df,
x = ~log(baseMean + 1),
y = ~log2FoldChange,
type = "scatter",
mode = "markers",
marker = list(size = 4, opacity = 0.6, color = ~color),
text = ~paste0(
"<b>", gene, "</b><br>",
"Mean count: ", round(baseMean, 1), "<br>",
"LFC: ", round(log2FoldChange, 3), "<br>",
"padj: ", signif(padj, 3)
),
hoverinfo = "text"
) %>%
layout(
title = list(text = "MA Plot: Mean Expression vs Log2 Fold Change", font = list(size = 16)),
xaxis = list(title = "log(Mean Normalized Count + 1)"),
yaxis = list(title = "log2 Fold Change", zeroline = TRUE),
shapes = list(
list(type = "line", x0 = 0, x1 = 15, y0 = 1.5, y1 = 1.5,
line = list(dash = "dash", color = "red", width = 1)),
list(type = "line", x0 = 0, x1 = 15, y0 = -1.5, y1 = -1.5,
line = list(dash = "dash", color = "red", width = 1))
),
annotations = list(
list(x = 13, y = 2.2, text = "<b>Blue = padj < 0.05</b>",
showarrow = FALSE, font = list(color = "#2980b9", size = 12))
)
)Each dot is a gene. X-axis = average expression level. Y-axis = fold change (treatment vs untreated). Blue dots are statistically significant (padj < 0.05). Red dashed lines show the LFC threshold. Genes at low mean counts scatter more, expected due to measurement uncertainty.
topN <- head(order(res$padj, na.last = TRUE), 50)
mat <- assay(vsd)[topN, ]
mat_scaled <- t(scale(t(mat)))
col_anno <- HeatmapAnnotation(
Treatment = colData(vsd)$dex,
col = list(Treatment = c("untrt" = "#a29bfe", "trt" = "#f9ca24"))
)
Heatmap(
mat_scaled,
name = "Z-score",
top_annotation = col_anno,
show_row_names = TRUE,
show_column_names = TRUE,
row_names_gp = gpar(fontsize = 7),
column_names_gp = gpar(fontsize = 9),
column_title = "Top 50 Differentially Expressed Genes",
column_title_gp = gpar(fontsize = 13, fontface = "bold"),
clustering_distance_rows = "euclidean",
clustering_distance_columns = "euclidean",
col = circlize::colorRamp2(c(-2, 0, 2), c("#2980b9", "white", "#c0392b"))
)Each row is a gene, each column is a sample. Red = high expression, Blue = low expression (Z-score relative to the gene mean). The annotation bar at the top shows treatment group: yellow = treated, purple = untreated. Treated and untreated samples confirming dexamethasone has a strong, consistent transcriptional effect across all 4 donors.
counts_data <- plotCounts(
dds,
gene = "ENSG00000103196",
intgroup = "dex",
returnData = TRUE
)
p_counts <- ggplot(counts_data, aes(x = dex, y = count, color = dex, fill = dex)) +
geom_jitter(width = 0.15, size = 4, alpha = 0.8) +
geom_boxplot(alpha = 0.2, outlier.shape = NA) +
scale_color_manual(values = c("untrt" = "#2980b9", "trt" = "#c0392b")) +
scale_fill_manual(values = c("untrt" = "#2980b9", "trt" = "#c0392b")) +
scale_y_log10() +
labs(
title = "CRISPLD2 (ENSG00000103196) - Normalized Counts",
subtitle = "Top hit from Himes et al. 2014",
x = "Treatment",
y = "Normalized Count (log scale)"
) +
theme_minimal(base_size = 13) +
theme(plot.title = element_text(face = "bold"), legend.position = "none")
ggplotly(p_counts, tooltip = c("x", "y"))CRISPLD2 was the headline finding in Himes et al. they used RT-qPCR to validate it independently. Every single treated sample has higher counts than every untreated sample, which is a clean result. It’s also not a statistical artifact from one outlier donor, all 4 cell lines show the same direction of change.
Paper: Himes et al. (2014). RNA-Seq Transcriptome Profiling Identifies CRISPLD2 as a Glucocorticoid Responsive Gene. PLoS ONE 9(6): e99625.
top_hits <- as.data.frame(res) %>%
tibble::rownames_to_column("Gene (Ensembl ID)") %>%
dplyr::filter(!is.na(padj)) %>%
dplyr::arrange(padj) %>%
dplyr::slice_head(n = 20) %>%
dplyr::mutate(
log2FoldChange = round(log2FoldChange, 3),
pvalue = formatC(pvalue, format = "e", digits = 2),
padj = formatC(padj, format = "e", digits = 2),
Direction = ifelse(log2FoldChange > 0, "Up", "Down"),
Note = ifelse(`Gene (Ensembl ID)` == "ENSG00000103196",
"★ CRISPLD2 - Himes et al. top hit", "")
) %>%
dplyr::select(`Gene (Ensembl ID)`, log2FoldChange, pvalue, padj, Direction, Note)
DT::datatable(
top_hits,
caption = "Table 1. Top 20 DEGs from our DESeq2 analysis (sorted by adjusted p-value)",
rownames = FALSE,
filter = "top",
options = list(
pageLength = 20,
dom = "tp",
columnDefs = list(list(className = "dt-center", targets = "_all"))
),
class = "compact stripe hover"
) %>%
DT::formatStyle(
"Direction",
color = DT::styleEqual(c("Up", "Down"), c("#c0392b", "#2980b9")),
fontWeight = "bold"
) %>%
DT::formatStyle(
"Note",
color = "#e67e22",
fontWeight = "bold"
)Comparison with Himes et al. 2014: The paper’s headline finding was CRISPLD2 (ENSG00000103196), which they described as the most significantly glucocorticoid-responsive gene in airway smooth muscle cells. Our DESeq2 analysis reproduces this result - CRISPLD2 appears in our top-ranked genes with a large positive fold change (LFC ≈ 2.2) and an extremely small adjusted p-value, confirming our pipeline faithfully replicates the published findings. Other genes highlighted by Himes et al. - including DUSP1, KLF15, and PER1, also appear among our significant hits.
On our Table: If sorted by padj alone, CRISPLD2 lands around position 16–20. That’s not a problem with the analysis, it just reflects that statistical ranking and biological importance aren’t always the same thing.
The analysis is also packaged as a Shiny app (app.R) so
you can explore results without re-knitting the Rmd. To launch it, open
app.R in RStudio and click Run App, or
run:
The app has three tabs (Volcano Plot, Heatmap, Results Table), all controlled by sidebar sliders.
Data setup - runs once when the app loads:
data(airway)
dds <- DESeqDataSet(airway, design = ~ dex)
dds$dex <- relevel(dds$dex, ref = "untrt")
dds <- DESeq(dds)
res <- results(dds)
vsd <- vst(dds, blind = FALSE)Running DESeq2 once at startup means all three tabs read from the
same res and vsd objects, so adjusting a
slider rerenders instantly.
Volcano plot responds to your p-value and LFC sliders:
output$volcano <- renderPlot({
EnhancedVolcano(res,
lab = rownames(res),
x = "log2FoldChange",
y = "pvalue",
pCutoff = input$pval,
FCcutoff = input$lfc,
title = "Dexamethasone vs Untreated",
pointSize = 2.0,
labSize = 3.0)
})Heatmap shows however many top DEGs you select:
output$heatmap <- renderPlot({
library(ComplexHeatmap)
topN <- head(order(res$padj, na.last = TRUE), input$topn)
mat <- assay(vsd)[topN, ]
# Scale rows for better visualization
mat_scaled <- t(scale(t(mat)))
# Color for treatment annotation
col_anno <- HeatmapAnnotation(
treatment = colData(vsd)$dex,
col = list(treatment = c("untrt" = "#a29bfe", "trt" = "#f9ca24"))
)
Heatmap(mat_scaled,
name = "Z-score",
top_annotation = col_anno,
show_row_names = TRUE,
show_column_names = TRUE,
row_names_gp = gpar(fontsize = 6),
column_title = paste("Top", input$topn, "DEGs"),
clustering_distance_rows = "euclidean",
clustering_distance_columns = "euclidean")
})Result Table shows DESeq2 Output:
output$table <- DT::renderDataTable({
as.data.frame(res) %>%
tibble::rownames_to_column("gene") %>%
dplyr::filter(!is.na(padj)) %>%
dplyr::arrange(padj)
})| Tab | What it shows |
|---|---|
| Volcano Plot | Adjustable p-value & LFC cutoffs |
| Heatmap | Top N genes (slider: 10–100), Z-score scaled |
| Results Table | Full sortable/searchable DESeq2 results |
Running DESeq2 on the airway dataset is a good way to get comfortable with the full RNA-seq pipeline - from raw counts to visualizations - before applying it to a dataset of your own. The steps here (load → normalize → test → visualize) are the same ones you’d use on any GEO dataset; only the metadata handling changes.
The key takeaway biologically: dexamethasone has a broad transcriptional effect in airway smooth muscle cells, and CRISPLD2 is one of the clearest responders. Our results line up with what Himes et al. reported, which is a good sign the pipeline is working correctly.
If you want to try this on real data from GEO, search
rnaseq counts[filter] on NCBI GEO and look for zipped raw
count files - the setup will look very similar to what’s here.
Himes BE, et al. (2014). RNA-Seq Transcriptome Profiling Identifies CRISPLD2 as a Glucocorticoid Responsive Gene. PLoS ONE 9(6): e99625. https://doi.org/10.1371/journal.pone.0099625
Love MI, Huber W, Anders S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology 15: 550. https://doi.org/10.1186/s13059-014-0550-8
Blighe K, Rana S, Lewis M. (2018). EnhancedVolcano. Bioconductor. https://bioconductor.org/packages/EnhancedVolcano
Gu Z, Eils R, Schlesner M. (2016). Complex heatmaps reveal patterns and correlations in multidimensional genomic data. Bioinformatics 32: 2847-2849.
Sievert C. (2020). Interactive Web-Based Data Visualization with R, plotly, and shiny. Chapman and Hall/CRC. https://plotly-r.com