RNA-seq Vignette: DESeq2 Analysis of the Airway Dataset

Introduction

Biological Question: How does dexamethasone - a glucocorticoid steroid used to treat asthma changes gene expression in human airway smooth muscle (ASM) cells? Which genes does it switch on or off, and how does this tell us about its anti-inflammatory effects?

RNA sequencing (RNA-seq) measures genome-wide gene expression by sequencing mRNA and counting how many reads map to each gene. Compared to microarrays, RNA-seq can detect novel transcripts, works across a huge dynamic range, and gives actual counts rather than relative fluorescence intensities.

The dataset used here comes from Himes et al. (2014) [1], who treated 4 human ASM cell lines with dexamethasone and sequenced both the treated and untreated version of each, 8 samples total. Their main finding was that CRISPLD2 is one of the most strongly glucocorticoid-responsive genes in these cells.

Materials / Software Required

Tool	Purpose
R >= 4.0 + RStudio	Programming environment
`DESeq2` (Bioconductor)	Differential expression analysis
`airway` (Bioconductor)	Example RNA-seq count dataset
`EnhancedVolcano`	Publication-ready volcano plots
`ComplexHeatmap`	Heatmap visualization
`plotly`	Interactive plots
`shiny`	Interactive web app (app.R)
`DT`	Interactive data tables
`dplyr`	Data manipulation
`ggplot2`	Static plotting
`tibble`	Tidy data frames

Early Steps

Step 1 - Load the Airway Dataset

1Start here - load raw count data

data(airway)
airway

## class: RangedSummarizedExperiment 
## dim: 63677 8 
## metadata(1): ''
## assays(1): counts
## rownames(63677): ENSG00000000003 ENSG00000000005 ... ENSG00000273492
##   ENSG00000273493
## rowData names(10): gene_id gene_name ... seq_coord_system symbol
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(9): SampleName cell ... Sample BioSample

A SummarizedExperiment object with 63,677 genes across 8 samples - 4 donors x 2 conditions (treated/untreated). Each number in the count matrix is how many RNA-seq reads mapped to that gene in that sample. Nothing has been normalized yet.

Let’s look at the sample metadata interactively:

colData(airway)[, c("cell", "dex")] %>%
  as.data.frame() %>%
  tibble::rownames_to_column("Sample") %>%
  DT::datatable(
    caption  = "Airway Dataset: Sample Metadata",
    rownames = FALSE,
    options  = list(pageLength = 8, dom = "t"),
    class    = "compact stripe"
  )

Let’s look at how many reads each sample produced

# Pull raw counts and log-transform for visualization
count_mat <- assay(airway)
log_counts <- log2(count_mat + 1)

# Reshape to long format for plotting
long_df <- as.data.frame(log_counts) %>%
  tibble::rownames_to_column("gene") %>%
  tidyr::pivot_longer(-gene, names_to = "sample", values_to = "log2count") %>%
  dplyr::left_join(
    as.data.frame(colData(airway)) %>%
      tibble::rownames_to_column("sample") %>%
      dplyr::select(sample, dex),
    by = "sample"
  )

ggplot(long_df, aes(x = sample, y = log2count, fill = dex)) +
  geom_boxplot(outlier.size = 0.3, outlier.alpha = 0.3) +
  scale_fill_manual(values = c("untrt" = "#a29bfe", "trt" = "#f9ca24"),
                    labels = c("untrt" = "Untreated", "trt" = "Treated")) +
  labs(
    title = "Raw Count Distribution per Sample (before normalization)",
    x     = "Sample",
    y     = "log2(count + 1)",
    fill  = "Treatment"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    axis.text.x  = element_text(angle = 35, hjust = 1, size = 9),
    plot.title   = element_text(face = "bold"),
    legend.position = "right"
  )

The boxes should be roughly the same height and centered around similar values. If one sample’s box is dramatically lower or shifted, it may have been sequenced to a much shallower depth, which DESeq2’s normalization will correct but it’s good to know. Here the samples look reasonably consistent, which is a good sign.

Step 2 - Build DESeqDataSet and Run DESeq2

2Run the core differential expression analysis

dds <- DESeqDataSet(airway, design = ~ dex)
dds$dex <- relevel(dds$dex, ref = "untrt")   # untreated = reference/baseline
dds <- DESeq(dds)                             # normalize + estimate dispersion + test
res <- results(dds)
vsd <- vst(dds, blind = FALSE)               # variance-stabilized data for heatmap

summary(res)

## 
## out of 33469 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 1881, 5.6%
## LFC < 0 (down)     : 1503, 4.5%
## outliers [1]       : 51, 0.15%
## low counts [2]     : 14833, 44%
## (mean count < 4)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

What DESeq2 does:

Size factor normalization - corrects for differences in sequencing depth so samples are comparable
Dispersion estimation - models biological variability; genes that vary a lot across replicates need stronger evidence to be called significant
Wald test - for each gene, tests whether log2FoldChange is significantly different from 0 using a negative binomial model because RNA-seq counts aren’t normally distributed

Step 3 - Summary Statistics

3How many genes were significant?

res_df <- as.data.frame(res) %>%
  tibble::rownames_to_column("gene") %>%
  dplyr::filter(!is.na(padj))

n_total <- format(nrow(res_df), big.mark = ",")
n_sig   <- format(sum(res_df$padj < 0.05), big.mark = ",")
n_up    <- format(sum(res_df$padj < 0.05 & res_df$log2FoldChange > 0), big.mark = ",")
n_down  <- format(sum(res_df$padj < 0.05 & res_df$log2FoldChange < 0), big.mark = ",")

htmltools::HTML(paste0('
<div class="stat-row">
  <div class="stat-card">
    <span class="number">', n_total, '</span>
    <span class="label">Genes Tested</span>
  </div>
  <div class="stat-card">
    <span class="number">', n_sig, '</span>
    <span class="label">Significant (padj &lt; 0.05)</span>
  </div>
  <div class="stat-card">
    <span class="number" style="color:#c0392b">', n_up, '</span>
    <span class="label">Upregulated</span>
  </div>
  <div class="stat-card">
    <span class="number" style="color:#2980b9">', n_down, '</span>
    <span class="label">Downregulated</span>
  </div>
</div>
'))

18,585 Genes Tested

2,700 Significant (padj < 0.05)

1,550 Upregulated

1,150 Downregulated

Later Steps: Visualizing Results

Interactive Volcano Plot

vol_df <- as.data.frame(res) %>%
  tibble::rownames_to_column("gene") %>%
  dplyr::filter(!is.na(pvalue), !is.na(log2FoldChange)) %>%
  dplyr::mutate(
    neg_log10_p = -log10(pvalue),
    significant = dplyr::case_when(
      padj < 0.05 & log2FoldChange >  1.5 ~ "Upregulated",
      padj < 0.05 & log2FoldChange < -1.5 ~ "Downregulated",
      TRUE                                 ~ "Not significant"
    )
  )

color_map <- c(
  "Upregulated"     = "#c0392b",
  "Downregulated"   = "#2980b9",
  "Not significant" = "#bdc3c7"
)

plot_ly(
  vol_df,
  x     = ~log2FoldChange,
  y     = ~neg_log10_p,
  type  = "scatter",
  mode  = "markers",
  color = ~significant,
  colors = color_map,
  marker = list(size = 4, opacity = 0.7),
  text  = ~paste0(
    "<b>", gene, "</b><br>",
    "LFC: ", round(log2FoldChange, 3), "<br>",
    "p-value: ", signif(pvalue, 3), "<br>",
    "padj: ", signif(padj, 3)
  ),
  hoverinfo = "text"
) %>%
  layout(
    title  = list(text = "Volcano Plot: Dexamethasone vs Untreated", font = list(size = 16)),
    xaxis  = list(title = "log2 Fold Change", zeroline = TRUE),
    yaxis  = list(title = "-log10(p-value)"),
    shapes = list(
      list(type = "line", x0 = -1.5, x1 = -1.5, y0 = 0, y1 = 40,
           line = list(dash = "dash", color = "gray", width = 1)),
      list(type = "line", x0 =  1.5, x1 =  1.5, y0 = 0, y1 = 40,
           line = list(dash = "dash", color = "gray", width = 1)),
      list(type = "line", x0 = -8, x1 = 8,
           y0 = -log10(0.05), y1 = -log10(0.05),
           line = list(dash = "dash", color = "gray", width = 1))
    ),
    legend = list(orientation = "h", y = -0.15)
  )

Hover over any point to see the gene ID, fold change, and p-value. Red = significantly upregulated by dexamethasone. Blue = significantly downregulated. Dashed lines mark the cutoffs (|LFC| > 1.5, padj < 0.05) is considered a hit. You can adjust these in the Shiny app.

Interactive MA Plot

ma_df <- as.data.frame(res) %>%
  tibble::rownames_to_column("gene") %>%
  dplyr::filter(!is.na(log2FoldChange), !is.na(padj)) %>%
  dplyr::mutate(color = ifelse(padj < 0.05, "#2980b9", "#bdc3c7"))

plot_ly(
  ma_df,
  x    = ~log(baseMean + 1),
  y    = ~log2FoldChange,
  type = "scatter",
  mode = "markers",
  marker = list(size = 4, opacity = 0.6, color = ~color),
  text = ~paste0(
    "<b>", gene, "</b><br>",
    "Mean count: ", round(baseMean, 1), "<br>",
    "LFC: ", round(log2FoldChange, 3), "<br>",
    "padj: ", signif(padj, 3)
  ),
  hoverinfo = "text"
) %>%
  layout(
    title  = list(text = "MA Plot: Mean Expression vs Log2 Fold Change", font = list(size = 16)),
    xaxis  = list(title = "log(Mean Normalized Count + 1)"),
    yaxis  = list(title = "log2 Fold Change", zeroline = TRUE),
    shapes = list(
      list(type = "line", x0 = 0, x1 = 15, y0 =  1.5, y1 =  1.5,
           line = list(dash = "dash", color = "red", width = 1)),
      list(type = "line", x0 = 0, x1 = 15, y0 = -1.5, y1 = -1.5,
           line = list(dash = "dash", color = "red", width = 1))
    ),
    annotations = list(
      list(x = 13, y = 2.2, text = "<b>Blue = padj < 0.05</b>",
           showarrow = FALSE, font = list(color = "#2980b9", size = 12))
    )
  )

Each dot is a gene. X-axis = average expression level. Y-axis = fold change (treatment vs untreated). Blue dots are statistically significant (padj < 0.05). Red dashed lines show the LFC threshold. Genes at low mean counts scatter more, expected due to measurement uncertainty.

Heatmap of Top 50 DEGs

topN       <- head(order(res$padj, na.last = TRUE), 50)
mat        <- assay(vsd)[topN, ]
mat_scaled <- t(scale(t(mat)))

col_anno <- HeatmapAnnotation(
  Treatment = colData(vsd)$dex,
  col = list(Treatment = c("untrt" = "#a29bfe", "trt" = "#f9ca24"))
)

Heatmap(
  mat_scaled,
  name                        = "Z-score",
  top_annotation              = col_anno,
  show_row_names              = TRUE,
  show_column_names           = TRUE,
  row_names_gp                = gpar(fontsize = 7),
  column_names_gp             = gpar(fontsize = 9),
  column_title                = "Top 50 Differentially Expressed Genes",
  column_title_gp             = gpar(fontsize = 13, fontface = "bold"),
  clustering_distance_rows    = "euclidean",
  clustering_distance_columns = "euclidean",
  col = circlize::colorRamp2(c(-2, 0, 2), c("#2980b9", "white", "#c0392b"))
)

Each row is a gene, each column is a sample. Red = high expression, Blue = low expression (Z-score relative to the gene mean). The annotation bar at the top shows treatment group: yellow = treated, purple = untreated. Treated and untreated samples confirming dexamethasone has a strong, consistent transcriptional effect across all 4 donors.

CRISPLD2 Gene Count Plot

counts_data <- plotCounts(
  dds,
  gene       = "ENSG00000103196",
  intgroup   = "dex",
  returnData = TRUE
)

p_counts <- ggplot(counts_data, aes(x = dex, y = count, color = dex, fill = dex)) +
  geom_jitter(width = 0.15, size = 4, alpha = 0.8) +
  geom_boxplot(alpha = 0.2, outlier.shape = NA) +
  scale_color_manual(values = c("untrt" = "#2980b9", "trt" = "#c0392b")) +
  scale_fill_manual(values  = c("untrt" = "#2980b9", "trt" = "#c0392b")) +
  scale_y_log10() +
  labs(
    title    = "CRISPLD2 (ENSG00000103196) - Normalized Counts",
    subtitle = "Top hit from Himes et al. 2014",
    x        = "Treatment",
    y        = "Normalized Count (log scale)"
  ) +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"), legend.position = "none")

ggplotly(p_counts, tooltip = c("x", "y"))

CRISPLD2 was the headline finding in Himes et al. they used RT-qPCR to validate it independently. Every single treated sample has higher counts than every untreated sample, which is a clean result. It’s also not a statistical artifact from one outlier donor, all 4 cell lines show the same direction of change.

Comparison to Publication

Paper: Himes et al. (2014). RNA-Seq Transcriptome Profiling Identifies CRISPLD2 as a Glucocorticoid Responsive Gene. PLoS ONE 9(6): e99625.

top_hits <- as.data.frame(res) %>%
  tibble::rownames_to_column("Gene (Ensembl ID)") %>%
  dplyr::filter(!is.na(padj)) %>%
  dplyr::arrange(padj) %>%
  dplyr::slice_head(n = 20) %>%
  dplyr::mutate(
    log2FoldChange = round(log2FoldChange, 3),
    pvalue         = formatC(pvalue, format = "e", digits = 2),
    padj           = formatC(padj,   format = "e", digits = 2),
    Direction      = ifelse(log2FoldChange > 0, "Up", "Down"),
    Note           = ifelse(`Gene (Ensembl ID)` == "ENSG00000103196",
                            "★ CRISPLD2 - Himes et al. top hit", "")
  ) %>%
  dplyr::select(`Gene (Ensembl ID)`, log2FoldChange, pvalue, padj, Direction, Note)

DT::datatable(
  top_hits,
  caption  = "Table 1. Top 20 DEGs from our DESeq2 analysis (sorted by adjusted p-value)",
  rownames = FALSE,
  filter   = "top",
  options  = list(
    pageLength = 20,
    dom        = "tp",
    columnDefs = list(list(className = "dt-center", targets = "_all"))
  ),
  class = "compact stripe hover"
) %>%
  DT::formatStyle(
    "Direction",
    color      = DT::styleEqual(c("Up", "Down"), c("#c0392b", "#2980b9")),
    fontWeight = "bold"
  ) %>%
  DT::formatStyle(
    "Note",
    color      = "#e67e22",
    fontWeight = "bold"
  )

Comparison with Himes et al. 2014: The paper’s headline finding was CRISPLD2 (ENSG00000103196), which they described as the most significantly glucocorticoid-responsive gene in airway smooth muscle cells. Our DESeq2 analysis reproduces this result - CRISPLD2 appears in our top-ranked genes with a large positive fold change (LFC ≈ 2.2) and an extremely small adjusted p-value, confirming our pipeline faithfully replicates the published findings. Other genes highlighted by Himes et al. - including DUSP1, KLF15, and PER1, also appear among our significant hits.

On our Table: If sorted by padj alone, CRISPLD2 lands around position 16–20. That’s not a problem with the analysis, it just reflects that statistical ranking and biological importance aren’t always the same thing.

Interactive Shiny App

The analysis is also packaged as a Shiny app (app.R) so you can explore results without re-knitting the Rmd. To launch it, open app.R in RStudio and click Run App, or run:

shiny::runApp("app.R")

The app has three tabs (Volcano Plot, Heatmap, Results Table), all controlled by sidebar sliders.

Data setup - runs once when the app loads:

data(airway)
dds <- DESeqDataSet(airway, design = ~ dex)
dds$dex <- relevel(dds$dex, ref = "untrt")
dds <- DESeq(dds)
res <- results(dds)
vsd <- vst(dds, blind = FALSE)

Running DESeq2 once at startup means all three tabs read from the same res and vsd objects, so adjusting a slider rerenders instantly.

Volcano plot responds to your p-value and LFC sliders:

output$volcano <- renderPlot({
    EnhancedVolcano(res,
                    lab = rownames(res),
                    x = "log2FoldChange",
                    y = "pvalue",
                    pCutoff = input$pval,
                    FCcutoff = input$lfc,
                    title = "Dexamethasone vs Untreated",
                    pointSize = 2.0,
                    labSize = 3.0)
  })

Heatmap shows however many top DEGs you select:

output$heatmap <- renderPlot({
    library(ComplexHeatmap)
    
    topN <- head(order(res$padj, na.last = TRUE), input$topn)
    mat  <- assay(vsd)[topN, ]
    
    # Scale rows for better visualization
    mat_scaled <- t(scale(t(mat)))
    
    # Color for treatment annotation
    col_anno <- HeatmapAnnotation(
      treatment = colData(vsd)$dex,
      col = list(treatment = c("untrt" = "#a29bfe", "trt" = "#f9ca24"))
    )
    
    Heatmap(mat_scaled,
            name = "Z-score",
            top_annotation = col_anno,
            show_row_names = TRUE,
            show_column_names = TRUE,
            row_names_gp = gpar(fontsize = 6),
            column_title = paste("Top", input$topn, "DEGs"),
            clustering_distance_rows = "euclidean",
            clustering_distance_columns = "euclidean")
  })

Result Table shows DESeq2 Output:

output$table <- DT::renderDataTable({
    as.data.frame(res) %>%
      tibble::rownames_to_column("gene") %>%
      dplyr::filter(!is.na(padj)) %>%
      dplyr::arrange(padj)
  })

Tab	What it shows
Volcano Plot	Adjustable p-value & LFC cutoffs
Heatmap	Top N genes (slider: 10–100), Z-score scaled
Results Table	Full sortable/searchable DESeq2 results

Conclusion

Running DESeq2 on the airway dataset is a good way to get comfortable with the full RNA-seq pipeline - from raw counts to visualizations - before applying it to a dataset of your own. The steps here (load → normalize → test → visualize) are the same ones you’d use on any GEO dataset; only the metadata handling changes.

The key takeaway biologically: dexamethasone has a broad transcriptional effect in airway smooth muscle cells, and CRISPLD2 is one of the clearest responders. Our results line up with what Himes et al. reported, which is a good sign the pipeline is working correctly.

If you want to try this on real data from GEO, search rnaseq counts[filter] on NCBI GEO and look for zipped raw count files - the setup will look very similar to what’s here.

References

Himes BE, et al. (2014). RNA-Seq Transcriptome Profiling Identifies CRISPLD2 as a Glucocorticoid Responsive Gene. PLoS ONE 9(6): e99625. https://doi.org/10.1371/journal.pone.0099625
Love MI, Huber W, Anders S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology 15: 550. https://doi.org/10.1186/s13059-014-0550-8
Blighe K, Rana S, Lewis M. (2018). EnhancedVolcano. Bioconductor. https://bioconductor.org/packages/EnhancedVolcano
Gu Z, Eils R, Schlesner M. (2016). Complex heatmaps reveal patterns and correlations in multidimensional genomic data. Bioinformatics 32: 2847-2849.
Sievert C. (2020). Interactive Web-Based Data Visualization with R, plotly, and shiny. Chapman and Hall/CRC. https://plotly-r.com

RNA-seq Vignette: DESeq2 Analysis of the Airway Dataset

Yusra Amena

2026-04-21

Introduction

Materials / Software Required

Early Steps

Step 1 - Load the Airway Dataset

Step 2 - Build DESeqDataSet and Run DESeq2

Step 3 - Summary Statistics

Later Steps: Visualizing Results

Interactive Volcano Plot

Interactive MA Plot

Heatmap of Top 50 DEGs

CRISPLD2 Gene Count Plot

Comparison to Publication

Interactive Shiny App

Conclusion

References