Motivation

Is expression of genes in a gene set associated with experimental condition?

Many methods, a recent review is Kharti et al., 2012.

What is a gene set?

Any a priori classification of `genes’ into biologically relevant groups

Sets do not need to be…

Collections of gene sets

Gene Ontology (GO) Annotation (GOA)

Pathways

E.g., MSigDb

Statistical approaches

Initially based on a presentation by Simon Anders, CSAMA 2010

Approach 1: hypergeometric tests

Steps

  1. Classify each gene as ‘differentially expressed’ DE or not, e.g., based on P < 0.05
  2. Are DE genes in the set more common than DE genes not in the set?
In gene set?
Yes No
Differentially Yes k K
expressed? No n - k N - K
  1. Fisher hypergeometric test, via fiser.test() or GOstats

Notes

Approach 2: enrichment score

Steps

Approach 3: category \(t\)-test

E.g., Jiang & Gentleman, 2007;

Expression in NEG vs BCR/ABL samples for genes in the ‘ribosome’ KEGG pathway; Category vignette.

Competitive versus self-contained null hypothesis

Goemann & Bühlmann, 2007

Approach 4: linear models

E.g., Hummel et al., 2008,

limma

Approach 5: pathway topology

E.g., Tarca et al., 2009,

Evidence plot, colorectal cancer. Points: pathway gene sets. Significant after Bonferroni (red) or FDR (blue) correction.

Issues with sequence data?

E.g., Young et al., 2010, goseq

DE genes vs. transcript length. Points: bins of 300 genes. Line: fitted probability weighting function.

Approach 6: de novo discovery

Example: Langfelder & Hovarth, WGCNA

Representing gene sets in R

Conclusions

Gene set enrichment classifications

Selected Packages

Approach Packages
Hypergeometric GOstats, topGO
Enrichment limma::romer()
Category \(t\)-test Category
Linear model GlobalAncova, GSEAlm, limma::roast()
Pathway topology SPIA
Sequence-specific goseq
de novo WGCNA

Practical

This practical is based on section 6 of the goseq vignette.

1-6 Experimental design, …, Analysis of gene differential expression

This (relatively old) experiment examined the effects of androgen stimulation on a human prostate cancer cell line, LNCaP (Li et al., 2008). The experiment used short (35bp) single-end reads from 4 control and 3 untreated lines. Reads were aligned to hg19 using Bowtie, and counted using ENSEMBL 54 gene models.

Input the data to edgeR’s DGEList data structure.

library(edgeR)
path <- system.file(package="goseq", "extdata", "Li_sum.txt")

table.summary <- read.table(path, sep='\t', header=TRUE, stringsAsFactors=FALSE)
counts <- table.summary[,-1]
rownames(counts) <- table.summary[,1]
grp <- factor(rep(c("Control","Treated"), times=c(4,3)))
summarized <- DGEList(counts, lib.size=colSums(counts), group=grp)

Use a ‘common’ dispersion estimate, and compare the two groups using an exact test

disp <- estimateCommonDisp(summarized)
tested <- exactTest(disp)
topTags(tested)
## Comparison of groups:  Treated-Control 
##                     logFC   logCPM       PValue          FDR
## ENSG00000127954 11.557868 6.680748 2.574972e-80 1.274766e-75
## ENSG00000151503  5.398963 8.499530 1.781732e-65 4.410322e-61
## ENSG00000096060  4.897600 9.446705 7.983756e-60 1.317479e-55
## ENSG00000091879  5.737627 6.282646 1.207655e-54 1.494654e-50
## ENSG00000132437 -5.880436 7.951910 2.950042e-52 2.920896e-48
## ENSG00000166451  4.564246 8.458467 7.126763e-52 5.880292e-48
## ENSG00000131016  5.254737 6.607957 1.066807e-51 7.544766e-48
## ENSG00000163492  7.085400 5.128514 2.716461e-45 1.681014e-41
## ENSG00000113594  4.051053 8.603264 9.272066e-44 5.100255e-40
## ENSG00000116285  4.108522 7.864773 6.422468e-43 3.179507e-39

7. Comprehension

Start by extracting all P values, then correcting for multiple comparison using p.adjust(). Classify the genes as differentially expressed or not.

padj <- with(tested$table, {
    keep <- logFC != 0
    value <- p.adjust(PValue[keep], method="BH")
    setNames(value, rownames(tested)[keep])
})
genes <- padj < 0.05
table(genes)
## genes
## FALSE  TRUE 
## 19535  3208

Gene symbol to pathway

Under the hood, goseq uses Bioconductor annotation packages (in this case org.Hs.eg.db and r Biocannopkg("GO.db") to map from gene symbols to GO pathways.

Expore these packages through the columns() and select() functions. Can you map between ENSEMBL gene identifiers (the row names of topTable()) to GO pathway? What about ‘drilling down’ on particular GO identifiers to discover the term’s definition?

Probability weighting function

Calculate the weighting for each gene. This looks up the gene lengths in a pre-defined table (how could these be calculated using TxDb packages? What challenges are associated with calculating these ‘weights’, based on the knowledge that genes typically consist of several transcripts, each expressed differently?)

pwf <- nullp(genes,"hg19","ensGene")
## Loading hg19 length data...
## Warning in pcls(G): initial point very close to some inequality
## constraints

head(pwf)
##                 DEgenes bias.data        pwf
## ENSG00000230758   FALSE       247 0.03757470
## ENSG00000182463   FALSE      3133 0.20436865
## ENSG00000124208   FALSE      1978 0.16881769
## ENSG00000230753   FALSE       466 0.06927243
## ENSG00000224628   FALSE      1510 0.15903532
## ENSG00000125835   FALSE       954 0.12711992

Over- and under-representation

Perform the main analysis. This includes association of genes to GO pathway

GO.wall <- goseq(pwf, "hg19", "ensGene")
## Fetching GO annotations...
## For 9751 genes, we could not find any categories. These genes will be excluded.
## To force their use, please run with use_genes_without_cat=TRUE (see documentation).
## This was the default behavior for version 1.15.1 and earlier.
## Calculating the p-values...
head(GO.wall)
##         category over_represented_pvalue under_represented_pvalue
## 10729 GO:0044763            8.237627e-15                        1
## 10708 GO:0044699            2.079753e-14                        1
## 2453  GO:0005737            2.956026e-10                        1
## 3004  GO:0006614            6.131543e-09                        1
## 7499  GO:0031982            1.101818e-08                        1
## 2372  GO:0005576            1.339207e-08                        1
##       numDEInCat numInCat
## 10729       1893     8355
## 10708       2054     9201
## 2453        1790     8080
## 3004          39      107
## 7499         638     2675
## 2372         669     2836
##                                                              term ontology
## 10729                            single-organism cellular process       BP
## 10708                                     single-organism process       BP
## 2453                                                    cytoplasm       CC
## 3004  SRP-dependent cotranslational protein targeting to membrane       BP
## 7499                                                      vesicle       CC
## 2372                                         extracellular region       CC

What if we’d ignored gene length?

Here we do the same operation, but ignore gene lengths

GO.nobias <- goseq(pwf,"hg19","ensGene",method="Hypergeometric")
## Fetching GO annotations...
## For 9751 genes, we could not find any categories. These genes will be excluded.
## To force their use, please run with use_genes_without_cat=TRUE (see documentation).
## This was the default behavior for version 1.15.1 and earlier.
## Calculating the p-values...

Compare the over-represented P-values for each set, under the different methods

idx <- match(GO.nobias$category, GO.wall$category)
plot(log10(GO.nobias[, "over_represented_pvalue"]) ~
     log10(GO.wall[idx, "over_represented_pvalue"]),
     xlab="Wallenius", ylab="Hypergeometric",
     xlim=c(-5, 0), ylim=c(-5, 0))
abline(0, 1, col="red", lwd=2)

References