Methylation Arrays – Lab

Epigenomics 2014
Author: Martin Morgan (mtmorgan@fhcrc.org)
Date: 24 August, 2014

This case study takes a quick tour through the minfi Bioconductor package. The main goal is to become familiar with the use of Bioconductor objects and methods; see the vignette accompanying the minfi package for more background on analysis of Illumina arrays for methylation analysis.

Start by attaching the minfi and minfiData packages. Use browseVignettes("minfi") to access the vignette for additional background information.

require(minfi)
require(minfiData)
browseVignettes("minfi")

The first step in any work flow is to read in the data. A sample data set is provided at the following location.

baseDir <- system.file("extdata", package = "minfiData")
baseDir
## [1] "/home/mtmorgan/R/x86_64-unknown-linux-gnu-library/3.1-2.14/minfiData/extdata"
dir(baseDir)
## [1] "5723646052"      "5723646053"      "SampleSheet.csv"
dir(file.path(baseDir, "5723646052"))
## [1] "5723646052_R02C02_Grn.idat" "5723646052_R02C02_Red.idat"
## [3] "5723646052_R04C01_Grn.idat" "5723646052_R04C01_Red.idat"
## [5] "5723646052_R05C02_Grn.idat" "5723646052_R05C02_Red.idat"

Of course your own data would be at another location, and you might enter the path (with 'tab completion') instead of using system.file(). A typical organization is that each 'slide' (containing 12 arrays) is stored in a separate directory. The top-level directory contains a .csv file describing the samples; inside each slide directory are IDAT files representing the output of the Illumina scanner.

Next read in the sample sheet, and then the raw probe-level data. Take a moment to use R to explore the sample sheet. Read the 'man' page for read.450k.sheet and read.450k.exp to see what options are available.

## 'pData'
targets <- read.450k.sheet(baseDir)
## [read.450k.sheet] Found the following CSV files:
## [1] "/home/mtmorgan/R/x86_64-unknown-linux-gnu-library/3.1-2.14/minfiData/extdata/SampleSheet.csv"
head(targets)
##   Sample_Name Sample_Well Sample_Plate Sample_Group Pool_ID person age sex
## 1    GroupA_3          H5           NA       GroupA      NA    id3  83   M
## 2    GroupA_2          D5           NA       GroupA      NA    id2  58   F
## 3    GroupB_3          C6           NA       GroupB      NA    id3  83   M
## 4    GroupB_1          F7           NA       GroupB      NA    id1  75   F
## 5    GroupA_1          G7           NA       GroupA      NA    id1  75   F
## 6    GroupB_2          H7           NA       GroupB      NA    id2  58   F
##   status  Array     Slide
## 1 normal R02C02 5.724e+09
## 2 normal R04C01 5.724e+09
## 3 cancer R05C02 5.724e+09
## 4 cancer R04C02 5.724e+09
## 5 normal R05C02 5.724e+09
## 6 cancer R06C02 5.724e+09
##                                                                                                    Basename
## 1 /home/mtmorgan/R/x86_64-unknown-linux-gnu-library/3.1-2.14/minfiData/extdata/5723646052/5723646052_R02C02
## 2 /home/mtmorgan/R/x86_64-unknown-linux-gnu-library/3.1-2.14/minfiData/extdata/5723646052/5723646052_R04C01
## 3 /home/mtmorgan/R/x86_64-unknown-linux-gnu-library/3.1-2.14/minfiData/extdata/5723646052/5723646052_R05C02
## 4 /home/mtmorgan/R/x86_64-unknown-linux-gnu-library/3.1-2.14/minfiData/extdata/5723646053/5723646053_R04C02
## 5 /home/mtmorgan/R/x86_64-unknown-linux-gnu-library/3.1-2.14/minfiData/extdata/5723646053/5723646053_R05C02
## 6 /home/mtmorgan/R/x86_64-unknown-linux-gnu-library/3.1-2.14/minfiData/extdata/5723646053/5723646053_R06C02
## 'raw' probe-level data
RGset <- read.450k.exp(base = baseDir, targets = targets)

As a basic quality assessment, visualize the distribution of beta values across each array, coloring the density functions by sample. Are there any concerns about the data?

## Basic QA -- comparable densities across samples?
densityPlot(RGset, sampGroups = RGset$Sample_Group, 
    main = "Beta", xlab = "Beta")

plot of chunk unnamed-chunk-6

A technical artifact is that probe intensities differ depending on their sequence composition, so it is necessary to perform a 'background correction'. Also, the distribution of probe intensities differ from one another as a consequence of sample preparation steps, e.g., slightly different initial amounts of DNA from one sample compared to another. Basic steps in microarray analysis (many of these steps are shared by other high-throughput assays) are therefore background correction and between-array normalization. Use the preprocessIllumina() command to perform these steps. Are there other normalization strategies available in this package?

## background correction and normalization
##     like Illumina Genome Studio (other approaches available)
MSet.norm <- preprocessIllumina(RGset, bg.correct = TRUE,
    normalize = "controls", reference = 2)

Once data are background-corrected and normalized, it is possible to compare the vector of methylation values of each sample. Use the mdsPlot() function to visualize the multidimensional relationship between samples in reduced dimensions. Use arguments of mdsPlot() to name and highlight the different sample groups.

## How similar are the samples to one another?
mdsPlot(MSet.norm, numPositions = 1000, sampGroups = MSet.norm$Sample_Group, 
    sampNames = MSet.norm$Sample_Name)

plot of chunk unnamed-chunk-8

Take a portion of the data (the first 100,000 probes), retrieve the logit-transformed beta values, and then use dmpFinder() CpGs where methylation status is associated with sample group. From the help page, references, and your own knowledge, any ideas about shrinkVar?

## Identify probes with methylation status differing between groups
mset <- MSet.norm[1:100000,]
## logit(beta)
M <- getM(mset, type = "beta", betaThreshold = 0.001)
dmp <- dmpFinder(M, pheno=mset$Sample_Group, type="categorical")
head(dmp)
##            intercept    f      pval    qval
## cg18207348     3.916 4499 2.960e-07 0.01382
## cg07485273     4.273 3630 4.545e-07 0.01382
## cg02025578    -6.526 2970 6.788e-07 0.01382
## cg16650529    -2.236 2739 7.977e-07 0.01382
## cg12019520     9.964 2011 1.479e-06 0.01936
## cg10805483    -9.964 1706 2.053e-06 0.01936

Visualize differential methylation using plotCpg()

plotCpg(mset, cpg=rownames(dmp)[1], pheno=mset$Sample_Group)

plot of chunk unnamed-chunk-10

Probes interrogate genomic locations. Use mapToGenome() to translate the probe identifiers to genomic coordinates. This transforms mset into an object of class SummarizedExperiment. A SummarizedExperiment is similar to an expression set, bui with GRanges to annotate rows, rather than a data.frame. Use rowData() to extract the GRanges from the SummarizedExperiment; explore it. Add the annotations about differentially expressed probes to the row data of the SummarizedExperiment.

## Genomic locations
mset <- mset[rownames(dmp),]
mse <- mapToGenome(mset)              # 'SummarizedExperiment'
rowData(mse)
## GRanges with 100000 ranges and 0 metadata columns:
##              seqnames               ranges strand
##                 <Rle>            <IRanges>  <Rle>
##   cg13869341     chr1     [ 15865,  15865]      *
##   cg12045430     chr1     [ 29407,  29407]      *
##   cg20826792     chr1     [ 29425,  29425]      *
##   cg00381604     chr1     [ 29435,  29435]      *
##   cg17149495     chr1     [530959, 530959]      *
##          ...      ...                  ...    ...
##   cg02050847     chrY [22918038, 22918038]      *
##   cg25427172     chrY [23566730, 23566730]      *
##   cg10267609     chrY [24453757, 24453757]      *
##   cg26983430     chrY [24549675, 24549675]      *
##   cg08265308     chrY [28555550, 28555550]      *
##   ---
##   seqlengths:
##     chr1  chr2  chr3  chr4  chr5  chr6 ... chr13 chr14 chr15  chrX  chrY
##       NA    NA    NA    NA    NA    NA ...    NA    NA    NA    NA    NA
mcols(rowData(mse)) <- cbind(mcols(rowData(mse)), dmp)