Contents

0.1 Introduction

The purpose of this workshop is to serve as a guide for analyzing 450k Illumina Human Methylation data to find differences in methylation at the base pair and region levels. We will use previous research findings of differentially methylated locations, the TRIM29 gene, and other transcription regulators between various solid tissue samples, including breast (BRCA), colon (COAD), lung (LUSC). Normal and tumor breast (BRCA) samples are observed as guide.

We will use mainly the minfi and epivizrChart packages in Bioconductor. The minfi package is used to identify differentially methylated positions, regions, and blocks. The epivizrChart package is used to create interactive visualizations for genomic data objects in R. This protocol aims to walk through the analysis and visualization of Illumina Infinium Human Methylation 450k data from the Cancer Genome Atlas (TCGA) project.

0.2 Motivation

0.2.0.1 Tissue Specific DNA Methylation in Normal Human Breast Epithelium and in Breast Cancer (Avraham et al., 2014)

Research findings from Avraham et al. in 2014 conclude that the TRIpartite Motif-containing 29 (TRIM29), also called the ataxia telangiectasia group D-complementing (ATDC), has differing functions in various cancer types. In breast cancer, the expression of TRIM29 is less than the expression in normal breast tissue, and it acts as a tumor suppressor. The promoter region of the gene is hypo-methylated in normal breast tissue, but hyper-methylated in other normal tissues, like the colon, lung, and endometrium, where it is oncogenic. In tumor breast tissues, the methylation increased and expression decreased when compared with normal breast tissues. In comparisons between other cancer types and their respective normal tissues, the opposite methylation pattern was observed1.

The researchers identified 110 genes that were differentially methylated in the breast compared to all other epithelial tissues tested. Of these 110 genes, a group 15 of them were DNA binding and transcription regulators, which were found to have a notable relevance to cancer. Their analysis of TCGA data for these genes, including ALX4, GATA5, MGMT, NEUROG1, SOX10, SREBP1, ST18, TRIM29, and TP73, revealed methylation differences at most of the corresponding loci between normal and tumor breast samples. The methylation pattern also varied between breast cancer types, including basal, luminal-A and luminal-B, and between different normal tissue types, identifying these gene regions as differentially methylated regions (DMRs). The TRIM29 gene has a CpG island at the transcription start site of the gene (between 120,008,000 and 120,009,000 on chromosome 11)1.

In breast tissues, it was found that this CpG island was hypo-methylated, and in other tissues, including colon and lung, it has full methylation. These results were found analyzing Illumina-Infinium HumanMethylation27 array data. TRIM29 expression was higher in normal breast tissues than in other normal tissues tested. The main conclusion proposed by authors was that, “Epigenetic alterations affecting tissue-specific differentiation are the predominant mechanism by which epigenetic changes cause cancer1.” In this protocol we will recapitulate some of these findings and use interactive visualization to explore additional findings in this data.

0.3 Gathering and Preprocessing Data

The data used in this protocol is 450k Illumina Human Methylation data from the Cancer Genome Atlas (TCGA) project. In order to analyze the data and prepare it for visualization in epiviz, it must first be loaded into R. We use four different types of 450k samples, as shown in the chart below, so that methylation analyses could be done between the three normal samples and the normal and tumor BRCA samples, separately.

Cancer Type Tissue # Normal Samples # Tumor Samples
BRCA breast 15 15
COAD colon 15 0
LUSC lung 15 0

The accompanying vignette data_preprocessing.Rmd contains information on how to download and preprocess data used in this workshop. The end result are objects in classes defined by minfi:

Data Input Processing Function Output
Raw data (IDAT files) read.metharray RGChannelSet
RGChannelSet preprocessIllumina MethylSet
MethylSet mapToGenome GenomicMethylSet
GenomicMethylSet ratioConvert GenomicRatioSet

Installing, the epivizrChart package

library(devtools)
BiocInstaller::biocLite("epiviz/epivizrChart")

For this workshop we have subsetted these to chromosomes 10, 11 and 20:

library(epivizWorkshop)
library(minfi)
library(epivizrChart)
data(meth_set)
data(gm_set)
data(gratio_set)

meth_set
gm_set
gratio_set

0.4 Epiviz Web Components

The epivizrChart package is used to add interactive charts and dashboards for genomic data visualization into RMarkdown and HTML documents using the epiviz framework. It provides an API to interactively create and manage web components that encapsulate epiviz charts. Charts can be embedded in R markdown/notebooks to create interactive documents. Epiviz Web components are built using the Google Polymer library.

0.4.1 Epiviz Charts

Epiviz charts are used to visualize genomic data objects in R/BioConductor. The data objects can be BioConductor data types for ex: Genomic Ranges, ExpressionSet, SummarizedExperiment etc.

For example, to visualize hg19 reference genome as a genes track at a particular genomic location (chr, start, end)

library(Homo.sapiens)

genes_track <- epivizChart(Homo.sapiens, chr="chr11", start=118000000, end=121000000)

genes_track

epivizChart infers the chart type from the data object that was passed. Parameters are available to explicitly plot using a particular chart type.

0.4.2 Epiviz Environment

An important part of the epivizrChart design is that data and plots are separated: you can make multiple charts from the same data object without having to replicate data multiple times. This way, data queries are made by data object, not per chart, which leads to a more responsive design of the system. To enable this, we built the epiviz-environment web component. The environment element also enables brushing across all the charts.

To create an environment,

epivizEnv <- epivizEnv(chr="chr11", start=118000000, end=121000000)

# to add charts to the environment
genes_track <- epivizChart(Homo.sapiens, parent=epivizEnv)

data(cgi_gr)
cgi_track <- epivizChart(cgi_gr, parent=epivizEnv, datasource_name="CpG Islands")

epivizEnv

0.4.3 Epiviz Navigation

epiviz-navigation is a specific instance of environment elements with genomic context linked to it. In interactive sessions with a data provider, navigation elements provide functionality to search for a gene/probe and navigate to a genomic location. Navigation elements also provide an ideogram view when collapsed. This will be more clear as we walk through the rest of the vignette.

To create a navigation,

epivizNav <- epivizNav(chr="chr11", start=118000000, end=121000000)

# to add charts to the environment
genes_track <- epivizChart(Homo.sapiens, parent=epivizNav)

data(cgi_gr)
cgi_track <- epivizChart(cgi_gr, parent=epivizNav, datasource_name="CpG Islands")

epivizNav

Note: you can create environments without any genomic location. This will then plot all the data from a data object. Navigation elements must be initialized with a genomic location.

0.5 DMP Visualization

For this analysis, we will display our analysis data along with contextual including gene annotation, CpG Island locations and gene expression from the Gene Expression Barcode project. We will setup workspaces similar to those from Timp et al. in 2014 that included visualizations of large hypo-methylated blocks2 available at http//epiviz.github.io/timp2014. We’ll start by exploring a couple of those to get a sense of the type of visualizations and analyses we can perform.

To start our visualization, we will add a gene annotation track for the hg19 reference, a track with CpG islands obtained from UCSC using the AnnotationHub resource, and a heatmap with expression from the gene expression barcode project.

Let’s start the application with hg19 as the reference for genes track. We will be adding charts to the environment element to enable brushing and reuse data objects.

epivizEnv <- epivizEnv(chr="chr11", start=118000000, end=121000000)

library(Homo.sapiens)
genes_track <- epivizChart(Homo.sapiens, parent=epivizEnv)

genes_track

Now, let’s add the CpG island track included in this package.

data(cgi_gr)
cgi_track <- epivizChart(cgi_gr, parent=epivizEnv, datasource_name="CpG Islands")

cgi_track

Now, let’s add a heatmap with the gene expression barcode data

data(bcode_eset)

settings <- list(maxColumns = 300, colLabel="SYMBOL")
bcode_hmap <- epivizChart(bcode_eset, parent=epivizEnv, datasource_name="Gene Expression Barcode", chart="HeatmapPlot", settings=settings)

bcode_hmap

To view the current workspace -

epivizEnv

0.5.0.1 Differentially methylated position (DMP) analysis

Minfi’s main function to find differentially methylated positions is “dmpFinder”. It uses single probe analysis on CpGs to find methylation differences and a t-test (or F-test) between two or more phenotype groups phenotypes to determine CpG positions that are differentially methylated (in this case between BRCA normal and tumor samples).

Before we do this analysis, we may want to add a track showing promoter regions for genes since we may want to distinguish DMPs that occur within promoters. We will use Bioconductor annotation infrastructure to do this.

library(GenomicFeatures)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)

promoter_regions <- promoters(TxDb.Hsapiens.UCSC.hg19.knownGene, upstream=1000, downstream=200)
promoters_track <- epivizChart(promoter_regions, parent=epivizEnv, datasource_name="Promoters")

promoters_track

The promoters_track object serves as the interface to the specific track we just added to the epiviz workspace. We can change track settings through the object. Let’s first find out settings that are available to modify for this track.

To do so, we use the

promoters_track$get_available_settings()

Let’s change the track color to distinguish it from the CpG track we added previously:

library(RColorBrewer)
color <- brewer.pal(3, "Paired")

promoters_track$set_colors(color)
promoters_track

Now, we are ready to perform the minfi analysis.

library(doParallel)
cores <- detectCores()
registerDoParallel(cores)
meth_set_breast <- meth_set[, which(meth_set$Tissue == "breast")]
dmp <- dmpFinder(meth_set_breast, meth_set_breast$Status,type="categorical")
head(dmp)
##             intercept        f         pval         qval
## cg10819493  1.8178602 154.3067 6.562626e-13 1.481329e-08
## cg08833577  0.8150518 140.5385 1.983895e-12 1.481329e-08
## cg23055735  1.9852222 139.9305 2.087557e-12 1.481329e-08
## cg08688773 -0.4177857 137.5906 2.544023e-12 1.481329e-08
## cg07204662  3.3497656 133.9346 3.484853e-12 1.481329e-08
## cg25904964  1.8990540 132.7374 3.869137e-12 1.481329e-08

Now we see the probes (CpG positions) that show differential methylation along with associated statistics for them. Let’s first add a track showing the location of these CpGs to our session.

dmp_gr <- granges(gm_set[rownames(dmp),])
dmps_track <- epivizChart(dmp_gr, parent=epivizEnv, datasource_name="Differentially Methylated Positions") 

dmps_track
epivizEnv

In the epivizEnv, you can use the hovering feature to explore how DMPs spatially correlate to promoters, CpG Islands, or genes based on their expression from the heatmap.

Now let’s plot methylation beta to see the measurement that gives rise to the statistical inferences we just plotted across the genome for each tissue type and status.

First, we calculate mean methylation for each group of samples

betas <- getBeta(gratio_set)
pd <- pData(gratio_set)
fac <- paste(pd$Tissue, pd$Status, sep=":")
sample_indices <- split(seq(len=nrow(pd)), fac)

mean_betas <- sapply(sample_indices, function(ind) rowMeans(betas[, ind]))

head(mean_betas)
##            breast:cancer breast:normal colon:normal lung:normal
## cg23934071     0.5840358     0.5974998    0.4479496   0.5020181
## cg06848711     0.7241740     0.7649785    0.6893632   0.7491951
## cg22609331     0.7189007     0.7449193    0.6804898   0.7157142
## cg26807412     0.8095336     0.8297993    0.8277101   0.8159922
## cg26679884     0.2634846     0.3354599    0.3179806   0.3224416
## cg05985066     0.6635175     0.7191271    0.6096836   0.6966796

Now we have a matrix with mean methylation for each group of samples which we will use to create our track. Let’s create a GRanges object that maps each of these values to the corresponding probes genomic location. Finally, add these values as a line track.

cpg_gr <- granges(gm_set)
mcols(cpg_gr) <- mean_betas
colors <- brewer.pal(4, "Spectral")

beta_track <- epivizChart(cpg_gr, datasource_name="Percent Methylation", type="bp", parent=epivizEnv, colors=colors)

beta_track

Another useful track to plot for these DMPs is the p-value resulting from the statistical test. We will plot that as a StackedLineTrack. First, we add the p-value to the GRanges object we are using.

dmp_gr$pval <- -log10(dmp$pval)

This time, we are plotting things a bit differently. The plot method we have used so far does two things: first, it registers the data object we pass to it as a data source in epiviz, and then it adds a chart with the default chart type defined for that data type.

So, since we will be using a “StackedLineTrack” which is not a default chart type, we need to do these two steps in order.

pvals_track <- epivizChart(dmp_gr, parent=epivizEnv, chart="StackedLineTrack", columns="pval", type="bp")

pvals_track

Finally, let’s add an MA plot of BRCA methylation to visualize how methylation differences are distributed across probes. We can of course do this in R by adding methylation difference and average as additional columns to the “GenomicsRanges” object we are using.

dmp_gr$methylation_diff = mean_betas[, "breast:cancer"] - mean_betas[, "breast:normal"]
dmp_gr$methylation_avg = 0.5 * (mean_betas[, "breast:cancer"] + mean_betas[, "breast:normal"])

ma_plot <- epivizChart(dmp_gr, parent=epivizEnv, chart="ScatterPlot", columns=c("methylation_avg", "methylation_diff"), type="bp")

ma_plot

Now that we have added all of these tracks, let’s navigate to the TRIM29 gene location to check the results discussed above. There are two ways to do this. One, we can navigate to the genomic location on the environment

# epivizEnv$navigate("chr11", 44182278, 44631716) #ALX4
epivizEnv$navigate("chr11", 119819400, 120150600) #TRIM29

epivizEnv

(or) We can create a new epivizNavigation element, and assign it to the genomic location of the TRIM29 gene.

# initialize a region as a navigation element
epivizNav <- epivizEnv$init_region(chr="chr11", start=119819400, end=120150600)

epivizEnv

We can remove all the charts from the environment using -

epivizEnv$remove_all_charts()

0.6 Differentially Methylated Region Analysis

Let’s start by adding gene annotation track, CpG islands and heatmap for breast (tumor and cancer), lung, and colon tissues from the gene expression barcode datsest, similar to what we did at the beginning of this vignette

genes_track <- epivizChart(Homo.sapiens, parent=epivizEnv)
cgi_track <- epivizChart(cgi_gr, parent=epivizEnv, datasource_name="CpG Islands")
bcode_hmap <- epivizChart(bcode_eset, parent=epivizEnv, datasource_name="Gene Expression Barcode", chart="HeatmapPlot", settings=settings)

and add the promoter regions and average methylation tracks we added previously.

promoter_track <- epivizChart(promoter_regions, datasource_name="Promoters", parent=epivizEnv)
beta_track <- epivizChart(cpg_gr, datasource_name="Percent Methylation", type="bp", parent=epivizEnv, colors=colors)

Next, use Minfi’s bumphunter function to find differentially methylated regions, which are deviations in the methylation of relatively small CpG loci clusters at the gene promoter scale (1-2kb), between normal and tumor breast tissue. Bumphunting uses a cutoff value to determine the magnitude in methylation difference to identify differentially methylated regions (DMR).

# first subset to breast samples
gratio_set_breast <- gratio_set[, which(gratio_set$Tissue == "breast")]

# make a design matrix to use with bumphunter
status <- pData(gratio_set_breast)$Status
mod <- model.matrix(~status)

# cluster cpgs into regions holding potential dmrs
gr <- granges(gratio_set_breast)
chr <- as.factor(seqnames(gr))
pos <- start(gr)
cl <- clusterMaker(chr, pos, maxGap=500)

# find dmrs
bumps <- bumphunter(gratio_set_breast, mod, cluster=cl, cutoff=0.1, B=0)

Let’s add a track containg the resulting dmrs. We will color blocks in track depending on the kind of dmr we find. To do so we will create to datasources, one for hypo-methylated DMRs, and one for hyper-methylated DMRs.

# categorize dmrs by type
dmr_gr <- with(bumps$table, GRanges(chr, IRanges(start, end), area=area, value=value))
dmr_gr$type <- ifelse(abs(dmr_gr$value) < 0.1, "neither",
  ifelse(dmr_gr$value < 0, "hypo", "hyper"))
table(dmr_gr$type)
## 
## hyper  hypo 
##  3474  5111
# make a GRanges object for each dmr type
hyper_gr <- dmr_gr[dmr_gr$type == "hyper"]
hypo_gr <- dmr_gr[dmr_gr$type == "hypo"]

# add each of these as a datasource on epiviz
hypo_ds <- epivizEnv$data_mgr$add_measurements(hypo_gr, datasource_name="Hypo DMRs")
hyper_ds <- epivizEnv$data_mgr$add_measurements(hyper_gr, datasource_name="Hyper DMRs")

# add the track
measurements <- c(hypo_ds$get_measurements(), hyper_ds$get_measurements())
dmr_track <- epivizChart(chart="BlocksTrack", measurements=measurements, parent=epivizEnv)

dmr_track

Lets look at the entire workspace

epivizEnv

Exercise: Add an MA plot of breast normal and tumor gene expression from the gene expression barcode to visualize differences in gene expression that are important. Use the same method as in the DMP Visualization section, but with gene expression data for BRCA normal and tumor types rather than DNA methylation data.

Exercise: In order to compare the visualizations created in this protocol to the findings of prior research, let’s create multiple epivizNavigation elements to look closer at the genes of interest. In this case we will examine a subset DNA binding and transcription regulator genes, where differences in methylation have been found by Avraham et al. (ALX4, GATA5 and MGMT).

Gene of Interest Chromosome # Start End
ALX4 11 44282278 44331716
GATA5 20 61038553 61051026
MGMT 10 131265454 131565783

Another analysis of interest would be to navigate hyper-methylated promoters regions which could reveal new interesting findings.

hyper_promoters <- subsetByOverlaps(hyper_gr, promoter_regions)
o <- order(-hyper_promoters$area)[1:5]
top_promoters <- hyper_promoters[o,]
top_promoters <- top_promoters + 10000

# Exercise: add navigation elements for different regions/navigate the environment #

Let’s remove all charts from the environment, before the next section:

epivizEnv$remove_all_charts()

0.7 Block finding

Let’s start with the same basic workspace again.

genes_track <- epivizChart(Homo.sapiens, parent=epivizEnv)
cgi_track <- epivizChart(cgi_gr, parent=epivizEnv, datasource_name="CpG Islands")
bcode_hmap <- epivizChart(bcode_eset, parent=epivizEnv, datasource_name="Gene Expression Barcode", chart="HeatmapPlot")

The third analysis we will use from minfi will be based on finding larger regions of methylation differences (blocks) in breast cancer. As before, let’s add tracks with promoter regions and average methylation percentages.

promoter_track <- epivizChart(promoter_regions, datasource_name="Promoters", parent=epivizEnv)
beta_track <- epivizChart(cpg_gr, datasource_name="Percent Methylation", type="bp", parent=epivizEnv, colors=colors)

Let’s also add a track showing breast beta methylation difference across the genome

cpg_gr$breast_diff <- cpg_gr$`breast:cancer` - cpg_gr$`breast:normal`
diff_track <- epivizChart(cpg_gr, datasource_name="Methylation Beta Difference", type="bp",
  columns="breast_diff", parent=epivizEnv)

diff_track

And the gene expression MA plot as before using the epiviz UI.

In comparison to bumphunter, Minfi’s blockFinder function is used to find differences in methylation on a much larger scale than bump hunting (gene scale instead of promoter scale). Block finding uses the cpgCollapse function to create clusters of neighboring open sea loci, groups them into larger regions than bump hunting only would, and then uses bump hunting on the averages of these regions to detect large differentially methylated regions.

cl <- cpgCollapse(gratio_set_breast)
blocks <- blockFinder(cl$object, mod, cluster=cl$blockInfo$pns, cutoff = 0.1)

Let’s add a track with hypomethylated blocks identified.

blocks_gr <- with(blocks$table, GRanges(chr, IRanges(start, end), area=area, value=value))
blocks_gr$type <- ifelse(abs(blocks_gr$value) < 0.1, "neither",
  ifelse(blocks_gr$value < 0, "hypo", "hyper"))
table(blocks_gr$type)

hypo_blocks <- blocks_gr[blocks_gr$type == "hypo",]
hypo_track <- epivizChart(hypo_blocks, datasource_name="Breast Hypo Blocks", parent=epivizEnv)
hypo_track

Let’s navigate through the blocks ordered by width

o <- order(-width(hypo_blocks))[1:5]
slideshow_regions <- hypo_blocks[o,] * .5

# Exercise: add navigation elements for different regions/navigate the environment #

These blocks tend to occur in gene-poor regions. Let’s look at blocks that may overlap gene regions.

gene_regions <- genes(TxDb.Hsapiens.UCSC.hg19.knownGene,columns="gene_id",
  single.strand.genes.only=TRUE)
hypo_gene_blocks <- subsetByOverlaps(hypo_blocks, gene.regions)
o <- order(-hypo_gene_blocks$area)[1:5]
slideshow_regions <- hypo_gene_blocks[o,] * .5

# Exercise: add navigation elements for different regions/navigate the environment #

0.8 On your own

  1. Add a heatmap of CpG methylation values for the clustered regions used in the blockFinder function. Hint: add object cl$object as a data source, then use the visualize for HeatmapPlot.

  2. Add tracks or plots for objects loaded through AnnotationHub.

0.9 References

  1. Avraham A, Cho SS, Uhlmann R, Polak ML, Sandbank J, Karni T, et al. (2014) Tissue Specific DNA Methylation in Normal Human Breast Epithelium and in Breast Cancer. PLoS ONE 9(3): e91805. doi:10.1371/journal.pone.0091805

  2. Timp W, Bravo HC, McDonald OG, Goggins M, Umbricht C, Zeiger M, Feinberg AP, Irizarry RA (2014) Large hypomethylated blocks as a universal defining epigenetic alteration in human solid tumors. Genome Med 6(8): 61. doi:10.1186/s13073-014-0061-y