Practical Session: Analysis of genome-scale count data in Bioconductor
Friday 30 July, BioC 2010 Workshop, Fred Hutchinson Cancer Research
Center, Seattle, WA
Mark Robinson1,2 and Davis McCarthy1
-
Bioinformatics Division, The Walter and Eliza Hall Institute of
Medical Research
- Epigenetics Laboratory, Garvan Institute of Medical Research
Table of Contents
1 Session Description
High throughput sequencing experiments extract information in two distinct ways: from the sequence itself, or from the mapped position. For the latter, mapped reads are typically aggregated to counts at some level of interest, such as transcripts, promoters, or genomic regions. Several emerging tools within Bioconductor have been developed for the differential analysis of count data, including baySeq, DEGseq, DESeq and edgeR. The tutorial gives an overview of the theory behind these tools and illustrates their features through examples using public datasets. Topics include summarization, statistical models for count data, sharing information over the whole dataset, statistical testing, normalization and the use of generalized linear models for more complex experimental designs.
2 Session outline
We plan to feature both discussion of the background theory and hands-on analysis of real datasets in this practical.
2.1 Talk: preliminary theory [40 mins]
Topics:
- Applications
- Summarization
- Statistical models for count data
- Normalization
2.2 Practical: simple analysis [20 mins]
Analysis of the Marioni data using the basic (default) approaches in each of the four Bioconductor packages for genome-scale count data (baySeq, DEGSeq, DESeq and edgeR).
2.3 Talk: advanced theory and research topics [30 mins]
Topics:
- Sharing information over entire dataset
- Statistical testing
- Other considerations—error model and more complex design, variance models
2.4 Practical: advanced analysis [30 mins]
More advanced analysis of 't Hoen Tag-seq data (focus will be on edgeR). For those more interested in the analysis of ChIP-seq/MeDIP/other count data, there will be time for practical experience on another dataset and discussion.
3 Materials
All of the materials necessary for this session (data files and documented R code) will be provided in the custom package put together for the BioC 2010 practical session.
3.1 Documented R code files
3.1.1 R code for the Marioni data
The file is marioni_DE_analysis.R
3.1.2 R code for the 't Hoen Tag-seq data
The file is tHoen_TagSeq_analysis.R
3.2 Data files
The data we will use for examples in this session will be available in the custom package from Bioconductor put together for the BioC 2010 workshop. The data are briefly described below.
3.2.1 Marioni data
This dataset consists of one Rdata file with the table of counts. The data come from an early RNA-seq experiment carried out by Marioni et al (2008), comparing gene expression levels in human kidney and liver cells. There are five technical replicates sample each for kidney and liver. Currently this dataset is also available from the WEHI Bioinf website.
3.2.2 't Hoen data
These data come from a Tag-seq (Long-SAGE like) experiment conducted by 't Hoen et al (2008) to compare gene expression in mouse hippocampal tissue from wild-type mice and knockout mice with the gene DCLK knocked out. This dataset is larger (~40Mb), consisting of a text file for each of the eight samples containing the observed counts for each tag (GSM272105.txt, GSM272106.txt, GSM272318.txt, GSM272319.txt, GSM272320.txt,GSM272321.txt, GSM272322.txt, GSM272323.txt), a 'targets file' providing file names and description to be read into edgeR (Targets.txt) and a summary file giving information about the dataset as a whole (GSE10782_Dataset_Summary.txt).
3.3 PPT Presentation
We will be giving a presentation on the background and theory for analysing count data as part of the practical session. Our slides should be available after the workshop.
4 Setup in R
The following R code will ensure that you have the necessary R/Bioconductor packages installed for the practical session.
source("http://bioconductor.org/biocLite.R") biocLite("baySeq") biocLite("DEGseq") biocLite("DESeq") biocLite("edgeR") install.packages("matrixStats")
5 References
Here are some references for edgeR and the other Bioconductor packages covered in our practical session, as well as some other important and useful references in this area.
- Robinson and Smyth, Biostatistics, 2008, 9(2):321-32.
- Robinson and Smyth, Bioinformatics, 2007, 23(21):2881-7.
- Robinson et al., Bioinformatics, 2010, 26(1):139-40.
- Bullard et al. BMC Bioinformatics, 2010, 11:94.
- Robinson and Oshlack, Genome Biology, 2010;11(3):R25.
- Anders and Huber, 2010, Nature Precedings, http://dx.doi.org/10.1038/npre.2010.4282.1
- Wang et al. Bioinformatics, 2010, 26(1):136-8.
- Hardcastle, baySeq, http://www.bioconductor.org/packages/release/bioc/html/baySeq.html
- Oshlack and Wakefield, Biol Direct. 2009, 4:14.
Date: 2010-07-23 12:26:20 EST
HTML generated by org-mode 6.33x in emacs 23