Bioconductor provides software to help analyze diverse high-throughput genomic data. Common workflows include:
Sequence Analysis Import fasta, fastq, BAM, gff, bed, wig, and other sequence formats. Trim, transform, align, and manipulate sequences. Perform quality assessment, ChIP-seq, differential expression, RNA-seq, and other workflows. Access the Sequence Read Archive.
Oligonucleotide Arrays Import Affymetrix, Illumina, Nimblegen, Agilent, and other platforms. Perform quality assessment, normalization, differential expression, clustering, classification, gene set enrichment, genetical genomics and other workflows for expression, exon, copy number, SNP, methylation and other assays. Access GEO, ArrayExpress, Biomart, UCSC, and other community resources.
Annotation Resources Introduction to using gene, pathway, gene ontology, homology annotations and the AnnotationHub. Access GO, KEGG, NCBI, Biomart, UCSC, vendor, and other sources.
Annotating Genomic Ranges Represent common sequence data types (e.g., from BAM, gff, bed, and wig files) as genomic ranges for simple and advanced range-based queries.
Annotating Genomic Variants Read and write VCF files. Identify structural location of variants and compute amino acid coding changes for non-synonymous variants. Use SIFT and PolyPhen database packages to predict consequence of amino acid coding changes.
Changing genomic coordinate systems with rtracklayer::liftOver The liftOver facilities developed in conjunction with the UCSC browser track infrastructure are available for transforming data in GRanges formats. This is illustrated here with an image of the NHGRI GWAS catalog that is, as of Oct. 31 2014, distributed with coordinates defined by NCBI build hg38.
High Throughput Assays Import, transform, edit, analyze and visualize flow cytometric, mass spec, HTqPCR, cell-based, and other assays.
RNA-Seq workflow: gene-level exploratory analysis and differential expression This lab will walk you through an end-to-end RNA-Seq differential expression workflow, using DESeq2 along with other Bioconductor packages. We will start from the FASTQ files, show how these were aligned to the reference genome, prepare gene expression values as a count matrix by counting the sequenced fragments, perform exploratory data analysis (EDA), perform differential gene expression analysis with DESeq2, and visually explore the results.
Mass spectrometry and proteomics This lab demonstrates how to access data from proteomics data repositories, how to parse various mass spectrometry data formats, how to identify MS2 spectra and analyse the search results, how to use the high-level infrastructure for raw mass spectrometry and quantitative proteomics experiments and quantitative data processing and analysis.
Transcription Factor Binding Finding Candidate Binding Sites for Known Transcription Factors via Sequence Matching.
Cloud-enabled cis-eQTL search and annotation Bioconductor can be used to perform detailed analyses of relationships between DNA variants and mRNA abundance. Genotype (potentially imputed) and expression data are organized in packages prior to analysis, using very concise representations. SNP and probe filters can be specified at run time. Transcriptome-wide testing can be carried out using multiple levels of concurrency (chromosomes to nodes, genes to cores is a common approach). Default outputs of the cloud-oriented interface ciseqByCluster include FDR for all SNP-gene pairs in cis, along with locus-specific annotations of genetic and genomic contexts.
Differential Binding from ChIP-seq data This workflow describes an analysis pipeline for de novo detection of differential binding (DB) from ChIP-seq data, from read alignment to interpretation of putative DB regions. It will be based on the use of sliding windows in the csaw package, with statistical modelling performed using methods in the edgeR package. Analyses will be demonstrated on real histone mark and transcription factor ChIP-seq data.
Variant Calling This presentation illustrates a typical variant calling workflow starting with FASTQ data working through alignment, filtering, tallying, and calling. QC issues such as alignment coverage, mappability, and problematic homopolymers are explored. Called variants are exported as a vcf file and compared against published genotypes for concordance. Final variants are annotated with with coding consequence and disease association.
Nucleotide Tallies Managing sequence data of large cohorts for population level analysis has become increasingly difficult with current file formats such as BAM, VCF, BCF, GTF, etc. Many studies work exclusively on the level of preprocessed variant calls stored in VCF/MAF file simply because there is no way to look at the data with reasonable resource usage. This tutorial presents an HDF5 alternative for storing variant tallies from BAM files. This intermediate file format stores nucleotide tallies rather than alignments and provides efficient random access to cohort-level data. Once created, the tally files can be easily manipulated and used to create custom reports and plots.
Low-level analyses of single-cell RNA-sequencing data This workflow implements a low-level analysis pipeline for scRNA-seq data using scran, scater and other Bioconductor packages. It describes how to perform quality control on the libraries, normalization of cell-specific biases, basic data exploration and cell cycle phase identification. Procedures to detect highly variable genes, significantly correlated genes and subpopulation-specific marker genes are also shown. These analyses are demonstrated on a range of publicly available scRNA-seq data sets.
RNA-seq analysis is easy as 1-2-3 This workflow demonstrates how to analyse RNA-sequencing data using the edgeR, limma and Glimma packages. The edgeR package is first used to import, organise, filter and normalise the data, followed by the limma package with its voom method, linear modelling and empirical Bayes moderation to assess differential expression and perform gene set testing. This pipeline is further enhanced by the Glimma package which enables interactive exploration of the results so that individual samples and genes can be examined by the user.
Gene Expression Normalization Workflow This workflow elucidates a customizable strategy to identify the effects of technical and confounding factors on gene expression data and normalize it while preserving the underlying biological features of interest. The example analysis demonstrated here explores how certain technical covariates influence the interpretation of the impact of Coronary Artery Disease on peripheral blood gene expression.
Methylation Array Analysis Methylation in the human genome is known to be associated with development and disease. The Illumina Infinium methylation arrays are by far the most common way to interrogate methylation across the human genome. This Bioconductor workflow uses multiple packages for the analysis of methylation array data. Specifically, we demonstrate the steps involved in a typical differential methylation analysis pipeline including: quality control, filtering, normalization, data exploration and statistical testing for probe-wise differential methylation. We further outline other analyses such as differential methylation of regions, differential variability analysis, estimating cell type composition and gene ontology testing. Finally, we provide some examples of how to visualise methylation array data.
Gene-level RNA-seq differential expression and pathway analysis Gene-level RNA-seq differential expression and pathway analysis using Rsubread and the edgeR quasi-likelihood pipeline
TCGA Workflow: Analyze cancer genomics and epigenomics data Biotechnological advances in sequencing have led to an explosion of publicly available data via large international consortia such as The Cancer Genome Atlas (TCGA), The Encyclopedia of DNA Elements (ENCODE), and The NIH Roadmap Epigenomics Mapping Consortium (Roadmap). These projects have provided unprecedented opportunities to interrogate the epigenome of cultured cancer cell lines as well as normal and tumor tissues with high genomic resolution. The Bioconductor project offers more than 1,000 open-source software and statistical packages to analyze high-throughput genomic data. However, most packages are designed for specific data types (e.g. expression, epigenetics, genomics) and there is no one comprehensive tool that provides a complete integrative analysis of the resources and data provided by all three public projects. A need to create an integration of these different analyses was recently proposed. In this workflow, we provide a series of biologically focused integrative analyses of different molecular data. We describe how to download, process and prepare TCGA data and by harnessing several key Bioconductor packages, we describe how to extract biologically meaningful genomic and epigenomic data. Using Roadmap and ENCODE data, we provide a work plan to identify biologically relevant functional epigenomic elements associated with cancer. To illustrate our workflow, we analyzed two types of brain tumors: low-grade glioma (LGG) versus high-grade glioma (glioblastoma multiform or GBM).
See the HOWTO Creating Workflow Vignettes for information on contributing your own workflow.