--- title: "1. Introduction to R " author: "Sonali Arora" output: BiocStyle::html_document: toc: true toc_depth: 2 vignette: > % \VignetteIndexEntry{1. Introduction to Bioconductor} % \VignetteEngine{knitr::rmarkdown} --- ```{r style, echo = FALSE, results = 'asis'} BiocStyle::markdown() options(width=100, max.print=1000) options(useFancyQuotes=FALSE) knitr::opts_chunk$set( eval=as.logical(Sys.getenv("KNITR_EVAL", "TRUE")), cache=as.logical(Sys.getenv("KNITR_CACHE", "TRUE")), error=FALSE) ``` Author: Sonali Arora (sarora@fredhutch.org)
Date: 20-22 July, 2015
The material in this course requires R version 3.2.1 and Bioconductor version 3.2 ## R - http://r-project.org - Open-source, statistical programming language; widely used in academia, finance, pharma, . . . - Core language, 'base' and > 4000 contributed packages - Interactive sessions, scripts, packages ## Useful Functions in base R - `dir`, `list.files` - List files - `read.table`, `scan` - Read Data into R - `c`, `factor`, `data.frame`, `matrix` - Create vectors , data.frame and matrices to store data - `summary`, `table`, `xtabs` - Summarize or cross-tabulate data. - `plot` - Plot data to visualize it - `match`, `%in%`, `which` - find elements of one vector in another. - `split`, `cut` - Split or cut vectors. - `strsplit`, `grep`, `sub` - Operate on character vectors. - `lapply`, `sapply`, `mapply` - Apply function to elements of lists. - `t.test`, `lm`, `anova` - Compare two or several groups. - `dist` , `hclust` - Cluster Data - `biocLite`, `install.packages` - Install packages in R from online repository - `traceback`, `debug`, `browser` - debug errors ## Getting help in R - ?data.frame - methods(lm), methods(class=class(fit)) - ?"plot" - help(package="Biostrings") - vignette(package="GenomicRanges") - StackOverflow; R-help mailing list ## Data types in R - Vectors - logical, integer, numeric, character, . . . - list() - contains other vectors (recursive) - factor(), NA - statistical concepts - Can be named - c(Seattle=1, Portland=2) - matrix(), array() - a vector with a 'dim' attribute. - data.frame() - like spreadsheets; list of equal length vectors. - Homogenous types within a column, heterogenous types across columns. - Other classes - more complicated arrangement of vectors. - Examples - the value returned by lm(); - the DNAStringSet class used to hold DNA sequences. - plain, 'accessor', 'generic', and 'method' functions - Packages - base, recommended, contributed. ## R programming concepts - Functions - built-in (e.g., rnorm()); user-defined ```{r} mean(1:10) rnorm(1:10) summary(rnorm(1:10)) ``` - Subsetting - logical, numeric, character; ```{r} data(iris) # find those rows where petal.width is exactly 0.2 iris[iris$Petal.Width==0.2,] # find those rows where sepal.length is less than 4.5 iris[iris$Sepal.Length < 4.5,] # find all rows belonging to setosa setosa_iris = iris[iris$Species=="setosa",] dim(setosa_iris) head(setosa_iris) ``` - Iteration - over vector elements, lapply(), mapply(), apply() ```{r} # drop the column containing characters i.e., Species iris <- iris[,!( names(iris) %in% "Species")] dim(iris) # find the mean of the first 4 numerical columns lapply(iris, mean) # simpler: colMeans(iris) # simplify the result sapply(iris, mean) # find the mean for each row. apply(iris, 1 , mean) #simpler : rowMeans(iris) ``` ## R as a Statistical Computing Environment ```{r} # define a vector x <- rnorm(1000) # vectorized calculation y <- x + rnorm(1000, sd=.8) # object construction df <- data.frame(x=x, y=y) # linear model fit <- lm(y ~ x, df) ``` ## Visualizing Data in R ```{r} par(mfrow=c(1,2)) plot(y ~ x, df, cex.lab=2) abline(fit, col="red", lwd=2) library(ggplot2) ggplot(df, aes(x, y)) + geom_point() + stat_smooth(method="lm") ``` ## `sessionInfo()` ```{r sessionInfo} sessionInfo() ```