Contents

1 Rants

1.1 Classic, tidy, rich data representations

Simple example: classic vs. tidy

aggregate(hp ~ cyl, mtcars, mean)                     # classic
##   cyl        hp
## 1   4  82.63636
## 2   6 122.28571
## 3   8 209.21429
library(tidyverse)
mtcars %>% group_by(cyl) %>% summarize(hp = mean(hp)) # tidy
## # A tibble: 3 x 2
##     cyl        hp
##   <dbl>     <dbl>
## 1     4  82.63636
## 2     6 122.28571
## 3     8 209.21429

Biological example

library(airway)
data(airway)
airway                                   # rich
## class: RangedSummarizedExperiment 
## dim: 64102 8 
## metadata(1): ''
## assays(1): counts
## rownames(64102): ENSG00000000003 ENSG00000000005 ... LRG_98 LRG_99
## rowData names(0):
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(9): SampleName cell ... Sample BioSample
hist(log10(rowMeans(assay(airway))))

library(reshape2)
library(tibble)
tbl <- as_tibble(melt(
    assay(airway), c("Gene", "Sample"), value.name="Count"
))
tbl                                      # tidy (incomplete -- no metadata)
## # A tibble: 512,816 x 3
##               Gene     Sample Count
##             <fctr>     <fctr> <int>
##  1 ENSG00000000003 SRR1039508   679
##  2 ENSG00000000005 SRR1039508     0
##  3 ENSG00000000419 SRR1039508   467
##  4 ENSG00000000457 SRR1039508   260
##  5 ENSG00000000460 SRR1039508    60
##  6 ENSG00000000938 SRR1039508     0
##  7 ENSG00000000971 SRR1039508  3251
##  8 ENSG00000001036 SRR1039508  1433
##  9 ENSG00000001084 SRR1039508   519
## 10 ENSG00000001167 SRR1039508   394
## # ... with 512,806 more rows
tbl %>% group_by(Gene) %>% summarize(mean(Count))
## # A tibble: 64,102 x 2
##               Gene `mean(Count)`
##             <fctr>         <dbl>
##  1 ENSG00000000003       741.875
##  2 ENSG00000000005         0.000
##  3 ENSG00000000419       534.875
##  4 ENSG00000000457       242.000
##  5 ENSG00000000460        58.375
##  6 ENSG00000000938         0.375
##  7 ENSG00000000971      6034.750
##  8 ENSG00000001036      1305.000
##  9 ENSG00000001084       615.125
## 10 ENSG00000001167       392.500
## # ... with 64,092 more rows

Context

  • Classic: sample x feature data.frame().
  • Tidy: long-form data.frame(); see the tidyverse
  • Rich: domain-specific, object oriented.

Vocabulary

  • Simple: extensive
  • Tidy: restricted ‘endomorphisms’
  • Rich: extensive, ‘meaningful’

Constraints (e.g., probes & samples)

  • Tidy: implicit
  • Simple, Rich: explicit

Flexibility

  • Simple, tidy: general-purpose
  • Rich: specialized

Programming contract

  • Simple, tidy: limited
  • Rich: strict

Lessons learned / best practices

  • Considerable value in semantically rich structures
  • Current implementations trade-off user and developer convenience
  • Endomorphism, simple vocabulary, consistent paradigm aid use

1.2 Scaling and performance

library(BiocFileCache)
library(TENxGenomics)              # https://github.com/mtmorgan/TENxGenomics
fl <- bfcrpath(BiocFileCache(), "1M_neurons_filtered_gene_bc_matrices_h5.h5")
## Warning: call dbDisconnect() when finished working with a connection
TENxGenomics(fl)                          # big!
## class: TENxGenomics 
## h5path: /home/mtmorgan/.cache/BiocFileCache/38e45fadc64 
## dim(): 27998 x 1306127
m = as.matrix(TENxMatrix(fl)[, 1:1000])   # first 1000 cells
sum(m == 0) / length(m)                   # % zeros
## [1] 0.9276541
sum(rowSums(m == 0) == ncol(m)) / nrow(m) # % genes with no expression
## [1] 0.4264947
hist(log10(1 + colSums(m)))               # reads per cell

  • Write efficient R code first (10 - 100x speedup, often cleaner and more robust)
  • Scale to cores, cluster, cloud when needed
    • 10x (cores) - 100x (large clusters / clouds) ‘faster’
    • 10x - 100x more complicated and fragile!
  • Large data requires different approach – illusion of in-memory, iteration

1.3 Visualization, interactivity, …

library(ggplot2)
library(plotly)
static <- mtcars %>% mutate(cyl = factor(cyl)) %>%
    ggplot(aes(x=qsec, y=mpg)) +
        geom_point(aes(color=cyl))
ggplotly(static)
  • plotly::ggplotly() makes interactive visualization more-or-less a no-brainer.
  • shiny is also ‘fun’ and ‘easy’ interactivity.
    • Especially valuable for throw-away applications, e.g., for use with a local collaborator for data exploration.
    • More ambitious problems must be built on top of robust, testable, functions and script-based work flows implemented as fully documented and tested R packages.

2 Olde school / neue school

2.1 Continuous integration

  • Commits trigger a script to build and check your package
  • Continuous integration on Linux and Windows platforms via (mostly) GitHub webhooks or integrations
    • Travis-CI: Linux setup for continuous integration
    • AppVeyor: Windows instance for package testing (available for 32 & 64 bit architectures)
    • Wercker: Docker-container-based integration option may work with already available Bioconductor Docker images
    • Codeship: Supports bitbucket
    • R-Hub: Integrated with R, still in development
  • The first three work with package-level YAML files to setup an integration environment. Cost-effective option for cross-platform testing.

2.2 Do I / Will I use…

devtools create(), load_all(), check()

  • Yes, until I run into mysterious S4 / namespace problems. These are often devtools limitations.

roxygen2 (e.g., devtools::document())

  • Yes, but I pay a lot of attention to consolidating documentation on fewer pages – not ‘one function, one man page’ – and end up writing important parts of the docs in Rd-style.

testthat (e.g., devtools::use_testthat(), devtools::test())

  • Yes, unit tests are essential, and this is a nice package.

Rcpp

  • No, but I feel like I should…
  • STL containers and algorithms can help a lot with formulating basic solutions.
  • Effective Rcpp use (given my current state of R’s C API) requires a deeper understanding of C++.

Pipes and lazy evaluation (e.g., magrittr, %>%)

  • No, not for robust code development.
  • Emphasizes ‘endomorphisms’ (function returns same object as first argument), which is probably good.

S3 classes or S4 ‘reference’ classes

  • Only under limited circumstances.
  • S3: internal and lightweight, or for interoperability.
  • S4 ‘reference’ classes – when reference-based semantics are appropriate – almost never with data objects; sometimes it seems like files and connections have reference-based semnatics, but I often try this and ‘get it wrong’…

Other class systems

  • No, R is weird enough.

git and GitHub

  • Yes, git’s local repositories and branching are great.
  • GitHub’s issue tracker and pull requests are useful for relatively small collections of repositories.

Continuous integration

  • Yes, but I’ll still rely on local R CMD build / R CMD check before pushing commits.

(R / Bioconductor) Docker and AMI containers

  • No, not for my daily work; it pays to have a fully function native installation.
  • AMIs are great for courses.

3 Acknowledgements

Research reported in this course was supported by the National Human Genome Research Institute and the National Cancer Institute of the National Institutes of Health under award numbers U41HG004059 and U24CA180996.

This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement number 633974)