Contents

1 Introduction

This describes some of the ‘upstream’ steps required for known-gene differential expression analysis, using the ultra-fast alignment work flow that goes from FASTQ files to count data without intermediate BAM files.

We follow the salmon tutorial, R-ified in the later stages. We assume a Linux operating system with folder ~/a/salmon/tutorial, and with salmon installed at ~/bin/salmon.

TUTORIAL <- "~/a/salmon/tutorial"
SALMON <- "~/bin/salmon"

2 Download

The tutorial requires FASTQ files representing (paired-end) sequenced samples. We retrieve them from the short read archive (SRA) using the wget command line tool.

#!/bin/bash

DIR=~/a/tutorial
SRA=ftp://ftp.sra.ebi.ac.uk/vol1/fastq

for i in `seq 28 40`; 
do 
  mkdir -p ${DIR}/data/DRR0161${i}; 
  cd ${DIR}/data/DRR0161${i}; 
  wget -bqc ${SRA}/DRR016/DRR0161${i}/DRR0161${i}_1.fastq.gz; 
  wget -bqc ${SRA}/DRR016/DRR0161${i}/DRR0161${i}_2.fastq.gz; 
done

cd $DIR

We will align the sequences to a reference transcriptome, in this case from Ensembl for Arabidopsis thaliana. We download the sequence and index it.

#! /bin/bash

DIR=~/a/tutorial
ENSEMBL=ftp://ftp.ensemblgenomes.org/pub/plants/release-28
curl \
    ${ENSEMBL}/fasta/arabidopsis_thaliana/cdna/Arabidopsis_thaliana.TAIR10.28.cdna.all.fa.gz \
    -o ${DIR}/athal.fa.gz
salmon index -t ${DIR}/athal.fa.gz -i ${DIR}/athal_index

If we were interested, it’s easy enough to read this in to R

library(Biostrings)
readDNAStringSet(file.path(TUTORIAL, "athal.fa.gz"))
##   A DNAStringSet instance of length 41392
##         width seq                                        names               
##     [1]   462 ATGTCCCTTCTGTTTCAACA...GAAGAGATGTCGGCATTAA ATMG00010.1 cdna:...
##     [2]   324 ATGTTTCAGTTCGCTAAGTT...CTTTCTTTTCAACTCGTAA ATMG00030.1 cdna:...
##     [3]   948 ATGACAAAGCGTGAGTATAA...CAGACATTATGTCGACTAG ATMG00040.1 cdna:...
##     [4]   396 ATGTCTGGCGTTTACCTGAC...AAGCGGCGGGTTCGGGTAA ATMG00050.1 cdna:...
##     [5]   542 CCAATTTTTGGGCCAATTCC...AAAGTCAAGTCAAGAATAA ATMG00060.1 cdna:...
##     ...   ... ...
## [41388]   952 GCCATTCCACATTTTTTTAT...ATGAAAAGAAGTAGGAATT AT1G80970.2 cdna:...
## [41389]  1418 GAACATAGTTAGACCGTGTA...CTTTTGTTTGTTGAATCAC AT1G80980.1 cdna:...
## [41390]   690 ATGGATAGATTGAGTGACTT...TCGTGACGACTGCTGCTGA AT1G80990.1 cdna:...
## [41391]   276 ATGTTATGTGCGTGTTGCGT...TGCTTTTTCGGATGATTGA AT1G81001.1 cdna:...
## [41392]   739 AAAGAAACCTCTTCCTCTAA...ATTAGAATAACATTATTTT AT1G81020.1 cdna:...

We also need to know how transcripts map to genes since we’ll do a gene-level analysis. We download the annotation file (gff3) matching the transcript data base.

#! /bin/bash

DIR=~/a/tutorial
ENSEMBL=ftp://ftp.ensemblgenomes.org/pub/plants/release-28
curl \
    ${ENSEMBL}/gtf/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.28.gff3.gz\
    -o ${DIR}/Arabidopsis_thaliana.TAIR10.28.gff3.gz

3 Quantification

The tutorial bash script invokes salmon on each sample, using the paired-end FASTQ files.

#!/bin/bash

DIR=/home/ubuntu/tutorial
SALMON=/home/ubuntu/bin/salmon

for fn in $DIR/data/DRR0161{25..40};
do
    sample=`basename ${fn}`
    echo "Processing sample ${samp}"
    ${SALMON} quant -i athal_index -l A \
         -1 ${fn}/${sample}_1.fastq.gz \
         -2 ${fn}/${sample}_2.fastq.gz \
         -p 8 -o quants/${sample}_quant
done 

An R version to quantify one sample is

quantify = function(sample_file, output, salmon, doit = TRUE) {
    sample = basename(sample_file)
    samples = file.path(sample_file, paste0(sample, "_", 1:2, ".fastq.gz"))
    output = file.path(output, paste0(sample, "_quant"))
    args = c(
        "quant", "-i", index, "-l A",
        "-1", samples[1], "-2", samples[2],
        "-p", parallel::detectCores(),
        "-o", output
    )
    if (doit)
        system2(salmon, args)
    else {
        txt = paste(salmon, paste(args, collapse=" "))
        message(txt)
        invisible(0)
    }
}

Here we quantify all samples (set doit = TRUE to actually perform the counting step.

index = file.path(TUTORIAL, "athal_index")
output  = file.path(TUTORIAL, "quants")
data_dir = file.path(TUTORIAL, "data")
sample_files = normalizePath(dir(data_dir, full = TRUE))

args = list(salmon = SALMON, output = output, doit = FALSE)
Map(quantify, sample_files, MoreArgs = args)

4 Analysis in R

We need to provide a mapping between the transcripts that the reads were aligned to and the genes that we will perform the analysis on. We import the GFF annotation file, and extract the transcripts and their parents (genes) as a tibble.

4.1 Transcript - gene map

library(rtracklayer)
library(tidyverse)
file = file.path(TUTORIAL, "Arabidopsis_thaliana.TAIR10.29.gff3")
gff = import(file)
tx2gene = tibble(
    txid = gff$transcript_id,
    geneid = as.character(gff$Parent)
) %>% na.omit()

4.2 Input

To import the count data, we use the tximport package to import the data. We find the relevant count files, and provide names to identify each file (these names are propagated by the import function to the column names of the output count matrix).

library(tximport)
library(SummarizedExperiment)
files = dir(
    file.path(TUTORIAL, "quants"),
    pattern = ".sf", recursive = TRUE, full = TRUE
)
names(files) = sub("_quant", "", basename(dirname(files)))
counts = tximport(files, type = "salmon", tx2gene = tx2gene)
## reading in files with read_tsv
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 
## summarizing abundance
## summarizing counts
## summarizing length
names(counts)
## [1] "abundance"           "counts"              "length"             
## [4] "countsFromAbundance"

The input is a list with three identically-dimensioned matrices and a fourth element describing how the other elements were determine. We put the three matrices into a SummarizedExperiment, to connect up with the DESeq2 work flow

library(SummarizedExperiment)
se = SummarizedExperiment(counts[-4])

4.3 Experimental design

Next steps: add the experimental design as colData() on the SummarizedExperiment.