Contents

1 Extended Exercise 1: BRFSS Survey Data

We will explore a subset of data collected by the CDC through its extensive Behavioral Risk Factor Surveillance System (BRFSS) telephone survey. Check out the link for more information. We’ll look at a subset of the data.

  1. Use file.choose() to find the path to the file ‘BRFSS-subset.csv’

    path <- file.choose()
  2. Input the data using read.csv(), assigning to a variable brfss

    brfss <- read.csv(path)
  3. Use command like class(), head(), dim(), colnames(), summary() to explore the data.

    • What variables have been measured?

    • Can you guess at the units used for, e.g., Weight and Height?

  4. Use the $ operator to extract the ‘Sex’ column, and summarize the number of males and females in the survey using table(). Do the same for ‘Year’.

    table(brfss$Sex)
    ## 
    ## Female   Male 
    ##  12039   7961
  5. The xtabs() function performs cross-tabulation using a formula-like interface; summarize the number of males and female in each year of the study.

    xtabs(~ Year + Sex, brfss)
    ##       Sex
    ## Year   Female Male
    ##   1990   5718 4282
    ##   2010   6321 3679
  6. Use aggregate() to summarize the mean weight of each group. What about the median weight of each group?

    aggregate(Weight ~ Year + Sex, brfss, mean)
    ##   Year    Sex   Weight
    ## 1 1990 Female 64.81838
    ## 2 2010 Female 72.95424
    ## 3 1990   Male 81.17999
    ## 4 2010   Male 88.84657
  7. Create a subset of the data consisting of only the 1990 observations. Perform a t-test comparing the weight of males and females (“‘Weight’ as a function of ‘Sex’”, Weight ~ Sex)

    brfss_1990 = brfss[brfss$Year == 1990,]
    t.test(Weight ~ Sex, brfss_1990)
    ## 
    ##  Welch Two Sample t-test
    ## 
    ## data:  Weight by Sex
    ## t = -58.734, df = 9214, p-value < 2.2e-16
    ## alternative hypothesis: true difference in means is not equal to 0
    ## 95 percent confidence interval:
    ##  -16.90767 -15.81554
    ## sample estimates:
    ## mean in group Female   mean in group Male 
    ##             64.81838             81.17999

    What about differences between weights of males (or females) in 1990 versus 2010? Check out the help page ?t.test.formula. Is there a way of performing a t-test on brfss without explicitly creating the object brfss_1990?

  8. Use boxplot() to plot the weights of the Male individuals. Can you transform weight, e.g., sqrt(Weight) ~ Year? Interpret the results. Do similar boxplots for the t-tests of the previous question.

    boxplot(Weight ~ Year, brfss, subset = (Sex == "Male"),
            main="Males")

  9. Use hist() to plot a histogram of weights of the 1990 Female individuals.

    hist(brfss_1990[brfss_1990$Sex == "Female", "Weight"],
         main="Females, 1990", xlab="Weight" )

2 Extended Exercise 2: ALL Phenotypic Data

This data comes from an (old) Acute Lymphoid Leukemia microarray data set.

Choose the file that contains ALL (acute lymphoblastic leukemia) patient information

path <- file.choose()    # look for ALL-phenoData.csv
stopifnot(file.exists(path))
pdata <- read.csv(path)

Check out the help page ?read.delim for input options. The exercises use ?read.csv; Can you guess why? Explore basic properties of the object you’ve created, for instance…

class(pdata)
## [1] "data.frame"
colnames(pdata)
##  [1] "X"              "cod"            "diagnosis"      "sex"            "age"           
##  [6] "BT"             "remission"      "CR"             "date.cr"        "t.4.11."       
## [11] "t.9.22."        "cyto.normal"    "citog"          "mol.biol"       "fusion.protein"
## [16] "mdr"            "kinet"          "ccr"            "relapse"        "transplant"    
## [21] "f.u"            "date.last.seen"
dim(pdata)
## [1] 128  22
head(pdata)
##       X  cod diagnosis sex age BT remission CR   date.cr t.4.11. t.9.22. cyto.normal        citog
## 1 01005 1005 5/21/1997   M  53 B2        CR CR  8/6/1997   FALSE    TRUE       FALSE      t(9;22)
## 2 01010 1010 3/29/2000   M  19 B2        CR CR 6/27/2000   FALSE   FALSE       FALSE  simple alt.
## 3 03002 3002 6/24/1998   F  52 B4        CR CR 8/17/1998      NA      NA          NA         <NA>
## 4 04006 4006 7/17/1997   M  38 B1        CR CR  9/8/1997    TRUE   FALSE       FALSE      t(4;11)
## 5 04007 4007 7/22/1997   M  57 B2        CR CR 9/17/1997   FALSE   FALSE       FALSE      del(6q)
## 6 04008 4008 7/30/1997   M  17 B1        CR CR 9/27/1997   FALSE   FALSE       FALSE complex alt.
##   mol.biol fusion.protein mdr   kinet   ccr relapse transplant               f.u date.last.seen
## 1  BCR/ABL           p210 NEG dyploid FALSE   FALSE       TRUE BMT / DEATH IN CR           <NA>
## 2      NEG           <NA> POS dyploid FALSE    TRUE      FALSE               REL      8/28/2000
## 3  BCR/ABL           p190 NEG dyploid FALSE    TRUE      FALSE               REL     10/15/1999
## 4 ALL1/AF4           <NA> NEG dyploid FALSE    TRUE      FALSE               REL      1/23/1998
## 5      NEG           <NA> NEG dyploid FALSE    TRUE      FALSE               REL      11/4/1997
## 6      NEG           <NA> NEG hyperd. FALSE    TRUE      FALSE               REL     12/15/1997
summary(pdata$sex)
##    F    M NA's 
##   42   83    3
summary(pdata$cyto.normal)
##    Mode   FALSE    TRUE    NA's 
## logical      69      24      35

Remind yourselves about various ways to subset and access columns of a data.frame

pdata[1:5, 3:4]
##   diagnosis sex
## 1 5/21/1997   M
## 2 3/29/2000   M
## 3 6/24/1998   F
## 4 7/17/1997   M
## 5 7/22/1997   M
pdata[1:5, ]
##       X  cod diagnosis sex age BT remission CR   date.cr t.4.11. t.9.22. cyto.normal       citog
## 1 01005 1005 5/21/1997   M  53 B2        CR CR  8/6/1997   FALSE    TRUE       FALSE     t(9;22)
## 2 01010 1010 3/29/2000   M  19 B2        CR CR 6/27/2000   FALSE   FALSE       FALSE simple alt.
## 3 03002 3002 6/24/1998   F  52 B4        CR CR 8/17/1998      NA      NA          NA        <NA>
## 4 04006 4006 7/17/1997   M  38 B1        CR CR  9/8/1997    TRUE   FALSE       FALSE     t(4;11)
## 5 04007 4007 7/22/1997   M  57 B2        CR CR 9/17/1997   FALSE   FALSE       FALSE     del(6q)
##   mol.biol fusion.protein mdr   kinet   ccr relapse transplant               f.u date.last.seen
## 1  BCR/ABL           p210 NEG dyploid FALSE   FALSE       TRUE BMT / DEATH IN CR           <NA>
## 2      NEG           <NA> POS dyploid FALSE    TRUE      FALSE               REL      8/28/2000
## 3  BCR/ABL           p190 NEG dyploid FALSE    TRUE      FALSE               REL     10/15/1999
## 4 ALL1/AF4           <NA> NEG dyploid FALSE    TRUE      FALSE               REL      1/23/1998
## 5      NEG           <NA> NEG dyploid FALSE    TRUE      FALSE               REL      11/4/1997
head(pdata[, 3:5])
##   diagnosis sex age
## 1 5/21/1997   M  53
## 2 3/29/2000   M  19
## 3 6/24/1998   F  52
## 4 7/17/1997   M  38
## 5 7/22/1997   M  57
## 6 7/30/1997   M  17
tail(pdata[, 3:5], 3)
##      diagnosis  sex age
## 126  3/27/1998    M  30
## 127 10/23/1998    M  29
## 128       <NA> <NA>  NA
head(pdata$age)
## [1] 53 19 52 38 57 17
head(pdata$sex)
## [1] M M F M M M
## Levels: F M
head(pdata[pdata$age > 21,])
##        X  cod diagnosis sex age BT remission CR   date.cr t.4.11. t.9.22. cyto.normal        citog
## 1  01005 1005 5/21/1997   M  53 B2        CR CR  8/6/1997   FALSE    TRUE       FALSE      t(9;22)
## 3  03002 3002 6/24/1998   F  52 B4        CR CR 8/17/1998      NA      NA          NA         <NA>
## 4  04006 4006 7/17/1997   M  38 B1        CR CR  9/8/1997    TRUE   FALSE       FALSE      t(4;11)
## 5  04007 4007 7/22/1997   M  57 B2        CR CR 9/17/1997   FALSE   FALSE       FALSE      del(6q)
## 10 08001 8001 1/15/1997   M  40 B2        CR CR 3/26/1997   FALSE   FALSE       FALSE     del(p15)
## 11 08011 8011 8/21/1998   M  33 B3        CR CR 10/8/1998   FALSE   FALSE       FALSE del(p15/p16)
##    mol.biol fusion.protein mdr   kinet   ccr relapse transplant               f.u date.last.seen
## 1   BCR/ABL           p210 NEG dyploid FALSE   FALSE       TRUE BMT / DEATH IN CR           <NA>
## 3   BCR/ABL           p190 NEG dyploid FALSE    TRUE      FALSE               REL     10/15/1999
## 4  ALL1/AF4           <NA> NEG dyploid FALSE    TRUE      FALSE               REL      1/23/1998
## 5       NEG           <NA> NEG dyploid FALSE    TRUE      FALSE               REL      11/4/1997
## 10  BCR/ABL           p190 NEG    <NA> FALSE    TRUE      FALSE               REL      7/11/1997
## 11  BCR/ABL      p190/p210 NEG dyploid FALSE   FALSE       TRUE BMT / DEATH IN CR           <NA>

It seems from below that there are 17 females over 40 in the data set. However, some individuals have NA for the age and / or sex, and these NA values propagate through some computations. Use table() to summarize the number of females over 40, and the number of samples for which this classification cannot be determined. When R encounters an NA value in a subscript index, it introduces an NA into the result. Observe this (rows of NA values introduced into the result) when subsetting using [ versus using the subset() function.

idx <- pdata$sex == "F" & pdata$age > 40
table(idx, useNA="ifany")
## idx
## FALSE  TRUE  <NA> 
##   108    17     3
dim(pdata[idx,])           # WARNING: 'NA' rows introduced
## [1] 20 22
tail(pdata[idx,])
##          X   cod  diagnosis  sex age   BT remission                 CR    date.cr t.4.11. t.9.22.
## 83   49006 49006  8/12/1998    F  43   B2        CR                 CR 11/19/1998      NA      NA
## 84   57001 57001  1/29/1997    F  53   B3      <NA> DEATH IN INDUCTION       <NA>   FALSE   FALSE
## 85   62001 62001 11/11/1997    F  50   B4       REF                REF       <NA>   FALSE    TRUE
## NA.1  <NA>  <NA>       <NA> <NA>  NA <NA>      <NA>               <NA>       <NA>      NA      NA
## 98   02020  2020  3/23/2000    F  48   T2      <NA> DEATH IN INDUCTION       <NA>   FALSE   FALSE
## NA.2  <NA>  <NA>       <NA> <NA>  NA <NA>      <NA>               <NA>       <NA>      NA      NA
##      cyto.normal         citog mol.biol fusion.protein  mdr   kinet   ccr relapse transplant  f.u
## 83            NA          <NA>  BCR/ABL           p210  NEG dyploid FALSE    TRUE      FALSE  REL
## 84          TRUE        normal      NEG           <NA>  NEG hyperd.    NA      NA         NA <NA>
## 85         FALSE t(9;22)+other  BCR/ABL           <NA>  NEG hyperd.    NA      NA         NA <NA>
## NA.1          NA          <NA>     <NA>           <NA> <NA>    <NA>    NA      NA         NA <NA>
## 98         FALSE  complex alt.      NEG           <NA>  NEG dyploid    NA      NA         NA <NA>
## NA.2          NA          <NA>     <NA>           <NA> <NA>    <NA>    NA      NA         NA <NA>
##      date.last.seen
## 83        4/26/1999
## 84             <NA>
## 85             <NA>
## NA.1           <NA>
## 98             <NA>
## NA.2           <NA>
dim(subset(pdata, idx))    # BETTER: no NA rows
## [1] 17 22
tail(subset(pdata,idx))
##        X   cod  diagnosis sex age BT remission                 CR    date.cr t.4.11. t.9.22.
## 63 28032 28032  9/26/1998   F  52 B1        CR                 CR 10/30/1998    TRUE   FALSE
## 71 30001 30001  1/16/1997   F  54 B3      <NA> DEATH IN INDUCTION       <NA>   FALSE    TRUE
## 83 49006 49006  8/12/1998   F  43 B2        CR                 CR 11/19/1998      NA      NA
## 84 57001 57001  1/29/1997   F  53 B3      <NA> DEATH IN INDUCTION       <NA>   FALSE   FALSE
## 85 62001 62001 11/11/1997   F  50 B4       REF                REF       <NA>   FALSE    TRUE
## 98 02020  2020  3/23/2000   F  48 T2      <NA> DEATH IN INDUCTION       <NA>   FALSE   FALSE
##    cyto.normal         citog mol.biol fusion.protein mdr   kinet   ccr relapse transplant  f.u
## 63       FALSE       t(4;11) ALL1/AF4           <NA> NEG dyploid  TRUE   FALSE      FALSE  CCR
## 71       FALSE t(9;22)+other  BCR/ABL           p190 NEG hyperd.    NA      NA         NA <NA>
## 83          NA          <NA>  BCR/ABL           p210 NEG dyploid FALSE    TRUE      FALSE  REL
## 84        TRUE        normal      NEG           <NA> NEG hyperd.    NA      NA         NA <NA>
## 85       FALSE t(9;22)+other  BCR/ABL           <NA> NEG hyperd.    NA      NA         NA <NA>
## 98       FALSE  complex alt.      NEG           <NA> NEG dyploid    NA      NA         NA <NA>
##    date.last.seen
## 63      5/16/2002
## 71           <NA>
## 83      4/26/1999
## 84           <NA>
## 85           <NA>
## 98           <NA>
## work-around for `[`: set NA values to FALSE
idx[is.na(idx)] <- FALSE
dim(pdata[idx,])
## [1] 17 22

Use the mol.biol column to subset the data to contain just individuals with ‘BCR/ABL’ or ‘NEG’, e.g.,

bcrabl <- pdata[pdata$mol.biol %in% c("BCR/ABL", "NEG"),]

The mol.biol column is a factor, and retains all levels even after subsetting. It is sometimes convenient to retain factor levels, but in our case we use droplevels() to removed unused levels

bcrabl$mol.biol <- droplevels(bcrabl$mol.biol)

The BT column is a factor describing B- and T-cell subtypes

levels(bcrabl$BT)
##  [1] "B"  "B1" "B2" "B3" "B4" "T"  "T1" "T2" "T3" "T4"

How might one collapse B1, B2, … to a single type B, and likewise for T1, T2, …, so there are only two subtypes, B and T? One strategy is to replace two-letter level (e.g., B1) with the single-letter level (e.g., B). Do this using substring() to select the first letter of level, and update the previous levels with the new value using levels<-.

table(bcrabl$BT)
## 
##  B B1 B2 B3 B4  T T1 T2 T3 T4 
##  4  9 35 22  9  5  1 15  9  2
levels(bcrabl$BT) <- substring(levels(bcrabl$BT), 1, 1)
table(bcrabl$BT)
## 
##  B  T 
## 79 32

Use xtabs() (cross-tabulation) to count the number of samples with B- and T-cell types in each of the BCR/ABL and NEG groups

xtabs(~ BT + mol.biol, bcrabl)
##    mol.biol
## BT  BCR/ABL NEG
##   B      37  42
##   T       0  32

Use aggregate() to calculate the average age of males and females in the BCR/ABL and NEG treatment groups.

aggregate(age ~ mol.biol + sex, bcrabl, mean)
##   mol.biol sex      age
## 1  BCR/ABL   F 39.93750
## 2      NEG   F 30.42105
## 3  BCR/ABL   M 40.50000
## 4      NEG   M 27.21154

Use t.test() to compare the age of individuals in the BCR/ABL versus NEG groups; visualize the results using boxplot(). In both cases, use the formula interface. Consult the help page ?t.test and re-do the test assuming that variance of ages in the two groups is identical. What parts of the test output change?

t.test(age ~ mol.biol, bcrabl)
## 
##  Welch Two Sample t-test
## 
## data:  age by mol.biol
## t = 4.8172, df = 68.529, p-value = 8.401e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   7.13507 17.22408
## sample estimates:
## mean in group BCR/ABL     mean in group NEG 
##              40.25000              28.07042
boxplot(age ~ mol.biol, bcrabl)