Lists

Lists are a key data type in R but can be avoided by beginners due to being a bit harder to work with. The do however hold the key to many of R’s more advanced features.

The “apply” family functions

We move now to apply functions. We will look at the three most popular: lapply(), vapply() and apply()

lapply()

List apply will take each element of your list, call a function upon it, then return a list of the results. lapply() can be really useful to write compact, understandable code.

It is often used to replace what would be for loops in other languages, and offers benefits like better memory management and vectorisation. Its one reason some cite use of for() loops in R code as bad R-style, although there is still some scope for them.

Guide to using lapply()

Lets build up an lapply function, that will work on a list of data.frame’s.

  1. Choose what you want the lapply() function to iterate across. In this example we have a list of data.frames, a_list_of_df

  2. Choose our function. This will operate on each element,

  3. Test the function on one element i.e. the contents of a_list_of_df[[x]] NOT a_list_of_df[x]. Once happy we know it should work on the entire list.

  4. Any constant arguments get added to the end of the lapply - e.g. is you are using sum() you may want all the sum functions to ignore NA e.g. sum(x, na.rm = TRUE)

  5. Construct the full function result <- lapply(a_list, myfunction, constant_arguments = TRUE)

Example

Here we would like to just check the number of rows for each data.frame, so the function chosen is nrow().

Testing the function chosen against our example, we see it works as expected:

nrow(a_list_of_df[[1]])
# 345

We can now apply this to all the data.frames in the list:

lapply(a_list_of_df, nrow)

Note we don’t add arguments to the function call (nrow, not nrow()).

If we want to add arguments to the function, we can supply those as named arguments to lapply()

Here we take advantage that a data.frame is a list of equal length vectors:

## this applies mean to every column
lapply(web_data, mean, trim = 0.5)

Using your own function

What if you want to supply your own function? You can predefine it before the lapply() and supply it:

my_func <- function(x){
  sum(x)
}

lapply(my_vector, my_func)

…or you can add it straight into the lapply function itself - this is normally done when the function is small:

A function defined this way is called an anonymous function, as it has no name.

lapply(my_vector, function(x){
  sum(x)
})

vapply()

vapply() is an extension on top of lapply(). It works in the same way, but you also supply a FUN.VALUE which is a template of what you expect the result to be. If the result is not of that class, then it will raise an error.

This is good to stop any nasty surprises messing up your code later, and is recommended over lapply() when creating functions.

It also has the USE.NAMES argument which is useful to have the list output have the same names as the input list.

apply()

apply() is similar but lets you work with data.frame rows as inputs, instead of list elements. Here each element is a data.frame row, which you can then operate on from your supplied function:

apply(mtcars, MARGIN = 2, sum)
##      mpg      cyl     disp       hp     drat       wt     qsec       vs 
##  642.900  198.000 7383.100 4694.000  115.090  102.952  571.160   14.000 
##       am     gear     carb 
##   13.000  118.000   90.000

Exercise

Write a function that iterates over this list of data.frames supplied below, to output the sum of each column. The output should be a list of numeric vectors, each vector element being the maxium number.

## don't touch this, it just creates your data
a_list_of_df <- lapply(1:10, function(x){
    data.frame(matrix(runif(1000), ncol = 10))
})

## perhaps you want to pass more arguments in, replace as you see fit
your_function <- function(a_data_frame){
  
}

your_result <- lapply(a_list_of_df, your_function)

Hint - use colSums

How would you change this to use vapply ?

Working with the result of lists

There are some handy functions to work with lists: I use mostly Reduce(). This takes a list in its second argument and applies the function given in its first argument in turn, adding the result of the previous call to the next.

As an example, you can use rbind() which binds a data.frame to another and apply it to a list of data.frames. Reduce will rbind() the first two data.frames, then take that result and rbind() it to the third, then take that result and rbind() it to the forth, etc. until you have added up all the data.frames into one:

Reduce(rbind, list_of_df)

Example - create custom ANOVA

Lets make our own function that performs ANOVA on some data using lapply() over a data.frames columns. There are R functions that will do all these steps for you, but its educational to build it up yourself.

First step is to create the data:

library(tidyverse)
anova_data <- web_data %>% 
  filter(deviceCategory == "desktop") %>% 
  select(date, sessions, channelGrouping) %>% 
  spread(channelGrouping, sessions) %>% 
  select(-date)
(Other) Direct Display Email Organic Search Paid Search Referral Social Video
19 133 307 17 431 555 131 68 NA
156 1003 196 43 1077 1060 226 158 3
35 1470 235 29 696 489 179 66 90
31 1794 321 70 1075 558 235 46 898
27 1899 309 74 1004 478 218 47 461
21 1972 204 299 974 494 246 47 418

Do the different groups differ significantly from each other? You can probably see for yourself in this dataset, but for cases such as AB tests this is less obvious.

Recap - 1-way ANOVA

The steps for 1-way ANOVA were outlined earlier, and repeated here:

  1. Calculate the mean for each column
  2. Calculate the mean for all columns combined
  3. Calculate sum of squares error
  4. Calculate sum of squares between groups
  5. Calculate mean square errors
  6. Determine the actual F-value
  7. Look up the critical F-value
  8. Compare actual and critical F-values

We create a function that will create this from fundamentals, using lapply() to help us perform on each column (each element of our list)

Calculate the mean per colum

col_mean <- lapply(anova_data, mean, na.rm = TRUE)
## compare with..
colMeans(anova_data)
##        (Other)         Direct        Display          Email Organic Search 
##       50.79812     1397.08920      416.37559      104.92958      733.07042 
##    Paid Search       Referral         Social          Video 
##      725.51643      353.30047      105.32394             NA

Calculate the sum of squares for each column

my_ss <- function(df_col){
  
  ## df_col is a column of the data.frame we pass into the lapply
  mean_col <- mean(df_col, na.rm = TRUE)
  
  ## squares error
  se <- (df_col - mean_col)^2
  
  ## sum of squares error
  sum(se, na.rm = TRUE)
}

ss_a <- lapply(anova_data, my_ss)
SS_e <- Reduce(sum, ss_a)
SS_e
## [1] 418843067

Calculate the total mean

totalMean <- sum(colSums(anova_data), na.rm = TRUE) / 
                   (ncol(anova_data) * nrow(anova_data))
totalMean
## [1] 431.8226

Calculate the sum of squares between groups

my_ssg <- function(df_col){
  
  mean_col <- mean(df_col, na.rm = TRUE)
  
  ((totalMean - mean_col)^2)*length(df_col)

}

ssg <- lapply(anova_data, my_ssg)
SS_b <- Reduce(sum, ssg)
SS_b
## [1] 313925827

Calculate mean square errors

df1 <- (nrow(anova_data)*ncol(anova_data)) - ncol(anova_data)
MSe <- SS_e / df1
MSe
## [1] 219519.4

Calculate mean square between groups

df2 <- ncol(anova_data) - 1
MSb <- SS_b / df2
MSb
## [1] 39240728

Determine the F-value

Fvalue <- MSb / MSe
Fvalue
## [1] 178.7574

Look up critical F-value

Fcrit <- pf(.95, df1 = df1, df2 = df2)
Fcrit
## [1] 0.3939492

Compare the Actual vs critical F-Value

## If TRUE, we can reject null hypothesis that the means of the groups are equal
## The number of sessions does differ by channel
Fvalue > Fcrit
## [1] TRUE

How would we really do this?

anova_data2 <- web_data %>% 
  filter(deviceCategory == "desktop") %>% 
  select(sessions, channelGrouping)

anova_data2$channelGrouping <- as.factor(anova_data2$channelGrouping)

anova2 <- aov(sessions ~ channelGrouping , data = anova_data2, na.action = na.exclude)
summary(anova2)
##                   Df    Sum Sq  Mean Sq F value Pr(>F)    
## channelGrouping    8 309380887 38672611   176.1 <2e-16 ***
## Residuals       1907 418843067   219635                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The numbers are a little different due to ways of treating NAs in the data.

Where to go from here - purrr()

For more advanced useage, purrr is a tidyverse package which is very useful for both working with lists and manipulating data.

Consult the cheatsheet of list-columns for some ideas, and this Jennifer Bryan presentation