Learn to love lists

Lists

Lists are a key data type in R but can be avoided by beginners due to being a bit harder to work with. The do however hold the key to many of R’s more advanced features.

Navigating lists

[ ] vs [[ ]]

A key learning is to know the difference between [ ] and [[ ]] when subsetting lists.

When using [ ] you will always get out another list. When using [[ ]] you will get out the content of that list.

a_list <- list(head(mtcars), letters, list(a=1, b=2))
str(a_list)

## List of 3
##  $ :'data.frame':    6 obs. of  11 variables:
##   ..$ mpg : num [1:6] 21 21 22.8 21.4 18.7 18.1
##   ..$ cyl : num [1:6] 6 6 4 6 8 6
##   ..$ disp: num [1:6] 160 160 108 258 360 225
##   ..$ hp  : num [1:6] 110 110 93 110 175 105
##   ..$ drat: num [1:6] 3.9 3.9 3.85 3.08 3.15 2.76
##   ..$ wt  : num [1:6] 2.62 2.88 2.32 3.21 3.44 ...
##   ..$ qsec: num [1:6] 16.5 17 18.6 19.4 17 ...
##   ..$ vs  : num [1:6] 0 0 1 1 0 1
##   ..$ am  : num [1:6] 1 1 1 0 0 0
##   ..$ gear: num [1:6] 4 4 4 3 3 3
##   ..$ carb: num [1:6] 4 4 1 1 2 1
##  $ : chr [1:26] "a" "b" "c" "d" ...
##  $ :List of 2
##   ..$ a: num 1
##   ..$ b: num 2

a_list[1]

## [[1]]
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

class(a_list[1])

## [1] "list"

a_list[[1]]

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

class(a_list[[1]])

## [1] "data.frame"

This helps you make sure you are working on what you expect when dealing with list elements.

Named lists

Lists can also have names, and it can be useful do to so you can call specific elements by name, using the list$name convention, which is equivalent to list[["name"]]

a_list <- list(df = head(mtcars), abc = letters, b_list = list(a=1, b=2))
str(a_list)

## List of 3
##  $ df    :'data.frame':  6 obs. of  11 variables:
##   ..$ mpg : num [1:6] 21 21 22.8 21.4 18.7 18.1
##   ..$ cyl : num [1:6] 6 6 4 6 8 6
##   ..$ disp: num [1:6] 160 160 108 258 360 225
##   ..$ hp  : num [1:6] 110 110 93 110 175 105
##   ..$ drat: num [1:6] 3.9 3.9 3.85 3.08 3.15 2.76
##   ..$ wt  : num [1:6] 2.62 2.88 2.32 3.21 3.44 ...
##   ..$ qsec: num [1:6] 16.5 17 18.6 19.4 17 ...
##   ..$ vs  : num [1:6] 0 0 1 1 0 1
##   ..$ am  : num [1:6] 1 1 1 0 0 0
##   ..$ gear: num [1:6] 4 4 4 3 3 3
##   ..$ carb: num [1:6] 4 4 1 1 2 1
##  $ abc   : chr [1:26] "a" "b" "c" "d" ...
##  $ b_list:List of 2
##   ..$ a: num 1
##   ..$ b: num 2

a_list["df"]

## $df
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

class(a_list["df"])

## [1] "list"

a_list[["df"]]

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

class(a_list[["df"]])

## [1] "data.frame"

Exercise

What is the code to extract the other elements of a_list ?

a_list <- list(df = head(mtcars), abc = letters, b_list = list(a=1, b=2))


a_list_letters <- ## your code here

a_list_b <- ## your code here

## These shoud be TRUE
class(a_list_letters) == "character"
class(b_list) == "list"

Viewing lists

str(x, max.level = , list.len = )

str() is your friend as usual, but when printing out lists can turn into console spam. You can limit this by adding these parameters to str() to let you have more control on what is printed out.

library(listviewer)

Another handy library to help with lists is listviewer::jsonedit(your_list) which gives you a nested structure to look through the list.

library(listviewer)
jsonedit(web_data)

The “apply” family functions

We move now to apply functions. We will look at the three most popular: lapply(), vapply() and apply()

lapply()

List apply will take each element of your list, call a function upon it, then return a list of the results. lapply() can be really useful to write compact, understandable code.

It is often used to replace what would be for loops in other languages, and offers benefits like better memory management and vectorisation. Its one reason some cite use of for() loops in R code as bad R-style, although there is still some scope for them.

Guide to using lapply()

Lets build up an lapply function, that will work on a list of data.frame’s.

Choose what you want the lapply() function to iterate across. In this example we have a list of data.frames, a_list_of_df
Choose our function. This will operate on each element,
Test the function on one element i.e. the contents of a_list_of_df[[x]] NOT a_list_of_df[x]. Once happy we know it should work on the entire list.
Any constant arguments get added to the end of the lapply - e.g. is you are using sum() you may want all the sum functions to ignore NA e.g. sum(x, na.rm = TRUE)
Construct the full function result <- lapply(a_list, myfunction, constant_arguments = TRUE)

Example

Here we would like to just check the number of rows for each data.frame, so the function chosen is nrow().

Testing the function chosen against our example, we see it works as expected:

nrow(a_list_of_df[[1]])
# 345

We can now apply this to all the data.frames in the list:

lapply(a_list_of_df, nrow)

Note we don’t add arguments to the function call (nrow, not nrow()).

If we want to add arguments to the function, we can supply those as named arguments to lapply()

Here we take advantage that a data.frame is a list of equal length vectors:

## this applies mean to every column
lapply(web_data, mean, trim = 0.5)

Using your own function

What if you want to supply your own function? You can predefine it before the lapply() and supply it:

my_func <- function(x){
  sum(x)
}

lapply(my_vector, my_func)

…or you can add it straight into the lapply function itself - this is normally done when the function is small:

A function defined this way is called an anonymous function, as it has no name.

lapply(my_vector, function(x){
  sum(x)
})

vapply()

vapply() is an extension on top of lapply(). It works in the same way, but you also supply a FUN.VALUE which is a template of what you expect the result to be. If the result is not of that class, then it will raise an error.

This is good to stop any nasty surprises messing up your code later, and is recommended over lapply() when creating functions.

It also has the USE.NAMES argument which is useful to have the list output have the same names as the input list.

apply()

apply() is similar but lets you work with data.frame rows as inputs, instead of list elements. Here each element is a data.frame row, which you can then operate on from your supplied function:

apply(mtcars, MARGIN = 2, sum)

##      mpg      cyl     disp       hp     drat       wt     qsec       vs 
##  642.900  198.000 7383.100 4694.000  115.090  102.952  571.160   14.000 
##       am     gear     carb 
##   13.000  118.000   90.000

Exercise

Write a function that iterates over this list of data.frames supplied below, to output the sum of each column. The output should be a list of numeric vectors, each vector element being the maxium number.

## don't touch this, it just creates your data
a_list_of_df <- lapply(1:10, function(x){
    data.frame(matrix(runif(1000), ncol = 10))
})

## perhaps you want to pass more arguments in, replace as you see fit
your_function <- function(a_data_frame){
  
}

your_result <- lapply(a_list_of_df, your_function)

Hint - use colSums

How would you change this to use vapply ?

Working with the result of lists

There are some handy functions to work with lists: I use mostly Reduce(). This takes a list in its second argument and applies the function given in its first argument in turn, adding the result of the previous call to the next.

As an example, you can use rbind() which binds a data.frame to another and apply it to a list of data.frames. Reduce will rbind() the first two data.frames, then take that result and rbind() it to the third, then take that result and rbind() it to the forth, etc. until you have added up all the data.frames into one:

Reduce(rbind, list_of_df)

Example - create custom ANOVA

Lets make our own function that performs ANOVA on some data using lapply() over a data.frames columns. There are R functions that will do all these steps for you, but its educational to build it up yourself.

First step is to create the data:

library(tidyverse)
anova_data <- web_data %>% 
  filter(deviceCategory == "desktop") %>% 
  select(date, sessions, channelGrouping) %>% 
  spread(channelGrouping, sessions) %>% 
  select(-date)

(Other)	Direct	Display	Email	Organic Search	Paid Search	Referral	Social	Video
19	133	307	17	431	555	131	68	NA
156	1003	196	43	1077	1060	226	158	3
35	1470	235	29	696	489	179	66	90
31	1794	321	70	1075	558	235	46	898
27	1899	309	74	1004	478	218	47	461
21	1972	204	299	974	494	246	47	418

Do the different groups differ significantly from each other? You can probably see for yourself in this dataset, but for cases such as AB tests this is less obvious.

Recap - 1-way ANOVA

The steps for 1-way ANOVA were outlined earlier, and repeated here:

Calculate the mean for each column
Calculate the mean for all columns combined
Calculate sum of squares error
Calculate sum of squares between groups
Calculate mean square errors
Determine the actual F-value
Look up the critical F-value
Compare actual and critical F-values

We create a function that will create this from fundamentals, using lapply() to help us perform on each column (each element of our list)

Calculate the mean per colum

col_mean <- lapply(anova_data, mean, na.rm = TRUE)
## compare with..
colMeans(anova_data)

##        (Other)         Direct        Display          Email Organic Search 
##       50.79812     1397.08920      416.37559      104.92958      733.07042 
##    Paid Search       Referral         Social          Video 
##      725.51643      353.30047      105.32394             NA

Calculate the sum of squares for each column

my_ss <- function(df_col){
  
  ## df_col is a column of the data.frame we pass into the lapply
  mean_col <- mean(df_col, na.rm = TRUE)
  
  ## squares error
  se <- (df_col - mean_col)^2
  
  ## sum of squares error
  sum(se, na.rm = TRUE)
}

ss_a <- lapply(anova_data, my_ss)
SS_e <- Reduce(sum, ss_a)
SS_e

## [1] 418843067

Calculate the total mean

totalMean <- sum(colSums(anova_data), na.rm = TRUE) / 
                   (ncol(anova_data) * nrow(anova_data))
totalMean

## [1] 431.8226

Calculate the sum of squares between groups

my_ssg <- function(df_col){
  
  mean_col <- mean(df_col, na.rm = TRUE)
  
  ((totalMean - mean_col)^2)*length(df_col)

}

ssg <- lapply(anova_data, my_ssg)
SS_b <- Reduce(sum, ssg)
SS_b

## [1] 313925827

Calculate mean square errors

df1 <- (nrow(anova_data)*ncol(anova_data)) - ncol(anova_data)
MSe <- SS_e / df1
MSe

## [1] 219519.4

Calculate mean square between groups

df2 <- ncol(anova_data) - 1
MSb <- SS_b / df2
MSb

## [1] 39240728

Determine the F-value

Fvalue <- MSb / MSe
Fvalue

## [1] 178.7574

Look up critical F-value

Fcrit <- pf(.95, df1 = df1, df2 = df2)
Fcrit

## [1] 0.3939492

Compare the Actual vs critical F-Value

## If TRUE, we can reject null hypothesis that the means of the groups are equal
## The number of sessions does differ by channel
Fvalue > Fcrit

## [1] TRUE

How would we really do this?

anova_data2 <- web_data %>% 
  filter(deviceCategory == "desktop") %>% 
  select(sessions, channelGrouping)

anova_data2$channelGrouping <- as.factor(anova_data2$channelGrouping)

anova2 <- aov(sessions ~ channelGrouping , data = anova_data2, na.action = na.exclude)
summary(anova2)

##                   Df    Sum Sq  Mean Sq F value Pr(>F)    
## channelGrouping    8 309380887 38672611   176.1 <2e-16 ***
## Residuals       1907 418843067   219635                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The numbers are a little different due to ways of treating NAs in the data.

Where to go from here - purrr()

For more advanced useage, purrr is a tidyverse package which is very useful for both working with lists and manipulating data.

Consult the cheatsheet of list-columns for some ideas, and this Jennifer Bryan presentation