Lists are a key data type in R but can be avoided by beginners due to being a bit harder to work with. The do however hold the key to many of R’s more advanced features.
We move now to apply functions. We will look at the three most popular: lapply()
, vapply()
and apply()
List apply will take each element of your list, call a function upon it, then return a list of the results. lapply()
can be really useful to write compact, understandable code.
It is often used to replace what would be for
loops in other languages, and offers benefits like better memory management and vectorisation. Its one reason some cite use of for()
loops in R code as bad R-style, although there is still some scope for them.
Lets build up an lapply function, that will work on a list of data.frame
’s.
Choose what you want the lapply()
function to iterate across. In this example we have a list of data.frames, a_list_of_df
Choose our function. This will operate on each element,
Test the function on one element i.e. the contents of a_list_of_df[[x]]
NOT a_list_of_df[x]
. Once happy we know it should work on the entire list.
Any constant arguments get added to the end of the lapply
- e.g. is you are using sum()
you may want all the sum
functions to ignore NA
e.g. sum(x, na.rm = TRUE)
Construct the full function result <- lapply(a_list, myfunction, constant_arguments = TRUE)
Here we would like to just check the number of rows for each data.frame, so the function chosen is nrow()
.
Testing the function chosen against our example, we see it works as expected:
nrow(a_list_of_df[[1]])
# 345
We can now apply this to all the data.frames in the list:
lapply(a_list_of_df, nrow)
Note we don’t add arguments to the function call (nrow
, not nrow()
).
If we want to add arguments to the function, we can supply those as named arguments to lapply()
Here we take advantage that a data.frame is a list of equal length vectors:
## this applies mean to every column
lapply(web_data, mean, trim = 0.5)
What if you want to supply your own function? You can predefine it before the lapply() and supply it:
my_func <- function(x){
sum(x)
}
lapply(my_vector, my_func)
…or you can add it straight into the lapply function itself - this is normally done when the function is small:
A function defined this way is called an anonymous function
, as it has no name.
lapply(my_vector, function(x){
sum(x)
})
vapply()
is an extension on top of lapply()
. It works in the same way, but you also supply a FUN.VALUE
which is a template of what you expect the result to be. If the result is not of that class, then it will raise an error.
This is good to stop any nasty surprises messing up your code later, and is recommended over lapply()
when creating functions.
It also has the USE.NAMES
argument which is useful to have the list output have the same names as the input list.
apply()
is similar but lets you work with data.frame rows as inputs, instead of list elements. Here each element is a data.frame row, which you can then operate on from your supplied function:
apply(mtcars, MARGIN = 2, sum)
## mpg cyl disp hp drat wt qsec vs
## 642.900 198.000 7383.100 4694.000 115.090 102.952 571.160 14.000
## am gear carb
## 13.000 118.000 90.000
Write a function that iterates over this list of data.frames
supplied below, to output the sum
of each column. The output should be a list of numeric vectors, each vector element being the maxium number.
## don't touch this, it just creates your data
a_list_of_df <- lapply(1:10, function(x){
data.frame(matrix(runif(1000), ncol = 10))
})
## perhaps you want to pass more arguments in, replace as you see fit
your_function <- function(a_data_frame){
}
your_result <- lapply(a_list_of_df, your_function)
Hint - use
colSums
How would you change this to use vapply
?
There are some handy functions to work with lists: I use mostly Reduce()
. This takes a list in its second argument and applies the function given in its first argument in turn, adding the result of the previous call to the next.
As an example, you can use rbind()
which binds a data.frame to another and apply it to a list of data.frames. Reduce will rbind()
the first two data.frames, then take that result and rbind()
it to the third, then take that result and rbind()
it to the forth, etc. until you have added up all the data.frames into one:
Reduce(rbind, list_of_df)
Lets make our own function that performs ANOVA on some data using lapply()
over a data.frames columns. There are R functions that will do all these steps for you, but its educational to build it up yourself.
First step is to create the data:
library(tidyverse)
anova_data <- web_data %>%
filter(deviceCategory == "desktop") %>%
select(date, sessions, channelGrouping) %>%
spread(channelGrouping, sessions) %>%
select(-date)
(Other) | Direct | Display | Organic Search | Paid Search | Referral | Social | Video | |
---|---|---|---|---|---|---|---|---|
19 | 133 | 307 | 17 | 431 | 555 | 131 | 68 | NA |
156 | 1003 | 196 | 43 | 1077 | 1060 | 226 | 158 | 3 |
35 | 1470 | 235 | 29 | 696 | 489 | 179 | 66 | 90 |
31 | 1794 | 321 | 70 | 1075 | 558 | 235 | 46 | 898 |
27 | 1899 | 309 | 74 | 1004 | 478 | 218 | 47 | 461 |
21 | 1972 | 204 | 299 | 974 | 494 | 246 | 47 | 418 |
Do the different groups differ significantly from each other? You can probably see for yourself in this dataset, but for cases such as AB tests this is less obvious.
The steps for 1-way ANOVA were outlined earlier, and repeated here:
We create a function that will create this from fundamentals, using lapply()
to help us perform on each column (each element of our list)
col_mean <- lapply(anova_data, mean, na.rm = TRUE)
## compare with..
colMeans(anova_data)
## (Other) Direct Display Email Organic Search
## 50.79812 1397.08920 416.37559 104.92958 733.07042
## Paid Search Referral Social Video
## 725.51643 353.30047 105.32394 NA
my_ss <- function(df_col){
## df_col is a column of the data.frame we pass into the lapply
mean_col <- mean(df_col, na.rm = TRUE)
## squares error
se <- (df_col - mean_col)^2
## sum of squares error
sum(se, na.rm = TRUE)
}
ss_a <- lapply(anova_data, my_ss)
SS_e <- Reduce(sum, ss_a)
SS_e
## [1] 418843067
totalMean <- sum(colSums(anova_data), na.rm = TRUE) /
(ncol(anova_data) * nrow(anova_data))
totalMean
## [1] 431.8226
my_ssg <- function(df_col){
mean_col <- mean(df_col, na.rm = TRUE)
((totalMean - mean_col)^2)*length(df_col)
}
ssg <- lapply(anova_data, my_ssg)
SS_b <- Reduce(sum, ssg)
SS_b
## [1] 313925827
df1 <- (nrow(anova_data)*ncol(anova_data)) - ncol(anova_data)
MSe <- SS_e / df1
MSe
## [1] 219519.4
df2 <- ncol(anova_data) - 1
MSb <- SS_b / df2
MSb
## [1] 39240728
Fvalue <- MSb / MSe
Fvalue
## [1] 178.7574
Fcrit <- pf(.95, df1 = df1, df2 = df2)
Fcrit
## [1] 0.3939492
## If TRUE, we can reject null hypothesis that the means of the groups are equal
## The number of sessions does differ by channel
Fvalue > Fcrit
## [1] TRUE
anova_data2 <- web_data %>%
filter(deviceCategory == "desktop") %>%
select(sessions, channelGrouping)
anova_data2$channelGrouping <- as.factor(anova_data2$channelGrouping)
anova2 <- aov(sessions ~ channelGrouping , data = anova_data2, na.action = na.exclude)
summary(anova2)
## Df Sum Sq Mean Sq F value Pr(>F)
## channelGrouping 8 309380887 38672611 176.1 <2e-16 ***
## Residuals 1907 418843067 219635
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The numbers are a little different due to ways of treating NAs in the data.
For more advanced useage, purrr
is a tidyverse package which is very useful for both working with lists and manipulating data.
Consult the cheatsheet of list-columns
for some ideas, and this Jennifer Bryan presentation