The dplyr
(pronounced DEE ply er) package is one of those packages that, consistently, newcomers to R do not know about and who then get confused by some (one) aspect of it…but experienced R users seldom write a script without using.
dplyr
has just a handful of functions, all of which are geared towards doing basic manipulation of data sets in a fairly straightforward manner We’re not going to go into all of the details of using these functions, as there are plenty of write-ups on that (like this one). But, we will at least provide a brief description of the functions and, at a high level, what they do:
filter()
– used to subset the rows of a data setselect()
– used to subset the columns of a data setarrange()
– used to sort the rows of a data setdistinct()
– used to select only distinct/unique rows in a data setmutate()
– used to add new columns that are based on calculations on data in other columns (e.g., creating a new column with a conversion rate by dividing the orders
column by the sessions
column)summarise()
– used to perform summary calculations (mean, max, etc.) on a set of data (this is generally used in conjuntion with the group_by()
function)n()
– used to count the number of rows in a data set (or subset)sample_n()
– used to return a sample from the data setA key aspect of all of these functions is that the first argument is always the data set. There is some magic as to how this works in conjunction with the pipe (%>%
), which we’re going to tackle next.
First off, the pipe is not something that Hadley Wickham created, nor is it actually functionality that he implemented natively within the package. However, once you load dplyr
, the pipe is available to you, because dplyr
loads the magrittr
package, and that is where the pipe originated in R.
The pipe is, simply, a combination of three characters: %>%
. When used properly, it does two things:
All the pipe does is provide “forward application” of an object to a function.
Huh? That doesn’t help!
Okay, let’s try again: the pipe lets you string together a series of functions – passing the result of one function directly into another function in the sequence that you want them applied and without creating temporary variables. There are two keys to this:
By default, the function to the right of the %>%
assumes the value it is receiving is the first argument for the function, so the first argument is simply omitted in each “downstream”" function.
If you’re using a function where you don’t want the result of the previous function to be the first argument – you want it to be some other argument – then simply write the function as you normally would, but put a .
in the position where you want the upstream function’s result to go.
The second bullet above is confusing…so just file it away. You’ll know it when you need it, and it will make perfect sense. We’re not going to need it here.
Let’s consider our web analytics data (web_data
) that has sessions by date, device category, and channel:
head(web_data)
## Date Device Channel Sessions
## 1 2016-01-01 desktop (Other) 19
## 2 2016-01-01 mobile (Other) 112
## 3 2016-01-01 tablet (Other) 24
## 4 2016-01-01 desktop Direct 133
## 5 2016-01-01 mobile Direct 345
## 6 2016-01-01 tablet Direct 126
Now, suppose we want to find the average number of sessions for Display traffic when the Display traffic from mobile for the day was greater than 2 000 sessions.
First, the longest way to do this (but, really, okay when you’re starting out – it works!):
# Get the subset of data that is display traffic
display_traffic <- web_data[web_data$Channel == "Display",]
# Get the subset of *that* traffic that is mobile
mobile_display <- display_traffic[display_traffic$Device == "mobile",]
# Get the subset of *that* traffic that is greater than 2 000 sessions
final_data <- mobile_display[mobile_display$Sessions > 2000,]
# Calculate the average sessions
avg <- mean(final_data$Sessions)
# Round to the nearest whole number and print the results
round(avg,0)
## [1] 3273
Option 1 was an extreme. So, we may realize that we can combine a bunch of the subsetting operations into a single command:
# Subset the data in one fell swoop
final_data <- web_data[(web_data$Channel == "Display" & web_data$Device == "mobile"
& web_data$Sessions > 2000),]
# Calculate, round, and print all at once
round(mean(final_data$Sessions),0)
## [1] 3273
We could get this all down to a single line (although we need line breaks to control the wrapping):
round(mean(web_data[(web_data$Channel == "Display" & web_data$Device == "mobile"
& web_data$Sessions > 2000),]$Sessions),0)
## [1] 3273
Does this start to remind you of something in Excel? The dreaded heavy nesting of functions!
With the pipe and some dplyr
functions, we can perform these operations in a way that is both efficient and easy to read:
# We'll start to need the dplyr library
library(dplyr)
web_data %>% filter(Channel == "Display",
Device == "mobile",
Sessions > 2000) %>%
summarise(mean(Sessions)) %>%
round(0)
## mean(Sessions)
## 1 3273
As soon as you understand how the pipe works, the code above starts to be super-readable. You could read it like this:
Can you see how the %>%
is really just a way to say, “with the result of what you just did…now do this?”
Remember: Each of the functions above actually has a first argument that is “the data the function should act on.” For instance, if I wanted to round pi to 4 digits, I would use the full round()
function:
round(pi,4)
## [1] 3.1416
But, in our example above, we simply used round(0)
. That’s because the first argument was omitted – it was just assumed to be the value resulting from the preceding function. We could have written the piped function this way, too:
# We'll start to need the dplyr library
library(dplyr)
web_data %>% filter(.,Channel == "Display",
Device == "mobile",
Sessions > 2000) %>%
summarise(.,mean(Sessions)) %>%
round(.,0)
## mean(Sessions)
## 1 3273
It returns the same result, and you would never really write it this way. But, when using pipes with functions that are not dplyr
functions, sometimes, you don’t want the result of a function to be passed into the first argument. So, the .
comes in handy!