Exercise: Tidy and dplyr

This exercise requires having a web_data data frame. You can either load up some sample data by completing the I/O Exercise (which is what is shown in the step-by-step instructions below), or, if you have access to a Google Analytics account, you can use your own data by following the steps on the Google Analytics API page. Or, if you have access to an Adobe Analytics account, then you can use your own data by following the Generating web_data steps on the Adobe Analytics API page.

Once you have a web_data data frame to work with, the command head(web_data) should return a table that, at least structurally, looks something like this:

date	channelGrouping	deviceCategory	sessions	pageviews	entrances	bounces
2016-01-01	(Other)	desktop	19	23	19	15
2016-01-01	(Other)	mobile	112	162	112	82
2016-01-01	(Other)	tablet	24	41	24	19

Our web_data data is already in a tidy format. When data is not in a tidy format when you get it, the tidyr package has functions to help make it that way. Some day, perhaps we’ll have an exercise to do this!

For this exercise, complete the following tasks with web_data:

Calculate the average pageviews per day by channel using the summarise() function in the dplyr package
Add two columns to web_data using the mutate() function: pvs_per_session and bounce_rate
Calculate the average bounce rate by deviceCategory (be careful with the order of your calculations!).

Start by loading the dplyr package:

library(dplyr)

Now, dive into the exercises!

1. Average Pageviews

To get the averages for each combination of deviceCategory and channelGrouping we simply use summarise() with group_by:

group_by(web_data, deviceCategory, channelGrouping) %>%
  summarise(mean(pageviews))

## # A tibble: 27 x 3
## # Groups:   deviceCategory [?]
##    deviceCategory channelGrouping `mean(pageviews)`
##             <chr>           <chr>             <dbl>
##  1        desktop         (Other)          108.4413
##  2        desktop          Direct         2220.2394
##  3        desktop         Display          576.0329
##  4        desktop           Email          246.8169
##  5        desktop  Organic Search         2902.8826
##  6        desktop     Paid Search         1927.7136
##  7        desktop        Referral         1401.0094
##  8        desktop          Social          368.0798
##  9        desktop           Video          584.2358
## 10         mobile         (Other)          193.7981
## # ... with 17 more rows

This function uses the group_by function to take the data set (web_data) and specify that it should be grouped together by two fields: deviceCategory and channelGrouping. That, in and of itself, isn’t enough, as this simply identifies the grouped rows. It doesn’t actually collapse them in any way (run the code above without the summarise() function to see).

So, in addition to grouping, we need to specify how we want to handle the multiple rows that we’ll collapse within the groups. For that, we use the summarise() function, tell it that we want to include pageviews as a metric, and tell it that we want to collapse the grouped rows by averaging (mean()) pageviews.

Note #1: The above uses the ‘pipe’ notation. The exact same result could be achieved by nesting the group_by() function inside the summarise() function (below). In simple cases, either way works, but the pipe notation generally makes for more readable code.

summarise(group_by(web_data, deviceCategory, channelGrouping),mean(pageviews))

Note #2: We could include multiple metrics in the summarise() function. For instance, if we wanted to get the total visits and the average pageviews, we could alter the code slightly:

group_by(web_data, deviceCategory, channelGrouping) %>%
  summarise(mean(pageviews),sum(sessions))

## # A tibble: 27 x 4
## # Groups:   deviceCategory [?]
##    deviceCategory channelGrouping `mean(pageviews)` `sum(sessions)`
##             <chr>           <chr>             <dbl>           <int>
##  1        desktop         (Other)          108.4413           10820
##  2        desktop          Direct         2220.2394          297580
##  3        desktop         Display          576.0329           88688
##  4        desktop           Email          246.8169           22350
##  5        desktop  Organic Search         2902.8826          156144
##  6        desktop     Paid Search         1927.7136          154535
##  7        desktop        Referral         1401.0094           75253
##  8        desktop          Social          368.0798           22434
##  9        desktop           Video          584.2358           92885
## 10         mobile         (Other)          193.7981           29115
## # ... with 17 more rows

Note #3: Both summarise() and summarize() do the same thing. On this site, we skew towards the Queen’s English either because one of the key contributors is British or because Hadley Wickham is from New Zealand. Or both. Either will work!

2. Add Two Columns

In this case, we’re actually going to modify the web_data object by adding a couple of calculated columns. This is, really, just like working with an Excel Table and adding columns that are based on existing columns in the table. The mutate() function takes a data set and then adds new columns as specified in the remaining parameters:

web_data <- mutate(web_data, pvs_per_session = pageviews / sessions, 
                  bounce_rate = bounces / entrances)

# Display the results: truncated here so that it fits nicely on the screen
head(web_data[c("date","channelGrouping","deviceCategory",
               "pvs_per_session","bounce_rate")])

##         date channelGrouping deviceCategory pvs_per_session bounce_rate
## 1 2016-01-01         (Other)        desktop        1.210526   0.7894737
## 2 2016-01-01         (Other)         mobile        1.446429   0.7321429
## 3 2016-01-01         (Other)         tablet        1.708333   0.7916667
## 4 2016-01-01          Direct        desktop        3.180451   0.4586466
## 5 2016-01-01          Direct         mobile        2.544928   0.5000000
## 6 2016-01-01          Direct         tablet        1.880952   0.6111111

3. Bounce Rate

This is a little bit of a trick question. Since we’ve already calculated the bounce rate and added that as a new column in our web_data data frame, we may be tempted to simply average those values:

group_by(web_data,deviceCategory) %>%
  summarise(mean(bounce_rate))

## # A tibble: 3 x 2
##   deviceCategory `mean(bounce_rate)`
##            <chr>               <dbl>
## 1        desktop           0.5614256
## 2         mobile           0.6041581
## 3         tablet           0.5981690

This would be incorrect, though, wouldn’t it? This would be the daily average bounce rate, but we want just the average bounce rate. Note that R is not going to point this out. We still have to use our analytical brains.

What we need to do is to first summarise the data and sum the metrics we need to use for the bounce rate calculation and then perform the bounce rate calculation. This is where the pipe really starts to come in handy for code readability:

group_by(web_data,deviceCategory) %>%
summarise(entrances = sum(entrances), 
          bounces = sum(bounces)) %>%
  mutate(bounce_rate = bounces / entrances)

## # A tibble: 3 x 4
##   deviceCategory entrances bounces bounce_rate
##            <chr>     <int>   <int>       <dbl>
## 1        desktop    914621  565252   0.6180177
## 2         mobile    789386  479515   0.6074531
## 3         tablet    204921  122656   0.5985526

Note how the results from the second calculation differ from those in the first calculation.

Also, a note on why we included “entrances =” in the summarise() function. If we hadn’t, then the column heading would have been sum(entrances) and sum(bounces). That would have been fine, but we then would have needed to change the mutate() function:

mutate(bounce_Rate = sum(bounces) / sum(entrances))

Either would work, but it’s sometimes easier to clean up names as we go along. If anything, this should be reminiscent of working with pivot tables in Excel: if we added sessions as a value summarised by summing it, then the value would show as SUM of sessions in the pivot table. Right?