This exercise requires having a web_data
data frame. You can either load up some sample data by completing the I/O Exercise (which is what is shown in the step-by-step instructions below), or, if you have access to a Google Analytics account, you can use your own data by following the steps on the Google Analytics API page. Or, if you have access to an Adobe Analytics account, then you can use your own data by following the Generating web_data steps on the Adobe Analytics API page.
web_data
data frame to work with, the command head(web_data)
should return a table that, at least structurally, looks something like this:
date | channelGrouping | deviceCategory | sessions | pageviews | entrances | bounces |
---|---|---|---|---|---|---|
2016-01-01 | (Other) | desktop | 19 | 23 | 19 | 15 |
2016-01-01 | (Other) | mobile | 112 | 162 | 112 | 82 |
2016-01-01 | (Other) | tablet | 24 | 41 | 24 | 19 |
Our web_data
data is already in a tidy format. When data is not in a tidy format when you get it, the tidyr
package has functions to help make it that way. Some day, perhaps we’ll have an exercise to do this!
For this exercise, complete the following tasks with web_data
:
summarise()
function in the dplyr
packagemutate()
function: pvs_per_session and bounce_rateStart by loading the dplyr
package:
library(dplyr)
Now, dive into the exercises!
To get the averages for each combination of deviceCategory and channelGrouping we simply use summarise()
with group_by
:
group_by(web_data, deviceCategory, channelGrouping) %>%
summarise(mean(pageviews))
## # A tibble: 27 x 3
## # Groups: deviceCategory [?]
## deviceCategory channelGrouping `mean(pageviews)`
## <chr> <chr> <dbl>
## 1 desktop (Other) 108.4413
## 2 desktop Direct 2220.2394
## 3 desktop Display 576.0329
## 4 desktop Email 246.8169
## 5 desktop Organic Search 2902.8826
## 6 desktop Paid Search 1927.7136
## 7 desktop Referral 1401.0094
## 8 desktop Social 368.0798
## 9 desktop Video 584.2358
## 10 mobile (Other) 193.7981
## # ... with 17 more rows
This function uses the group_by
function to take the data set (web_data
) and specify that it should be grouped together by two fields: deviceCategory and channelGrouping. That, in and of itself, isn’t enough, as this simply identifies the grouped rows. It doesn’t actually collapse them in any way (run the code above without the summarise()
function to see).
So, in addition to grouping, we need to specify how we want to handle the multiple rows that we’ll collapse within the groups. For that, we use the summarise()
function, tell it that we want to include pageviews as a metric, and tell it that we want to collapse the grouped rows by averaging (mean()
) pageviews.
Note #1: The above uses the ‘pipe’ notation. The exact same result could be achieved by nesting the group_by()
function inside the summarise()
function (below). In simple cases, either way works, but the pipe notation generally makes for more readable code.
summarise(group_by(web_data, deviceCategory, channelGrouping),mean(pageviews))
Note #2: We could include multiple metrics in the summarise()
function. For instance, if we wanted to get the total visits and the average pageviews, we could alter the code slightly:
group_by(web_data, deviceCategory, channelGrouping) %>%
summarise(mean(pageviews),sum(sessions))
## # A tibble: 27 x 4
## # Groups: deviceCategory [?]
## deviceCategory channelGrouping `mean(pageviews)` `sum(sessions)`
## <chr> <chr> <dbl> <int>
## 1 desktop (Other) 108.4413 10820
## 2 desktop Direct 2220.2394 297580
## 3 desktop Display 576.0329 88688
## 4 desktop Email 246.8169 22350
## 5 desktop Organic Search 2902.8826 156144
## 6 desktop Paid Search 1927.7136 154535
## 7 desktop Referral 1401.0094 75253
## 8 desktop Social 368.0798 22434
## 9 desktop Video 584.2358 92885
## 10 mobile (Other) 193.7981 29115
## # ... with 17 more rows
Note #3: Both summarise()
and summarize()
do the same thing. On this site, we skew towards the Queen’s English either because one of the key contributors is British or because Hadley Wickham is from New Zealand. Or both. Either will work!
In this case, we’re actually going to modify the web_data object by adding a couple of calculated columns. This is, really, just like working with an Excel Table and adding columns that are based on existing columns in the table. The mutate()
function takes a data set and then adds new columns as specified in the remaining parameters:
web_data <- mutate(web_data, pvs_per_session = pageviews / sessions,
bounce_rate = bounces / entrances)
# Display the results: truncated here so that it fits nicely on the screen
head(web_data[c("date","channelGrouping","deviceCategory",
"pvs_per_session","bounce_rate")])
## date channelGrouping deviceCategory pvs_per_session bounce_rate
## 1 2016-01-01 (Other) desktop 1.210526 0.7894737
## 2 2016-01-01 (Other) mobile 1.446429 0.7321429
## 3 2016-01-01 (Other) tablet 1.708333 0.7916667
## 4 2016-01-01 Direct desktop 3.180451 0.4586466
## 5 2016-01-01 Direct mobile 2.544928 0.5000000
## 6 2016-01-01 Direct tablet 1.880952 0.6111111
This is a little bit of a trick question. Since we’ve already calculated the bounce rate and added that as a new column in our web_data data frame, we may be tempted to simply average those values:
group_by(web_data,deviceCategory) %>%
summarise(mean(bounce_rate))
## # A tibble: 3 x 2
## deviceCategory `mean(bounce_rate)`
## <chr> <dbl>
## 1 desktop 0.5614256
## 2 mobile 0.6041581
## 3 tablet 0.5981690
This would be incorrect, though, wouldn’t it? This would be the daily average bounce rate, but we want just the average bounce rate. Note that R is not going to point this out. We still have to use our analytical brains.
What we need to do is to first summarise the data and sum the metrics we need to use for the bounce rate calculation and then perform the bounce rate calculation. This is where the pipe really starts to come in handy for code readability:
group_by(web_data,deviceCategory) %>%
summarise(entrances = sum(entrances),
bounces = sum(bounces)) %>%
mutate(bounce_rate = bounces / entrances)
## # A tibble: 3 x 4
## deviceCategory entrances bounces bounce_rate
## <chr> <int> <int> <dbl>
## 1 desktop 914621 565252 0.6180177
## 2 mobile 789386 479515 0.6074531
## 3 tablet 204921 122656 0.5985526
Note how the results from the second calculation differ from those in the first calculation.
Also, a note on why we included “entrances =” in the summarise()
function. If we hadn’t, then the column heading would have been sum(entrances) and sum(bounces). That would have been fine, but we then would have needed to change the mutate()
function:
mutate(bounce_Rate = sum(bounces) / sum(entrances))
Either would work, but it’s sometimes easier to clean up names as we go along. If anything, this should be reminiscent of working with pivot tables in Excel: if we added sessions as a value summarised by summing it, then the value would show as SUM of sessions in the pivot table. Right?