This example creates a Sankey chart to show how traffic flows from the homepage, split by device category.
Steps to achieve this are:
google_analytics_4
to get unique pageviews, split by the secondPagePath dimension. We apply a dimension filter to limit the results only to users landing on our website homepage.googleVis
package library and build a gvisSankey
plot. We set some options to colour traffic paths according to the referring channelGrouping
for our visitors.These examples were built thanks to the excellent articles on r-bloggers.com by Tony Hirst. For more information, see:
Be sure you’ve completed the steps on the Initial Setup page before running this code.
For the setup, we’re going to load a few libraries, load our specific Google Analytics credentials, and then authorize with Google.
# Load the necessary libraries. These libraries aren't all necessarily required for every
# example, but, for simplicity's sake, we're going ahead and including them in every example.
# The "typical" way to load these is simply with "library([package name])." But, the handy
# thing about using the approach below -- which uses the pacman package -- is that it will
# check that each package exists and actually install any that are missing before loading
# the package.
if (!require("pacman")) install.packages("pacman")
pacman::p_load(googleAnalyticsR, # How we actually get the Google Analytics data
tidyverse, # Includes dplyr, ggplot2, and others; very key!
devtools, # Generally handy
googleVis, # Useful for some of the visualizations
scales) # Useful for some number formatting in the visualizations
# Authorize GA. Depending on if you've done this already and a .ga-httr-oauth file has
# been saved or not, this may pop you over to a browser to authenticate.
ga_auth(token = ".ga-httr-oauth")
# Set the view ID and the date range. If you want to, you can swap out the Sys.getenv()
# call and just replace that with a hardcoded value for the view ID. And, the start
# and end date are currently set to choose the last 30 days, but those can be
# hardcoded as well.
view_id <- Sys.getenv("GA_VIEW_ID")
start_date <- Sys.Date() - 31 # 30 days back from yesterday
end_date <- Sys.Date() - 1 # Yesterday
If that all runs with just some messages but no errors, then you’re set for the next chunk of code: pulling the data.
Now we’re ready to make our call to the google_analytics_4
function. We’ll pull the uniquePageViews
metric, combined with the channelGrouping
and secondPagePath
(read about this metric in the GA reporting documentation).
Before making the call, we need to build a filter_clause_ga4
object, containing one dim_filter
to get data only for sessions where our users landed on the homepage of our site. Mark Edmondson has written some very helpful documentation on the new filter clauses - read the documentation here.
The code below will build a list which can be passed as an argument to our google_analytics_4
request. We use a regular expression to identify the homepage. The code below assumes the homepage shows up in your Pages report as simply “/”, but you can adjust the expressions
argument if that’s not actually the case (or, of course, you can run this entire example for any landing page you choose by adjusting that argument. Or… create multiple Sankey charts for each of your top landing pages; oh…the possibilities!) .
# Create page filter object
page_filter <- dim_filter(
dimension = "landingPagePath",
operator = "REGEXP",
expressions = "^/$")
homepage_filter <- filter_clause_ga4(list(page_filter))
# Now, we're ready to pull the data from the GA API. We build a `google_analytics_4` request,
# passing the `homepage_filter` to the `dim_filters` argument.
home_next_pages <- google_analytics(
viewId = view_id,
date_range = c(start_date, end_date),
dimensions = c("secondPagePath", "channelGrouping"),
metrics = "uniquePageviews",
dim_filters = homepage_filter,
max = -1,
anti_sample = TRUE
)
# Go ahead and do a quick inspection of the data that was returned. This isn't required,
# but it's a good check along the way.
head(home_next_pages)
secondPagePath | channelGrouping | uniquePageviews |
---|---|---|
(not set) | (Other) | 4 |
(not set) | Direct | 354 |
(not set) | Display | 3 |
(not set) | 1 | |
(not set) | Organic Search | 337 |
(not set) | Referral | 213 |
What we have is a data frame containing unique pageviews per next page for visits which started on your homepage, split by traffic source. The data used here isn’t from a site with the world’s most diverse and interesting set of channels, but, with luck, your data will be!
We have a small problem in the number of possible next pages for our sessions (you don’t need to include this code – it’s just checking the number of unique values for the secondPagePath
dimension in our data set).
length(unique(home_next_pages$secondPagePath))
## [1] 25
We should thin this down to a number which can be easily visualised, which is a two-step process:
# Build the data frame of top 10 pages:
top_10 <- home_next_pages %>%
group_by(secondPagePath) %>%
summarise(upvs = sum(uniquePageviews)) %>%
top_n(10, upvs) %>%
arrange(desc(upvs))
# Using this list of our top 10 pages, use the `semi_join` function from `dplyr` to restrict
# our data to pages & channels that have one of these top 10 pages as the second page viewed.
home_next_pages <- home_next_pages %>%
semi_join(top_10, by = "secondPagePath")
# Check the data again. It's the same structure as it was originally, and the head() is likely
# identical. But, we know that, deeper in the data, the lower-volume pages have been removed.
head(home_next_pages)
secondPagePath | channelGrouping | uniquePageviews |
---|---|---|
(not set) | (Other) | 4 |
(not set) | Direct | 354 |
(not set) | Display | 3 |
(not set) | 1 | |
(not set) | Organic Search | 337 |
(not set) | Referral | 213 |
Now we have a data frame ready for plotting, using our top 10 pages. Again, you don’t need to include this code. We’re just showing that we’re now down to 10 unique values for secondPagePath
.
# Only 10 unique URLs are in our results, now
length(unique(home_next_pages$secondPagePath))
## [1] 10
We’ll make use of the gvisSankey
function to build our plot ( read the function documentation ).
# Reordering colums: the gVisSankey function doesn't take kindly
# if our df columns aren't strictly ordered as from:to:weight
home_next_pages <- home_next_pages %>%
select(channelGrouping, secondPagePath, uniquePageviews)
# Build the plot
s <- gvisSankey(home_next_pages)
# chartid = chart_id)
plot(s)
Note how you can actually mouse over the different values to see additional details.
Our first chart is a nice enough start, but pretty messy and hard to discriminate between traffic sources. We should try to colour node links according to the source (channelGrouping
).
You can control the appearance of your sankey chart, including link colours, by passing options
values in using a json object or as part of a list. I find it easier to write the values as json, for readability.
We have multiple possible channel groupings in our GA data. If we know how many we’ll include (which we could have addressed in our Data Munging by forcing just the top X channels), then we could define a list of colors that is exactly that long. For now, we’re going to define 8 colour values, even though that’s more than we actually need.
We can generate these colour values as hex codes using the colorbrewer website. Colorbrewer helps to ensure our colours can be differentiated and follow good practice for data visualisation.
# 8 values from colorbrewer. Note the use of array notation
# colour_opts <- '["#7fc97f", "#beaed4","#fdc086","#ffff99","#386cb0","#f0027f","#bf5b17","#666666"]'
colour_opts <- '["#7fc97f", "#beaed4","#fdc086","#ffff99"]'
# Set colorMode to 'source' to colour by the chart's source
opts <- paste0("{
link: { colorMode: 'source',
colors: ", colour_opts ," }
}" )
# This colour list can now be passed as an option to our `gvisSankey` call. We pass them to the
# `options` argument for our plot.
s <- gvisSankey(home_next_pages,
options = list(sankey = opts))
plot(s)
This is a bit more useful. Still messy, but if you build your own plot, you’ll notice that when you hover over each node, tooltips will appear to give you information about the source, destination, and volume of pageviews.
We may find it useful to limit the use of colour and focus on a subset of the data. Let’s highlight the second channel and wash out the colour for all other traffic sources.
# 25% gray for all sources except the second one.
colour_opts <- '["#999999", "#7fc97f","#999999","#999999","#999999","#999999","#999999","#999999"]'
opts <- paste0("{
link: { colorMode: 'source',
colors: ", colour_opts ," },
node: { colors: '#999999' }
}" )
# This colour list can now be passed as an option to our `gvisSankey` call.
s <- gvisSankey(home_next_pages,
options = list(sankey = opts))
plot(s)
This is a bit easier to read. There is plenty more work that can be done, but hopefully this guide provides enough information to get you started.
Remember that the number of next pages can be controlled to your preference. It could also be interesting to classify traffic by segment rather than channelGrouping
, and use the segment types as your sources.
Full documentation on googleVis
sankey charts can be found at https://developers.google.com/chart/interactive/docs/gallery/sankey#controlling-colors
This site is a sub-site to dartistics.com