This exercise requires having a web_data
data frame. You can either load up some sample data by completing the I/O Exercise (which is what is shown in the step-by-step instructions below), or, if you have access to a Google Analytics account, you can use your own data by following the steps on the Google Analytics API page. Or, if you have access to an Adobe Analytics account, then you can use your own data by following the Generating web_data steps on the Adobe Analytics API page.
web_data
data frame to work with, the command head(web_data)
should return a table that, at least structurally, looks something like this:
date | channelGrouping | deviceCategory | sessions | pageviews | entrances | bounces |
---|---|---|---|---|---|---|
2016-01-01 | (Other) | desktop | 19 | 23 | 19 | 15 |
2016-01-01 | (Other) | mobile | 112 | 162 | 112 | 82 |
2016-01-01 | (Other) | tablet | 24 | 41 | 24 | 19 |
The step-by-step instructions for each of the components of this exercised are below. But, what we want to do is:
web_data
in the Environment paneweb_data
head()
functionweb_data
web_data
web_data
head()
function using indices to display the first 6 rows of data in web_data
web_data
sessions
column using $
notationsessions
column using both $
and indicesmobile_data
that includes only the rows from web_data
where the deviceCategory
value is mobile
mobile_data
where the sessions
are greater than 2.000web_data
(don’t create an intermediate mobile_data
object)mobile
is the deviceCategory
, sessions
is greater than 2000, but only display three columns: date
, channelGrouping
, and sessions
Instead of clicking on the object in the Environment tab, we can just type the object’s name in the console. Go ahead and do that:
web_data
Whoa! We’re not showing the results of that here, as it’s 10,000 rows of material quickly flashing past your eyes. But, often, we just want to get a basic sense of the data structure, so just viewing the first few rows of the data will suffice. We can do that with the head()
function (remember: ?head()
will give you documentation on the function):
## date channelGrouping deviceCategory sessions pageviews entrances
## 1 2016-01-01 (Other) desktop 19 23 19
## 2 2016-01-01 (Other) mobile 112 162 112
## 3 2016-01-01 (Other) tablet 24 41 24
## 4 2016-01-01 Direct desktop 133 423 133
## 5 2016-01-01 Direct mobile 345 878 344
## 6 2016-01-01 Direct tablet 126 237 126
## bounces
## 1 15
## 2 82
## 3 19
## 4 61
## 5 172
## 6 77
The above will likely look a bit better in your console than it does here. If you have a lot of columns, R will actually wrap the data in the console. It tends to be hard to digest that way, but that’s why we need to get comfortable with other ways of referencing subsets of a data frame!
Let’s look at a single value from this data set: the value in the second row and the fifth column:
web_data[2,5]
## [1] 162
Find this value in the data frame that you opened up from the environment (or just find it in the head()
data you pulled above). Make sense?
Now, let’s look at the entire second row.
web_data[2,]
## date channelGrouping deviceCategory sessions pageviews entrances
## 2 2016-01-01 (Other) mobile 112 162 112
## bounces
## 2 82
Or, we could look at the entire 5th column (not shown here, but feel free to give it a try):
web_data[,5]
We can also look at ranges using these indices. To mimic head(web_data)
, we can simply specify we want to see all columns for the first six rows of the data:
web_data[1:6,]
## date channelGrouping deviceCategory sessions pageviews entrances
## 1 2016-01-01 (Other) desktop 19 23 19
## 2 2016-01-01 (Other) mobile 112 162 112
## 3 2016-01-01 (Other) tablet 24 41 24
## 4 2016-01-01 Direct desktop 133 423 133
## 5 2016-01-01 Direct mobile 345 878 344
## 6 2016-01-01 Direct tablet 126 237 126
## bounces
## 1 15
## 2 82
## 3 19
## 4 61
## 5 172
## 6 77
Or, if we wanted to look at just the second through fifth columns for the first six rows of data:
web_data[1:6,2:5]
## channelGrouping deviceCategory sessions pageviews
## 1 (Other) desktop 19 23
## 2 (Other) mobile 112 162
## 3 (Other) tablet 24 41
## 4 Direct desktop 133 423
## 5 Direct mobile 345 878
## 6 Direct tablet 126 237
The dicey thing about using numeric indices is that, if the structure of the data changes (e.g., the query of the API gets updated to add a dimension or a metric), the indices may suddenly start referencing the wrong thing.
Happily, we can use column names to prevent this. If you’ve worked with Excel tables, this will seem somewhat familiar.
Let’s look at just the sessions column:
web_data$sessions
Or, we can combine column names and indices. If we use a column name, then we don’t need to specify a column index, so there is only one value inside the [ ]
s:
web_data$sessions[1:5]
## [1] 19 112 24 133 345
It’s generally more efficient to do as few API calls as possible. That means that, often, we’re pulling a master data set, even though we only want to work on pieces of it at once. In this example, what if we wanted to look at just the mobile data. And, as a small twist, let’s not only isolate the mobile data, but let’s put that data into its own data frame calle mobile_data
:
mobile_data <- web_data[web_data$deviceCategory=="mobile",]
Double-click on the mobile_data object in your Environment to check out this data. (Or, perhaps, view the head()
of this new object in your console!)
What if we wanted to quickly get a list of dates and channels where the channel’s sessions for the day were greater than 2.000 (or 2,000, depending on which continent you are on)? We can perform this on our new mobile_data
object:
mobile_data[mobile_data$sessions>2000,]
## date channelGrouping deviceCategory sessions pageviews
## 8 2016-01-01 Display mobile 3266 3772
## 33 2016-01-02 Display mobile 2375 2745
## 42 2016-01-02 Paid Search mobile 2270 4405
## 59 2016-01-03 Display mobile 2377 2697
## 86 2016-01-04 Display mobile 2535 2821
## 113 2016-01-05 Display mobile 2067 2386
## 1200 2016-02-14 Paid Search mobile 2225 4361
## 1227 2016-02-15 Paid Search mobile 10216 15883
## 1242 2016-02-16 Direct mobile 2352 13527
## 1251 2016-02-16 Organic Search mobile 3063 17671
## 1254 2016-02-16 Paid Search mobile 7151 12634
## 1281 2016-02-17 Paid Search mobile 3039 5357
## 3471 2016-05-09 Direct mobile 2694 17043
## 3582 2016-05-13 Display mobile 7955 13157
## 3608 2016-05-14 Display mobile 3151 4462
## 3635 2016-05-15 Display mobile 2459 3257
## 4540 2016-06-17 Social mobile 2088 2769
## 4567 2016-06-18 Social mobile 2112 3243
## entrances bounces
## 8 3253 2904
## 33 2366 2118
## 42 2257 1405
## 59 2369 2145
## 86 2522 2304
## 113 2059 1841
## 1200 2214 1428
## 1227 10155 7359
## 1242 2344 682
## 1251 3048 632
## 1254 7126 4936
## 1281 3028 2115
## 3471 2681 1149
## 3582 7946 7121
## 3608 3145 2855
## 3635 2453 2234
## 4540 2073 1809
## 4567 2098 1720
Could we have gotten this same result from our base web_data
data set? We could – by combining criteria:
web_data[(web_data$sessions>2000 & web_data$deviceCategory=="mobile"),]
## date channelGrouping deviceCategory sessions pageviews
## 8 2016-01-01 Display mobile 3266 3772
## 33 2016-01-02 Display mobile 2375 2745
## 42 2016-01-02 Paid Search mobile 2270 4405
## 59 2016-01-03 Display mobile 2377 2697
## 86 2016-01-04 Display mobile 2535 2821
## 113 2016-01-05 Display mobile 2067 2386
## 1200 2016-02-14 Paid Search mobile 2225 4361
## 1227 2016-02-15 Paid Search mobile 10216 15883
## 1242 2016-02-16 Direct mobile 2352 13527
## 1251 2016-02-16 Organic Search mobile 3063 17671
## 1254 2016-02-16 Paid Search mobile 7151 12634
## 1281 2016-02-17 Paid Search mobile 3039 5357
## 3471 2016-05-09 Direct mobile 2694 17043
## 3582 2016-05-13 Display mobile 7955 13157
## 3608 2016-05-14 Display mobile 3151 4462
## 3635 2016-05-15 Display mobile 2459 3257
## 4540 2016-06-17 Social mobile 2088 2769
## 4567 2016-06-18 Social mobile 2112 3243
## entrances bounces
## 8 3253 2904
## 33 2366 2118
## 42 2257 1405
## 59 2369 2145
## 86 2522 2304
## 113 2059 1841
## 1200 2214 1428
## 1227 10155 7359
## 1242 2344 682
## 1251 3048 632
## 1254 7126 4936
## 1281 3028 2115
## 3471 2681 1149
## 3582 7946 7121
## 3608 3145 2855
## 3635 2453 2234
## 4540 2073 1809
## 4567 2098 1720
So far, we’ve been pulling all columns. But, we can also pull a subset of columns by passing a “vector” of column name values that we’ve “combined” with the c()
function:
web_data[(web_data$sessions>2000 & web_data$deviceCategory=="mobile"),c("date","channelGrouping","sessions")]
## date channelGrouping sessions
## 8 2016-01-01 Display 3266
## 33 2016-01-02 Display 2375
## 42 2016-01-02 Paid Search 2270
## 59 2016-01-03 Display 2377
## 86 2016-01-04 Display 2535
## 113 2016-01-05 Display 2067
## 1200 2016-02-14 Paid Search 2225
## 1227 2016-02-15 Paid Search 10216
## 1242 2016-02-16 Direct 2352
## 1251 2016-02-16 Organic Search 3063
## 1254 2016-02-16 Paid Search 7151
## 1281 2016-02-17 Paid Search 3039
## 3471 2016-05-09 Direct 2694
## 3582 2016-05-13 Display 7955
## 3608 2016-05-14 Display 3151
## 3635 2016-05-15 Display 2459
## 4540 2016-06-17 Social 2088
## 4567 2016-06-18 Social 2112
Believe it or not, we’ve only scratched the surface of the different ways we can access data within a data frame. Just from looking at the last example, you can see that the syntax can get loaded in a hurry. That’s where the console can come in very handy: experimenting with the different aspects of the data you’re trying to filter down to, and then combining them as warranted in your actual script.