Exercise: Accessing Data

This exercise requires having a web_data data frame. You can either load up some sample data by completing the I/O Exercise (which is what is shown in the step-by-step instructions below), or, if you have access to a Google Analytics account, you can use your own data by following the steps on the Google Analytics API page. Or, if you have access to an Adobe Analytics account, then you can use your own data by following the Generating web_data steps on the Adobe Analytics API page.

Once you have a web_data data frame to work with, the command head(web_data) should return a table that, at least structurally, looks something like this:

date	channelGrouping	deviceCategory	sessions	pageviews	entrances	bounces
2016-01-01	(Other)	desktop	19	23	19	15
2016-01-01	(Other)	mobile	112	162	112	82
2016-01-01	(Other)	tablet	24	41	24	19

The step-by-step instructions for each of the components of this exercised are below. But, what we want to do is:

Inspect the entire data set:

View the data as its own tab in RStudio by double-clicking web_data in the Environment pane
View the data in the console by typing web_data
View just the first six rows of data using the head() function

Use row/column indices to inspect specific values/sets in the data frame:

Display the element in the 2nd row and 5th column of web_data
Display the entire 2nd row of web_data
Display the entire 5th column of web_data
Mimic the head() function using indices to display the first 6 rows of data in web_data
Display the 1st through 6th rows and 2nd through 5th columns of web_data

Use column names (and indices) to inspect specific values/sets in the data frame:

Display the entire sessions column using $ notation
Display the first five values in the sessions column using both $ and indices

Filter/subset the data:

Create an object called mobile_data that includes only the rows from web_data where the deviceCategory value is mobile
Display all of the rows from mobile_data where the sessions are greater than 2.000
Display the same result as in the previous task, but do so using only web_data (don’t create an intermediate mobile_data object)
Display rows where mobile is the deviceCategory, sessions is greater than 2000, but only display three columns: date, channelGrouping, and sessions

1. Inspect the data

Instead of clicking on the object in the Environment tab, we can just type the object’s name in the console. Go ahead and do that:

web_data

Whoa! We’re not showing the results of that here, as it’s 10,000 rows of material quickly flashing past your eyes. But, often, we just want to get a basic sense of the data structure, so just viewing the first few rows of the data will suffice. We can do that with the head() function (remember: ?head() will give you documentation on the function):

##         date channelGrouping deviceCategory sessions pageviews entrances
## 1 2016-01-01         (Other)        desktop       19        23        19
## 2 2016-01-01         (Other)         mobile      112       162       112
## 3 2016-01-01         (Other)         tablet       24        41        24
## 4 2016-01-01          Direct        desktop      133       423       133
## 5 2016-01-01          Direct         mobile      345       878       344
## 6 2016-01-01          Direct         tablet      126       237       126
##   bounces
## 1      15
## 2      82
## 3      19
## 4      61
## 5     172
## 6      77

The above will likely look a bit better in your console than it does here. If you have a lot of columns, R will actually wrap the data in the console. It tends to be hard to digest that way, but that’s why we need to get comfortable with other ways of referencing subsets of a data frame!

2. Use Row/Column Indices

Let’s look at a single value from this data set: the value in the second row and the fifth column:

web_data[2,5]

## [1] 162

Find this value in the data frame that you opened up from the environment (or just find it in the head() data you pulled above). Make sense?

Now, let’s look at the entire second row.

web_data[2,]

##         date channelGrouping deviceCategory sessions pageviews entrances
## 2 2016-01-01         (Other)         mobile      112       162       112
##   bounces
## 2      82

Or, we could look at the entire 5th column (not shown here, but feel free to give it a try):

web_data[,5]

We can also look at ranges using these indices. To mimic head(web_data), we can simply specify we want to see all columns for the first six rows of the data:

web_data[1:6,]

##         date channelGrouping deviceCategory sessions pageviews entrances
## 1 2016-01-01         (Other)        desktop       19        23        19
## 2 2016-01-01         (Other)         mobile      112       162       112
## 3 2016-01-01         (Other)         tablet       24        41        24
## 4 2016-01-01          Direct        desktop      133       423       133
## 5 2016-01-01          Direct         mobile      345       878       344
## 6 2016-01-01          Direct         tablet      126       237       126
##   bounces
## 1      15
## 2      82
## 3      19
## 4      61
## 5     172
## 6      77

Or, if we wanted to look at just the second through fifth columns for the first six rows of data:

web_data[1:6,2:5]

##   channelGrouping deviceCategory sessions pageviews
## 1         (Other)        desktop       19        23
## 2         (Other)         mobile      112       162
## 3         (Other)         tablet       24        41
## 4          Direct        desktop      133       423
## 5          Direct         mobile      345       878
## 6          Direct         tablet      126       237

3. Use Column Names

The dicey thing about using numeric indices is that, if the structure of the data changes (e.g., the query of the API gets updated to add a dimension or a metric), the indices may suddenly start referencing the wrong thing.

Happily, we can use column names to prevent this. If you’ve worked with Excel tables, this will seem somewhat familiar.

Let’s look at just the sessions column:

web_data$sessions

Or, we can combine column names and indices. If we use a column name, then we don’t need to specify a column index, so there is only one value inside the [ ]s:

web_data$sessions[1:5]

## [1]  19 112  24 133 345

4. Filter/Subset the Data

It’s generally more efficient to do as few API calls as possible. That means that, often, we’re pulling a master data set, even though we only want to work on pieces of it at once. In this example, what if we wanted to look at just the mobile data. And, as a small twist, let’s not only isolate the mobile data, but let’s put that data into its own data frame calle mobile_data:

mobile_data <- web_data[web_data$deviceCategory=="mobile",]

Double-click on the mobile_data object in your Environment to check out this data. (Or, perhaps, view the head() of this new object in your console!)

What if we wanted to quickly get a list of dates and channels where the channel’s sessions for the day were greater than 2.000 (or 2,000, depending on which continent you are on)? We can perform this on our new mobile_data object:

mobile_data[mobile_data$sessions>2000,]

##            date channelGrouping deviceCategory sessions pageviews
## 8    2016-01-01         Display         mobile     3266      3772
## 33   2016-01-02         Display         mobile     2375      2745
## 42   2016-01-02     Paid Search         mobile     2270      4405
## 59   2016-01-03         Display         mobile     2377      2697
## 86   2016-01-04         Display         mobile     2535      2821
## 113  2016-01-05         Display         mobile     2067      2386
## 1200 2016-02-14     Paid Search         mobile     2225      4361
## 1227 2016-02-15     Paid Search         mobile    10216     15883
## 1242 2016-02-16          Direct         mobile     2352     13527
## 1251 2016-02-16  Organic Search         mobile     3063     17671
## 1254 2016-02-16     Paid Search         mobile     7151     12634
## 1281 2016-02-17     Paid Search         mobile     3039      5357
## 3471 2016-05-09          Direct         mobile     2694     17043
## 3582 2016-05-13         Display         mobile     7955     13157
## 3608 2016-05-14         Display         mobile     3151      4462
## 3635 2016-05-15         Display         mobile     2459      3257
## 4540 2016-06-17          Social         mobile     2088      2769
## 4567 2016-06-18          Social         mobile     2112      3243
##      entrances bounces
## 8         3253    2904
## 33        2366    2118
## 42        2257    1405
## 59        2369    2145
## 86        2522    2304
## 113       2059    1841
## 1200      2214    1428
## 1227     10155    7359
## 1242      2344     682
## 1251      3048     632
## 1254      7126    4936
## 1281      3028    2115
## 3471      2681    1149
## 3582      7946    7121
## 3608      3145    2855
## 3635      2453    2234
## 4540      2073    1809
## 4567      2098    1720

Could we have gotten this same result from our base web_data data set? We could – by combining criteria:

web_data[(web_data$sessions>2000 & web_data$deviceCategory=="mobile"),]

##            date channelGrouping deviceCategory sessions pageviews
## 8    2016-01-01         Display         mobile     3266      3772
## 33   2016-01-02         Display         mobile     2375      2745
## 42   2016-01-02     Paid Search         mobile     2270      4405
## 59   2016-01-03         Display         mobile     2377      2697
## 86   2016-01-04         Display         mobile     2535      2821
## 113  2016-01-05         Display         mobile     2067      2386
## 1200 2016-02-14     Paid Search         mobile     2225      4361
## 1227 2016-02-15     Paid Search         mobile    10216     15883
## 1242 2016-02-16          Direct         mobile     2352     13527
## 1251 2016-02-16  Organic Search         mobile     3063     17671
## 1254 2016-02-16     Paid Search         mobile     7151     12634
## 1281 2016-02-17     Paid Search         mobile     3039      5357
## 3471 2016-05-09          Direct         mobile     2694     17043
## 3582 2016-05-13         Display         mobile     7955     13157
## 3608 2016-05-14         Display         mobile     3151      4462
## 3635 2016-05-15         Display         mobile     2459      3257
## 4540 2016-06-17          Social         mobile     2088      2769
## 4567 2016-06-18          Social         mobile     2112      3243
##      entrances bounces
## 8         3253    2904
## 33        2366    2118
## 42        2257    1405
## 59        2369    2145
## 86        2522    2304
## 113       2059    1841
## 1200      2214    1428
## 1227     10155    7359
## 1242      2344     682
## 1251      3048     632
## 1254      7126    4936
## 1281      3028    2115
## 3471      2681    1149
## 3582      7946    7121
## 3608      3145    2855
## 3635      2453    2234
## 4540      2073    1809
## 4567      2098    1720

So far, we’ve been pulling all columns. But, we can also pull a subset of columns by passing a “vector” of column name values that we’ve “combined” with the c() function:

web_data[(web_data$sessions>2000 & web_data$deviceCategory=="mobile"),c("date","channelGrouping","sessions")]

##            date channelGrouping sessions
## 8    2016-01-01         Display     3266
## 33   2016-01-02         Display     2375
## 42   2016-01-02     Paid Search     2270
## 59   2016-01-03         Display     2377
## 86   2016-01-04         Display     2535
## 113  2016-01-05         Display     2067
## 1200 2016-02-14     Paid Search     2225
## 1227 2016-02-15     Paid Search    10216
## 1242 2016-02-16          Direct     2352
## 1251 2016-02-16  Organic Search     3063
## 1254 2016-02-16     Paid Search     7151
## 1281 2016-02-17     Paid Search     3039
## 3471 2016-05-09          Direct     2694
## 3582 2016-05-13         Display     7955
## 3608 2016-05-14         Display     3151
## 3635 2016-05-15         Display     2459
## 4540 2016-06-17          Social     2088
## 4567 2016-06-18          Social     2112

Believe it or not, we’ve only scratched the surface of the different ways we can access data within a data frame. Just from looking at the last example, you can see that the syntax can get loaded in a hurry. That’s where the console can come in very handy: experimenting with the different aspects of the data you’re trying to filter down to, and then combining them as warranted in your actual script.