We’ve touched on this subject already, but a more detailed look at the types – “classes” – of objects R uses is worthwhile.
There are several functions that you will seldom – if ever – use in a script, but that can come in very handy in the console when writing and debugging your code. To illustrate, we’re going to use part of the web_data
data that gets used throughout this site. This data looks like this:
date | channelGrouping | deviceCategory | sessions |
---|---|---|---|
2016-01-01 | (Other) | desktop | 19 |
2016-01-01 | (Other) | mobile | 112 |
2016-01-01 | (Other) | tablet | 24 |
2016-01-01 | Direct | desktop | 133 |
2016-01-01 | Direct | mobile | 345 |
2016-01-01 | Direct | tablet | 126 |
The str()
function provides the “structure” of an object. In addition to providing the class of the entire object and, as applicable, each component of the object, it includes some of the actual data within the object. As such, it can really be your go-to function, once you’re used to reading the output.
str(web_data)
## 'data.frame': 5732 obs. of 4 variables:
## $ date : chr "2016-01-01" "2016-01-01" "2016-01-01" "2016-01-01" ...
## $ channelGrouping: chr "(Other)" "(Other)" "(Other)" "Direct" ...
## $ deviceCategory : chr "desktop" "mobile" "tablet" "desktop" ...
## $ sessions : int 19 112 24 133 345 126 307 3266 1025 17 ...
What this output is saying is that web_data
is a “data frame,” and then it provides the class for each column of data within the data frame, as well as a few sample values.
The class()
function provides just the class of the specified object. Let’s see what that looks like for web_data
:
class(web_data)
## [1] "data.frame"
Notice how this is just a subset of what was returned by str()
– it doesn’t get into the weeds of the underlying data within the overall object. In this case, it just tells you that, overall, web_data
is a data frame.
Without worrying too much about the syntax (we’ll get to that very shortly), note that we can check the class of something inside the object. Let’s check the class for the sessions
column in the data:
class(web_data$sessions)
## [1] "integer"
Can you find this same information in the str()
output above?
You may be asking, “Why would I ever use class()
?” The answer is really that, as you get more complex objects (data), the str()
function can start to return a lot of information. If you’re really interested in determining the class of something deep inside a complex object, class()
can be more concise and preferable.
This topic is inherently circular, in that we’ve introduced different classes without actually explaining what they are. That’s what we’ll do now (but we’ll use class()
function to inspect sample data as we go, which is why we had to cover it first!).
There are many different types of classes, and you can make your own, so this is not a definitive list, but it will cover the major ones:
The numeric class is simply a number like 1 2 3
. Sometimes this will show up as integer
(no decimals), and sometimes this will show up as double
(has decimals). R will pick which one is appropriate, but you can force one or the other by using the as.integer()
and as.double()
functions (with the authors of this site cannot recall ever using, but, the fact that their assumptions that these functions existed – and what they were named – bore out as true was gratifying).
The metrics from your web analytics platform will almost always arrive as numeric class objects.
The character class is just what it says – a text-based string like hello
or mobile
or Paid Search
. Of course, you can have numerical digits stored in a character class object…but you generally do not want that. (If you’re an Excel junkie, thing of this as being one of those cases where you wind up needing to do a “convert text to numbers” operation).
This is the Boolean class: TRUE
or FALSE
. This may seem like it’s more of a corner case class, but it really isn’t – there are any number of operations in R which, under the hood, are actually generating a bunch of TRUE
/FALSE
flags. So, it’s good to know that these are a special class unto themselves – distinct from a character-class object storing the strings "TRUE"
and "FALSE"
.
Date-class objects are objects that store a (wait for it!) date. Things can actually get a little tricky here, since you cannot tell whether a value is a Date
-class object or a character
-class object simply by looking at the data. Yet, they are fundamentally different things (and are generally pretty easy to convert from one class to the other – and back, if that’s your thing).
# Get the current date and assign it to an object called `a_date`
a_date <- Sys.Date()
# Display the result
a_date
## [1] "2017-10-04"
# Check its class
class(a_date)
## [1] "Date"
So far, so good. But, let’s set the date as a character class instead – simply be creating it as a string:
# Create an object called `a_character` and assign a "date" to it
a_character <- "2017-04-17"
# Display the result
a_character
## [1] "2017-04-17"
Uh-oh! Compare that result to a_date
above. Do you see any difference? There is none (apparently)! But, yet, there is! Let’s check the class of a_character
:
# Check the class
class(a_character)
## [1] "character"
Things like finding date ranges or weekdays will work on a Date
object, but not on a character
. And, depending on the package and the function, you may need to pass in “dates” as either character
class objects or as Date
class objects.
Factors are really nothing more than categorical variables… except factors are both brilliant and can be frustrating. The reason? Well, they look the same as character
when printed but, they act quite differently.
Let’s take a look. Again, we’re not going to get into the details of the wherefore and the why just yet, but factors will almost certainly bite you in the tush at some point. Probably multiple times. At that point, it will become old hat for you to remember to include stringsAsFactors = FALSE
in functions where that matters. But we’re getting wayyyyy ahead of ourselves.
Let’s define two objects (variables) – one as a factor, and one as a string (character):
# Create `a_factor` as a factor
a_factor <- factor("hello", levels = c("hello","goodbye"))
# Take a look at what we just created. Notice the "Levels:" get listed. That's
# curious, is it not?
a_factor
## [1] hello
## Levels: hello goodbye
# And, let's check the class of the object. No surprises!
class(a_factor)
## [1] "factor"
Now, let’s define our string object:
# Create `a_string`. Since we're assigning a string value to it and not telling R
# anything special, it's going to go ahead and create it as a character class.
a_string <- "hello"
# Take a look at what we just created.
a_string
## [1] "hello"
# And, check its class. See! Character!
class(a_string)
## [1] "character"
As an example, see what happens when we try to combine a string and a factor using the c()
function:
c(a_factor, a_string)
## [1] "1" "hello"
Huh? Weird!
Whats going on? Well, since a_factor
is a factor, it is actually represented as a number out of the choice of levels it could possibly be (c("hello","goodbye")
). When it is added to the character the factor is coerced into a character via as.numeric
, and then into a character as.character
.
The upshot of this all is to be very careful in making sure your variables are the class you expect them to be!
A classic mistake is to use data.frame()
or read.csv()
to make a data.frame from your data, but to not set the stringsAsFactors = FALSE
argument (I told you we’d get to this!), which, if not used will default to using factors instead.
<whew> Still with us? Good!
So far, all we’ve covered are the atomic classes. Things get more fun (and more powerful) when we start digging into multi-classes!
These are objects in R that work with combinations of the classes above.
A vector is a combination of the atomic elements above. You can only combine elements that are all of the same type in a vector, and you create a vector using the c()
function.
a_vector <- c("a","b","c","d")
a_vector
## [1] "a" "b" "c" "d"
class(a_vector)
## [1] "character"
str(a_vector)
## chr [1:4] "a" "b" "c" "d"
The class of the vector is the same as the element!
This hints at a powerful feature of R: vectorisation. The atomic elements above are actually a vector of length 1
, which means that anything you can do to one element can also be done to the entire vector of the same class
all at once!
An example of this:
# Sum individual elements
sum(1,2,3,3,4,5,6)
## [1] 24
# Sum a vector
a_vector <- c(1,2,3,3,4,5,6)
sum(a_vector)
## [1] 24
Some useful shortcuts with vectors are below:
# Make a sequence
1:10
## [1] 1 2 3 4 5 6 7 8 9 10
# The lowercase letters
letters
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q"
## [18] "r" "s" "t" "u" "v" "w" "x" "y" "z"
# The uppercase letters
LETTERS
## [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q"
## [18] "R" "S" "T" "U" "V" "W" "X" "Y" "Z"
The most common and useful R class is the data frame. Remember our web_data
object from earlier? Let’s take one more look at that to confirm that it’s a data frame:
class(web_data)
## [1] "data.frame"
str(web_data)
## 'data.frame': 5732 obs. of 4 variables:
## $ date : chr "2016-01-01" "2016-01-01" "2016-01-01" "2016-01-01" ...
## $ channelGrouping: chr "(Other)" "(Other)" "(Other)" "Direct" ...
## $ deviceCategory : chr "desktop" "mobile" "tablet" "desktop" ...
## $ sessions : int 19 112 24 133 345 126 307 3266 1025 17 ...
Data frames are most often used to represent tabular data, and many R functions are designed to operate on data frames.
Data frames can be manually created using the data.frame()
function:
# Names before the `=`, values after it.
my_data_frame <- data.frame(numbers = 1:5,
letters = c("a","b","c","d","e"),
logic = c(TRUE, FALSE, FALSE, TRUE, TRUE))
class(my_data_frame)
## [1] "data.frame"
str(my_data_frame)
## 'data.frame': 5 obs. of 3 variables:
## $ numbers: int 1 2 3 4 5
## $ letters: Factor w/ 5 levels "a","b","c","d",..: 1 2 3 4 5
## $ logic : logi TRUE FALSE FALSE TRUE TRUE
Each column can only be one class, but the class of the columns can be different types.
And, uh-oh, did you see that we wound up with our characters being turned into factors
? We can avoid this by including the stringsAsFactors = FALSE
argument:
# Names before the `=`, values after it.
my_data_frame <- data.frame(numbers = 1:5,
letters = c("a","b","c","d","e"),
logic = c(TRUE, FALSE, FALSE, TRUE, TRUE),
stringsAsFactors = FALSE)
class(my_data_frame)
## [1] "data.frame"
str(my_data_frame)
## 'data.frame': 5 obs. of 3 variables:
## $ numbers: int 1 2 3 4 5
## $ letters: chr "a" "b" "c" "d" ...
## $ logic : logi TRUE FALSE FALSE TRUE TRUE
You can access the individual columns of a data frame using $
notation:
# The column of numbers
my_data_frame$numbers
## [1] 1 2 3 4 5
class(my_data_frame$numbers)
## [1] "integer"
Data frames are a special case of the next multi-class – the list
. Data frames, at there core, are just lists where all of the columns are equal length.
A list is like a data frame, but it can carry variable lengths of objects. And, list elements can be anything, including data frames or even other lists! They can get really, really confusing. But, they also can be very handy.
my_list <- list(letters_data = letters,
numbers_data = 1:5,
all_data = my_data_frame,
nested = list(LETTERS))
class(my_list)
## [1] "list"
str(my_list)
## List of 4
## $ letters_data: chr [1:26] "a" "b" "c" "d" ...
## $ numbers_data: int [1:5] 1 2 3 4 5
## $ all_data :'data.frame': 5 obs. of 3 variables:
## ..$ numbers: int [1:5] 1 2 3 4 5
## ..$ letters: chr [1:5] "a" "b" "c" "d" ...
## ..$ logic : logi [1:5] TRUE FALSE FALSE TRUE TRUE
## $ nested :List of 1
## ..$ : chr [1:26] "A" "B" "C" "D" ...
Just like data frames (because, at their core, data frames are actually lists) you can access individual elements in the list using the $
symbol:
extract <- my_list$all_data
class(extract)
## [1] "data.frame"
str(extract)
## 'data.frame': 5 obs. of 3 variables:
## $ numbers: int 1 2 3 4 5
## $ letters: chr "a" "b" "c" "d" ...
## $ logic : logi TRUE FALSE FALSE TRUE TRUE
If you find an R object is the wrong class for the function you need, what can you do? This is where coercian comes into play.
All the classes have an as.this
function, which, when you pass an R object in, will try to change it to what you need. It will usually throw an error if this is impossible (which is much better than failing silently!).
Some coercing functions as shown below:
# Quotes indicate characters
as.character(-1:3)
## [1] "-1" "0" "1" "2" "3"
# 0 is FALSE, everything else is TRUE
as.logical(-1:3)
## [1] TRUE FALSE TRUE TRUE TRUE
# Character to date
as.Date("2015-01-02")
## [1] "2015-01-02"
# If your dates are in format other than YYYY-MM-DD, then you need to include
# the `format` argument
as.Date("20150102", format = "%Y%m%d")
## [1] "2015-01-02"
as.Date("12-24-2016", format = "%m-%d-%Y")
## [1] "2016-12-24"
# To change factors to numeric, be careful to go via as.character first
numeric_factor <- factor(1, levels = 5:1)
numeric_factor
## [1] 1
## Levels: 5 4 3 2 1
# This gives the result as 5, as that's the first factor
wrong_factor <- as.numeric(numeric_factor)
wrong_factor
## [1] 5
# But, if we use as.character, then we get what's expected/desired.
right_factor <- as.numeric(as.character(numeric_factor))
right_factor
## [1] 1
As you start to work with R, you will find that you are working with a range of classes, and, when it comes to multi-classes, you will find yourself wanting everything to be a vector or a data frame… until you hit a use case where, all of the sudden, you need more flexibility in the structure (including data nested within other data), at which point you will find yourself in list
world. Lists can be maddening…until they’re not. Do you remember when you were first learning how to use pivot tables in Excel? It’s the same thing: kinda’ confusing at first, but kinda’ genius once you’re comfortable with them!