Once you are past the basics, you should be looking at creating functions for your common tasks. Why create functions?
A good rule of thumb is if you are performing the same task more than twice, look to create a function to do it.
The “good names” point is more powerful than it sounds - you will start to create building blocks for yourself that can in themselves be built upon.
For example, here is an example of using some functions you may create to download Google Analytics data and upload it to your own private database:
## returns dataframe of GA data if successful
my_data <- get_google_analytics_data(my_view_id = 12345)
## returns upload details if successful
upload_result <- upload_to_database(my_data)
## returns TRUE if successful
email_result <- email_result(upload_result, email = "me@mycompany.com")
With good naming, commenting code becomes (almost) unnecessary.
You may later then decide to generalise the above for any viewId:
upload_and_email <- function(view_id, email){
my_data <- get_google_analytics_data(my_view_id = view_id)
upload_result <- upload_to_database(my_data)
email_result <- email_result(upload_result, email = email)
}
Now using lapply()
you can download and email several Google Analytics downloads with a couple of lines of R:
viewIds <- c(12345, 345453, 789002)
lapply(viewIds, upload_and_email, email = "me@mycompany.com")
As you abstract away inner functions, higher level thinking is encouraged, building on your past successes.
As ever, Hadley Wickham’s Advanced R book is my prime reference for R concepts, in particular its chapter on functions.
When writing a function, you are actually asigning a new object type to a variable, just as you would for say for other data. However, once assigned a function, you can then get that function to operate on other objects by using ()
afterwards:
name_of_your_function <- function(one, two, etc. ){
## your code goes here, and can refer to the arguments
result <- one + two
## the function will return the last object declaed in the function
result
}
## call your function
name_of_your_function(one = "blah", two = "foo")
R has some shortcuts that lets you not need to specify the name of the argument if its the first one positionally, so the above call can also be written:
name_of_your_function("blah", two = "foo")
You can also define functions with no arguments:
my_function <- function(){
## I just do stuff without needing arguments
print(mtcars)
}
## call the function
my_function()
R has some unusual scoping behaviour compared to other languages. Variables you declare outside of functions can effect interior functions if left to defaults. This can cause confusing errors so its worth highlighting:
x <- 1
scope_example <- function(){
## this shouldn't work
x
}
scope_example()
## [1] 1
This is linked to the environments concept with R, and lexical scoping. This is an advanced topic, but briefly its important to know that R will evaluate in the context of the function, but if it doesn’t find a particular object it will look in the parent frame all the way up to a global. As such you should keep an eye on what is in your global environment vs within a function, as you may forget to define a variable in a function that R then looks for elsewhere, which can be very confusing when debugging.
Often functions are calling other functions within them. If you have a inner function that relies on arguments from the function above, you could laboriously copy all the argumets down to the function that needs them:
func1 <- function(a, b, c){
my_thing <- func2(a = a, b = b, c = c)
}
…or, you can use the construct ...
which will copy the arguments for you:
func1 <- function(...){
my_thing <- func2(...)
}
Create a function that takes a numeric vector and prints out the max, min, mean and median. You can use cat
to print out to the console. Here is a starting template:
my_summary <- function(x){
## your code here
cat("Your result")
}
Once you have the basics, the best way to learn is to examine what others are doing.
Every function you use in R has its code available if you issue the function name with no brackets e.g.
library(googleAnalyticsR)
ga_account_list
## function ()
## {
## url <- "https://www.googleapis.com/analytics/v3/management/accountSummaries"
## acc_sum <- gar_api_generator(url, "GET", data_parse_function = parse_ga_account_summary)
## acc_sum()
## }
## <environment: namespace:googleAnalyticsR>
However, some are easier than others. Some functions, including many that are fundamental to R in its base package, are written in C and are called Primitive
and the R function simply calls that underlying code. These functions won’t return much of use:
sum
## function (..., na.rm = FALSE) .Primitive("sum")
But a lot of R functions are available on GitHub which is rapidly becoming an R standard practice. All of Hadley Wickham’s packages have their functions available for example - find them on GitHub and look within the R
folder (we look more into navigating R package structure later)
Now you are looking at R code, once thing worth mentioning are R methods
. If you have some programming expereince you may be familiar with the concept of object orientated programming. R has several implementations of this, but the most popular is the S3
method that we touch on briefly today.
In brief you may see some references to UseMethod
in some code. This acts as a signpost that decides what actual code to run against the passed in object judged on its class
(remember those?). For instance, the same function could act different if you pass it an object of class character
or of class number
.
The UseMethod
is the signpost, but where is the destination? R
looks for functions that have the same name as the original function, but with a .classname
suffix.
e.g.
my_function <- function(obj){
UseMethod("my_function")
}
my_function.data.frame <- function(obj){
## this function will act if obj is of class data.frame
}
my_function.character <- function(obj){
## this function will act if obj is of class character
}
## An end user will only need to remember one function
my_dataframe <- data.frame(blah = c(1,2,3), foo = c("a","b","c"))
my_character_vector <- c("a","b","c")
my_function(my_dataframe)
## NULL
my_function(my_character_vector)
## NULL
This offers several advantages for clean code, especially if you are used to this style of programming in other languages, but won’t be covered today. See here for more details if you are keen.
There are some good habits I have developed over time that mitigate against too many bugs in your code.
The basic premise is you want to know as soon as possible what is wrong, and print an informative error message so you know whats up. A somewhat valid criticisim of R is that the R error messages are obscure. By creating your own, you can do something to help yourself and users of your functions know whats wrong.
One big gotcha with R is you can assign names to anything, including names already used by base functions. This means you can do evil things like this:
## very evil
`+` <- `-`
3 + 1
## [1] 2
## go back again
rm(`+`)
3 + 1
## [1] 4
Whilst an extreme example, more common are variable such as c
or data
which will happily accept your assignment, then throw an obscure error when you forget to assign them later on. Best is to avoid using these names altogetehr unless you really mean to.
A key element for this is the user of the function stop()
. This as it says stops the function and will print out an error message of your choosing:
sum_safe <- function(x, y){
## if x and y isn't numeric
if(!all(is.numeric(x), is.numeric(y))){
## raise an error
stop("Need numerics for the sum!")
}
sum(x,y)
}
sum_safe(1,2)
## [1] 3
sum_safe("1",2)
## Error in sum_safe("1", 2): Need numerics for the sum!
A short cut for this common task (If this is not true, raise an error
) is the stopifnot()
function, although you can’t set a custom error message:
sum_safe <- function(x, y){
stopifnot(is.numeric(x), is.numeric(y))
sum(x,y)
}
sum_safe(1,2)
## [1] 3
sum_safe("1",2)
## Error: is.numeric(x) is not TRUE
As always, Hadley has an alternative that gives better error messages than stopifnot()
- assertthat
library(assertthat)
sum_safe <- function(x, y){
assert_that(is.numeric(x), is.numeric(y))
sum(x,y)
}
sum_safe(1,2)
## [1] 3
sum_safe("1",2)
## Error: x is not a numeric or integer vector
You can also write your own checks with error messages, like these examples:
is_odd <- function(x) {
assert_that(is.numeric(x), length(x) == 1)
x %% 2 == 1
}
assert_that(is_odd(2))
# Error: is_odd(x = 2) is not TRUE
on_failure(is_odd) <- function(call, env) {
paste0(deparse(call$x), " is even")
}
assert_that(is_odd(2))
# Error: 2 is even
Errors that fail as soon as something is wrong means you will be closer to where the problem occured when debugging.
Sometimes you don’t want to stop the program, but rather do something else if an error is detected.
A good use case for this is when fetching from an API, as you can’t always guarantee the API will return what you expect. Wrapping your call in a try()
command means that instead of an error you will get an object of class try-error
. You can then test for this and react accordingly.
assertthat
also adds some missing checks that can be useful, such as is.error
that you can use with a try()
, otherwise you can use the more verbose in base R inherits(x "try-error")
library(assertthat)
get_data_safe <- function(file_name){
##
assert_that(is.character(file_name))
read_file <- try(read.csv(file_name))
if(is.error(read_file)){
message("Something went wrong, lets try something else")
read_file <- mtcars
}
## return the file, or mtcars if it didn't find it
read_file
}
my_data <- get_data_safe("nofile.csv")
## Warning in file(file, "rt"): cannot open file 'nofile.csv': No such file or
## directory
## Something went wrong, lets try something else
Armed with the above, good habits will be to always check that the inputs (and outputs if you like) are exactly what you expect. Since the majority of R errors are caused by unexpected types, this should help you mitigate against weird bugs.
As standard now, I always look to check the inputs at the beginning of a created function, and give an error or warning if its not expected.
library(assertthat)
extract_xy <- function(data_frame, column_name, row_number){
assert_that(
is.data.frame(data_frame),
is.character(column_name),
is.numeric(row_number)
)
data_frame[row_number, column_name]
}
extract_xy(mtcars, column_name = "mpg", row_number = 3)
## [1] 22.8
Another tool for this is the function match.arg()
which lets you limit the choices an argument can have to a vector of choices you provide. An example on how it is used it below:
library(assertthat)
extract_xy <- function(data_frame,
column_name,
row_number){
## the syntax to make sure column_name is only from accepted values
column_name <- match.arg(column_name)
assert_that(
is.data.frame(data_frame),
is.character(column_name), ## no real need for this now since we hardcoded
is.numeric(row_number)
)
data_frame[row_number, column_name]
}
extract_xy(mtcars, column_name = "mpg", row_number = 3)
## Error in eval(expr, envir, enclos): argument is missing, with no default
## error as column_name not in match.arg vector
extract_xy(mtcars, column_name = "foo", row_number = 3)
## Error in eval(expr, envir, enclos): argument is missing, with no default
With good defensive programming techniques, you an be more sure that the functions are getting the data you expect, but you will still probably need to debug as you go. Getting a quick, iterative process to this is key as unfortunetly the time split is usually 90% of the code programmed in 20% of the time (this is the fun bit) with the remaining 80% of the time debugging 10% of your code.
For speed of delivery of useful programs, getting this debugging time down is key.
Below are some tips to help with this:
browser()
to examine the state of a function where its going wrong - use RStudio’s breakpoints or insert the line browser()
where you want the program to stop. You can then check variables in the environment of the function using RStudio’s Environment pane, try executing lines to replicate errors, etc.message()
or cat()
commands to print out what arguments are, to see if they are as you expect. Comment them out again afterwards as needed, although sometimes its nice to leave them in for user feedback.Rewrite the extract_xy
function below so it also gives custom errors:
column_name
is not in the data_frame
: {column_name} is not in {data_frame}
row_number
is not in the data_frame
: {row_number} is not in {data_frame}
library(assertthat)
extract_xy <- function(data_frame,
column_name = c("mpg", "cyl", "disp"),
row_number){
assert_that(
## insert checks heres
)
data_frame[row_number, column_name]
}
## this should give the custom errors
extract_xy(mtcars, column_name = "blah")
extract_xy(mtcars, row_number = 55)
extract_xy(mtcars, column_name = "blah", row_number = 55)
Compare with base R, where you instead get three different classes of results (an error, NA
s, and NULL
)
mtcars[, "blah"]
## Error in `[.data.frame`(mtcars, , "blah"): undefined columns selected
mtcars[55, ]
## mpg cyl disp hp drat wt qsec vs am gear carb
## NA NA NA NA NA NA NA NA NA NA NA NA
mtcars[55, "blah"]
## NULL