An introduction to functions in R
Published:
What you’ll learn
- What functions in R do, and how to write some of them.
- When to use functions!
I am not an expert on functions, and have been lucky to learn from well-written online resources. Hadley Wickham’s chapter on functions in Advanced R provides a helpful extension to this tutorial for those interested in writing more complex and useful functions. Wickham and Garrett Grolemund also write a chapter on functions in their introductory text, R for Data Science. Ken Rice and Thomas Lumley have an approachable slide deck that also includes an intro to Shiny in R — a useful way to display interactive outputs (a tutorial on Shiny is coming soon!).
What’s a function?
You use functions every single time you code, whether they are in base R, or a package you load in. For example, every time you have used the dplyr
library to mutate
a new column in a dataset, you are calling on a function. Lucky for you, the kind folks who created dplyr
have already coded out what exactly mutate
does, so all you have to do is call the function. Yet, there are some times when the right function just isn’t out there, and you’ll need to write one yourself. How would you even start?
Well, let’s start with the basic syntax of a function. Functions need a name, arguments, and a body. The name of the function is what term you will use to call it later. Arguments are values, dataframes, or other entities you feed into functions to be worked upon. The body of a function is a set of instructions telling R exactly what to do to your arguments.
Seems complicated? It can be! But it can also be really simple. See below:
# We first have to name our function.
# I named mine "multiply" Try to
# pick an informative name that you'll
# remember.
<- function(argument1, argument2)
multiply # We then need to tell our
# function what arguments we
# pass through it!
*argument2}
{argument1# what happens to the arguments
# we are calling the function here
# and setting argument1 = 1 and
# argument2 =2.
multiply(1,2)
## [1] 2
# I could have also written:
# multiply(argument1 = 1, argument2 = 2)
A function can be as easy as argument1*argument2! You can even nest functions within each other:
<- function(number1, number2)
add + number2}
{number1
add(multiply(1,2), 4)
## [1] 6
# What's going on here?
# number1 = multiply(1,2)
# Try to do the math manually
# to verify that the answer makes
# sense.
More complex functions can use if else statements, for loops, and a host of other fun R quirks! These function types can differ from the syntax we define above.
If else functions
If else functions can take on a different syntax than our standard function. See below:
= 7
x
ifelse(x > 6, TRUE, FALSE)
## [1] TRUE
We read the above like this: “If x is greater than 6, return TRUE. Else, return FALSE.” Because we have defined x = 7, we get TRUE in this case! This function also works if x is a list of many numbers, and will return TRUE and FALSE values for each number included in x — though the output is messy:
<- c(-5:19)
x # Let x be a vector ranging from -5 to 19.
# Feel free to print x to get a closer look.
ifelse(x > 6, TRUE, FALSE)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [25] TRUE
For loops
Let’s say I’m trying to brainstorm potential restaurants to bring a date and I need to decide on a type of food. Now, this is a nonsensical way for anyone but a social scientist to do this (just ask your date what they like?). I’ve made a list, called food
that contains types of food I’ve been craving.
<- c("Indian",
food "Chinese",
"Japanese",
"Laotian",
"Italian",
"Mexican")
# For each element in list food, print!
for(x in food){
print(x)
}
## [1] "Indian"
## [1] "Chinese"
## [1] "Japanese"
## [1] "Laotian"
## [1] "Italian"
## [1] "Mexican"
My little function prints my list again! Let’s see how we can adapt our function to be more useful. Assume I took my previous advice and decided to communicate with my date. They gave me a list of food they’ve been craving, and I’ve written it down in a list called food_beloved
. How do I figure out what types of food we both like? Well. . . we could write a function! We’ll need to use an if statement.
<- c("Indian",
food "Chinese",
"Japanese",
"Laotian",
"Italian",
"Mexican")
<- c("Laotian",
food_beloved "Chinese",
"French",
"Ethiopian")
# For each element in list food ---
# IF the element is also in food_beloved,
# print it!
for(x in food){
if (x %in% food_beloved){
print(x)
} }
## [1] "Chinese"
## [1] "Laotian"
Alright! Chinese or Laotian it is then. Dating aside, this type of function can be really useful when you need to iterate through a list or a column of a dataframe and do the same operation to each element. As you can see, that operation can be a type of filtering (like above), but it could also be a data transformation or changing a data type! A lot of functions written by extremely smart people already do this for us (take the case_when
function, for example), but it’s still good to know how to write them in case you’re doing some really newfangled stuff.
Now, I suppose I should mention that for loops have a terrible PR team. They get bad press because coders believe they are less efficient than functions that use lapply
, though this is only partially true (see Hadley Wickham’s post for more details). Let’s look at some example lapply
functions now.
Functions using lapply
I’ve missed my date and am trying to rewrite my function using lapply
. How would I do it? Well first, I would need to write a function the old-fashioned way — using the syntax we first established at the beginning of this tutorial. Then, I’ll need to let lapply
apply that function to all elements of a list (like food
!).
See if you can tell what’s going on below!
<- function(x){
firstdate if(x %in% food_beloved){
print(x)
}
}
invisible(lapply(food, firstdate))
## [1] "Chinese"
## [1] "Laotian"
# I use invisible to suppress lapply
# printing the output you see
# and additional output, but sometimes you
# want all that information!
# Try running this code without invisible(), and see
# what output you get!
Okay— new problem. You wrote a test but really messed up the wording of one question (Tocqueville is hard to spell, and you managed to beef it so badly no student recognized his name on the exam). You’re not cruel or unusual so you want to give students back points for this question. It was worth 5 points, and you have all your students’ grades in a handy list.
# You should really get them to come to office hours
<- c(50, 90, 85, 95, 71, 83, 90)
grades
<- function(x){
boost
# Give everyone 5 points back please!
+ 5
x
}
# Please apply the boost function
# to each element of grades
lapply(grades, boost)
## [[1]]
## [1] 55
##
## [[2]]
## [1] 95
##
## [[3]]
## [1] 90
##
## [[4]]
## [1] 100
##
## [[5]]
## [1] 76
##
## [[6]]
## [1] 88
##
## [[7]]
## [1] 95
Now, an astute reader may see some problems with this function. If a student got the bogus question correct by chance, they shouldn’t receive an additional five points, no matter how terrible your test-writing abilities are. You can specify all these limitations with if else statements. It is also worth nothing that if your grades come in data frame form, you could easily use dplyr
to fix this problem!
Now that you know a bit about functions, what are the reasons for and against using them while you code?
Reasons you should use functions
It can make code less tedious to write.
This is dependent upon your coding and/or drafting style. If tedium makes you unwilling to code, then it may make sense to sit and think about functions before brute forcing a problem.
It makes your code more intelligible because there’s less of it.
Have you ever looked back at old code only to realize it’s 1000 lines long and you didn’t annotate anything and oh my god how are you going to make this work? Yeah. Functions can help a bit with that! Nothing is a substitute for good documentation, but the longer your code is, often the harder it is to interpret. If a function can shorten your code without making it harder to understand, you should consider writing some! It will also make re-reading your code less painful.
You need to write one to accomplish your data wrangling task.
There are some tasks that just can’t reasonably be accomplished without a function! This is why smart people write R packages to help us (thank you dplyr
). But sometimes, even packages written by others are not enough — you need to write your own code. What if you need to load in a bunch of data saved on your computer stored in separate CSV files. Luckily for you, the data is named in an organized way that that is patterned (“data1.csv,” “data2.csv,” you get the picture). You don’t want to laboriously download each CSV one by one (that’s a line of code for each CSV!). Sometimes you can brute force a solution, but if you have enough literal or metaphorical CSVs, you’ll need to write a function instead. If you think a task will take you many hours of mindless typing, stop and consider writing a function instead! We use code to make our lives easier and less repetitive, so unless you want to go back to running regressions by hand, try and take the lazier way out — write a function!
Reasons you may not want to use functions
You’re drafting and it slows down your coding flow.
I’m of the strong opinion that everyone codes differently. If you are collaborating with someone, you may want to talk about best group coding practices, but it is okay if your style of coding differs from your peers. When I first write my code, I rarely use functions. Functions are added upon review or if I need to accomplish a task that would be incredibly tedious without a handy function. I think this is because of my writing and coding style! When I write (or code), I want to get a first draft out as soon as possible. After that, I can go back and edit for efficiency, clarity, and style. You may code differently and find it helpful to go slowly and be as meticulous as possible. That’s also totally fine! But tl;dr: if writing functions is making your workflow slow, save function writing for the editing phase!
It makes your code harder to understand, even with annotations.
Sometimes, consolidating code with functions can make the code difficult to read, especially if the functions are complex. Many R users will be familiar with what the mutate
function does, but they won’t be familiar with a function you wrote! Make sure to clearly document your functions and use annotations to explain how you’re writing them and why you are using them. I prize code intelligibility above most other things, but your mileage may vary.
The functions you’re writing are only being used once or twice, and don’t shorten the code all that much.
Functions should make your code easier to write, read, and understand. If you write a function only to use it once or twice, you might want to reconsider. Now, there are functions designed to only be used once (for example, take a function that loads a bunch of data based on a file name pattern). That is absolutely okay if it makes your life noticeably easier! However, writing a data wrangling function and only using it once usually defeats the point of writing the function!
An example: let’s consolidate our line chart code
We will be adapting code from my first tutorial, Plotting trends over time with the CES. Make sure you have the code from this tutorial at the ready (the entire script can be found at the bottom of the linked page).
If you look at the code from the tutorial in question, we repeat a lot of the same analysis on different subsets of data. For example, we use the survey package to analyze Democrats and Republicans’ attendance of political meetings, party donations, and putting up campaign signs. See this redundant code below:
# pol_meet By Party
<- as.data.frame(svyby(~pol_meet_recode,
pol_meet ~year,
survey,
svymean, na.rm = TRUE))
<- as.data.frame(svyby(~pol_meet_recode,
pol_meet_rep ~year,
subset(survey, pid3 == 2),
svymean, na.rm = T)) %>%
mutate(party = "Republican")
<- as.data.frame(svyby(~pol_meet_recode,
pol_meet_dem ~year,
subset(survey, pid3 == 1),
svymean, na.rm = T)) %>%
mutate(party = "Democrat")
<- pol_meet_rep %>%
pol_meet_party bind_rows(pol_meet_dem)
# donate_candidate by Party
<- as.data.frame(
donate_candidate_rep svyby(~donate_candidate_recode,
~year,
subset(survey, pid3 == 2),
na.rm = T)) %>%
svymean, mutate(party = "Republican")
<- as.data.frame(
donate_candidate_dem svyby(~donate_candidate_recode,
~year,
subset(survey, pid3 == 1),
svymean, na.rm = T)) %>%
mutate(party = "Democrat")
<- donate_candidate_rep %>%
donate_candidate_party bind_rows(donate_candidate_dem)
# These nearly identical code chunks repeat
# for every variable we're interested in analyzing
# by party. Surely there's an easier way!
How might we consolidate this code? Luckily, writing a function will be quite easy once we see the pattern in our script. We’ve already done the hard work!
# All of this code uses work we've
# already done in the previous coding chunk
<- function(measure){
party_analysis
<- as.data.frame(
rep svyby(~measure,
~year,
subset(survey, pid3 == 2),
FUN = svymean,
keep.names = F,
na.rm = T)) %>%
mutate(party = "Republican")
<- as.data.frame(
dem svyby(~measure,
~year,
subset(survey, pid3 == 1),
FUN = svymean,
keep.names = F,
na.rm = T)) %>%
mutate(party = "Democrat")
<- rep %>%
party bind_rows(dem)
# Hey R! Return party when you're done with
# this function.
return(party)
}
party_analysis(pol_meet_recode)
## Error in eval(predvars, data, env): object 'pol_meet_recode' not found
Uh. . .okay. Let’s try this a different way?
party_analysis(ces_participation$pol_meet_recode)
## Error: Must subset elements with a valid subscript vector.
## i Logical subscripts must match the size of the indexed input.
## x Input has size 8932 but subscript has size 366893.
Oh no! Not only did our function break— it broke in two distinct ways when we tried to fix it! The first error we got when running party_analysis(pol_meet_recode)
was one that made some sense. pol_meet_recode
isn’t an object in our R environment. It’s a column in ces_participation
and part of a svyobject
we created earlier in our code. However, when we try to call it by linking it to its parent data frame (ces_participation$pol_meet_recode
), we get this super weird error about subscripts. What’s going on?
Reader, I did not know what was going on. But I found out! I learned for you! It turns out that functions can sometimes be really picky. Any function you write that uses a data frame column name as an argument is likely to break unless you use special notation (read more on this by Bryan Shalloway). This is because R functions need to know where to look for their arguments. They default to the R environment. . . and columns just aren’t there!
But it turns out Shalloway’s expert advice on functions using column names won’t help us here. At least, it won’t fix the whole thing. Our obstacle has to do with the survey
package, and any other packages that use formulas. The package takes arguments formatted in a very specific way ~variable
, which proves troubling for us. To get around it, we need to specify that the function takes a character string as.formula()
, and use the paste
function to prefix the relevant variable with a ~
. The eval and get functions are a bit more complicated. Essentially, I needed to tell R to get the character string I give it, evaluate it from the environment, and then slap a good ole tilde right in front of it before my function can run.
If any coding wizards see this tutorial and can think of a better way to do this, please let me know. I will include it in this tutorial and it will likely save everyone a big headache.
Until a genius comes by, the below workaround works well! Special thanks to this Stack Overflow answer by Ter for the coding approach. Without them, we would all be lost. (Also, Ter both asked and answered his own Stack Overflow question — which I feel deserves some kind of Nobel Prize).
<- function(measure){
party_analysis
<- as.data.frame(
rep svyby(
as.formula(
paste("~" , eval(get("measure")))),
~year,
subset(survey, pid3 == 2),
FUN = svymean,
keep.names = F,
na.rm = T)) %>%
mutate(party = "Republican")
<- as.data.frame(
dem svyby(
as.formula(
paste("~" , eval(get("measure")))),
~year,
subset(survey, pid3 == 1),
FUN = svymean,
keep.names = F,
na.rm = T)) %>%
mutate(party = "Democrat")
<- rep %>%
party bind_rows(dem)
return(party)
}
If this function actually works, we should be able to use it on any political participation measure, and really consolidate our code. Sounds excellent!
<- party_analysis(measure = "pol_meet_recode")
pol_meet_party_2 pol_meet_party_2
## year pol_meet_recode se party
## 1 2008 0.14114705 0.004002461 Republican
## 2 2010 0.17256445 0.004274271 Republican
## 3 2012 0.12910268 0.004079811 Republican
## 4 2014 0.11547733 0.003905134 Republican
## 5 2016 0.10185321 0.003571708 Republican
## 6 2018 0.11007146 0.003635247 Republican
## 7 2020 0.07620658 0.003123228 Republican
## 8 2008 0.14488431 0.004163190 Democrat
## 9 2010 0.13199639 0.003515533 Democrat
## 10 2012 0.11290912 0.003488991 Democrat
## 11 2014 0.12065632 0.003403810 Democrat
## 12 2016 0.12543229 0.003713778 Democrat
## 13 2018 0.14307637 0.003720745 Democrat
## 14 2020 0.08235351 0.002771559 Democrat
Hooray! Our function actually works. But how do we know it works as intended? There are many ways to test your functions, and the way you test them is dependent on what the function is meant to do. Actually, there is a whole package devoted to testing your function and your code, which you can read more about here.
We won’t use this package for this tutorial. Luckily, we know what the function is supposed to do, and we have a version of the code that does not use a function. pol_meet_party_2
should be the exact same data frame as pol_meet_party
, but how do we verify that? We can use anti_join
, which checks to see what does not match between two data frames. Just like other joins, anti_join
works by matching on certain columns. If you are trying to check if two data frames are identical, the function should be matching on all columns, which it will list for you in your console.
anti_join(pol_meet_party_2, pol_meet_party)
## Joining, by = c("year", "pol_meet_recode", "se", "party")
## [1] year pol_meet_recode se party
## <0 rows> (or 0-length row.names)
# anti-join returns no results! looks like
# we have two identical data frames!
That brings us to the end of our introductory tour of functions! This is a sprawling topic, and you’ll likely learn more about functions as new and terrifying data problems arise when you work on projects. And remember, if you write a host of functions you think will be useful to the coding community, you can publish them in a package yourself! Think how much more dismal coding would be without your favorite packages — you could really make somebody’s day!