12| Apply Functions to Vectors and Lists

Miles Robertson, 12.25.23 (edited 01.17.24)

Introduction

Sometimes, you have a function that you want to apply to every element in a vector or list. For example, in the chapter about for loops, we discussed a case where we wanted to capitalize every element in a vector of names, and make sure it had a certain suffix. As we did there, we can always apply this function to each element by using a for loop. However, there is an easier and computationally faster way to do this. In many languages, this is referred to as mapping a function to a vector. This concept is called applying in R. In this chapter, we will discuss how to apply functions to vectors and lists.

The Apply Functions

In R, there are a few different functions that can be used to apply a function to each element in a vector or list. These functions are lapply(), sapply(), vapply(), and replicate(), and essentially do the same thing. Their differences are largely in the format of the output. In fact, all four of these functions are just wrappers for lapply(), meaning that they all just call lapply() under the hood. Generally, you will use only lapply() (the apply function that returns a list) and sapply() (the apply function that usually returns a vector, discussed more below).

Below, we will discuss a couple of ways that you can pass these functions as an argument to these apply functions.

Using a Function Name

The first way to pass a function as an argument to an apply function is to simply pass the name of the function. For example, if we wanted to apply the sqrt() (square root) function to each element in a vector, we could do the following:

sapply(1:10, sqrt)

In this case, it would have been even simpler to just run sqrt(1:10), but this is a good starting point. Note that we did not include the parentheses after sqrt when used in sapply(). This is because we are not calling the function right then and there, but rather telling sapply() what to use on each element of the vector.

As mentioned before, lapply() returns a list of results, while sapply() returns a vector of results. Which you pick depends on what you want. For example, say you want a list of 10 vectors of random uniform numbers, but in a special structure where the first vector has 5 random numbers, the second to have 6, and so on. This can be done quickly with lapply():

lapply(5:14, runif)

As you can see here, the output of this is a list of 10 vectors, instead of the sapply() example where the output of sqrt() was always a single number, and thus a list of vectors was a more appropriate output. In fact, if we tried to use sapply() in this case, it would just give up on trying to return a vector and just return a list of vectors instead.

These functions also work with user-defined functions. For example, we can borrow the function from the previous chapter where we wanted to capitalize every element in a vector of names, and make sure it had a certain suffix:

fixDoctorName <- function (doctor.name) {
    # make all letters capitalized
    doctor.name <- toupper(doctor.name)
    # if the name doesn't end with " MD"...
    if (!endsWith(doctor.name, " MD")) {
        # ...add " MD" to the end
        doctor.name <- paste(doctor.name, "MD")
    }
    return(doctor.name)
}

Now, we can apply this function to a vector of names (this vector can be copied below):

names <- c("olivia bennett MD", "ETHAN HAYES MD", "Mia Rodriguez")

We can apply this function to each element in the vector using sapply():

sapply(names, fixDoctorName)

Note that in all the examples above, we are only changing the first argument of the apply function. However, you can also pass additional arguments to the function you are applying. For example, say we wanted to apply the round() function to each element in a vector, but we wanted to round to the nearest 3 decimals. We can do this by passing the round() argument digits=3 to sapply():

sapply(1:10*pi, round, digits=3)

In case this line of code is confusing, I will write this same line of code out the "long way" below:

c(
    round(pi, digits=3),
    round(2*pi, digits=3),
    round(3*pi, digits=3),
    round(4*pi, digits=3),
    round(5*pi, digits=3),
    round(6*pi, digits=3),
    round(7*pi, digits=3),
    round(8*pi, digits=3),
    round(9*pi, digits=3),
    round(10*pi, digits=3)
)

As you hopefully see here, adding the digits=3 argument to sapply() has the effect of giving this argument to round() each time it is called. In these examples, we used the keyword digits to specify the argument, but you can also use the argument's position in the function. The round() function can have the number of decimals specified by position only, as described in the functions chapter: round(10.12345, 3). This means that the above sapply() call could also be written as follows:

sapply(1:10*pi, round, 3)

Perhaps this will be unsurprising to you, but I recommend that you specify this argument by name, rather than by position, for readability. In these apply functions, multiple arguments, not just one, can be passed to the function you are applying by position or keyword.

Using an Anonymous Function

In all the examples in the previous examples, each element of the vector or list was passed as the first argument to the function being applied. That is a constraint to using the apply functions, that the elements of the vector or list must be the first argument of the function being applied. However, there is a way around this constraint. You can use an anonymous function, which is a function that is not given a name. This is done by defining a function "on the fly" using the function keyword, and not assigning that function to any variable name. For example, perhaps we want to make a list of vectors of random normal numbers, but we want to change the mean for each list. We can do this as follows:

lapply(1:10, function (x) {rnorm(10, mean=x)})

This successfully gives us a list of ten vectors, each with ten random normal numbers, but with a different mean each time. Note that we did not assign the function function (x) {rnorm(10, mean=x)} to a variable name, which makes it an anonymous function. This function takes one argument, x, and returns a vector of ten random normal numbers with mean x. Since the first (and only) argument of the anonymous function is the element of the vector or list, this is an appropriate work-around for needing to pass each element of the vector as the first argument.

As mentioned in a previous chapter, there is a shorthand for creating functions in R, where \ replaces the word function. Perhaps the only place where this shorthand is helpful is when using functions like lapply(), where you are passing a function as an argument. For example, the above code could be written as follows:

lapply(1:10, \(x) {rnorm(10, mean=x)})

Above, I mentioned that using sapply() instead of lapply() for one example ended with sapply() just giving up on trying to return a vector and just returning a list of vectors instead. Interestingly, in the case of the code shown immediately above, sapply() will actually return a matrix of values, where each column is a vector of random normal numbers with a different mean. This is because the "s" in sapply() stands for simplify, as this apply function first tries to simplify the output to a vector, and if it cannot, it tries to return a matrix, and if it cannot, it returns a list. In this case, it cannot return the output as a single vector, so instead it simplifies the output to a matrix.

To give one more example of an anonymous function used as an argument of the apply functions, say we want to delete the leading digit of each number in a vector. This can be done as follows:

sapply(c(101, 34, 555), \(x) {as.numeric(substr(x, 2, nchar(x)))})

The anonymous function uses the substr() and nchar functions to convert each number to a string and chop off the first character. Finally, the as.numeric() function converts the string back to a number, which now is missing the leading digit. Technically, this example does not require sapply at all, as the substr() and nchar() functions can be applied to a vector of numbers directly. However, this example is a good demonstration of how to use an anonymous function as an argument to an apply function.

Conclusion

In this chapter, we discussed how to apply functions to vectors and lists. We discussed the lapply() and sapply() functions, and how they can be used to apply functions to vectors and lists. We also discussed how to pass additional arguments to the function being applied, and how to use anonymous functions as arguments to the apply functions.

I will note that there are other ways to apply functions to vectors and lists, with one noteworthy omission being the mapply() function, which allows you to apply a function to multiple vectors or lists at once. However, I imagine that this chapter will get you started with understanding how to apply functions to vectors and lists, and you can learn more about the other ways to do this in the future through documentation and online research.


Practice

Complete Function Mapping Prompts

Use the functions discussed above to complete the following prompts.

  1. Apply the seq() function to make a list of vectors, where each vector is a sequence of numbers from 1 to 10, with each vector having a length of 10 plus the index number.
  2. Apply the rep() function to make a 5-by-5 matrix of characters, where each column is a vector of the same character repeated 5 times, and each column has a different character. Get these characters from the letters vector.
  3. Apply the rnorm() function to make a list of 5 vectors, where each vector has 10 random normal numbers with the mean equal to the index number of the entry and a standard deviation of 1. Then, apply the mean() function to this list to get a vector that shows what the means of each vector actually turned out to be.

Redo Earlier Practice Problem

In a practice problem of an earlier chapter, I asked you to create a list where each element is a row of Pascal's triangle. Here, we will redo this problem using apply functions.

To begin, I will introduce the choose() function. This function takes two arguments, n and k, and returns the number of ways to choose k items from a set of n items. For example, say you have five friends and you want to choose two of them to go to the movies with you. There are choose(5, 2) ways to do this, or 10 ways. This is connected to Pascal's triangle, as the kth entry of the nth row of Pascal's triangle is choose(n, k).

To begin, write a function called getPascalRow that takes one argument, n, and returns a vector of length n where each entry is choose(n, k), where k is the index of the entry. You can do this by applying the choose() function to a vector of numbers from 0 to n.

Now apply getPascalRow to a vector of numbers from 1 to 10 to get a list of vectors, where each vector is a row of Pascal's triangle.

Apply Functions to Columns of a Data Set

Imagine you have a data frame where each column is a basketball team and each of the 10 rows is the number of points scored in a game. I made a fake data set with this structure that you can copy below:

points.data <- data.frame(
    team1 = c(62, 110, 93, 121, 105, 83, 92, 101, 110, 109),
    team2 = c(91, 93, 65, 102, 93, 92, 98, 83, 95, 175),
    team3 = c(81, 96, 107, 111, 53, 92, 94, 106, 92, 120),
    team4 = c(110, 80, 92, 86, 78, 85, 87, 63, 72, 73),
    team5 = c(90, 93, 25, 50, 120, 92, 93, 92, 93, 92)
)

Using boxplot(points.data), you will be able to have a rough visualization of the data. You may be interested in finding each team's average score. This can be done by simply running colMeans(points.data), which will return a vector of the average score for each team. This is equivalent to running sapply(points.data, mean), which applies the mean() function to each column of the data frame. However, the box plot shows that there are outliers for each team, and averages are very sensitive to outliers. Thus, an average may not be appropriate for our purposes, most notably since team 2 scored 175 points in one game and team 5 scored 25 points in another.

There is a concept of the trimmed mean, which is the mean of a vector after removing the top and bottom x percent of the values. This is useful for removing the influence of outliers on a measurement that aims to establish what is a typical value for the data set. For example, say you have a vector of 100 numbers called number.data, and you are looking to calculate the 5% trimmed mean. You could do so as follows:

percent.trim <- 0.05

number.data <- sort(number.data)
data.length <- length(number.data)
trim.length <- round(data.length * percent.trim)

lower.bound <- trim.length + 1
upper.bound <- data.length - trim.length

trimmed.mean <- mean(number.data[lower.bound:upper.bound])

This code first sorts the vector, then calculates the number of values to trim from the top and bottom of the vector. Then, it calculates the indices of the values to include in the trimmed mean, and finally calculates the trimmed mean. This code should be wrapped in a function called getTrimmedMean as follows:

getTrimmedMean <- function (number.data, percent.trim) {
    number.data <- sort(number.data)
    data.length <- length(number.data)
    trim.length <- round(data.length * percent.trim)

    lower.bound <- trim.length + 1
    upper.bound <- data.length - trim.length

    trimmed.mean <- mean(number.data[lower.bound:upper.bound])
    return (trimmed.mean)
}

Apply the getTrimmedMean function with a 10% trim to calculate the trimmed mean for each team in points.data. Compare these results to the output of colMeans(points.data) to see if it made a meaningful difference in the ranking between teams.