11| Set Operations

Miles Robertson, 12.25.23 (edited 01.17.24)

Introduction

Sets are a powerful concept in mathematics. Sets are collections of unique elements, where unique indicates that there are no duplicates. For example, {1, 2, 3} is a set, but {1, 2, 2, 3} is not a set because it contains a duplicate element. Set operations, specifically union, intersection, and difference, are useful in many applications, including data sets and probability. In this chapter, I will cover how to execute set operations in R.

Making a Set

In R, sets are not their own type of object, unlike other languages. Instead, vectors can just have all duplicates removed, effectively making the vector a set. This is done with the unique() function. See below for an example:

set.to.be <- c(1, 2, 2, 3)
unique(set.to.be)

The output of the above code is 1 2 3, which is a set. Note that the original vector is not changed, and the set version is not saved unless you assign it to a variable.

In mathematics, sets do not have a specific order. However, in R, since sets are just vectors, they do have an order. The outputs of the functions in this chapter are vectors, but their order should not be relied upon. If you need to use a set in a specific order, you should sort these vectors per your specifications.

Finding the Union

The union of two sets is the set of all elements that exist in both sets. In R, the union of two sets can be found with the union() function. See below for an example:

set1 <- c(1, 2, 3)
set2 <- c(3, 4, 5)
union(set1, set2)

The output of the above code is 1 2 3 4 5, which is the union of the two sets. Note that the order of the elements in the output is not necessarily the same as the order of the elements in the original sets.

Finding the Intersection

The intersection of two sets is the set of all elements that exist in both sets. In R, the intersection of two sets can be found with the intersect() function. See below for an example:

set1 <- c(4, 2, 3)
set2 <- c(3, 4, 5)
intersect(set1, set2)

The output of the above code is 4 3, which is the intersection of the two sets.

Finding the Difference

The difference of two sets is the set of all elements that exist in the first set but not the second set. You might have a certain degree of intuition about the above functions, but this operation might throw you off a bit at first. In R, the difference of two sets can be found with the setdiff() function. See below for an example:

set1 <- c(1, 2, 3)
set2 <- c(3, 4, 5)
setdiff(set1, set2)
setdiff(set2, set1)

The output of line 3 is 1 2, which is the difference of the two sets. The output of line 4 is 4 5, which is the difference of the two sets in the opposite order. As you can see, unlike the other operations above, the order of the sets matters in this operation.

Conclusion

Set operations are helpful in many cases in coding. In this chapter, I covered how to execute these set operations in R. These functions might not be the most commonly used functions in R, but you will certainly need them in some cases.


Practice

Use Set Operations on Built-In Data Sets

There are many data sets that come built-in with R. They are often used for pedagogical or testing purposes. To see the list of built-in data sets, run help(package = "datasets"). The following prompts will use those built-in data sets to practice using set functions: unique(), union(), intersect(), and setdiff(). To find out more about any data set, use a question mark followed by its name, e.g., ?airquality. You may benefit from reviewing the chapter about how to handle data structures.

  1. The airquality data set contains information about air quality in New York in a few months of 1973. One of its columns is Month, which contains the month of the year as a number. In a single line, make a vector that lists all the unique months that show up in the data set. E.g., if the Month column of the data set were c(4, 4, 4, 5, 5, 6, 6, 6), the vector sought after is c(4, 5, 6). You should find five unique months.
  2. The quakes data set contains information about earthquakes near Fiji since 1964. The stations column contains the numerical value given to the station that recorded the earthquake. In a single line, make a vector that lists all the unique stations that show up in the data set. You should find 102 unique stations.
  3. The CO2 data set contains information about the uptake rate of CO2 by six grass plants in Quebec and Mississippi. The Treatment column indicates if the plant was chilled or not, while the uptake column indicates the uptake rate of CO2 for each plant. Begin by creating an additional column called rounded.uptake that uses the round() function on the uptake column to round the uptake rates to the nearest integer. Then, find what integer-valued uptake rates occurred for both "chilled" and "nonchilled" plants. Note that the set functions require vectors for input. You should find 13 unique rounded uptake rate values that are found for both treatment types.
  4. The mdeaths and fdeaths data sets record the number of deaths of males and females, respectively, from diseases of the lung in the UK during the 70's. These data sets look like matrices when printed, but is.matrix(mdeaths) gives FALSE. In reality, they are actually time series objects, or ts as named in R (note that is.ts(mdeaths) is TRUE). These are one-dimensional arrays that are positioned across months and years. For both data sets, entries 1 through 12 the number of deaths across the twelve months of 1974. Access them using mdeaths[1:12] and fdeaths[1:12], and save them in the variables male.deaths and female.deaths, respectively. For both data sets, find the months that had above-average mortality for the year. To do this, begin by using [] to find the months that were above average, and using that with the built-in month.name vector to get the name of the month. Then, complete the following:
    • Find the months where death counts were above average for both males and females using intersect().
    • Find the months where death counts were above average for either males of females using union().
    • Find the months where death counts were above average for males but not for females using setdiff().
    • Find the months where death counts were above average for females but not for males using setdiff().
    Complete the above tasks again, but for the year 1975 (indices 13 through 24 for the two original data sets).