10| Graphing in Base R

Miles Robertson, 12.26.23 (edited 01.17.24)

Introduction

Graphing is a terrific way to visualize data. It can help demonstrate relationships between variables that are not possible to identify in a list of data. In this chapter, we will learn how to create graphs in R using the base R graphics package. It is my understanding that many people prefer to use the ggplot2 package, and there is room in this wide world for multiple opinions. However, knowing how to use the base R graphics is important given its continued prominence amongst data scientists.

Plotting and Analyzing Generated Data

Let's start with a simple example. We will create a scatter plot of the relationship between the number of hours studied and the grade received on a test. We will use the plot() function to create the plot. The first argument is the x-axis variable, and the second argument is the y-axis variable. Both are provided as vectors of data. We will also add a title to the plot using the main argument.

set.seed(1234)
hours.studied <- rnorm(100, 10, 2)
grade <- rnorm(100, 60, 2)

plot(hours.studied, grade, main="Hours Studied vs. Grade")

Since both variables in this example were created randomly with no relationship between them, we observe that there is no clear relationship between the two variables. If there is any relationship, it is by chance alone. We can quantify the linear relationship between the two variables using the lm() statistical function:

linear.fit <- lm(grade ~ hours.studied)

Here, we are fitting a linear model to the data. The ~ symbol is used to indicate the dependent variable on the left and the independent variable on the right. To see the results of this model, we can use the summary() function:

summary(linear.fit)

This shows that the numbers of hours studied has virtually no predictive ability in explaining the grade received on the test. I will quickly note that passing linear.fit to the plot function will give a series of plots related to statistical analysis, which are revealed one at a time as the message "Hit <Return> to see next plot:" is printed to the console. These plots may be of interest to you, so I mention them here. However, I will focus this chapter on plots that you have more contol over.

Technically, instead of plot(hours.studied, grade), you can alternatively use plot(grade ~ hours.studied). This is because the ~ symbol is used to indicate the dependent variable (vertical axis) on the left and the independent variable (horizontal axis) on the right. This is the reverse order of the two-argument case, which is confusing. In this case, grade ~ hours.studied is a single argument, which is a formula. This can be verified using the class() function:

class(grade ~ hours.studied)

This formula object can be used for both lm() and plot(). However, the two-argument approach, i.e., plot(hours.studied, grade), is easier to read for the plot() function, and it is my opinion that you should reserve the use of formulas for statistical functions.

Now that we have performed statistical analysis on the data, we can add the regression line to the plot. We can do this using the abline() function. This function can be used a few different ways, the most basic way being that you can provide the slope and the intercept of the line you want to add to the plot (e.g., abline(a = 5, b = 3) will add a line with a slope of 3 and an intercept of 5). No matter the way it is used, this function is always used to put a straight line on an already-existing plot. Here, we can pass in the linear model object, linear.fit, to add the regression line, since linear.fit contains information about the slope and intercept of the line:

abline(linear.fit)

You will see that the regression line is almost completely horizontal. This is unsurprising, since variables with virtually no correlation will mean that the best guess of the value of the dependent variable is its mean in the absence of any other information.

We can adjust this generated data to actually include a relationship between the two variables. We will do this by adding a small amount of noise to the dependent variable, grade, that is correlated with the independent variable, hours.studied. See the code below for how this is done:

set.seed(1234)
hours.studied <- rnorm(100, 10, 2)
grade <- 3 * hours.studied + rnorm(100, 50, 10)

plot(hours.studied, grade, main="Hours Studied vs. Grade")

Now, generally, the trend is that the average number of hours studied is 10, and the grade that people receive is roughly 3 times the number of hours studied plus 50. Plotting this shows a clear linear relationship between the two variables. We can add the regression line to the plot using the abline() function, as before:

linear.fit <- lm(grade ~ hours.studied)
abline(linear.fit)

Adding a confidence interval to the regression line is also possible, but in true R style, it is messy. We can do this using the predict() function. This function takes the linear model object as its first argument, and the interval="confidence" argument to indicate that we want a confidence interval. The level argument is used to indicate the confidence level, which is 95% by default, so we will not change it. Given that we need to draw new lines to this plot, we have to specify the horizontal axis values for which the confidence intervals will be plotted, which we will do by creating a vector of values from the minimum to the maximum of the hours.studied variable, with a total of 100 values. We will store this vector in the horizontal.axis.values variable. Annoyingly, the predict() function requires that the horizontal axis values be given in a data frame, so we will use the data.frame() function to create a data frame with the horizontal axis values as the hours.studied variable. Note that the name of the column in the data frame must match the name of the original horizontal axis data vector.

horizontal.axis.values <- data.frame(
    hours.studied = seq(
        min(hours.studied), 
        max(hours.studied), 
        length.out = 100
    )
)

confidence.bounds <- predict(
    linear.fit, 
    interval = "confidence", 
    newdata = horizontal.axis.values
)

This makes confidence.bounds a matrix with three columns. The first column is the predicted value of the dependent variable, the second column is the lower bound of the confidence interval, and the third column is the upper bound of the confidence interval. We can use this information to add the confidence interval to the plot. We will use the lines() function to add the confidence interval to the plot. The lines() function is similar to the abline() function, but its output need not be a straight line. Its first argument is the horizontal axis variable, and the second argument is the vertical axis variable, much like the plot() function.

Since we want to plot both the lower and upper bounds of the confidence interval, we can use a for loop to reduce code repetition. Here, we will use the horizontal.axis.values$hours.studied vector for the first argument, and either the second or third column of the confidence.bounds matrix for the second argument. We will use the lty = "dashed" argument to indicate that we want a dashed line, and the col = "red" argument to indicate the color we want for the lines.

for (i in 2:3) {
    lines(
        horizontal.axis.values$hours.studied, 
        confidence.bounds[, i], 
        lty = "dashed", 
        col = "red"
    )
}

Plotting Mathematical Functions

Above, we used the seq() function to create a vector of values from the minimum to the maximum of the hours.studied variable. Here, we will use a similar concept to plot mathematical functions. We will plot the function f(x) = sin(x) from x = -10 to x = 10.

x.values <- seq(-10, 10, length.out = 100)
y.values <- sin(x.values)

Now, we can plot the function using the plot() function:

plot(x.values, y.values)

As you can see, by default, the plot() function plots the points as circles. However, we can change this to a straight line using the type argument:

plot(x.values, y.values, type = "l") # l stands for line

Doing this gives us a much better idea of the shape of the function. You can see near the peaks and valleys of the function that there are slightly jagged edges. This is because the plot() function is just drawing straight lines between the 100 points we provided. We can increase the number of points to make the function appear smoother. On the other hand, we can decrease the number of points to make the function appear more jagged. We can change the length.out argument to adjust the number of points.

TODO: adjust the number of points

The function below "wraps" the code we used above to plot the function f(x) = sin(x). Its only argument is the number of points to use. The function returns nothing, and instead just plots the function. Try calling the function with different numbers of points to see how the function changes.

plotSin <- function (point.count) {
    x.values <- seq(-10, 10, length.out = point.count)
    y.values <- sin(x.values)
    plot(x.values, y.values, type = "l")
}

We can also plot multiple functions on the same plot. However, as you have seen above, every time you call the plot() function, it creates a new plot. To plot additional functions on an existing plot, we can use the lines() function as shown above. We will plot the functions f(x) = sin(x) and f(x) = 0.25 * x^2 - 1 on the same plot to demonstrate this:

x.values <- seq(-10, 10, length.out = 100)
sin.values <- sin(x.values)
square.values <- 0.25 * x.values ^ 2 - 10
plot(x.values, sin.values, type = "l")
lines(x.values, square.values, col = "red")

Adjusting Plot Appearance

As seen above, it is somewhat awkward to have to switch the function used to plot data from the initial plot() function to the lines() function. However, there is an important reason for this: the plot() function is used to create the axes and the plot area, while the lines() function is used to add lines to the axes. You will notice that even though our second function, f(x) = 0.25 * x^2 - 1, has a range of y = [-10, 10], the y-axis of the plot is still y = [-1, 1]. This is because the plot() function automatically adjusts the axes to fit the initial data. This is helpful for most cases, but sometimes, you might want more control over the bounds of the axes. For example, the code below sets the x-axis to be x = [-7, 7] and the y-axis to be y = [-10, 10] and plots the same two functions as above:

x.values <- seq(-10, 10, length.out = 100)
sin.values <- sin(x.values)
square.values <- 0.25 * x.values ^ 2 - 10
plot(
    x.values, 
    sin.values, 
    type = "l", 
    xlim = c(-7, 7), 
    ylim = c(-10, 10)
)
lines(x.values, square.values, col = "red")

You will note that by default, R borrows what you actually typed for the arguments of the x and y data for their respective axis titles. This feature is a grave sin in the eyes of the rest of the coding world. All other major programming languages simply evaluate what is typed, and does not keep track of how it was typed in, which improves memory management and program speed. However, R always finds a way to be special, so the code typed for arguments are used to label the axes by default. This is usually not a critical issue if you use a good variable name for the argument, but is disastrous if you put something more complicated in as the arguments of the function:

plot(c(1, 2, 3, 5, 4), tail(1:6 * 3, n=5))

Here, the x-axis is labeled c(1, 2, 3, 5, 4), and the y-axis is labeled tail(1:6 * 3, n=5), which hurts me in my bones. We can specify the labels of the axes using the xlab and ylab arguments:

plot(
    c(1, 2, 3, 5, 4), 
    tail(1:6 * 3, n=5),
    xlab = "X Values",
    ylab = "Y Values"
)

As shown above, the main argument is used to specify the title of the plot, which is always a good idea for anything other than a quick plot. There are many other things you can adjust about the plot, which you can learn about by typing ?plot into the console, and then clicking the "Generic X-Y Plotting" link that appears.

Colors, Legends, and Line Styling

Personally, I love making my code tidy and simple. So, when I have to plot several functions onto one plot, I like to use the plot() function only to create the axes, axis titles, and main title, and then use the lines() function to plot all the functions. For example, the code below plots the first few Taylor expansions of f(x) = sin(x) programmatically by using a for loop. I will first provide the function that makes the curves, which you can copy, followed by the code to make the plots, which you should type out. No need to understand what is happening in the sinExpansion function, just focus on how it is used.

sinExpansion <- function (curve.number, x.values) {
    total.expansion <- 0
    for (i in 1:curve.number) {
        term.num <- 2 * i - 1
        coefficient <- (-1) ^ (i + 1) / factorial(term.num)
        total.expansion <- total.expansion + 
            coefficient * x.values ^ term.num
    }
    return (total.expansion)
}

x.values <- seq(-10, 10, length.out = 100)
number.of.curves <- 8
curve.colors <- hcl.colors(number.of.curves, "BluGrn")

plot(
    NULL, # indicates that we do not want to plot any data
    xlim = c(-15, 10), # required if above is NULL
    ylim = c(-2, 2),   # required if above is NULL
    xlab = "X Values", 
    ylab = "Y Values", 
    main = "Taylor Expansions of sin(x)"
)

lines(x.values, sin(x.values), lty = "dashed", lwd = 2)

for (i in 1:number.of.curves) {
    lines(
        x.values,
        sinExpansion(i, x.values),
        col = curve.colors[i]
    )
}

legend(
    "bottomleft", 
    legend = c("sin(x)", paste("Curve", 1:number.of.curves)),
    col = c("black", curve.colors),
    lty = c("dashed", rep("solid", number.of.curves)),
    lwd = c(2, rep(1, number.of.curves)),
    cex = 0.7
)

There are a few important things that I used here that I want to discuss:

The hcl.colors() function is used to generate a vector of colors that are evenly spaced from each other. The first argument is the number of colors to generate, and the second argument is the color palette. I used the "BluGrn" color, which is a blue-green color scale. To find other palettes, you can type hcl.pals(), which returns all the possible color palette names. There are many other ways to control color in R, so do more research if you are unhappy with the color options.
As mentioned in the comments, the initial plot() function call doesn't need to have any data plotted (i.e., NULL can be given instead of data). However, if you do give NULL to the plot function, it does need to have the xlim and ylim arguments specified. Otherwise, R will not know what the bounds of the plot should be. It is not required to do plots this way, but I like the layout when I need to plot several lines on one graph.
The lty argument is used to specify the line type. The default is a "solid" line, but you can also use "dashed", or several other options. You can also use a number from 0 to 6 to specify a different type of line, but I can never remember what each number means, so I prefer to use the string options, and find them more readable.
The lwd argument is used to specify the line width. The default is 1, but you can use any number greater than 0. I used 2 for the dashed line to make it more visible.
The legend() function is used to add a legend to the plot. Unlike other languages, you must re-specify the colors and line types for each line. Since I had one line that was dashed, thickened, and black, I had to specify the col, tly, and lwd values for that line, and then specify the values for the other lines using rep(). If they were all the same, no such repetition would be necessary. E.g., if I removed the sin(x) line, I could have alternatively just used col = curve.colors, lty = "solid", and lwd = 1 for the legend() function.
In the legend() function, the cex argument (i.e., the character expansion factor) is used to specify the size of the text in the legend. The default is 1, but I wanted the text in the legend to be smaller so everything would fit.

Conclusion

In this chapter, I exposed you to many different features of the base plot() function in R. There are many other features that I did not cover, but you can find how to change your plot to your needs with a little research. Of course, ggplot2 is popular for many R users, and you should feel free to use it if you feel more comfortable with it. However, it is important to know how to use the base R graphics package, since it is still widely used.

Practice

Plot Mathematical Functions

Use plot() to show the following functions within the given x-axis range. For each, rerun the code to have 5, 50, 500 and 5000 steps used. To do this, use a for loop with iteration variable number.steps and the vector c(5, 50, 500, 5000). At the end of each iteration, use the line readline("Press Enter to see next plot: "). As implied, this will stop execution of the script until you press enter in the console. This will allow you to see the plot before the next one is generated. As an added challenge, make the title of the plot include how many steps were used.

$y = \sin(x)$, with $x$ from $-10$ to $10$.
$y = x^2 + x$, with $x$ from $0$ to $5$.
$y = x\ln(x)$, with $x$ from $0$ to $50$.
$y = 1 / \sin(x)$, with $x$ from $0$ to $1$.

Complete Plotting Prompts

Complete the following prompts using the plot() and lines() function.

Show ten functions on a single plot, where all the functions are of the form $y = x^2 + bx + 1$ with $b$ taking on the values $-5$ to $4$. Do so using a for loop. Add a legend with the legend function that shows which curve color represents which $b$ value. Make sure that the y-axis goes from $-6$ to $6$.
Show the function $y = cos(x)$ with $x$ from $0$ to $35$. Every peak of the function is $1$ unit tall and occurs at a multiple of $2\pi$ (e.g., $0$, $2\pi$, $4\pi$, $6\pi$, etc). Use a for loop and the text() function to add text that says the x-value of each peak just above itself. Inside the for loop, since the peaks always occur at multiples of $2\pi$ and at $1$ unit high, such text can be added as follows: text(2*pi, 1, "2pi"). Include it for all 6 peaks visible on the plot. Note: the text overlaps the line slightly since the text() function directly centers the text at the given coordinates on the plot. To avoid this issue, you may want to increase the y-axis bounds to be from $-1$ to $1.5$, and use 1.1 for the y-coordinate of the text.
Create a data frame with two columns and 50 rows. The first column should have random normal numbers with a mean and standard deviation of your choice, and the second should be three times the first column plus a little random noise (i.e., add random numbers that are normally distributed with a low mean and standard deviation). Plot this data frame as a scatter plot. Change the name of the plot title and axes titles to something interesting.

Use Related Plotting Functions

There are several plotting functions that are worth knowing about in addition to the plot() function. Each of the following prompts will introduce one of these functions and give you a prompt where you can practice using it.

hist(): This function plots a histogram, which is good for showing the distribution of one-dimensional data. Start by plotting a histogram of 100 random normal numbers like this:
```
random.numbers <- rnorm(100, 5, 2)
hist(random.numbers)
```
By default, this function just guesses how many bars to break the data into. However, you can specify the number of bars using the breaks argument. Try using breaks = 10 and breaks = 20 to see how the plot changes. You can alternatively specify the start and end points of each bar by giving a vector, like this: breaks = c(-6, -3, 0, 3, 6, 9, 12). With this information, use a for loop to plot a histogram for three sets: 10, 100, and 1000 random normal numbers. Use breaks from -10 to 15 with a step size of 1. As before, to avoid progression to subsequent graphs before you've seen the current one, use readline("Press Enter to see next plot: ") at the end of each iteration of the for loop.
boxplot(): This function plots a box-and-whisker plot, which is good for showing the distribution of one-dimensional data, but in a different way than histograms. Start by plotting a boxplot of 100 random normal numbers like this:
```
random.numbers <- rnorm(100, 5, 2)
boxplot(random.numbers, )
```
This shows how the data is distributed in terms of the median, quartiles, and outliers. To plot multiple box-and-whisker plots on the same plot, you can give a data frame or list of vectors to the boxplot() function. Use this function to plot the following two sets of data, with one plot for each data set:
```
data.set1 <- list(
    item1 = rnorm(10, 5, 2),
    item2 = rnorm(100, 5, 2),
    item3 = rnorm(1000, 5, 2)
)

data.set2 <- data.frame(
    col1 = rnorm(100, 5, 2),
    col2 = rnorm(100, 10, 4),
    col3 = rnorm(100, 15, 8)
)
```
barplot(): This function plots a bar for each value of a categorical variable in a data set. The type of data used for this plot type is different from that of the previous two: it must use two-dimensional data where one of those dimensions is a categorical variable. Here is an example of the function being used:
```
categorical.variable <- c("Green", "Brown", "Blue")
counts <- c(15, 30, 20)

barplot(
    counts, 
    names.arg = categorical.variable, 
    xlab = "Eye Color", 
    ylab = "Frequency"
)
```
Use this function to plot the following data set, where the color column is the categorical variable (i.e., used for names.arg) and the count column is the frequency:
```
shirt.color.data <- data.frame(
    color = c("red", "green", "blue", "black"),
    count = c(11, 21, 15, 34)
)
```
In many cases, there may be multiple measurements that you wish to report for the same categorical variable. For example, you may wish to report the height of three individual trees across four years. The following code shows such a data set:
```
tree.heights <- data.frame(
    tree1 = c(3.9, 4.0, 4.2, 4.5),
    tree2 = c(4.1, 4.1, 4.0, 4.1),
    tree3 = c(3.8, 4.0, 4.5, 4.9)
)

barplot(as.matrix(tree.heights), beside=TRUE)
```
Note that the data in this case must be a matrix, thus requiring the as.matrix() function, and that the beside=TRUE argument is used to indicate that the bars should be placed beside each other (as opposed to stacked on top of each other). Using this code snippet as a guide, make a barplot of a data set that tracks the number of people who like four different flavors of ice cream across three years. Use round(runif(3, 10, 1000)) to generate the data for each flavor you include. As an added challenge, add a legend that shows which color of bar represents which year.