Graphing is a terrific way to visualize data. It can help demonstrate relationships between variables that are not possible to identify in a list of data. In this chapter, we will learn how to create graphs in R using the base R graphics package. It is my understanding that many people prefer to use the ggplot2 package, and there is room in this wide world for multiple opinions. However, knowing how to use the base R graphics is important given its continued prominence amongst data scientists.
                    Let's start with a simple example. We will create a scatter plot of
                    the relationship between the number of hours studied and the grade
                    received on a test. We will use the plot() function to
                    create the plot. The first argument is the x-axis variable, and the
                    second argument is the y-axis variable. Both are provided as vectors
                    of data. We will also add a title to
                    the plot using the main argument.
                
set.seed(1234)
hours.studied <- rnorm(100, 10, 2)
grade <- rnorm(100, 60, 2)
plot(hours.studied, grade, main="Hours Studied vs. Grade")
                    Since both variables in this example were created randomly with
                    no relationship between them, we observe that there is no clear
                    relationship between the two variables. If there is any relationship,
                    it is by chance alone. We can quantify the linear relationship between
                    the two variables using the lm() statistical function:
                
linear.fit <- lm(grade ~ hours.studied)
                    Here, we are fitting a linear model to the data. The ~
                    symbol is used to indicate the dependent variable on the left and
                    the independent variable on the right. To see the results of this
                    model, we can use the summary() function:
                
summary(linear.fit)
                    This shows that the numbers of hours studied has virtually no predictive
                    ability in explaining the grade received on the test. I will quickly note that
                    passing linear.fit to the plot function will
                    give a series of plots related to statistical analysis, which are revealed
                    one at a time as the message "Hit <Return> to see next plot:" is
                    printed to the console. These plots may be of interest to you, so I mention them here.
                    However, I will focus this chapter on plots that you have more contol over.
                
                    Technically, instead of plot(hours.studied, grade), you can alternatively
                    use plot(grade ~ hours.studied). This is because the ~ symbol
                    is used to indicate the dependent variable (vertical axis) on the left and the 
                    independent variable (horizontal axis) on
                    the right. This is the reverse order of the two-argument case, which is confusing. 
                    In this case, grade ~ hours.studied is a single argument, which
                    is a formula. This can be verified using the class() function:
                
class(grade ~ hours.studied)
                    This formula object can be used for both lm() and plot().
                    However, the two-argument approach, i.e., plot(hours.studied, grade), 
                    is easier to read for the plot() function,
                    and it is my opinion that you should reserve the use of formulas for statistical functions.
                
                    Now that we have performed statistical analysis on the data, we can add the
                    regression line to the plot. We can do this using the abline()
                    function. This function can be used a few different ways, the most basic way being that you can
                    provide the slope and the intercept of the line you want to add to the plot (e.g., 
                    abline(a = 5, b = 3) will add a line with a slope of 3 and an intercept of 5).
                    No matter the way it is used, this function is always used
                    to put a straight line on an already-existing plot. Here, we can pass in the
                    linear model object, linear.fit, to add the regression line, since
                    linear.fit contains information about the slope and intercept of the line:
                
abline(linear.fit)
You will see that the regression line is almost completely horizontal. This is unsurprising, since variables with virtually no correlation will mean that the best guess of the value of the dependent variable is its mean in the absence of any other information.
                    We can adjust this generated data to actually include a relationship between
                    the two variables. We will do this by adding a small amount of noise to the
                    dependent variable, grade, that is correlated with the independent
                    variable, hours.studied. See the code below for how this is done:
                
set.seed(1234)
hours.studied <- rnorm(100, 10, 2)
grade <- 3 * hours.studied + rnorm(100, 50, 10)
plot(hours.studied, grade, main="Hours Studied vs. Grade")
                    Now, generally, the trend is that the average number of hours studied is 10, and the
                    grade that people receive is roughly 3 times the number of hours studied plus 50.
                    Plotting this shows a clear linear relationship between the two variables. We can add
                    the regression line to the plot using the abline() function, as before:
                
linear.fit <- lm(grade ~ hours.studied)
abline(linear.fit)
                    Adding a confidence interval to the regression line is also possible, but in true R style,
                    it is messy. We can do this using the predict() function. This function takes the
                    linear model object as its first argument, and the interval="confidence"
                    argument to indicate that we want a confidence interval. The level
                    argument is used to indicate the confidence level, which is 95% by default,
                    so we will not change it. Given that we need to draw new lines to this plot, we have to
                    specify the horizontal axis values for which the confidence intervals will be plotted,
                    which we will do by creating a vector of values from the minimum to the maximum of the
                    hours.studied variable, with a total of 100 values. We will store this vector in
                    the horizontal.axis.values variable. Annoyingly, the predict()
                    function requires that the horizontal axis values be given in a data frame, so we will
                    use the data.frame() function to create a data frame with the horizontal
                    axis values as the hours.studied variable. Note that the name of the column
                    in the data frame must match the name of the original horizontal axis data vector.
                
horizontal.axis.values <- data.frame(
    hours.studied = seq(
        min(hours.studied), 
        max(hours.studied), 
        length.out = 100
    )
)
confidence.bounds <- predict(
    linear.fit, 
    interval = "confidence", 
    newdata = horizontal.axis.values
)
                    This makes confidence.bounds a matrix with three columns. 
                    The first column is the predicted
                    value of the dependent variable, the second column is the lower bound of the
                    confidence interval, and the third column is the upper bound of the confidence
                    interval. We can use this information to add the confidence interval to the
                    plot. We will use the lines() function to add the confidence
                    interval to the plot. The lines() function is similar to the
                    abline() function, but its output need not be a straight line. 
                    Its first argument is the horizontal axis variable, and the second
                    argument is the vertical axis variable, much like the plot() function.
                
                    Since we want to plot both the lower and upper bounds of the confidence interval,
                    we can use a for loop to reduce code repetition.
                    Here, we will use the horizontal.axis.values$hours.studied
                    vector for the first argument, and either the second or third column of the 
                    confidence.bounds matrix for the second argument. We will use
                    the lty = "dashed" argument to indicate that we want a dashed line, and
                    the col = "red" argument to indicate the color we want for the lines.
                
for (i in 2:3) {
    lines(
        horizontal.axis.values$hours.studied, 
        confidence.bounds[, i], 
        lty = "dashed", 
        col = "red"
    )
}
                    Above, we used the seq() function to create a vector of values
                    from the minimum to the maximum of the hours.studied variable.
                    Here, we will use a similar concept to plot mathematical functions. We will
                    plot the function f(x) = sin(x) from 
                    x = -10 to x = 10.
                
x.values <- seq(-10, 10, length.out = 100)
y.values <- sin(x.values)
                    Now, we can plot the function using the plot() function:
                
plot(x.values, y.values)
                    As you can see, by default, the plot() function plots the
                    points as circles. However, we can change this to a straight line using 
                    the type argument:
                
plot(x.values, y.values, type = "l") # l stands for line
                    Doing this gives us a much better idea of the shape of the function. You can see near the
                    peaks and valleys of the function that there are slightly jagged edges. This is because
                    the plot() function is just drawing straight lines between the 100 points
                    we provided. We can increase the number of points to make the function appear smoother.
                    On the other hand, we can decrease the number of points to make the function appear
                    more jagged. We can change the length.out argument to adjust the number of
                    points.
                
                        The function below "wraps" the code we used above to plot the function f(x) = sin(x).
                        Its only argument is the number of points to use. The function returns nothing, and instead
                        just plots the function. Try calling the function with different numbers of points to see
                        how the function changes.
                    
plotSin <- function (point.count) {
    x.values <- seq(-10, 10, length.out = point.count)
    y.values <- sin(x.values)
    plot(x.values, y.values, type = "l")
}
                    We can also plot multiple functions on the same plot. However, as you have seen above,
                    every time you call the plot() function, it creates a new plot. To plot
                    additional functions on an existing plot, we can use the lines() function
                    as shown above. We will plot the functions f(x) = sin(x) and
                    f(x) = 0.25 * x^2 - 1 on the same plot to demonstrate this:
                
x.values <- seq(-10, 10, length.out = 100)
sin.values <- sin(x.values)
square.values <- 0.25 * x.values ^ 2 - 10
plot(x.values, sin.values, type = "l")
lines(x.values, square.values, col = "red")
                    As seen above, it is somewhat awkward to have to switch the function used to plot data from the initial 
                    plot() function to the lines() function. However, there is an 
                    important reason for this: the plot() function is used to create the axes
                    and the plot area, while the lines() function is used to add lines to the
                    axes. You will notice that even though our second function, f(x) = 0.25 * x^2 - 1,
                    has a range of y = [-10, 10], the y-axis of the plot is still y = [-1, 1].
                    This is because the plot() function automatically adjusts the axes to fit the
                    initial data. This is helpful for most cases, but sometimes, you might want more control
                    over the bounds of the axes. For example, the code below sets the x-axis to be
                    x = [-7, 7] and the
                    y-axis to be y = [-10, 10]
                    and plots the same two functions as above:
                
x.values <- seq(-10, 10, length.out = 100)
sin.values <- sin(x.values)
square.values <- 0.25 * x.values ^ 2 - 10
plot(
    x.values, 
    sin.values, 
    type = "l", 
    xlim = c(-7, 7), 
    ylim = c(-10, 10)
)
lines(x.values, square.values, col = "red")
You will note that by default, R borrows what you actually typed for the arguments of the x and y data for their respective axis titles. This feature is a grave sin in the eyes of the rest of the coding world. All other major programming languages simply evaluate what is typed, and does not keep track of how it was typed in, which improves memory management and program speed. However, R always finds a way to be special, so the code typed for arguments are used to label the axes by default. This is usually not a critical issue if you use a good variable name for the argument, but is disastrous if you put something more complicated in as the arguments of the function:
plot(c(1, 2, 3, 5, 4), tail(1:6 * 3, n=5))
                    Here, the x-axis is labeled c(1, 2, 3, 5, 4), and the y-axis is labeled
                    tail(1:6 * 3, n=5), which hurts me in my bones. We can 
                    specify the labels of the axes using the xlab and ylab arguments:
                
plot(
    c(1, 2, 3, 5, 4), 
    tail(1:6 * 3, n=5),
    xlab = "X Values",
    ylab = "Y Values"
)
                    As shown above, the main argument is used to specify the title of the plot,
                    which is always a good idea for anything other than a quick plot. There are many other things
                    you can adjust about the plot, which you can learn about by typing ?plot into the
                    console, and then clicking the "Generic X-Y Plotting" link that appears.
                
                    Personally, I love making my code tidy and simple. So, when I have to plot several functions 
                    onto one plot, I like to use the plot() function only to create the axes, axis titles,
                    and main title, and then use the lines() function to plot all the functions.
                    For example, the code below plots the first few Taylor expansions 
                    of f(x) = sin(x)
                    programmatically by using a for loop. 
                    I will first provide the function that makes the curves, which you can copy, 
                    followed by the code to make the plots, which you should type out. 
                    No need to understand what is happening in 
                    the sinExpansion function, just focus on how it is used.
                
sinExpansion <- function (curve.number, x.values) {
    total.expansion <- 0
    for (i in 1:curve.number) {
        term.num <- 2 * i - 1
        coefficient <- (-1) ^ (i + 1) / factorial(term.num)
        total.expansion <- total.expansion + 
            coefficient * x.values ^ term.num
    }
    return (total.expansion)
}
x.values <- seq(-10, 10, length.out = 100)
number.of.curves <- 8
curve.colors <- hcl.colors(number.of.curves, "BluGrn")
plot(
    NULL, # indicates that we do not want to plot any data
    xlim = c(-15, 10), # required if above is NULL
    ylim = c(-2, 2),   # required if above is NULL
    xlab = "X Values", 
    ylab = "Y Values", 
    main = "Taylor Expansions of sin(x)"
)
lines(x.values, sin(x.values), lty = "dashed", lwd = 2)
for (i in 1:number.of.curves) {
    lines(
        x.values,
        sinExpansion(i, x.values),
        col = curve.colors[i]
    )
}
legend(
    "bottomleft", 
    legend = c("sin(x)", paste("Curve", 1:number.of.curves)),
    col = c("black", curve.colors),
    lty = c("dashed", rep("solid", number.of.curves)),
    lwd = c(2, rep(1, number.of.curves)),
    cex = 0.7
)
There are a few important things that I used here that I want to discuss:
hcl.colors() function is used to generate a vector of colors
                        that are evenly spaced from each other. The first argument is the number
                        of colors to generate, and the second argument is the color palette. I used
                        the "BluGrn" color, which is a blue-green color scale. To find other
                        palettes, you can type hcl.pals(), which returns all the possible
                        color palette names. There are many other ways to control color in R, so do
                        more research if you are unhappy with the color options.
                    plot() function call
                        doesn't need to have any data plotted (i.e., NULL can be given instead of data). 
                        However, if you do give NULL to the plot function, it 
                        does need to have the
                        xlim and ylim arguments specified. Otherwise, R
                        will not know what the bounds of the plot should be. It is not required 
                        to do plots this way, but I like the layout when I need to plot several lines
                        on one graph.
                    lty argument is used to specify the line type. The default
                        is a "solid" line, but you can also use "dashed", 
                        or several other options. You can also use a number from 0 to 6 to specify a
                        different type of line, but I can never remember what each number means,
                        so I prefer to use the string options, and find them more readable.
                    lwd argument is used to specify the line width. The default
                        is 1, but you can use any number greater than 0. I used 2
                        for the dashed line to make it more visible.
                    legend() function is used to add a legend to the plot. Unlike other
                        languages, you must re-specify the colors and line types for each line. Since
                        I had one line that was dashed, thickened, and black, I had to specify the
                        col, tly, and lwd values for that line, and
                        then specify the values for the other lines using rep(). If they were
                        all the same, no such repetition would be necessary. E.g., if I removed the 
                        sin(x) line, I could have alternatively just used col = curve.colors,
                        lty = "solid", and lwd = 1 for the legend() function.
                    legend() function, the cex argument (i.e., the character
                        expansion factor) is used to specify the size of the text in the legend. 
                        The default is 1, but I wanted the text in the legend to be smaller
                        so everything would fit.
                    
                    In this chapter, I exposed you to many different features of the base plot()
                    function in R. There are many other features that I did not cover, but you can find how
                    to change your plot to your needs with a little research. Of course, ggplot2
                    is popular for many R users, and you should feel free to use it if you feel more comfortable
                    with it. However, it is important to know how to use the base R graphics package, since
                    it is still widely used.
                
    Use plot() to show the following functions within the given x-axis range. 
    For each, rerun the code to have 5, 50, 500 and 5000 steps used. 
    To do this, use a for loop with iteration variable number.steps 
    and the vector c(5, 50, 500, 5000). At the end of each iteration, use the line
    readline("Press Enter to see next plot: "). As implied, this will stop execution
    of the script until you press enter in the console. This will allow you to see the plot before the next one is
    generated. As an added challenge, make the title of the plot include how many steps were used.
    Complete the following prompts using the plot() and lines() function.
for loop. Add a legend with the legend function that
        shows which curve color represents which $b$ value. Make sure that the y-axis goes from $-6$ to $6$.for 
        loop and the text() function to add text that says the x-value of each peak just above 
        itself. Inside the for loop, since the peaks always occur at multiples of $2\pi$ and at
        $1$ unit high, such text can be added as follows: text(2*pi, 1, "2pi"). 
        Include it for all 6 peaks visible on the plot. Note: the text overlaps the line slightly
        since the text() function directly centers the text at the given coordinates on the plot.
        To avoid this issue, you may want to increase the y-axis bounds to be from $-1$ to $1.5$, and use
        1.1 for the y-coordinate of the text.
    There are several plotting functions that are worth knowing about in addition to the 
    plot() function. Each of the following prompts will introduce one of these
    functions and give you a prompt where you can practice using it.
hist(): This function plots a histogram, which is good for showing the distribution of
        one-dimensional data. Start by plotting a histogram of 100 random normal numbers like this:
        random.numbers <- rnorm(100, 5, 2)
hist(random.numbers)
        By default, this function just guesses how many bars to break the data into. However, you can
        specify the number of bars using the breaks argument. Try using breaks = 10
        and breaks = 20 to see how the plot changes. You can alternatively specify the
        start and end points of each bar by giving a vector, like this: breaks = c(-6, -3, 0, 3, 6, 9, 12).
        With this information, use a for loop to plot a histogram for three sets: 
        10, 100, and 1000 random normal numbers.
        Use breaks from -10 to 15 with a step size of 1. As before, to avoid progression to subsequent graphs before
        you've seen the current one, use readline("Press Enter to see next plot: ") at the end of each
        iteration of the for loop.boxplot(): This function plots a box-and-whisker plot, 
        which is good for showing the distribution of
        one-dimensional data, but in a different way than histograms. 
         Start by plotting a boxplot of 100 random normal numbers like this:
        random.numbers <- rnorm(100, 5, 2)
boxplot(random.numbers, )
        This shows how the data is distributed in terms of the median, quartiles, and outliers.
        To plot multiple box-and-whisker plots on the same plot, you can give a data frame or list
        of vectors to the boxplot() function. Use this function to plot the following
        two sets of data, with one plot for each data set:
        data.set1 <- list(
    item1 = rnorm(10, 5, 2),
    item2 = rnorm(100, 5, 2),
    item3 = rnorm(1000, 5, 2)
)
data.set2 <- data.frame(
    col1 = rnorm(100, 5, 2),
    col2 = rnorm(100, 10, 4),
    col3 = rnorm(100, 15, 8)
)
barplot(): This function plots a bar for each value of a categorical variable in a data set.
        The type of data used for this plot type is different from that of the previous two: it must use two-dimensional
        data where one of those dimensions is a categorical variable. Here is an example of the function being used:
        categorical.variable <- c("Green", "Brown", "Blue")
counts <- c(15, 30, 20)
barplot(
    counts, 
    names.arg = categorical.variable, 
    xlab = "Eye Color", 
    ylab = "Frequency"
)
        Use this function to plot the following data set, where the color column 
        is the categorical variable (i.e., used for names.arg) and the count column is the frequency:
        shirt.color.data <- data.frame(
    color = c("red", "green", "blue", "black"),
    count = c(11, 21, 15, 34)
)
        In many cases, there may be multiple measurements that you wish to report for the same categorical variable.
        For example, you may wish to report the height of three individual trees across four years. The following
        code shows such a data set:
        tree.heights <- data.frame(
    tree1 = c(3.9, 4.0, 4.2, 4.5),
    tree2 = c(4.1, 4.1, 4.0, 4.1),
    tree3 = c(3.8, 4.0, 4.5, 4.9)
)
barplot(as.matrix(tree.heights), beside=TRUE)
        Note that the data in this case must be a matrix, thus requiring the as.matrix() function,
        and that the beside=TRUE argument is used to indicate that the bars should be placed 
        beside each other (as opposed to stacked on top of each other). Using this code snippet as a guide,
        make a barplot of a data set that tracks the number of people who like four different flavors of ice cream
        across three years. Use round(runif(3, 10, 1000)) to generate the data for each flavor you include.
        As an added challenge, add a legend that shows which color of bar represents which year.