03| Variables and Primitive Data

Miles Robertson, 12.19.23 (edited 01.10.24)

Introduction

When first learning coding, it is somewhat difficult to comprehend what coding languages even do. Robotics, a field closely related to computer science, has clearer operations: a computer controlling a robot might turn a motor, light up an LED, or otherwise interact with the physical world. But what does a computer do when it is not controlling a robot? This question is more difficult to intuitively understand.

For questions like this, it is helpful to look at simple examples. One simple computer that we're all familiar with is a calculator. A calculator is a computer that takes in numbers and operators (e.g., add, subtract, multiply, divide) and outputs another number. If we hit the buttons 1, +, 2, and = in succession, the calculator will output 3 on its screen. The calculator has code that gives it instructions on what to do with the inputs it receives. Calculators have no sense of what numbers are, but when given instructions, they are able to execute them. The code tells the calculator how to receive data (e.g., numbers), how to manipulate that data (e.g., how to add numbers), and how to present this manipulated data to a user (e.g., putting the result on the screen).

Although our laptops, desktop computers and phones are all much more complicated, the basic principle is the same. Everything coding languages can do can be summarized as data input, data manipulation, and data output. Calculators only deal with numbers and operators, but computers can deal with several other types of data. If coding languages only input, output, and manipulate data, you may question how they can do anything that is all that interesting. However, data can be manipulated in intricate ways that produce highly organized and complex results, from a spreadsheet of measurements to a video game.

In this section, I will introduce the most basic types of data that the R coding language uses, and how variables are used to do things with these data. In many coding languages, the simplest data types are called primitive data, in the sense that it is uncomplicated and has very little structure. Although R documentation usually does not use this term, it is a useful way to consider these basic data types. All other data types are simply organized collections of primitive data, so primitive data are the building blocks of all data.

A Simple Example

As discussed in the practice section of the Setup chapter, RStudio has a console (bottom left window) that can be used to run code. The console is a place where you can type in code and see its results immediately. This is different from .R files, usually edited in a window directly above the console, which are simply text files containing R code that is only run when instructed to do so. The console is a great place to experiment with code and see what it does.

TODO: follow along

In the console, you'll notice a > symbol. This is called a prompt, and indicates that R is ready to receive code. Follow along with the code below in your console.

We'll start very simple. Type this into the console after the > and press Enter:

1 + 2

You'll see [1] 3 appear in the console, which has the result of the code you just ran. The [1] is not important for now, but know that it accompanies output in the console. Here, the data input is 1 and 2, the data manipulation is +, and the data output is 3.

In this code, we found the answer to 1 + 2, but we did not save the result anywhere. If we wanted to use this result later, we would have to re-run the code. This is where variables come in. Variables are a way to store data so that it can be used later. Variables are assigned using the <- operator, which is an assignment operator, and is often read as "gets" (e.g., x <- 3 is read as "x gets 3"). It is used to give a value to a variable. For example, the following code assigns the value 3 to the variable x:

x <- 3

You can also use = to assign variables, but to be consistent, it is recommended to use <-. Now, if you type x into the console and press Enter, you'll see 3 appear. These variables can be used in the same way as the data input in the previous example. For example, the following code will output 5:

x + 2

Now that we have gone through this simple example, we'll talk about the specific types of data that R uses, and expound on variables.

R's Primitive Data Types

R has six basic data types. One of them is obscure and will be of no use to you, but the other five are of interest: (1) numeric, (2) integer, (3) complex, (4) logical and (5) character. We'll go through each of these in turn, and I'll give some examples of how they are used.

R is unique in that all of its simplest data types are always in atomic vectors (more commonly just referred to as vectors). Here, a "vector" is just a bunch of things put in some order, and "atomic" means that the data in the vector are the most basic data types. These vectors only contain data that are all of the same type. We'll talk about why vectors make R unique later, but here I give an early warning: R has some oddities that can be confusing or frustrating. Nonetheless, understanding the "behavior" of primitive data will give you a foundation for understanding more complex data types. It is a worthy goal to try to get a firm mental grasp on the primitive data types.

Number Types: numeric, integer, and complex

There are three different types of numbers in R, but they largely behave the same way, and only rarely will you need to distinguish between them. Let's start with the example in the box below.

TODO: make an atomic vector

In the console, type c(1,2.5,3) and press Enter. You'll see 1.0 2.5 3.0 appear in the console. This is an atomic vector of length 3, containing the numbers 1, 2.5, and 3. The c stands for combine, and turns makes a vector from what is given in the parentheses.

The vector created in the box above is a numeric vector, the most common of R's primitive data types. Unsurprisingly, this indicates that the vector contains numbers. Even though you didn't add decimals to the numbers in the vector, R still treats the integers typed out in your code (e.g., 1, 179, -14) like they're decimal numbers. If you want to make a vector of integers, you can use the L suffix. For example, c(1L, 2L, 3L) creates a vector of integers. Generally speaking, you'll mostly deal with numeric vectors in code you write, with one strange exception described in the next paragraph.

In some cases, you need to make a list of sequential numbers. For example, you might want to make a vector of the numbers 5 through 10. You could type out c(5,6,7,8,9,10), but this is tedious and error-prone. Instead, you can use the : operator to make a vector of sequential numbers. 5:10 completes the goal much more succinctly and readably than the previous code. This, oddly, creates an integer vector, even though everywhere else in R you have to type L to make an integer. Luckily, integers and numerics act the same in almost all cases, so this is not a big deal. If you want to know what the type of a vector is, you can use the class() function. For example, class(5:10) returns "integer" and class(c(1,2,3)) returns "numeric".

TODO: test out some curiosities

As you read through the previous paragraphs, you may have wondered about special cases with vectors that I did not mention. Perhaps you wondered one of the following:

What happens if I try to make a vector with data of different types (e.g., c(1L, 1.5, -20))?
What happens if I try to use negative numbers with the : operator (e.g., -5:5)? What about if I put the same number on both sides? If I put the larger number on the left side? If I try to use decimal numbers on one or both sides?
What happens if I try to make a vector with no elements (i.e., c())?

If you ever have questions like these, you're in luck! It is incredibly easy to test out these curiosities in the console. Understanding how R handles these edge cases can help you gain confidence in how to use the language. Test out the above questions in the console and come up with answers for yourself.

The next primitive data type is complex. Complex numbers are numbers that have both a real and imaginary part. For example, 1 + 2i is a complex number. You are unlikely to use these for yourself, but they are good to be aware of.

You can use the standard mathematical operators (+, -, *, /, ^) with numeric, integer or complex vectors. Try out the following examples in the console to see how they work (note that these operations are performed element-by-element when there is more than one number in the vector):

2 + 6
2 - 6
2 * 6
2 / 6
2 ^ 6

c(1, 2, 3) + 6
c(1, 2, 3) - 6
c(1, 2, 3) * 6
c(1, 2, 3) / 6
c(1, 2, 3) ^ 6

c(1, 2, 3) + c(4, 5, 6)
c(1, 2, 3) - c(4, 5, 6)
c(1, 2, 3) * c(4, 5, 6)
c(1, 2, 3) / c(4, 5, 6)
c(1, 2, 3) ^ c(4, 5, 6)

Boolean Type: logical

The next primitive data type is logical. Logical data are either TRUE or FALSE (alternatively, T or F, but the full name is more clear). In most other languages, these are referred to as "booleans".

These are used in many ways in R. For example, if you want to compare two values, like 1 and 2, you can use the "less than" operator (<) to see if the first value is less than the second with 1 < 2. This will return TRUE or FALSE, depending on whether the statement is true or false (it is true in this case, of course). You can also use the >, <= (less than or equal to), >= (greater than or equal to), == (equal to), and != (not equal to) operators to compare values. See the code below for some examples, where comments indicate the return value of each line.

1 < 1  # FALSE
1 <= 1 # TRUE
1 >= 5 # FALSE
1 == 1 # TRUE
1 != 1 # FALSE
1 == 2 # FALSE
1 != 2 # TRUE

You can also use the & (and) and | (or) operators to combine logical values. These take two logical values and return a single logical value. The & operator returns TRUE if both values are TRUE, and FALSE otherwise. The | operator returns TRUE if either value is TRUE, and FALSE otherwise. See the code below for some examples, where comments indicate the return value of each line.

TRUE & TRUE   # TRUE
TRUE & FALSE  # FALSE
FALSE & TRUE  # FALSE
FALSE & FALSE # FALSE
TRUE | TRUE   # TRUE
TRUE | FALSE  # TRUE
FALSE | TRUE  # TRUE
FALSE | FALSE # FALSE

These may seem unhelpful at first, but are pivotal for making more complex logical statements, which are common in coding. These generally make more sense in the context of variables, so I'll lean on the simple example above to explain them in the next box.

TODO: test out your understanding of logical operators

Begin by running the following code in the console to create the variable x and give it a value of 3:

x <- 3

Try to predict the outcome of each of the following lines. Check to see if your predictions were correct by typing them in the terminal.

0 < x & x < 20
0 < x | x < 20
x < 2 & x < 8
x < 2 | x < 8
-10 <= x | x == 21
x == 0 | x == 1

As a final note, you will have instances where it is useful to know if a value is inside another vector. For example, you may want to know if the number 2 is in the vector c(2,1,5,4). It is obvious in this case, but if the vector is very long, it can be difficult to tell. You can use the %in% operator to check if a value is in a vector. For example, 2 %in% c(2,1,5,4) returns TRUE, and 3 %in% c(2,1,5,4) returns FALSE.

String Type: character

The final primitive data type is character. Character data usually just look like words inside of quotation marks, but as the name suggests, these data are just any typed characters between quotation marks. Try out the following code in the console to see some examples:

"Hello, world!"
c("abc", "d", "&", "1", "2!!!", " ")

The first line has a single character value, and the second line is a vector of character values. Note that the second line has "1" in it, and not 1. The former is a character value, where it's treated as the character 1, and the latter is a numeric value, treated as the number 1. Additionally, the second line has " ", which is a character value that is nothing but a space. In most other languages, this data type is called a string, as in "a string of characters", but R refers to this type just as "character".

These are used in many ways in R. Most commonly, you will use character values to label things, like columns in a data set. Let's look about how we could push two character values together to make a single character value. We can do this with the paste():

paste("Hello", "world!")

This returns "Hello world!", which is a single character value. The paste() function takes any number of values, of any type, and combines them into a single character value. You'll notice it automatically added a space between the two character values. This can be controlled, but that will be discussed later.

In some cases, you may want to put a message in the console as you're running code. For example, say your code is running a simulation that takes a long time to complete. You may want to show a message in the console to let you know how far along the code is. You can do this with the cat() function. For example, the following code shows the message "Hello, world!" in the console:

cat("Hello, world!")

This function does the trick in most cases, and even works with special characters, other data types or multiple values:

cat(13)
cat("\U1F600") # The \ indicates that what follows is
# a special character, in this case a smiley face emoji
cat("This line will print\non two lines") 
# The \n creates a new line
cat("I was born in", 1903)

There is another similar function in R, called print(), which can show more complicated data types than cat() can. However, print() can only take one argument at a time, and cannot handle special characters, so it is less useful for printing messages to the console.

Odds & Ends: Inf, NaN, NA, and NULL

There are a few values that come up with these primitive data types that are worth mentioning. The first is Inf, which is short for infinity. Running class(Inf) shows that it is a numeric value. In other words, R treats Inf as a special type of number. This is useful for indicating when a mathematical operation returns infinity, or an incredibly large number. Try out both of the following lines to see how Inf can appear:

1 / 0
-1 / 0
1.4e1000 # This is how you type 1.4 x 10^1000, 
# which is too big for R to handle, so it converts it to Inf

Similar to Inf, NaN is also a numeric value, which stands for "not a number". This is used to indicate when a mathematical operation returns a value that is not a number. The following line shows how NaN can appear:

0 / 0

If you see Inf or NaN in a result, it usually indicates that you've done some incorrect math, often by dividing by zero.

It is helpful to have a placeholder value that indicates that a value is missing. R uses NA for this purpose. This is a logical value (see class(NA)), and will be a frequent occurrence when you import data sets with empty cells. Note that NA looks very similar to NaN, but they have very different meanings.

Finally, NULL is a special value that indicates that a variable has no value. Conceptually, you may think of this example to help understand: If I am not wearing a hat, you might say that my hat is NULL. This value is different from NA, which indicates that a variable has a missing value. If you run class(NULL), you'll see that it is of type "NULL", and is in fact the only value of this type. Using this is not common for statistical purposes, but disambiguation between this and the other values above will leave you less confused when you run across them.

Converting Between Types of Vectors

There are many instances where you will want to convert between data types. For example, you may want to convert a numeric vector to a character vector, or a character vector to a numeric vector. This is easy to do with the as.character(), as.numeric(), as.integer(), etc. functions. For example, the following code converts the numeric vector c(1,2,3) to a character vector:

as.character(c(1,2,3))

I'll leave it to you in the Practice section below to try out these functions, and to see how it handles weird cases, like trying to convert "hello" to a numeric value.

Factors

In some cases, scientific data might take on a limited number of values. For example, an experiment might have categorical treatments, like different food types for an herbivore's diet. In this case, the data are characters, but are still limited to a few values (e.g., "hay", "corn", or "alfalfa"). R has a special way to indicate this type of data, called a factor. Factors are a special type of vector that can only take on a limited number of values. For example, the following code creates a factor with the three values mentioned above:

as.factor(c("hay", "corn", "alfalfa"))

Running this in R shows that the factor has three "levels", which are the three values that the factor can take on. Factors are useful for statistical analyses, and are used in many of R's built-in functions.

Everything's a Vector

At the beginning of this section, I said that R is unique when it comes to vectors. I mentioned that all of R's basic data types come in vectors. In fact, even solo numerics, logicals, characters, etc. are vectors of length 1. That means that there is no such thing as something that is of type numeric but is not a vector. All of the following examples are one-element vectors:

1
TRUE
"hello"

Other languages make the distinction between primitive data and vectors of that data, but R does not. In other languages, you can create vectors that themselves contain more vectors, but in R, you can only create atomic vectors that contain primitive data. The c() function mentioned before collects all the vectors it receives and makes a single vector out of them. All of the following examples create the same vector:

c(1, 2, 3)
c(c(1, 2), 3)
c(1, c(2, 3))

There are upsides and downsides to this behavior. However, it is generally useful for statistical purposes, and is one of the reasons R is so popular for statistics.

Variables

As discussed above in the initial example, variables are a way to store data so that it can be used later. They can store all types of data, and can be used in place of data in operations. Here, I'll discuss how variables work and the conventions you should follow when using them. Try out the following to see how variables work:

TODO: assign and use variables

Execute the following lines of code in the console, one by one. Some will have output to the console, and some will not. It may be helpful for you to pay attention to the Global Environment panel in the upper right window of RStudio, which shows the variables you have created. You'll be able to see how the value of x changes as you run the code.

x <- 3
x + 2
x <- 10
x + 2
x <- x + 1

Answer the following questions:

What is the value of x after each line?
What happened to x when line 3 was ran?
What happened to x when line 5 was ran?

As you were hopefully able to determine, variables in R hold their value until they are changed. In addition, a variable can be used in its own assignment (e.g., line 5 in the box above). This shows that R will first calculate whatever is on the right side of the assignment operator (<-), and then assign that value to the variable on the left side, which overrides the previous value.

Below, I will discuss several aspects of variables, including how to name them and use them.

RStudio's Global Environment

RStudio maintains a global environment. This term refers to what is available to R. Every time you assign a new variable, it is added to the global environment (e.g., the line x <- 3 adds x to the global environment, and then x can be accessed at any time). As mentioned in the box above, you can see what is in your global environment by looking at the top right window of RStudio, and can clear it by pressing the brush icon at the top of that window.

Having a global environment means that if you run a line of code that creates a variable, that variable will be stored until you close RStudio or clear the environment. This can be useful, but can also mean that you may think that code in your .R document is complete, but in reality, you created a variable at some point that is defined in the global environment but is not defined in your code. This is a common problem for new R users, and it is important to understand how the environment works in order to avoid this problem.

Variables Make Code Easier to Edit

Variables allow you to write code that is more adaptable. If you are doing a lot of calculations with one quantity, you're better off assigning it to a variable and using that variable in your calculations than typing the quantity every time you need it. This way, if you need to change the value of that quantity, you only need to change it in one place, and the rest of your code will automatically use the new value. This is much faster and less error-prone than having to change the value in every place it is used. Even if you don't plan on changing the value of a quantity, it is still good practice to assign it to a variable, and it makes your code more readable.

For example, say you are doing a calculation with the number of units of a product you are selling, the cost per unit, and the total cost. You could write the following code:

quantity <- 10
cost.per.unit <- 5
total.cost <- quantity * cost.per.unit

cat("Total cost for", quantity, "units: $", total.cost, "\n")

grams.conversion.factor <- 1000
grams.quantity <- quantity * grams.conversion.factor
cat(quantity, "units is equal to", grams.quantity, "grams\n")

pounds.conversion.factor <- 0.00220462
pounds.quantity <- quantity * pounds.conversion.factor
cat(quantity, "units is equal to", pounds.quantity, "pounds\n")

This code is easy to edit, and easy to understand, even if you don't know R. There are no "magic numbers" (meaning numbers that appear in code without context or explanation), and any change in the quantity will automatically update the total cost, grams, and pounds. The alternative, retyping the number in each place quantity is used, is much more error-prone and difficult.

Good Variable Names Make Readable Code

Variables in R are named using letters, numbers, underscores and periods, but cannot start with a number. For example, my.variable is a valid variable name, but 1variable is not. Generally, it is a good idea in R to use all lowercase letters with variables, with words separated by periods.

Although it can be a bit of a challenge, it is pivotal to give variables names that are descriptive of what they represent. For example, temp is a bad variable name. What does it represent? Temperature? Temporary? If you knew it meant temporary, would you know what it was temporary for? An example of a better name would be total.cost, which is descriptive of what it represents. This is a simple example, but it is important to give variables descriptive names, especially when you are working with more complicated code.

There are generally sets of conventions that people follow when coding, often referred to as a style guide (example of Google's R style guide here). Although you don't have to follow the exact recommendations I gave above, I sincerely urge you follow these rules when creating variables:

Make descriptive variable names.
Be consistent with how you name variables (e.g., periods vs underscores, lower or upper case, etc.)
Do not use abbreviated words in your variable names. Autocomplete in RStudio leaves you with no excuse to avoid long variable names.

Many, many people ignore conventions of coding, and instead aim to just make code that works. This is a bad idea. It is painful to read and edit messy code, and science is riddled with poor code that slows down scientific progress. The effort it takes to make code readable and easy to edit is well worth it. Even if you are the only person who will ever see your code, you will thank yourself later for using consistent conventions.

Conclusion

In this chapter, I introduced the most basic data types in R, and how to use variables to manipulate them. Although these data types are simple, they are the building blocks of all data in R, so it is important to understand them well. In the section below, you'll get a chance to practice using these data types and variables.

Practice

Use Atomic Vectors

I will briefly introduce some functionality of vectors that will be expounded upon in later chapters: (1) indexing with [], which gets the entry of a vector at the specified index, and (2) the length() function, which returns the length of a vector. With this information, run this code in your console:

vector.of.numbers <- c(8, 1, 8, 2)
length(vector.of.numbers)

vector.of.numbers[1]
vector.of.numbers[2]
vector.of.numbers[3]
vector.of.numbers[4]
vector.of.numbers[5]

vector.of.numbers[1] = 4
length(vector.of.numbers)

vector.of.numbers[5] = 10
length(vector.of.numbers)

Answer the following questions:

What's the length of the vector just after it is defined on line 1?
How do you interpret the output of lines 4-7?
How is line 8 different from lines 4-7? How do you explain its output?
What does line 10 do? How does the vector change?
What does line 13 do? How does the vector change?

Make Large Vectors from Scratch

As discussed above, the : operator is used to make a vector of sequential numbers, and operations (+, -, etc.) can be used on vectors to make new vectors. Using these two facts, make a vector of the numbers 1 through 100, and assign it to the variable large.vector. Then, use multiplication to change this vector to be all the even numbers up to 200. Finally, use the length() function to check that the vector is still of length 100.

Now, follow the same concept to generate a 101-element list that contains the numbers 0 through 25, with steps of 0.25.

Clean Up Code

The following code is trying to find the perimeter and area of a circle given a radius of 5. The code works, but is poorly written. Edit it so that it is more readable. Specifically:

Remove magic numbers and replace them with the relevant variables (Note: constant values from formulas are not considered magic numbers).
Put a single space between each operator, variable, etc.
Give the variables descriptive names without abbreviations. Use a naming convention of your choice for all variables.

When you're done, the code should be easy to read and understand, even for someone who doesn't know R. This sort of clean-up editing is often referred to as refactoring.

circrad <- 5
circlePerm<-2*pi* 5
A <- 3.14 *5 ^ 2
cat("The perimeter of a circle with radius", circrad, "is", circlePerm, ".\n")
cat("The area of this circle is", A, ".\n")