Session Prep: Download the dataset we will be working with for this session: Session_1.txt
This dataset contains experimental data on the pharmacokinetics of theophylline, a drug used in the treatment of COPD and asthma. Here are the descriptions of each of the variables:
Subject - a number identifying the subject on whom the observation was made. The ordering is by increasing maximum concentration of theophylline observed.
Wt - weight of the subject (kg).
Dose - dose of theophylline administered orally to the subject (mg/kg).
Time - time since drug administration when the sample was drawn (hr).
conc - theophylline concentration in the serum sample (mg/L).
Just like last time, let’s read in our dataset, “Session_1.txt”. Last session, we learned two ways to import data:
This
That
Session_1 <- read.delim("~/Desktop/Session_1.txt")
Session_1 <- read.delim(file.choose())
Now let’s preview our data and move on to practicing R logical operators! You can do this by using the View()
function. If you’re feeling really fancy, try using the tab-autocomplete feature to enter the file name:
View(Session_1)
help()
function or the ?
notation:help(View)
?View
Here are the logical operators we will be using today:
operator | meaning |
---|---|
< | less than |
<= | less than or equal to |
> | greater than |
>= | greater than or equal to |
== | exactly equal to |
!= | not equal to |
Now that we have the dataset uploaded, let’s give some of these operators a spin. But first, we need to figure out how to access some of our data. From last time, we learned how to list a particular column in a dataset. For example, if I wanted the “weight” (Wt) column of the Session_1 dataset, I would use the following command:
Session_1$Wt
And the output would look something like this:
[1] 79.6 79.6 79.6 79.6 79.6 79.6 79.6 79.6 79.6 79.6 79.6
[12] 72.4 72.4 72.4 72.4 72.4 72.4 72.4 72.4 72.4 72.4 72.4
[23] 70.5 70.5 70.5 70.5 70.5 70.5 70.5 70.5 70.5 70.5 70.5
[34] 72.7 72.7 72.7 72.7 72.7 72.7 72.7 72.7 72.7 72.7 72.7
[45] 54.6 54.6 54.6 54.6 54.6 54.6 54.6 54.6 54.6 54.6 54.6
[56] 80.0 80.0 80.0 80.0 80.0 80.0 80.0 80.0 80.0 80.0 80.0
[67] 64.6 64.6 64.6 64.6 64.6 64.6 64.6 64.6 64.6 64.6 64.6
[78] 70.5 70.5 70.5 70.5 70.5 70.5 70.5 70.5 70.5 70.5 70.5
[89] 86.4 86.4 86.4 86.4 86.4 86.4 86.4 86.4 86.4 86.4 86.4
[100] 58.2 58.2 58.2 58.2 58.2 58.2 58.2 58.2 58.2 58.2 58.2
[111] 65.0 65.0 65.0 65.0 65.0 65.0 65.0 65.0 65.0 65.0 65.0
[122] 60.5 60.5 60.5 60.5 60.5 60.5 60.5 60.5 60.5 60.5 60.5
But what about getting access to a particular value? It’s easy! Just provide the row and column coordinates, and out pops your value. For example, if I wanted to get the value in row 3, column 4 of the Session_1 dataset, I would enter:
Session_1[3,4]
And the output should look like this:
[1] 0.57
Let’s break that last bit down for you. The bracket notation allows you to select specific rows and columns from a data frame. The basic syntax is data.frame[rows,columns]. If you don’t specify either rows or columns you’ll get all rows or columns. Try these commands, what do you get?
Session_1[3,]
Session_1[,2]
You can also specify ranges of numbers using a colon. For example, if I wanted the values in rows 3-6 in column 4, I would write:
Session_1[3:6,4]
And the output should look like this:
[1] 0.57 1.12 2.02 3.82
There are many other ways to select parts of your data, but that will do for now. Now let’s do some simple R logic!
Let’s start by checking if the value in row 1, column 1 is equal to row 2, column 1. To do this, we would enter the following:
Session_1[1,1] == Session_1[2,1]
And we should get the following output:
[1] TRUE
Turns out that the data in rows 1 and 2 both belong to Subject 1! Good thing we checked :-) legend(“topright”, legend = c(“SCARLET FEVER”, “DIPHTHERIA”),
IMPORTANT: make sure you use both the equal signs (==
) when checking if two values are equal. If you use a single equal sign, you will re-assign the value on the right side of the =’s to the left side’s position in the dataset!
Now let’s try checking if two values are not equal. Let’s start by checking if the value in row 1, column 1 is equal to row 12, column 1. To do this, we would enter the following:
Session_1[1,1] != Session_1[12,1]
And we should get the following output:
[1] TRUE
Indeed, the data in row 12 belongs to subject 2. So when we compared the value in Session_1[1,1] (which was “1”), with value in Session_1[12,1] (which was “2”), it is true that these values are not equal.
So now let’s do something a little more practical with the remaining logical operators. I have a question for you: Over the course of this experiment, which subject had a higher maximum drug concentration in their bloodstream, subject 1 or subject 2? To answer this question, let’s incorporate two concepts you’ve already learned about, max()
and selecting a range of data. We’ll also use the logical operator >=
to check which one is higher. Believe it or not, we can do all of this in one line! See for yourself:
max(Session_1[1:11,5]) >= max(Session_1[12:22,5])
And we should get the following output:
[1] TRUE
Basically all we’re telling R to do is the following:
1. Find the maximum value in row 1 thru 11 in column 5 (which is the concentration data for subject 1)
2. Find the maximum value in row 12 thru 22 in column 5 (which is the concentration data for subject 2)
3. Compare the two maximum values to see if subject 1’s is higher than subject 2’s. Now imagine going through thousands of rows of data by hand looking for maximum values to compare! Phew! But we’re not done yet! Let’s just say (for fun) that the toxic concentration of theophylline in the bloodstream is 10.00 mg/kg. Does subject 1’s theophylline levels ever go past that value? Let’s check if he/she is safe (if subject 1’s maximum concentration is less than or equal to the toxic threshold of 10.00 mg/kg):
max(Session_1[1:11,5]) <= 10.00
And we should get the following output:
[1] FALSE
Uh oh, looks like we’ve passed the threshold! Perhaps we should alert the researchers to monitor subject 1 for theophylline overdose symptoms (cardiotoxicity and neurotoxicity). Finally, instead of specifying column numbers and row numbers (e.g. Session_1[2,5]), we can use logical criteria and column names. Using the above example we can select the subjects with doses greater than the threshold:
Session_1[Session_1$conc >= 10, c("Subject","conc")]
Session_1[Session_1$conc >= 10, c(1,5)]
Notice that we used the threshold to select the rows (Session_1$conc >= 10) and we selected two columns (c(“Subject”,“conc”)). You may be wondering about the syntax for selecting multiple columns. You can specify either a list of column names or a list of column numbers. Additionally, to make your code more readable and reusable, you can assign the row and column selection criteria to objects:
whichRows <- Session_1$conc >= 10
whichCols <- c("Subject","conc")
Session_1[whichRows,whichCols]
Use what you’ve learned about logical operators and subsetting data frames with the bracket notation (answers at end).
1. Is the dose for subject 1 greater than, less than, or equal to the dose for subject 2? 2. Dose subject 4 weigh more than subject 6?
3. Is the mean concentration for subject 5 greater than, less than, or equal to the mean concentration for subject 7?
4. Is 4 not equal to 5 (prove it to yourself)?
5. What is the average concentration for Subject 7?
6. What is the minimum concentration when the time is greater than 5 minutes?
7. Which subject corresponds to the minimum concentration found in #6?
See Answers
In this section we will introduce 2 functions: sort()
and order()
. Briefly, this is what each function does:
sort()
: returns back a sorted list of values from a data range
order()
: returns back a sorted list of row numbers from a data range (the row numbers that the sorted values came from)
This distinction is very important as we will soon see.
Let’s say we wanted a sorted list of the dosages given to the subjects. We would simply sort the specified column in our data set:
sort(Session_1$Dose)
And we should get the following output:
[1] 3.10 3.10 3.10 3.10 3.10 3.10 3.10 3.10 3.10 3.10 3.10
[12] 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00
[23] 4.02 4.02 4.02 4.02 4.02 4.02 4.02 4.02 4.02 4.02 4.02
[34] 4.40 4.40 4.40 4.40 4.40 4.40 4.40 4.40 4.40 4.40 4.40
[45] 4.40 4.40 4.40 4.40 4.40 4.40 4.40 4.40 4.40 4.40 4.40
[56] 4.53 4.53 4.53 4.53 4.53 4.53 4.53 4.53 4.53 4.53 4.53
[67] 4.53 4.53 4.53 4.53 4.53 4.53 4.53 4.53 4.53 4.53 4.53
[78] 4.92 4.92 4.92 4.92 4.92 4.92 4.92 4.92 4.92 4.92 4.92
[89] 4.95 4.95 4.95 4.95 4.95 4.95 4.95 4.95 4.95 4.95 4.95
[100] 5.30 5.30 5.30 5.30 5.30 5.30 5.30 5.30 5.30 5.30 5.30
[111] 5.50 5.50 5.50 5.50 5.50 5.50 5.50 5.50 5.50 5.50 5.50
[122] 5.86 5.86 5.86 5.86 5.86 5.86 5.86 5.86 5.86 5.86 5.86
Notice that these are the values that were contained in each row.
By the way, we can also sort the dose by decreasing value. We just need to set the decreasing argument to TRUE
. Where was the decreasing argument in example 1? Well, if we don’t explicitly define it as TRUE
, it will assume it to be FALSE
. Therefore, the default state for the sort()
function is ascending order. Let’s try it out:
sort(Session_1$Dose, decreasing = TRUE)
And we should get the following output:
[1] 5.86 5.86 5.86 5.86 5.86 5.86 5.86 5.86 5.86 5.86 5.86
[12] 5.50 5.50 5.50 5.50 5.50 5.50 5.50 5.50 5.50 5.50 5.50
[23] 5.30 5.30 5.30 5.30 5.30 5.30 5.30 5.30 5.30 5.30 5.30
[34] 4.95 4.95 4.95 4.95 4.95 4.95 4.95 4.95 4.95 4.95 4.95
[45] 4.92 4.92 4.92 4.92 4.92 4.92 4.92 4.92 4.92 4.92 4.92
[56] 4.53 4.53 4.53 4.53 4.53 4.53 4.53 4.53 4.53 4.53 4.53
[67] 4.53 4.53 4.53 4.53 4.53 4.53 4.53 4.53 4.53 4.53 4.53
[78] 4.40 4.40 4.40 4.40 4.40 4.40 4.40 4.40 4.40 4.40 4.40
[89] 4.40 4.40 4.40 4.40 4.40 4.40 4.40 4.40 4.40 4.40 4.40
[100] 4.02 4.02 4.02 4.02 4.02 4.02 4.02 4.02 4.02 4.02 4.02
[111] 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00 4.00
[122] 3.10 3.10 3.10 3.10 3.10 3.10 3.10 3.10 3.10 3.10 3.10
Now let’s say we wanted to sort a data frame (Session_1) according to a sorted list of values. Well, R doesn’t know how to do this right off the bat. For instance, in our output above, the first 11 values are all the same. Which row is which?? Fear not! We can simply tell R to sort the whole data frame by a particular order of rows that corresponds to our sorted list. But we’ll have to use a new function, order()
. Let’s try it out using dose again.
order(Session_1$Dose)
And we should get the following output:
[1] 89 90 91 92 93 94 95 96 97 98 99 56 57
[14] 58 59 60 61 62 63 64 65 66 1 2 3 4
[27] 5 6 7 8 9 10 11 12 13 14 15 16 17
[40] 18 19 20 21 22 34 35 36 37 38 39 40 41
[53] 42 43 44 23 24 25 26 27 28 29 30 31 32
[66] 33 78 79 80 81 82 83 84 85 86 87 88 111
[79] 112 113 114 115 116 117 118 119 120 121 67 68 69
[92] 70 71 72 73 74 75 76 77 122 123 124 125 126
[105] 127 128 129 130 131 132 100 101 102 103 104 105 106
[118] 107 108 109 110 45 46 47 48 49 50 51 52 53
[131] 54 55
Notice that the output now is the row numbers that correspond to our sorted values for dose. Rows 89-99 correspond to value 3.10, just like in our sort()
example.
Now let’s see it in action! We will be sorting the Session_1 data frame by the dose order (don’t forget the “,” at the end).
Session_1[order(Session_1$Dose),]
And we should get the following output:
Session_1[order(Session_1$Dose),]
Check it out! Rows 89-99 are at the top of the list, just like promised :-) But what the heck was that comma? How did R know to apply that list of ordered rows to all my columns? Well, R is smart like that. If you simply leave the comma, with no specified columns, it will assume you are referencing all the columns in the data frame.
Here are a few more questions to practice sorting data
1. Sort the Session_1
data frame by the Time variable and store it in a new object called sorted.
2. Sort the Session_1
data frame by the Dose and conc variables and store it in a new object called sorted2.
3. Sort the Session_1
data frame from the largest to smallest values and store it in a new object called sorted3.
See Answers
Last session we learned how to produce a simple scatter plot with our data. The function plot()
contained the x and y coordinates for arguments. For our purposes today, let’s plot concentration vs. time data for subject 1 (rows 1-11) :
plot(Session_1$Time[1:11],Session_1$conc[1:11])
Notice how we don’t have to specify the column in the brackets. That’s because we’re already in the column we want (it’s specified by the ‘Time’ or ‘conc’ after the ‘$’ sign).
To make a line plot, we can still use the plot()
function, we just need to specify that it will be a line graph. Notice that we added the ‘type’ argument in our function. Here, “l” stands for line. There are other options for other types of graphs too. Let’s try it out:
plot(Session_1$Time[1:11],Session_1$conc[1:11], type="l")
That looks like a pretty good graph so far, but let’s do one better. This time, let’s add in a nice title, as well as some axes labels. We can do this by specifying more arguments, this time using: main, xlab, ylab, which specify the title, x-axis, and y-axis, respectively. The code should look something like this:
plot(Session_1$Time[1:11],Session_1$conc[1:11], type="l", main="Concentration over Time", xlab="Time (hours)", ylab="Concentration (mg/kg)")
Now that we’ve made our improved graph, let’s push the envelope one more time. This time we’re going to plot two sets of data on the same graph. This part gets a little tricky, so hold on to your hats! Let’s take it step by step. First, let’s plot our data just like we did before. This time, let’s also make it red using the “col” argument and setting it to “red”:
plot(Session_1$Time[1:11],Session_1$conc[1:11], type="l", main="Concentration over Time", xlab="Time (hours)", ylab="Concentration (mg/kg)", col="red")
Now we want to add another line plot onto this graph. For this, we will need to introduce a new function lines()
. This function will allow you to add more data to your existing plot without erasing it! Let’s try it out with concentration data from patient 2. This time, let’s make the line blue:
lines(Session_1$Time[12:22],Session_1$conc[12:22], col="blue")
And our new and improved graph should look like this:
Looking pretty professional! But which line corresponds to which set of data? Sounds like we could use a legend…
Let’s add one in using the legend()
function. Here we specify an option for the position “topright”, the title “Subjects”, created a vector containing the labels “c(‘Subject 1’, ‘Subject 2’), specified the colored line widths in the legend”lwd=c(1.0,1.0)“, and colored the labels based on the order that they occur”col=c(“red”,“blue”)“.
legend("topright",title="Subjects",c('Subject 1','Subject 2'),lwd=c(1.0,1.0),col=c("red","blue"))
The final graph should look like this:
You’ll learn more about plot customization in future sessions, but this should suffice for now. Let’s move on to our next type of graph, the box plot. Believe it or not, box plots are just as easy to make as line plots, they just use a different function, boxplot()
. This time, we’re going to plot concentration data for subjects 1, 2, and 3, and make the boxes blue, green, and red, respectively. We’ve also added labels using the “names” argument.
boxplot(Session_1$conc[1:11],Session_1$conc[12:22],Session_1$conc[23:33],col=c("blue","green","red"),names=c("Subject 1","Subject 2","Subject 3"))
The next graph we’ll be creating is the bar plot using the function barplot()
. In this example, we’ll be plotting the maximum concentrations for the first 5 subjects over the course of the experiment. We’ll be using the max()
function to determine the maximum value for each patient’s concentration range. The barplot()
function will take a vector that contains our data:
c(max(Session_1$conc[1:11]),max(Session_1$conc[12:22]),max(Session_1$conc[23:33]),max(Session_1$conc[34:44]),max(Session_1$conc[45:55]))
As well as a vectors that contains our ordered labels using the names argument: names=c(“Subject 1”,“Subject 2”,“Subject 3”,“Subject 4”,“Subject 5”)"
The code should look like this (we also threw in a title and a y-label):
barplot(c(max(Session_1$conc[1:11]),max(Session_1$conc[12:22]),max(Session_1$conc[23:33]),max(Session_1$conc[34:44]),max(Session_1$conc[45:55])), main="Maximum Concentration", names=c("Subject 1","Subject 2","Subject 3","Subject 4","Subject 5"),ylab="concentration (mg/kg)")
Ok, those last few commands are a bit long. To make them easier to read, we can 1) store the values to be plotted in objects and 2) split them across a few lines:
maximums <- c(max(Session_1$conc[1:11]),
max(Session_1$conc[12:22]),
max(Session_1$conc[23:33]),
max(Session_1$conc[34:44]),
max(Session_1$conc[45:55]))
names <- c("Subject 1","Subject 2","Subject 3","Subject 4","Subject 5")
barplot(maximums,
main="Maximum Concentration",
names=names,
ylab="concentration (mg/kg)")
Awesome! Now let’s move on to our last type of plot, the histogram. For this data, we’ll just look at the distribution of concentrations over the experiment for all our subjects. We can do this by using the function hist()
.
It’s easy, the code should look like this:
hist(Session_1$conc,main="Concentrations",xlab="Concentration Range (mg/kg)")
That’s not bad, we have the frequency of each integer concentration up to 12. But what if we wanted a higher resolution histogram. After all, not all the concentration values are integers. We can do this by increasing the bin number using the argument breaks. Let’s modify our histogram to include 20 bins instead.
hist(Session_1$conc,main="Concentrations",xlab="Concentration Range (mg/kg)",breaks=20,xlim=c(0, 12))
Here we also changed the minimum/maximum boundaries in the x-axis to include our range of bins (without it, the x-axis would look a little wonky, see for yourself if you’re interested). We should get a histogram that looks like this:
Now let’s practice your plotting skills
help()
to lookup the formula method (x~y) of making box plot).Advanced: Using the Session_1 data frame, make a line plot of time versus concentration for subjects 1,5,8, and 12. (hint: First use plot()
for subject 1, then use lines()
for the remaining. Also use help()
to figure out how to specify the x and y axis limits using the range()
function)
And there we have it! You did it! Welcome to the wonderful world of R plotting :-) Next session you will learn how to customize your plots to your heart’s content. For now, digest these new methods and feel free to explore different types of arguments for these functions, as there is a whole lot more to these plots if you dig a little deeper!
Use what you’ve learned about logical operators and subsetting data frames with the bracket notation (answers at end).
Back to Quiz 1
1. Is the dose for subject 1 greater than, less than, or equal to the dose for subject 2?
Session_1[1,3] > Session_1[12,3]
[1] FALSE
Session_1[1,3] == Session_1[12,3]
[1] FALSE
Session_1[1,3] < Session_1[12,3]
[1] TRUE
Session_1[34,2] > Session_1[56,2]
[1] FALSE
Session_1[34,2] == Session_1[56,2]
[1] FALSE
Session_1[34,2] < Session_1[56,2]
[1] TRUE
mean(Session_1[45:44,5]) > mean(Session_1[67:77,2])
[1] FALSE
mean(Session_1[45:44,5]) == mean(Session_1[67:77,2])
[1] FALSE
mean(Session_1[45:44,5]) < mean(Session_1[67:77,2])
[1] TRUE
4 != 5
[1] TRUE
mean(Session_1[Session_1$Subject == 7, "conc"])
[1] 3.910909
minConc <- min(Session_1[Session_1$Time > 5, "conc"])
Session_1[Session_1$conc == minConc, "Subject"]
[1] 11
Here are a few more questions to practice sorting data
Back to Quiz 2
1. Sort the Session_1 data frame by the Time variable and store it in a new object called sorted.
sorted <- Session_1[order(Session_1$Time),]
sorted2 <- Session_1[order(Session_1$Dose,Session_1$conc),]
sorted3 <- Session_1[order(Session_1$conc),]
Now let’s practice your plotting skills
Back to Quiz 3
1. Read this file into a data frame called m: Mel.txt
m <- read.delim(file.choose())
plot(m$age,m$thickness)
help()
to lookup the formula method (x~y) of making box plot).statusColors <- c("darkcyan","hotpink","lawngreen")
boxplot(m$thickness ~ m$status, col=statusColors)
hist(m$thickness)
hist(m$age)
hist(m$age[m$sex == 1])
hist(m$age[m$sex == 0])
plot(m$age,m$thickness,
main="Chart title",
xlab="Age",
ylab="Thickness")
Advanced: Using the Session_1 data frame, make a line plot of time versus concentration for subjects 1,5,8, and 12. (hint: First use plot()
for subject 1, then use lines()
for the remaining. Also use help()
to figure out how to specify the x and y axis limits using the range()
function). Note: this is an advanced solution and we’ll learn more about it in the next session but try to understand the code as best you can. In particular, understand what paste(“Subjects”, plotSubjects), plotSubjects[1], plotColors[3] are doing.
## Make some vectors that will help us plot
plotColors <- c("olivedrab","steelblue","darkmagenta","burlywood4")
plotSubjects <- c(1,5,8,12)
lineNames <- paste("Subjects", plotSubjects)
## Plot the first line for subject 1
plot(Session_1$Time[Session_1$Subject == plotSubjects[1]],
Session_1$conc[Session_1$Subject == plotSubjects[1]],
type="l",
col=plotColors[1],
xlim=range(Session_1$Time),
ylim=range(Session_1$conc),
main="Concentration over Time for Subjects 1, 5, 8, and 12",
xlab="Time",
ylab="Concentration")