Session Prep: Download the dataset we will be working with for this session: Session_1.txt
If you want you can download a pdf version of the lesson here
This dataset contains experimental data on the pharmacokinetics of theophylline, a drug used in the treatment of COPD and asthma. Here are the descriptions of each of the variables:
Subject - a number identifying the subject on whom the observation was made. The ordering is by increasing maximum concentration of theophylline observed.
Wt - weight of the subject (kg).
Dose - dose of theophylline administered orally to the subject (mg/kg).
Time - time since drug administration when the sample was drawn (hr).
conc - theophylline concentration in the serum sample (mg/L).
RStudio will give you a preview of the dataset that looks a lot like an Excel table:
If you look in the console window you’ll see that two commands were executed. The first line stores the dataset as a data.frame named Session_1. The second line opens a preview of the data.frame. Note: The file path will be different depending on where the Session_1.txt file exists on your computer.
Session_1 <- read.delim("/Desktop/Session_1.txt")
View(Session_1)
read.delim()
function and a new function, file.choose()
, and store it in a data.frame
named s1.s1 <- read.delim(file.choose())
You should now have two data.frames listed in your environment: s1 and Session_1
Optional: Check to see if the two data.frames are identical:
identical(s1,Session_1)
[1] TRUE
Optional: Use the View()
function to display the s1 data.frame.
There are five primary functions we will use to explore a dataset:
summary()
head()
tail()
str()
View() #see exercise 1.1
summary()
function will display summary statistics on each variable in a data.frame
. Summarize the s1 data.frame
by entering the following in the console window:summary(s1)
You should see this output in the console window:
Subject Wt Dose
Min. : 1.00 Min. :54.60 Min. :3.100
1st Qu.: 3.75 1st Qu.:63.58 1st Qu.:4.305
Median : 6.50 Median :70.50 Median :4.530
Mean : 6.50 Mean :69.58 Mean :4.626
3rd Qu.: 9.25 3rd Qu.:74.42 3rd Qu.:5.037
Max. :12.00 Max. :86.40 Max. :5.860
Time conc
Min. : 0.000 Min. : 0.000
1st Qu.: 0.595 1st Qu.: 2.877
Median : 3.530 Median : 5.275
Mean : 5.895 Mean : 4.960
3rd Qu.: 9.000 3rd Qu.: 7.140
Max. :24.650 Max. :11.400
Tip: RStudio has a feature called tab-complete that can dramatically reduce the amount of typing you’ll need to do. In the console window, if you type “sum” and then hit the tab key, you should see the summary()
function listed as the second option. Use the down arrow to select summary and then hit enter. This works anytime you’re typing, if you type “s” and hit tab, you should see the s1 data.frame
at the top of the list. This is like autocomplete on your phone or when searching the web. You’ll also notice that when you type the first opening parenthesis it adds the closing parenthesis automatically.
head()
and tail()
functions will display the first six lines and the last six lines of a data.frame
in the console window, respectively. Display the first six lines of the s1 data.frame
:head(s1)
The output should look like this:
Subject Wt Dose Time conc
1 1 79.6 4.02 0.00 0.74
2 1 79.6 4.02 0.25 2.84
3 1 79.6 4.02 0.57 6.57
4 1 79.6 4.02 1.12 10.50
5 1 79.6 4.02 2.02 9.66
6 1 79.6 4.02 3.82 8.58
Note: The first column, which doesn’t have a label, indicates the row numbers. Here, we see that rows 1-6 are displayed.
Now display the last few lines of the s1 data.frame:
tail(s1)
You should see this as the output (note that rows 127-132 are displayed):
Subject Wt Dose Time conc
127 12 60.5 5.3 3.52 9.75
128 12 60.5 5.3 5.07 8.57
129 12 60.5 5.3 7.07 6.59
130 12 60.5 5.3 9.03 6.11
131 12 60.5 5.3 12.05 4.57
132 12 60.5 5.3 24.15 1.17
You can also indicate the number of lines to display with an additional argument to both the head() and tail() functions. To display the first or last 10 lines, try:
head(s1, n=10)
tail(s1, n=10)
str()
function (sometimes pronounce it “stir”). Apply this function to the s1 data.frame
:str(s1)
The output should look like this:
'data.frame': 132 obs. of 5 variables:
$ Subject: int 1 1 1 1 1 1 1 1 1 1 ...
$ Wt : num 79.6 79.6 79.6 79.6 79.6 79.6 79.6 79.6 79.6 79.6 ...
$ Dose : num 4.02 4.02 4.02 4.02 4.02 4.02 4.02 4.02 4.02 4.02 ...
$ Time : num 0 0.25 0.57 1.12 2.02 ...
$ conc : num 0.74 2.84 6.57 10.5 9.66 8.58 8.36 7.47 6.89 5.94 ...
Tip: In the console window, you can use the up and down arrows to recall the previous commands you’ve entered. Use the up arrow to recall the command you used to display the first 10 lines of the s1 data.frame
.
These commands will generate error messages indicating that a function can’t be found:
sumary(s1)
view(s1)
Read.Delim(file.choose())
These commands will generate error messages indicating that an object can’t be found:
summary(session_1)
View(S1)
head(s2)
Optional: Correct and re-enter the commands above.
All of the functions you’ve learned in Part 2 can be applied to specific variable (e.g. a column) stored in a data.frame
. The syntax for specifying a variable name is data.frame$variable
.
s1$Time
The output should look like this:
[1] 0.00 0.25 0.57 1.12 2.02 3.82 5.10 7.03 9.05
[10] 12.12 24.37 0.00 0.27 0.52 1.00 1.92 3.50 5.02
[19] 7.03 9.00 12.00 24.30 0.00 0.27 0.58 1.02 2.02
[28] 3.62 5.08 7.07 9.00 12.15 24.17 0.00 0.35 0.60
[37] 1.07 2.13 3.50 5.02 7.02 9.02 11.98 24.65 0.00
[46] 0.30 0.52 1.00 2.02 3.50 5.02 7.02 9.10 12.00
[55] 24.35 0.00 0.27 0.58 1.15 2.03 3.57 5.00 7.00
[64] 9.22 12.10 23.85 0.00 0.25 0.50 1.02 2.02 3.48
[73] 5.00 6.98 9.00 12.05 24.22 0.00 0.25 0.52 0.98
[82] 2.02 3.53 5.05 7.15 9.07 12.10 24.12 0.00 0.30
[91] 0.63 1.05 2.02 3.53 5.02 7.17 8.80 11.60 24.43
[100] 0.00 0.37 0.77 1.02 2.05 3.55 5.05 7.08 9.38
[109] 12.10 23.70 0.00 0.25 0.50 0.98 1.98 3.60 5.02
[118] 7.03 9.03 12.12 24.08 0.00 0.25 0.50 1.00 2.00
[127] 3.52 5.07 7.07 9.03 12.05 24.15
Note: the number of lines will depend on the width of the console window but you should always see 132 values for this particular data.frame. We can check this by looking at the numbers in brackets at the beginning of each line. On the first line [1] indicates that 0.00 is the first value of the variable Time. On the last line [122] indicates that 0.00 is the 122th value of the Time variable. Counting across the row, we can see that 24.15 is the 132nd value.
summary()
, head()
, tail()
, str()
) to display information about the Time variable in the s1 data.frame:summary(s1$Time)
head(s1$Time)
head(s1$Time, n=10)
tail(s1$Time)
tail(s1$Time, n=10)
str(s1$Time)
Here are a few additional functions that can be used to quickly summarize a variable. You may recognize some of them from Excel.
min()
max()
mean()
sum()
sd() #standard deviation
table() #frequency counts of each unique value
data.frame
(i.e. s1$Time
):min(s1$Time)
[1] 0
The output indicates that 0 is the minimum value:
max(s1$Time)
[1] 24.65
24.65 is the maximum value:
mean(s1$Time)
[1] 5.894621
5.894621 is the mean or the average value:
sum(s1$Time)
[1] 778.09
778.09 is the sum of the values of the Time variable
sd(s1$Time)
[1] 6.925952
The standard deviation is 6.925952:
table(s1$Time)
0 0.25 0.27 0.3 0.35 0.37 0.5 0.52 0.57 0.58
12 5 3 2 1 1 3 3 1 2
0.6 0.63 0.77 0.98 1 1.02 1.05 1.07 1.12 1.15
1 1 1 2 3 3 1 1 1 1
1.92 1.98 2 2.02 2.03 2.05 2.13 3.48 3.5 3.52
1 1 1 6 1 1 1 1 3 1
3.53 3.55 3.57 3.6 3.62 3.82 5 5.02 5.05 5.07
2 1 1 1 1 1 2 5 2 1
5.08 5.1 6.98 7 7.02 7.03 7.07 7.08 7.15 7.17
1 1 1 1 2 3 2 1 1 1
8.8 9 9.02 9.03 9.05 9.07 9.1 9.22 9.38 11.6
1 3 1 2 1 1 1 1 1 1
11.98 12 12.05 12.1 12.12 12.15 23.7 23.85 24.08 24.12
1 2 2 3 2 1 1 1 1 1
24.15 24.17 24.22 24.3 24.35 24.37 24.43 24.65
1 1 1 1 1 1 1 1
Yikes! You may have noticed that frequency counts are not always appropriate for variables with lots of unique values. Try using the table()
function on the Subject variable in the s1 data.frame
:
table(s1$Subject)
1 2 3 4 5 6 7 8 9 10 11 12
11 11 11 11 11 11 11 11 11 11 11 11
Here, the output is much more appropriate and we see that there are 11 observations for each subject:
Advanced (optional): Earlier we used to the indentical()
function to compare the s1 and Session_1 data.frames. Try to use this function to compare the conc variables in each of those data.frames. Are they identical?
summary(s1$time)
head(s1$Conc)
min(s1$dose)
Optional: Try to correct and re-enter the above commands
Ok, one more error, which you may encounter when you’re working with larger datasets. Enter the following code:
vomit <- runif(1:100000)
vomit
A bunch of lines will spew across the console and you’ll get something like this:
[9973] 8.458023e-01 6.285847e-02 2.523069e-01 6.230917e-01 5.983609e-01 8.318507e-01
[9979] 6.905970e-01 1.566991e-01 7.889734e-01 8.119727e-01 6.632301e-02 5.608480e-02
[9985] 4.320653e-01 7.313621e-01 5.808307e-01 4.450045e-01 4.845054e-02 2.959875e-02
[9991] 1.983839e-01 4.610772e-01 5.963980e-01 1.450190e-01 6.470672e-01 3.714696e-01
[9997] 5.144367e-01 3.211310e-01 5.469483e-01 6.439149e-01
[ reached getOption("max.print") -- omitted 90000 entries ]
The point here is that sometimes your dataset or variable can contain more information than R will display on screen. Don’t be alarmed by this, it’s an expected behavior!
Now that you’re familiar with some basic commands and error messages, try to apply what you’ve learned by answering the following questions (answers at the end).
data.frame
?data.frame
?data.frame
?data.frame
?data.frame
?Here we will introduce the plot()
function, which can be used to generate a scatterplot of two variables on standard two-dimensional cartesian coordinate system (i.e. x-axis and y-axis).
As a reminder, the dataset we loaded contains experimental data on the pharmacokinetics of theophylline, a drug used in the treatment of COPD and asthma. Here are the descriptions of each of the variables we now have stored in the s1 data.frame
:
Subject - a number identifying the subject on whom the observation was made. The ordering is by increasing maximum concentration of theophylline observed.
Wt - weight of the subject (kg).
Dose - dose of theophylline administered orally to the subject (mg/kg).
Time - time since drug administration when the sample was drawn (hr).
conc - theophylline concentration in the serum sample (mg/L).
Pharmacokinetics describes how a drug is metabolized and excreted by the body. Subjects were given a single dose and the concentration was measured in the serum over max(s1$Time)
hours. Let’s plot the concentration at each time point using the plot()
function. The syntax is simple, plot(x,y)
, where x is the variable to be plotted on the x-axis (often time) and y is the variable to plotted on the y-axis.
plot(s1$Time,s1$conc)
2. Plot Subject versus conc from the s1 data.frame:
plot(s1$Subject, s1$conc)
In this final section you have the opportunity to practice all of the skills you’ve learned so far: from reading in a text file to exploring a dataset and finally making a graph.
Download this new dataset: Mel.txt
This dataset has data on 205 patients in Denmark with malignant melanoma. It contains the following columns:
time - survival time in days, possibly censored.
status - 1 died from melanoma, 2 alive, 3 dead from other causes.
sex - 1 = male, 0 = female.
age - age in years.
year - of operation (i.e. surgery).
thickness - tumour thickness in mm.
ulcer - 1 = presence, 0 = absence.
Congratulations! You’ve just finished your first lesson in data visualization using R.
Part 3 5. Now that you’re familiar with some basic commands and error messages, try to apply what you’ve learned by answering the following questions (answers at the end).
What was the minimum value of the Dose variable in the s1 data.frame?
min(s1$Dose)
[1] 3.1
What is the maximum value of the Wt variable in the s1 data.frame
?
max(s1$Wt)
[1] 86.4
What is the average value of the conc variable in the s1 data.frame
?
Back to part 3 questions
mean(s1$conc)
[1] 4.960455
How many unique values of the Subject variable are there s1 data.frame
?
table(s1$Subject)
1 2 3 4 5 6 7 8 9 10 11 12
11 11 11 11 11 11 11 11 11 11 11 11
What is the 12th value of the Time variable in the s1 data.frame
? (there are many solutions, two shown below)
head(s1$Time, n=12)
[1] 0.00 0.25 0.57 1.12 2.02 3.82 5.10 7.03 9.05
[10] 12.12 24.37 0.00
s1$Time
[1] 0.00 0.25 0.57 1.12 2.02 3.82 5.10 7.03 9.05
[10] 12.12 24.37 0.00 0.27 0.52 1.00 1.92 3.50 5.02
[19] 7.03 9.00 12.00 24.30 0.00 0.27 0.58 1.02 2.02
[28] 3.62 5.08 7.07 9.00 12.15 24.17 0.00 0.35 0.60
[37] 1.07 2.13 3.50 5.02 7.02 9.02 11.98 24.65 0.00
[46] 0.30 0.52 1.00 2.02 3.50 5.02 7.02 9.10 12.00
[55] 24.35 0.00 0.27 0.58 1.15 2.03 3.57 5.00 7.00
[64] 9.22 12.10 23.85 0.00 0.25 0.50 1.02 2.02 3.48
[73] 5.00 6.98 9.00 12.05 24.22 0.00 0.25 0.52 0.98
[82] 2.02 3.53 5.05 7.15 9.07 12.10 24.12 0.00 0.30
[91] 0.63 1.05 2.02 3.53 5.02 7.17 8.80 11.60 24.43
[100] 0.00 0.37 0.77 1.02 2.05 3.55 5.05 7.08 9.38
[109] 12.10 23.70 0.00 0.25 0.50 0.98 1.98 3.60 5.02
[118] 7.03 9.03 12.12 24.08 0.00 0.25 0.50 1.00 2.00
[127] 3.52 5.07 7.07 9.03 12.05 24.15
Read in and view data
melData <- read.delim(file.choose())
View(melData)
What is the maximum tumor thickness?
max(melData$thickness)
[1] 17.42
What is the maximum survival time?
max(melData$time)
[1] 5565
What is the minimum survival time?
min(melData$time)
[1] 10
What is the average survival time?
mean(melData$time)
[1] 2152.8
In which year were the most operations (surgeries) performed
table(melData$year)
1962 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974
1 1 11 10 20 21 21 19 27 41 31 1
1977
1
Back to part 5
Plot the survival time by year
plot(melData$year, melData$time)
Plot the tumor thickness by age
plot(melData$year, melData$thickness)