Session Prep: Download the dataset we will be working with for this session: Session_1.txt

If you want you can download a pdf version of the lesson here

This dataset contains experimental data on the pharmacokinetics of theophylline, a drug used in the treatment of COPD and asthma. Here are the descriptions of each of the variables:

Subject - a number identifying the subject on whom the observation was made. The ordering is by increasing maximum concentration of theophylline observed.
Wt - weight of the subject (kg).
Dose - dose of theophylline administered orally to the subject (mg/kg).
Time - time since drug administration when the sample was drawn (hr).
conc - theophylline concentration in the serum sample (mg/L).

Part 1: Reading in a dataset from a text file

  1. Use the “Import Dataset” feature of Rstudio: Import Dataset >> From Text File >> Browse to the Session_1.txt file

RStudio will give you a preview of the dataset that looks a lot like an Excel table:

If you look in the console window you’ll see that two commands were executed. The first line stores the dataset as a data.frame named Session_1. The second line opens a preview of the data.frame. Note: The file path will be different depending on where the Session_1.txt file exists on your computer.

Session_1 <- read.delim("/Desktop/Session_1.txt")
View(Session_1)
  1. Load the file again using the read.delim() function and a new function, file.choose(), and store it in a data.frame named s1.
s1 <- read.delim(file.choose())

You should now have two data.frames listed in your environment: s1 and Session_1

Optional: Check to see if the two data.frames are identical:

identical(s1,Session_1)
[1] TRUE

Optional: Use the View() function to display the s1 data.frame.

Part 2: Exploring a dataset

There are five primary functions we will use to explore a dataset:

summary()
head()
tail()
str()
View() #see exercise 1.1
  1. The summary() function will display summary statistics on each variable in a data.frame. Summarize the s1 data.frame by entering the following in the console window:
summary(s1)

You should see this output in the console window:

    Subject            Wt             Dose      
 Min.   : 1.00   Min.   :54.60   Min.   :3.100  
 1st Qu.: 3.75   1st Qu.:63.58   1st Qu.:4.305  
 Median : 6.50   Median :70.50   Median :4.530  
 Mean   : 6.50   Mean   :69.58   Mean   :4.626  
 3rd Qu.: 9.25   3rd Qu.:74.42   3rd Qu.:5.037  
 Max.   :12.00   Max.   :86.40   Max.   :5.860  
      Time             conc       
 Min.   : 0.000   Min.   : 0.000  
 1st Qu.: 0.595   1st Qu.: 2.877  
 Median : 3.530   Median : 5.275  
 Mean   : 5.895   Mean   : 4.960  
 3rd Qu.: 9.000   3rd Qu.: 7.140  
 Max.   :24.650   Max.   :11.400  

Tip: RStudio has a feature called tab-complete that can dramatically reduce the amount of typing you’ll need to do. In the console window, if you type “sum” and then hit the tab key, you should see the summary() function listed as the second option. Use the down arrow to select summary and then hit enter. This works anytime you’re typing, if you type “s” and hit tab, you should see the s1 data.frame at the top of the list. This is like autocomplete on your phone or when searching the web. You’ll also notice that when you type the first opening parenthesis it adds the closing parenthesis automatically.

  1. The head() and tail() functions will display the first six lines and the last six lines of a data.frame in the console window, respectively. Display the first six lines of the s1 data.frame:
head(s1)

The output should look like this:

  Subject   Wt Dose Time  conc
1       1 79.6 4.02 0.00  0.74
2       1 79.6 4.02 0.25  2.84
3       1 79.6 4.02 0.57  6.57
4       1 79.6 4.02 1.12 10.50
5       1 79.6 4.02 2.02  9.66
6       1 79.6 4.02 3.82  8.58

Note: The first column, which doesn’t have a label, indicates the row numbers. Here, we see that rows 1-6 are displayed.

Now display the last few lines of the s1 data.frame:

tail(s1)

You should see this as the output (note that rows 127-132 are displayed):

    Subject   Wt Dose  Time conc
127      12 60.5  5.3  3.52 9.75
128      12 60.5  5.3  5.07 8.57
129      12 60.5  5.3  7.07 6.59
130      12 60.5  5.3  9.03 6.11
131      12 60.5  5.3 12.05 4.57
132      12 60.5  5.3 24.15 1.17

You can also indicate the number of lines to display with an additional argument to both the head() and tail() functions. To display the first or last 10 lines, try:

head(s1, n=10)
tail(s1, n=10)
  1. Another way to quickly preview a data.frame is with thestr() function (sometimes pronounce it “stir”). Apply this function to the s1 data.frame:
str(s1)

The output should look like this:

'data.frame':   132 obs. of  5 variables:
 $ Subject: int  1 1 1 1 1 1 1 1 1 1 ...
 $ Wt     : num  79.6 79.6 79.6 79.6 79.6 79.6 79.6 79.6 79.6 79.6 ...
 $ Dose   : num  4.02 4.02 4.02 4.02 4.02 4.02 4.02 4.02 4.02 4.02 ...
 $ Time   : num  0 0.25 0.57 1.12 2.02 ...
 $ conc   : num  0.74 2.84 6.57 10.5 9.66 8.58 8.36 7.47 6.89 5.94 ...

Tip: In the console window, you can use the up and down arrows to recall the previous commands you’ve entered. Use the up arrow to recall the command you used to display the first 10 lines of the s1 data.frame.

  1. You may have already noticed that R code is case sensitive and requires exact spelling matches. To get a sense of the common error messages that occur when you’ve misspelled something or used the wrong case try these commands:

These commands will generate error messages indicating that a function can’t be found:

sumary(s1)
view(s1)
Read.Delim(file.choose())

These commands will generate error messages indicating that an object can’t be found:

summary(session_1)
View(S1)
head(s2)

Optional: Correct and re-enter the commands above.

Part 3: Exploring a specific variable in a data.frame.

All of the functions you’ve learned in Part 2 can be applied to specific variable (e.g. a column) stored in a data.frame. The syntax for specifying a variable name is data.frame$variable.

  1. Using this syntax, we can display all of the values of the Time variable in the s1 data.frame:
s1$Time

The output should look like this:

  [1]  0.00  0.25  0.57  1.12  2.02  3.82  5.10  7.03  9.05
 [10] 12.12 24.37  0.00  0.27  0.52  1.00  1.92  3.50  5.02
 [19]  7.03  9.00 12.00 24.30  0.00  0.27  0.58  1.02  2.02
 [28]  3.62  5.08  7.07  9.00 12.15 24.17  0.00  0.35  0.60
 [37]  1.07  2.13  3.50  5.02  7.02  9.02 11.98 24.65  0.00
 [46]  0.30  0.52  1.00  2.02  3.50  5.02  7.02  9.10 12.00
 [55] 24.35  0.00  0.27  0.58  1.15  2.03  3.57  5.00  7.00
 [64]  9.22 12.10 23.85  0.00  0.25  0.50  1.02  2.02  3.48
 [73]  5.00  6.98  9.00 12.05 24.22  0.00  0.25  0.52  0.98
 [82]  2.02  3.53  5.05  7.15  9.07 12.10 24.12  0.00  0.30
 [91]  0.63  1.05  2.02  3.53  5.02  7.17  8.80 11.60 24.43
[100]  0.00  0.37  0.77  1.02  2.05  3.55  5.05  7.08  9.38
[109] 12.10 23.70  0.00  0.25  0.50  0.98  1.98  3.60  5.02
[118]  7.03  9.03 12.12 24.08  0.00  0.25  0.50  1.00  2.00
[127]  3.52  5.07  7.07  9.03 12.05 24.15

Note: the number of lines will depend on the width of the console window but you should always see 132 values for this particular data.frame. We can check this by looking at the numbers in brackets at the beginning of each line. On the first line [1] indicates that 0.00 is the first value of the variable Time. On the last line [122] indicates that 0.00 is the 122th value of the Time variable. Counting across the row, we can see that 24.15 is the 132nd value.

  1. Now use the four functions you learned about in Part 2 (summary(), head(), tail(), str()) to display information about the Time variable in the s1 data.frame:
summary(s1$Time)
head(s1$Time)
head(s1$Time, n=10)
tail(s1$Time)
tail(s1$Time, n=10)
str(s1$Time)

Here are a few additional functions that can be used to quickly summarize a variable. You may recognize some of them from Excel.

min()
max()
mean()
sum()
sd() #standard deviation
table() #frequency counts of each unique value
  1. Try out these functions on the Time variable in the s1 data.frame (i.e. s1$Time):
min(s1$Time)
[1] 0

The output indicates that 0 is the minimum value:

max(s1$Time)
[1] 24.65

24.65 is the maximum value:

mean(s1$Time)
[1] 5.894621

5.894621 is the mean or the average value:

sum(s1$Time)
[1] 778.09

778.09 is the sum of the values of the Time variable

sd(s1$Time)
[1] 6.925952

The standard deviation is 6.925952:

table(s1$Time)

    0  0.25  0.27   0.3  0.35  0.37   0.5  0.52  0.57  0.58 
   12     5     3     2     1     1     3     3     1     2 
  0.6  0.63  0.77  0.98     1  1.02  1.05  1.07  1.12  1.15 
    1     1     1     2     3     3     1     1     1     1 
 1.92  1.98     2  2.02  2.03  2.05  2.13  3.48   3.5  3.52 
    1     1     1     6     1     1     1     1     3     1 
 3.53  3.55  3.57   3.6  3.62  3.82     5  5.02  5.05  5.07 
    2     1     1     1     1     1     2     5     2     1 
 5.08   5.1  6.98     7  7.02  7.03  7.07  7.08  7.15  7.17 
    1     1     1     1     2     3     2     1     1     1 
  8.8     9  9.02  9.03  9.05  9.07   9.1  9.22  9.38  11.6 
    1     3     1     2     1     1     1     1     1     1 
11.98    12 12.05  12.1 12.12 12.15  23.7 23.85 24.08 24.12 
    1     2     2     3     2     1     1     1     1     1 
24.15 24.17 24.22  24.3 24.35 24.37 24.43 24.65 
    1     1     1     1     1     1     1     1 

Yikes! You may have noticed that frequency counts are not always appropriate for variables with lots of unique values. Try using the table() function on the Subject variable in the s1 data.frame:

table(s1$Subject)

 1  2  3  4  5  6  7  8  9 10 11 12 
11 11 11 11 11 11 11 11 11 11 11 11 

Here, the output is much more appropriate and we see that there are 11 observations for each subject:

Advanced (optional): Earlier we used to the indentical() function to compare the s1 and Session_1 data.frames. Try to use this function to compare the conc variables in each of those data.frames. Are they identical?

  1. You may have noticed that variable names are also case sensitive. Try these commands to see what kind of error messages you can expect when you haven’t typed the variable name correctly:
summary(s1$time)
head(s1$Conc)
min(s1$dose)

Optional: Try to correct and re-enter the above commands

Ok, one more error, which you may encounter when you’re working with larger datasets. Enter the following code:

vomit <- runif(1:100000)
vomit

A bunch of lines will spew across the console and you’ll get something like this:

[9973] 8.458023e-01 6.285847e-02 2.523069e-01 6.230917e-01 5.983609e-01 8.318507e-01  
 [9979] 6.905970e-01 1.566991e-01 7.889734e-01 8.119727e-01 6.632301e-02 5.608480e-02  
 [9985] 4.320653e-01 7.313621e-01 5.808307e-01 4.450045e-01 4.845054e-02 2.959875e-02  
 [9991] 1.983839e-01 4.610772e-01 5.963980e-01 1.450190e-01 6.470672e-01 3.714696e-01  
 [9997] 5.144367e-01 3.211310e-01 5.469483e-01 6.439149e-01  
 [ reached getOption("max.print") -- omitted 90000 entries ]  

The point here is that sometimes your dataset or variable can contain more information than R will display on screen. Don’t be alarmed by this, it’s an expected behavior!

Part 3 Questions

Now that you’re familiar with some basic commands and error messages, try to apply what you’ve learned by answering the following questions (answers at the end).

  • What was the minimum value of the Dose variable in the s1 data.frame?
  • What is the maximum value of the Wt variable in the s1 data.frame?
  • What is the average value of the conc variable in the s1 data.frame?
  • How many unique values of the Subject variable are there s1 data.frame?
  • What is the 12th value of the Time variable in the s1 data.frame?
    See Answers

Part 4: Plotting your first plot

Here we will introduce the plot() function, which can be used to generate a scatterplot of two variables on standard two-dimensional cartesian coordinate system (i.e. x-axis and y-axis).

As a reminder, the dataset we loaded contains experimental data on the pharmacokinetics of theophylline, a drug used in the treatment of COPD and asthma. Here are the descriptions of each of the variables we now have stored in the s1 data.frame:

Subject - a number identifying the subject on whom the observation was made. The ordering is by increasing maximum concentration of theophylline observed.
Wt - weight of the subject (kg).
Dose - dose of theophylline administered orally to the subject (mg/kg).
Time - time since drug administration when the sample was drawn (hr).
conc - theophylline concentration in the serum sample (mg/L).

Pharmacokinetics describes how a drug is metabolized and excreted by the body. Subjects were given a single dose and the concentration was measured in the serum over max(s1$Time) hours. Let’s plot the concentration at each time point using the plot() function. The syntax is simple, plot(x,y), where x is the variable to be plotted on the x-axis (often time) and y is the variable to plotted on the y-axis.

  1. Plot Time versus conc from the s1 data.frame:
plot(s1$Time,s1$conc)

2. Plot Subject versus conc from the s1 data.frame:

plot(s1$Subject, s1$conc)

Part 5: Putting it all together.

In this final section you have the opportunity to practice all of the skills you’ve learned so far: from reading in a text file to exploring a dataset and finally making a graph.

Download this new dataset: Mel.txt

This dataset has data on 205 patients in Denmark with malignant melanoma. It contains the following columns:

time - survival time in days, possibly censored.
status - 1 died from melanoma, 2 alive, 3 dead from other causes.
sex - 1 = male, 0 = female.
age - age in years.
year - of operation (i.e. surgery).
thickness - tumour thickness in mm.
ulcer - 1 = presence, 0 = absence.

  1. Read in the Mel.txt file.
  2. View the dataset
  3. Summarize all of the variables in the dataset
  4. What is the maximum tumor thickness?
  5. What are the maximum, minimum, and average survival times?
  6. In which year were the most operations (surgeries) performed?
  7. Plot the survival time by year
  8. Plot the tumor thickness by age
  9. Optional: What else can you discover about this dataset?
    See Answers to part 5
    Preview: Next week we’ll show you additional methods for visualizing these two dataset which include the following box plot (sometimes called box and whisker plots if you’re really old):

Congratulations! You’ve just finished your first lesson in data visualization using R.

Answers From Part 3

Part 3 5. Now that you’re familiar with some basic commands and error messages, try to apply what you’ve learned by answering the following questions (answers at the end).
What was the minimum value of the Dose variable in the s1 data.frame?

min(s1$Dose)
[1] 3.1

What is the maximum value of the Wt variable in the s1 data.frame?

max(s1$Wt)
[1] 86.4

What is the average value of the conc variable in the s1 data.frame?
Back to part 3 questions

mean(s1$conc)
[1] 4.960455

How many unique values of the Subject variable are there s1 data.frame?

table(s1$Subject)

 1  2  3  4  5  6  7  8  9 10 11 12 
11 11 11 11 11 11 11 11 11 11 11 11 

What is the 12th value of the Time variable in the s1 data.frame? (there are many solutions, two shown below)

head(s1$Time, n=12)
 [1]  0.00  0.25  0.57  1.12  2.02  3.82  5.10  7.03  9.05
[10] 12.12 24.37  0.00

Back to part 3 questions

s1$Time
  [1]  0.00  0.25  0.57  1.12  2.02  3.82  5.10  7.03  9.05
 [10] 12.12 24.37  0.00  0.27  0.52  1.00  1.92  3.50  5.02
 [19]  7.03  9.00 12.00 24.30  0.00  0.27  0.58  1.02  2.02
 [28]  3.62  5.08  7.07  9.00 12.15 24.17  0.00  0.35  0.60
 [37]  1.07  2.13  3.50  5.02  7.02  9.02 11.98 24.65  0.00
 [46]  0.30  0.52  1.00  2.02  3.50  5.02  7.02  9.10 12.00
 [55] 24.35  0.00  0.27  0.58  1.15  2.03  3.57  5.00  7.00
 [64]  9.22 12.10 23.85  0.00  0.25  0.50  1.02  2.02  3.48
 [73]  5.00  6.98  9.00 12.05 24.22  0.00  0.25  0.52  0.98
 [82]  2.02  3.53  5.05  7.15  9.07 12.10 24.12  0.00  0.30
 [91]  0.63  1.05  2.02  3.53  5.02  7.17  8.80 11.60 24.43
[100]  0.00  0.37  0.77  1.02  2.05  3.55  5.05  7.08  9.38
[109] 12.10 23.70  0.00  0.25  0.50  0.98  1.98  3.60  5.02
[118]  7.03  9.03 12.12 24.08  0.00  0.25  0.50  1.00  2.00
[127]  3.52  5.07  7.07  9.03 12.05 24.15

Back to part 3 questions

Answers From Part 5

Read in and view data

melData <- read.delim(file.choose())
View(melData)

What is the maximum tumor thickness?

max(melData$thickness)
[1] 17.42

What is the maximum survival time?

max(melData$time)
[1] 5565

What is the minimum survival time?

min(melData$time)
[1] 10

Back to part 5

What is the average survival time?

mean(melData$time)
[1] 2152.8

In which year were the most operations (surgeries) performed

table(melData$year)

1962 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 
   1    1   11   10   20   21   21   19   27   41   31    1 
1977 
   1 

Back to part 5
Plot the survival time by year

plot(melData$year, melData$time)

Plot the tumor thickness by age

plot(melData$year, melData$thickness)

Back to part 5