What is subsetting and why do I need it?

  • Most datasets have rows and columns.
    • Rows/records represent all values for a given subject.
    • Columns/fields represent a single measure across all subjects in the dataset.
    • You often want to examine or extract specific values to answer questions about your data.

First, pull in example data

orange_trees <- datasets::Orange
  • orange_trees is the object I am creating within my global environment.
  • I am assigning <- the Orange dataset from the datasets package to orange_trees.

To access functions in any packages you have installed, you can simply type <[package_i_want]> followed by two colons :: and then extract the item (e.g., package::item)

We can quickly examine the top 5 records through the head() function.

head(orange_trees)
  Tree  age circumference
1    1  118            30
2    1  484            58
3    1  664            87
4    1 1004           115
5    1 1231           120
6    1 1372           142

Now, how do we get rows that meet a condition?

  • Base R has a simple structure for subsetting records.
YourObject[Rows, Columns]
  1. YourObject is your dataset.
  2. Rows are your records - what do you want to filter or subset your records by?
  3. Columns represent the columns you want to select and retain in your object

Trying it with orange_trees

Let’s say we’re interested in getting trees that are above the mean.

mean(orange_trees$circumference)
orange_trees[orange_trees$circumference > 115.85, ]
[1] "The mean value for circumference is: 115.857142857143"
> orange_trees[orange_trees$circumference > 115.85, ]
   Tree  age circumference
5     1 1231           120
6     1 1372           142
7     1 1582           145
11    2 1004           156
12    2 1231           172
13    2 1372           203
14    2 1582           203
20    3 1372           139
21    3 1582           140
25    4 1004           167
26    4 1231           179
27    4 1372           209
28    4 1582           214
32    5 1004           125
33    5 1231           142
34    5 1372           174
35    5 1582           177

We can see that those orange trees with a circumference greater than the mean circumference are returned.

Why are we typing orange_tree twice?

  • First, you’re accessing your object, orange_trees.
  • Second, within this object (brackets indicate within), you are specifying that you want to subset your object, orange_trees by the column circumference.
  • Third, the $ allows you to access that specific column and the rows associated with that column.

We have to specify the condition - those that are greater than the mean circumference value. We do this as we would in any other languages - with an operator.

  • Here, I am requesting circumference values greater than 118.85

Notice that comma that follows this conditional expression - it is indicating that you only want to subset records and keep all columns.

We could also use this code to obtain the same result.

# If you want to save this to your global environment for future use,
# you could type something like the following:
# orange_trees_subset <- orange_trees[orange_trees$circumference > mean(orange_trees$circumference), ]
orange_trees[orange_trees$circumference > mean(orange_trees$circumference), ]
   Tree  age circumference
5     1 1231           120
6     1 1372           142
7     1 1582           145
11    2 1004           156
12    2 1231           172
13    2 1372           203
14    2 1582           203
20    3 1372           139
21    3 1582           140
25    4 1004           167
26    4 1231           179
27    4 1372           209
28    4 1582           214
32    5 1004           125
33    5 1231           142
34    5 1372           174
35    5 1582           177

Subsetting by multiple conditions

Often, subsetting by just one field won’t cut it - you have two or three conditions that you need in your next dataset.

Let’s say that I want to get trees that are above the mean age and have a circumference less than the mean circumference.

Note: I use parentheses to surround each individual expression. It’s easier for me to read, and can save some errors.

orange_trees[(orange_trees$age > mean(orange_trees$age)) & (orange_trees$circumference < mean(orange_trees$circumference)), ]
   Tree  age circumference
4     1 1004           115
18    3 1004           108
19    3 1231           115

You can get more detail about all the R operators here.

The subset() function

Don’t worry, if that looks like too much, you can default to the subset() function included in the base package.

subset(orange_trees, circumference > mean(circumference))
   Tree  age circumference
5     1 1231           120
6     1 1372           142
7     1 1582           145
11    2 1004           156
12    2 1231           172
13    2 1372           203
14    2 1582           203
20    3 1372           139
21    3 1582           140
25    4 1004           167
26    4 1231           179
27    4 1372           209
28    4 1582           214
32    5 1004           125
33    5 1231           142
34    5 1372           174
35    5 1582           177

This one is a lot more fluid and readable. You can, of course, do multiple conditions.

Notice how you do not have to type orange_trees and the corresponding $ with the subset() function.

This is similar to how you can attach() a dataset and access names directly.

subset(orange_trees, age > mean(age) & circumference < mean(circumference))
   Tree  age circumference
4     1 1004           115
18    3 1004           108
19    3 1231           115