What is subsetting and why do I need it?

Most datasets have rows and columns.
- Rows/records represent all values for a given subject.
- Columns/fields represent a single measure across all subjects in the dataset.
- You often want to examine or extract specific values to answer questions about your data.

First, pull in example data

orange_trees <- datasets::Orange

orange_trees is the object I am creating within my global environment.
I am assigning <- the Orange dataset from the datasets package to orange_trees.

To access functions in any packages you have installed, you can simply type <[package_i_want]> followed by two colons :: and then extract the item (e.g., package::item)

We can quickly examine the top 5 records through the head() function.

head(orange_trees)

  Tree  age circumference
  1  118            30
  1  484            58
  1  664            87
  1 1004           115
  1 1231           120
  1 1372           142

Now, how do we get rows that meet a condition?

Base R has a simple structure for subsetting records.

YourObject[Rows, Columns]

YourObject is your dataset.
Rows are your records - what do you want to filter or subset your records by?
Columns represent the columns you want to select and retain in your object

Trying it with `orange_trees`

Let’s say we’re interested in getting trees that are above the mean.

mean(orange_trees$circumference)
orange_trees[orange_trees$circumference > 115.85, ]

[1] "The mean value for circumference is: 115.857142857143"
> orange_trees[orange_trees$circumference > 115.85, ]
   Tree  age circumference
   1 1231           120
   1 1372           142
   1 1582           145
  2 1004           156
  2 1231           172
  2 1372           203
  2 1582           203
  3 1372           139
  3 1582           140
  4 1004           167
  4 1231           179
  4 1372           209
  4 1582           214
  5 1004           125
  5 1231           142
  5 1372           174
  5 1582           177

We can see that those orange trees with a circumference greater than the mean circumference are returned.

Why are we typing orange_tree twice?

First, you’re accessing your object, orange_trees.
Second, within this object (brackets indicate within), you are specifying that you want to subset your object, orange_trees by the column circumference.
Third, the $ allows you to access that specific column and the rows associated with that column.

We have to specify the condition - those that are greater than the mean circumference value. We do this as we would in any other languages - with an operator.

Here, I am requesting circumference values greater than 118.85

Notice that comma that follows this conditional expression - it is indicating that you only want to subset records and keep all columns.

We could also use this code to obtain the same result.

# If you want to save this to your global environment for future use,
# you could type something like the following:
# orange_trees_subset <- orange_trees[orange_trees$circumference > mean(orange_trees$circumference), ]
orange_trees[orange_trees$circumference > mean(orange_trees$circumference), ]

   Tree  age circumference
   1 1231           120
   1 1372           142
   1 1582           145
  2 1004           156
  2 1231           172
  2 1372           203
  2 1582           203
  3 1372           139
  3 1582           140
  4 1004           167
  4 1231           179
  4 1372           209
  4 1582           214
  5 1004           125
  5 1231           142
  5 1372           174
  5 1582           177

Subsetting by multiple conditions

Often, subsetting by just one field won’t cut it - you have two or three conditions that you need in your next dataset.

Let’s say that I want to get trees that are above the mean age and have a circumference less than the mean circumference.

Note: I use parentheses to surround each individual expression. It’s easier for me to read, and can save some errors.

orange_trees[(orange_trees$age > mean(orange_trees$age)) & (orange_trees$circumference < mean(orange_trees$circumference)), ]

   Tree  age circumference
   1 1004           115
  3 1004           108
  3 1231           115

You can get more detail about all the R operators here.

The `subset()` function

Don’t worry, if that looks like too much, you can default to the subset() function included in the base package.

subset(orange_trees, circumference > mean(circumference))

   Tree  age circumference
   1 1231           120
   1 1372           142
   1 1582           145
  2 1004           156
  2 1231           172
  2 1372           203
  2 1582           203
  3 1372           139
  3 1582           140
  4 1004           167
  4 1231           179
  4 1372           209
  4 1582           214
  5 1004           125
  5 1231           142
  5 1372           174
  5 1582           177

This one is a lot more fluid and readable. You can, of course, do multiple conditions.

Notice how you do not have to type orange_trees and the corresponding $ with the subset() function.

This is similar to how you can attach() a dataset and access names directly.

subset(orange_trees, age > mean(age) & circumference < mean(circumference))

   Tree  age circumference
   1 1004           115
  3 1004           108
  3 1231           115