What is subsetting and why do I need it?
- Most datasets have rows and columns.
- Rows/records represent all values for a given subject.
- Columns/fields represent a single measure across all subjects in the dataset.
- You often want to examine or extract specific values to answer questions about your data.
First, pull in example data
orange_trees <- datasets::Orange
orange_treesis the object I am creating within my global environment.- I am assigning
<-theOrangedataset from thedatasetspackage toorange_trees.
To access functions in any packages you have installed, you can simply type
<[package_i_want]>followed by two colons::and then extract the item (e.g., package::item)
We can quickly examine the top 5 records through the head() function.
head(orange_trees)
Tree age circumference
1 1 118 30
2 1 484 58
3 1 664 87
4 1 1004 115
5 1 1231 120
6 1 1372 142
Now, how do we get rows that meet a condition?
- Base R has a simple structure for subsetting records.
YourObject[Rows, Columns]
YourObjectis your dataset.Rowsare your records - what do you want to filter or subset your records by?Columnsrepresent the columns you want to select and retain in your object
Trying it with orange_trees
Let’s say we’re interested in getting trees that are above the mean.
mean(orange_trees$circumference)
orange_trees[orange_trees$circumference > 115.85, ]
[1] "The mean value for circumference is: 115.857142857143"
> orange_trees[orange_trees$circumference > 115.85, ]
Tree age circumference
5 1 1231 120
6 1 1372 142
7 1 1582 145
11 2 1004 156
12 2 1231 172
13 2 1372 203
14 2 1582 203
20 3 1372 139
21 3 1582 140
25 4 1004 167
26 4 1231 179
27 4 1372 209
28 4 1582 214
32 5 1004 125
33 5 1231 142
34 5 1372 174
35 5 1582 177
We can see that those orange trees with a circumference greater than the mean circumference are returned.
Why are we typing orange_tree twice?
- First, you’re accessing your object,
orange_trees. - Second, within this object (brackets indicate within), you are specifying that you want to subset your object,
orange_treesby the columncircumference. - Third, the
$allows you to access that specific column and the rows associated with that column.
We have to specify the condition - those that are greater than the mean circumference value. We do this as we would in any other languages - with an operator.
- Here, I am requesting
circumferencevalues greater than 118.85
Notice that comma that follows this conditional expression - it is indicating that you only want to subset records and keep all columns.
We could also use this code to obtain the same result.
# If you want to save this to your global environment for future use,
# you could type something like the following:
# orange_trees_subset <- orange_trees[orange_trees$circumference > mean(orange_trees$circumference), ]
orange_trees[orange_trees$circumference > mean(orange_trees$circumference), ]
Tree age circumference
5 1 1231 120
6 1 1372 142
7 1 1582 145
11 2 1004 156
12 2 1231 172
13 2 1372 203
14 2 1582 203
20 3 1372 139
21 3 1582 140
25 4 1004 167
26 4 1231 179
27 4 1372 209
28 4 1582 214
32 5 1004 125
33 5 1231 142
34 5 1372 174
35 5 1582 177
Subsetting by multiple conditions
Often, subsetting by just one field won’t cut it - you have two or three conditions that you need in your next dataset.
Let’s say that I want to get trees that are above the mean age and have a circumference less than the mean circumference.
Note: I use parentheses to surround each individual expression. It’s easier for me to read, and can save some errors.
orange_trees[(orange_trees$age > mean(orange_trees$age)) & (orange_trees$circumference < mean(orange_trees$circumference)), ]
Tree age circumference
4 1 1004 115
18 3 1004 108
19 3 1231 115
You can get more detail about all the R operators here.
The subset() function
Don’t worry, if that looks like too much, you can default to the subset() function included in the base package.
subset(orange_trees, circumference > mean(circumference))
Tree age circumference
5 1 1231 120
6 1 1372 142
7 1 1582 145
11 2 1004 156
12 2 1231 172
13 2 1372 203
14 2 1582 203
20 3 1372 139
21 3 1582 140
25 4 1004 167
26 4 1231 179
27 4 1372 209
28 4 1582 214
32 5 1004 125
33 5 1231 142
34 5 1372 174
35 5 1582 177
This one is a lot more fluid and readable. You can, of course, do multiple conditions.
Notice how you do not have to type orange_trees and the corresponding $ with the subset() function.
This is similar to how you can attach() a dataset and access names directly.
subset(orange_trees, age > mean(age) & circumference < mean(circumference))
Tree age circumference
4 1 1004 115
18 3 1004 108
19 3 1231 115