What is subsetting and why do I need it?
- Most datasets have rows and columns.
- Rows/records represent all values for a given subject.
- Columns/fields represent a single measure across all subjects in the dataset.
- You often want to examine or extract specific values to answer questions about your data.
First, pull in example data
orange_trees <- datasets::Orange
orange_trees
is the object I am creating within my global environment.- I am assigning
<-
theOrange
dataset from thedatasets
package toorange_trees
.
To access functions in any packages you have installed, you can simply type
<[package_i_want]>
followed by two colons::
and then extract the item (e.g., package::item)
We can quickly examine the top 5 records through the head()
function.
head(orange_trees)
Tree age circumference
1 1 118 30
2 1 484 58
3 1 664 87
4 1 1004 115
5 1 1231 120
6 1 1372 142
Now, how do we get rows that meet a condition?
- Base R has a simple structure for subsetting records.
YourObject[Rows, Columns]
YourObject
is your dataset.Rows
are your records - what do you want to filter or subset your records by?Columns
represent the columns you want to select and retain in your object
Trying it with orange_trees
Let’s say we’re interested in getting trees that are above the mean.
mean(orange_trees$circumference)
orange_trees[orange_trees$circumference > 115.85, ]
[1] "The mean value for circumference is: 115.857142857143"
> orange_trees[orange_trees$circumference > 115.85, ]
Tree age circumference
5 1 1231 120
6 1 1372 142
7 1 1582 145
11 2 1004 156
12 2 1231 172
13 2 1372 203
14 2 1582 203
20 3 1372 139
21 3 1582 140
25 4 1004 167
26 4 1231 179
27 4 1372 209
28 4 1582 214
32 5 1004 125
33 5 1231 142
34 5 1372 174
35 5 1582 177
We can see that those orange trees with a circumference greater than the mean circumference are returned.
Why are we typing orange_tree
twice?
- First, you’re accessing your object,
orange_trees
. - Second, within this object (brackets indicate within), you are specifying that you want to subset your object,
orange_trees
by the columncircumference
. - Third, the
$
allows you to access that specific column and the rows associated with that column.
We have to specify the condition - those that are greater than the mean circumference
value. We do this as we would in any other languages - with an operator.
- Here, I am requesting
circumference
values greater than 118.85
Notice that comma that follows this conditional expression - it is indicating that you only want to subset records and keep all columns.
We could also use this code to obtain the same result.
# If you want to save this to your global environment for future use,
# you could type something like the following:
# orange_trees_subset <- orange_trees[orange_trees$circumference > mean(orange_trees$circumference), ]
orange_trees[orange_trees$circumference > mean(orange_trees$circumference), ]
Tree age circumference
5 1 1231 120
6 1 1372 142
7 1 1582 145
11 2 1004 156
12 2 1231 172
13 2 1372 203
14 2 1582 203
20 3 1372 139
21 3 1582 140
25 4 1004 167
26 4 1231 179
27 4 1372 209
28 4 1582 214
32 5 1004 125
33 5 1231 142
34 5 1372 174
35 5 1582 177
Subsetting by multiple conditions
Often, subsetting by just one field won’t cut it - you have two or three conditions that you need in your next dataset.
Let’s say that I want to get trees that are above the mean age
and have a circumference
less than the mean circumference
.
Note: I use parentheses to surround each individual expression. It’s easier for me to read, and can save some errors.
orange_trees[(orange_trees$age > mean(orange_trees$age)) & (orange_trees$circumference < mean(orange_trees$circumference)), ]
Tree age circumference
4 1 1004 115
18 3 1004 108
19 3 1231 115
You can get more detail about all the R operators here.
The subset()
function
Don’t worry, if that looks like too much, you can default to the subset()
function included in the base
package.
subset(orange_trees, circumference > mean(circumference))
Tree age circumference
5 1 1231 120
6 1 1372 142
7 1 1582 145
11 2 1004 156
12 2 1231 172
13 2 1372 203
14 2 1582 203
20 3 1372 139
21 3 1582 140
25 4 1004 167
26 4 1231 179
27 4 1372 209
28 4 1582 214
32 5 1004 125
33 5 1231 142
34 5 1372 174
35 5 1582 177
This one is a lot more fluid and readable. You can, of course, do multiple conditions.
Notice how you do not have to type orange_trees
and the corresponding $
with the subset()
function.
This is similar to how you can attach()
a dataset and access names directly.
subset(orange_trees, age > mean(age) & circumference < mean(circumference))
Tree age circumference
4 1 1004 115
18 3 1004 108
19 3 1231 115