Overview
While we often generate binary variables from categorical variables, you might run into a situation where you want to do the opposite.
In other words, you may want to generate a single categorical variable from mutiple binary variables.
I sometimes do this to ensure my fields are being flagged correctly.
Creating a categorical variable
Assume that we have the following dataset:
data have;
infile datalines delimiter = ',' dsd;
input rx_hdc rx_warf rx_hep rx_gabp rx_qtp rx_nap 3.;
datalines;
1,0,0,0,0
0,1,0,0,0
0,0,1,0,0
0,0,0,1,0
0,1,0,0,0
0,1,0,0,0
0,0,0,0,1
0,1,0,0,0
0,0,0,0,1
;
run;
First, let me address the major assumption in this dataset – that is, the fields are mutually exclusive.
We could go through and do something like this:
data want;
set have;
if rx_hdc = 1 then prescription = 'hdc';
else if rx_hep = 1 then prescription = 'hep';
else if rx_warf = 1 then prescription = 'warf';
...
...
run;
This will certainly get what you need. However, if you have a lot of binary variables, this may get a bit tedious.
A clear solution is through an ARRAY
. But how can we extract the variable names dynamically without typing them?
Note: I assume your binary variables have a common prefix. If they do not, you can specify them with a double dash (e.g.,
ARRAY _rx [*] rx_hdc -- rx_nap
). This also assumes your fields are in a particular order.
data want_single;
set have;
array _rx [*] rx:;
do i = 1 to dim(_rx);
if _rx[i] = 1 then prescription = strip(tranwrd(vname(_rx[i]), 'rx_', ''));
end;
run;
I’ve covered arrays in a previous post. If that post wasn’t helpful or didn’t explain enough, I encourage you to examine the documentation.
You may wonder what is happening inside the DO
loop. If the value of the variable within the ARRAY
is equal to 1, then we do the following (from inside to outside) for the variable prescription
:
- We are extracting the variable name (
VNAME
) for that instance where our variable within theARRAY
is equal to 1.- This avoids having to manually set the
prescription
value.
- This avoids having to manually set the
- Next, we use the
TRANWRD
function to replace the prefixrx_
with a blank''
. - Finally, we remove any leading or trailing whitespace with the
STRIP
function.- Found this out the hard way. This is relevant for my next example.
Formatting your categorical variable
But what if we want a more descriptive format? We can use PROC FORMAT
before we run the DATA
step above, and then use PUT
to format it accordingly.
proc format;
value $ prescrip
'hdc' = 'Hydrocodone'
'hep' = 'Heparin'
'warf' = 'Warfarin'
'gabp' = 'Gabapentin'
'qtp' = 'Quetiapine'
'nap' = 'Naproxen';
run;
data want_single_fmt;
set have;
array _rx [*] rx:;
do i = 1 to dim(_rx);
if _rx[i] = 1 then prescription = put(strip(tranwrd(vname(_rx[i]), 'rx_', '')), $prescrip.);
end;
run;
If you do not use the STRIP
function, the variable will not format due to leading whitespace.
Multiple categorical values
Okay, but what if we have multiple binary variables that are equal to 1?
Fair point, and this one is a little more involved. The structure of this DATA
step is similar to the one I made in my last post.
Here’s our dataset with multiple binary variables flagged.
data have_multiple;
infile datalines delimiter = ',' dsd;
input rx_hdc rx_warf rx_hep rx_gabp rx_qtp rx_nap 3.;
datalines;
1,0,0,0,1,1
0,1,0,0,0,1
0,0,1,0,0,0
0,0,0,1,0,0
0,1,0,1,0,0
0,1,0,0,0,1
0,0,0,0,1,0
0,1,0,1,0,1
0,1,0,0,1,1
;
run;
And this is the corresponding code.
data want_multiple;
set have_multiple;
length combined $100.;
array _rx [*] rx:;
do i = 1 to dim(_rx);
if _rx[i] = 1 then do;
prescription = put(strip(tranwrd(vname(_rx[i]), 'rx_', '')), $prescrip.);
combined = catx(', ', combined, prescription);
end;
end;
run;
Two differences:
- We’re creating a new variable called
combined
and setting its length to 100. You need to determine the correct length beforehand. This will hold all of our variables. - Instead of just doing a usual
DO
loop, we need to explicitly tell SAS to do the following when_rx[i]
is equal to 1:- Create the prescription value like we did earlier.
- Concatenate
combined
(blank at the beginning) andprescription
each time_rx[i]
is equal to 1.
Code is available for download here.