Overview

While we often generate binary variables from categorical variables, you might run into a situation where you want to do the opposite.

In other words, you may want to generate a single categorical variable from mutiple binary variables.

I sometimes do this to ensure my fields are being flagged correctly.

Creating a categorical variable

Assume that we have the following dataset:

data have;
infile datalines delimiter = ',' dsd;
input rx_hdc rx_warf rx_hep rx_gabp rx_qtp rx_nap 3.;
datalines;
1,0,0,0,0
0,1,0,0,0
0,0,1,0,0
0,0,0,1,0
0,1,0,0,0
0,1,0,0,0
0,0,0,0,1
0,1,0,0,0
0,0,0,0,1
;
run;

Alt Text

First, let me address the major assumption in this dataset – that is, the fields are mutually exclusive.

We could go through and do something like this:

data want;
    set have;
        if rx_hdc = 1 then prescription = 'hdc';
            else if rx_hep = 1 then prescription = 'hep';
            else if rx_warf = 1 then prescription = 'warf';
    ...
    ...
run;

This will certainly get what you need. However, if you have a lot of binary variables, this may get a bit tedious.

A clear solution is through an ARRAY. But how can we extract the variable names dynamically without typing them?

Note: I assume your binary variables have a common prefix. If they do not, you can specify them with a double dash (e.g., ARRAY _rx [*] rx_hdc -- rx_nap). This also assumes your fields are in a particular order.

data want_single;
	set have;
	array _rx [*] rx:;
	do i = 1 to dim(_rx);
		if _rx[i] = 1 then prescription = strip(tranwrd(vname(_rx[i]), 'rx_', ''));
	end;
run;

Alt Text

I’ve covered arrays in a previous post. If that post wasn’t helpful or didn’t explain enough, I encourage you to examine the documentation.

You may wonder what is happening inside the DO loop. If the value of the variable within the ARRAY is equal to 1, then we do the following (from inside to outside) for the variable prescription:

  1. We are extracting the variable name (VNAME) for that instance where our variable within the ARRAY is equal to 1.
    • This avoids having to manually set the prescription value.
  2. Next, we use the TRANWRD function to replace the prefix rx_ with a blank ''.
  3. Finally, we remove any leading or trailing whitespace with the STRIP function.
    • Found this out the hard way. This is relevant for my next example.

Formatting your categorical variable

But what if we want a more descriptive format? We can use PROC FORMAT before we run the DATA step above, and then use PUT to format it accordingly.

proc format;
    value $ prescrip
        'hdc' = 'Hydrocodone'
        'hep' = 'Heparin'
        'warf' = 'Warfarin'
        'gabp' = 'Gabapentin'
        'qtp' = 'Quetiapine'
        'nap' = 'Naproxen';
run;

data want_single_fmt;
	set have;
	array _rx [*] rx:;
	do i = 1 to dim(_rx);
		if _rx[i] = 1 then prescription = put(strip(tranwrd(vname(_rx[i]), 'rx_', '')), $prescrip.);
	end;
run;

Alt Text

If you do not use the STRIP function, the variable will not format due to leading whitespace.

Multiple categorical values

Okay, but what if we have multiple binary variables that are equal to 1?

Fair point, and this one is a little more involved. The structure of this DATA step is similar to the one I made in my last post.

Here’s our dataset with multiple binary variables flagged.

data have_multiple;
infile datalines delimiter = ',' dsd;
input rx_hdc rx_warf rx_hep rx_gabp rx_qtp rx_nap 3.;
datalines;
1,0,0,0,1,1
0,1,0,0,0,1
0,0,1,0,0,0
0,0,0,1,0,0
0,1,0,1,0,0
0,1,0,0,0,1
0,0,0,0,1,0
0,1,0,1,0,1
0,1,0,0,1,1
;
run;

Alt Text

And this is the corresponding code.

data want_multiple;
	set have_multiple;
	length combined $100.;
	array _rx [*] rx:;
	do i = 1 to dim(_rx);
		if _rx[i] = 1 then do;
			prescription = put(strip(tranwrd(vname(_rx[i]), 'rx_', '')), $prescrip.);
			combined = catx(', ', combined, prescription);
		end;
	end;
run;

Alt Text

Two differences:

  1. We’re creating a new variable called combined and setting its length to 100. You need to determine the correct length beforehand. This will hold all of our variables.
  2. Instead of just doing a usual DO loop, we need to explicitly tell SAS to do the following when _rx[i] is equal to 1:
    • Create the prescription value like we did earlier.
    • Concatenate combined (blank at the beginning) and prescription each time _rx[i] is equal to 1.

Code is available for download here.