Have you ever encountered a single variable containing multiple values separated by some delimiter?
Maybe you want to split it into multiple columns. That seems like a common task on the forums.
This is what I’m talking about:
id string
1 Gabapentin,Escitalopram,Fluoxetine
2 Solotol Hydrochloride
3 Hydrocodone,Neurontin,Topamax,Carbamazepine
I’m not sure how common this is in claims data, but it certainly happens with some messy data. With any type of non-structured data, it’s going to get dirty pretty quick.
Let’s cut to the chase. Say we have this:
data have;
infile datalines delimiter = "|" truncover;
input id string $100.;
datalines;
1|Gabapentin,Escitalopram,Fluoxetine
2|Solotol Hydrochloride
3|Hydrocodone,Neurontin,Topamax,Carbamazepine
;
It’s the same thing that I posted up top since I’m a bit lazy. I intentionally added a drug with a space delimiter to make this a bit more difficult. Using things like COUNTW and SCAN will use a default set of delimiters (!,., |
, etc.) to parse your value if you do not explicitly request your delimter. |
R has some pretty convenient functions to clean up variables like string
above, namely tidyr::separate.
So how are we going to parse this string? Here’s our code:
data want;
set have;
do i = 1 to countw(string, ","); /* #1 */
drug = scan(string, i, ","); /* #2 */
output; /* #3 */
end; /* #4 */
run;
As per usual, let’s run through the numbered comments above.
- Here, I tell SAS to begin a
DO
loop.- I tell SAS that I want a variable
i
to represent the number of iterations it needs to go through for a given observation’sstring
value. You could name this anything you want (e.g., x, y, z, gogators, etc.). - I want SAS to do it from the first (1) instance
to
the end condition of myDO
loop. - The end condition of my loop uses the
COUNTW
function.COUNTW
counts the number of words in a given variable. Here, I am requesting that it counts the number of words before the comma delimiter. - This becomes a dynamic
DO
loop that executes depending on the number of words in a given variable. You don’t have to set an explicit numbered range – SAS does it for you. SAS assigns this number to our variablei
that is automatically created in theDO
loop.
- I tell SAS that I want a variable
- Within my
DO
loop, I want to create a variable calleddrug
that captures thei
‘th word for a given observation’sstring
.- We also use the
SCAN
function to extract thedrug
of interest fromstring
. Using thei
variable, we are telling SAS the starting position of the string that we want to extract, and the","
represents the end of the string that we wantSCAN
to stop at.
- We also use the
- The explicit
OUTPUT
statement outputs thei
‘th word to our dataset. We must request this explicitOUTPUT
because aDO
loop will onlyOUTPUT
the last instance without the explicit reference. - The
END
statement reverts back to theDO
loop in #1.- If there is another word to process (another word bounded by a comma), then it continues processing
string
. - Simlutaneously, the
i
variable is incremented by 1. Ifi
is not equal to the dynamic upper limit (set byCOUNTW
), then it continues processing. - If it has reached the upper limit and there are no more words to process, the
DO
loop stops executing.
- If there is another word to process (another word bounded by a comma), then it continues processing
Alright, that’s a lot of information. What did we end up getting? Sometimes it makes more sense to jsut see what happened instead of reading a block of text.
id string i drug
1 Gabapentin,Escitalopram,Fluoxetine 1 Gabapentin
1 Gabapentin,Escitalopram,Fluoxetine 2 Escitalopram
1 Gabapentin,Escitalopram,Fluoxetine 3 Fluoxetine
2 Solotol Hydrochloride 1 Solotol Hydrochloride
3 Hydrocodone,Neurontin,Topamax,Carbamazepine 1 Hydrocodone
3 Hydrocodone,Neurontin,Topamax,Carbamazepine 2 Neurontin
3 Hydrocodone,Neurontin,Topamax,Carbamazepine 3 Topamax
3 Hydrocodone,Neurontin,Topamax,Carbamazepine 4 Carbamazepine
Ugh, now we’ve gone from wide to long, and that may not be what we wanted. Is this what we’re left with?
An easy PROC TRANSPOSE
can take care of this.
proc transpose
data = want /* #1 */
out = want_tp (drop = _name_) /* #2 */
prefix = drug; /* #3 */
by id; /* #4 */
var drug; /* #5 */
run;
Again, let’s break down those comments.
- I’m transposing our
want
dataset. - I want to output a new dataset,
want_tp
, containing my transposed table values.- Further, I want to drop the automatic
_NAME_
variable that SAS creates when you transpose. This is the name of the variable that you transposed.
- Further, I want to drop the automatic
- I use the
PREFIX =
argument to give the parsed columns a newly prefixed name. Here, I set it to drug, so each column will be labeleddrug1
.drug2
,drug3
, etc. - We are transposing by
ID
. We want our values to be collapsed into one observation per ID. - Finally, we are transposing the value
drug
. We want the values fordrug
to be collapsed to a givenid
.
This gives us this:
id drug1 drug2 drug3 drug4
1 Gabapentin Escitalopram Fluoxetine
2 Solotol Hydrochloride
3 Hydrocodone Neurontin Topamax Carbamazepine
Again, I don’t think we have this issue too much with claims data, but you never know what you might run into. This is a good tool to have in your back pocket.
Please let me know what things you’d like to see more of.
If you have any questions or comments, please email me at michaelqmaguire2@cop.ufl.edu, message me on Teams, or stop by my office at HPNP 2330.
Script for this post can be located and downloaded here.