Have you ever encountered a single variable containing multiple values separated by some delimiter?
Maybe you want to split it into multiple columns. That seems like a common task on the forums.
This is what I’m talking about:
id string
1 Gabapentin,Escitalopram,Fluoxetine
2 Solotol Hydrochloride
3 Hydrocodone,Neurontin,Topamax,Carbamazepine
I’m not sure how common this is in claims data, but it certainly happens with some messy data. With any type of non-structured data, it’s going to get dirty pretty quick.

Let’s cut to the chase. Say we have this:
data have;
infile datalines delimiter = "|" truncover;
input id string $100.;
datalines;
1|Gabapentin,Escitalopram,Fluoxetine
2|Solotol Hydrochloride
3|Hydrocodone,Neurontin,Topamax,Carbamazepine
;
It’s the same thing that I posted up top since I’m a bit lazy. I intentionally added a drug with a space delimiter to make this a bit more difficult. Using things like COUNTW and SCAN will use a default set of delimiters (!,., |
, etc.) to parse your value if you do not explicitly request your delimter. |
R has some pretty convenient functions to clean up variables like string above, namely tidyr::separate.
So how are we going to parse this string? Here’s our code:
data want;
set have;
do i = 1 to countw(string, ","); /* #1 */
drug = scan(string, i, ","); /* #2 */
output; /* #3 */
end; /* #4 */
run;
As per usual, let’s run through the numbered comments above.
- Here, I tell SAS to begin a
DOloop.- I tell SAS that I want a variable
ito represent the number of iterations it needs to go through for a given observation’sstringvalue. You could name this anything you want (e.g., x, y, z, gogators, etc.). - I want SAS to do it from the first (1) instance
tothe end condition of myDOloop. - The end condition of my loop uses the
COUNTWfunction.COUNTWcounts the number of words in a given variable. Here, I am requesting that it counts the number of words before the comma delimiter. - This becomes a dynamic
DOloop that executes depending on the number of words in a given variable. You don’t have to set an explicit numbered range – SAS does it for you. SAS assigns this number to our variableithat is automatically created in theDOloop.
- I tell SAS that I want a variable
- Within my
DOloop, I want to create a variable calleddrugthat captures thei‘th word for a given observation’sstring.- We also use the
SCANfunction to extract thedrugof interest fromstring. Using theivariable, we are telling SAS the starting position of the string that we want to extract, and the","represents the end of the string that we wantSCANto stop at.
- We also use the
- The explicit
OUTPUTstatement outputs thei‘th word to our dataset. We must request this explicitOUTPUTbecause aDOloop will onlyOUTPUTthe last instance without the explicit reference. - The
ENDstatement reverts back to theDOloop in #1.- If there is another word to process (another word bounded by a comma), then it continues processing
string. - Simlutaneously, the
ivariable is incremented by 1. Ifiis not equal to the dynamic upper limit (set byCOUNTW), then it continues processing. - If it has reached the upper limit and there are no more words to process, the
DOloop stops executing.
- If there is another word to process (another word bounded by a comma), then it continues processing
Alright, that’s a lot of information. What did we end up getting? Sometimes it makes more sense to jsut see what happened instead of reading a block of text.
id string i drug
1 Gabapentin,Escitalopram,Fluoxetine 1 Gabapentin
1 Gabapentin,Escitalopram,Fluoxetine 2 Escitalopram
1 Gabapentin,Escitalopram,Fluoxetine 3 Fluoxetine
2 Solotol Hydrochloride 1 Solotol Hydrochloride
3 Hydrocodone,Neurontin,Topamax,Carbamazepine 1 Hydrocodone
3 Hydrocodone,Neurontin,Topamax,Carbamazepine 2 Neurontin
3 Hydrocodone,Neurontin,Topamax,Carbamazepine 3 Topamax
3 Hydrocodone,Neurontin,Topamax,Carbamazepine 4 Carbamazepine
Ugh, now we’ve gone from wide to long, and that may not be what we wanted. Is this what we’re left with?

An easy PROC TRANSPOSE can take care of this.
proc transpose
data = want /* #1 */
out = want_tp (drop = _name_) /* #2 */
prefix = drug; /* #3 */
by id; /* #4 */
var drug; /* #5 */
run;
Again, let’s break down those comments.
- I’m transposing our
wantdataset. - I want to output a new dataset,
want_tp, containing my transposed table values.- Further, I want to drop the automatic
_NAME_variable that SAS creates when you transpose. This is the name of the variable that you transposed.
- Further, I want to drop the automatic
- I use the
PREFIX =argument to give the parsed columns a newly prefixed name. Here, I set it to drug, so each column will be labeleddrug1.drug2,drug3, etc. - We are transposing by
ID. We want our values to be collapsed into one observation per ID. - Finally, we are transposing the value
drug. We want the values fordrugto be collapsed to a givenid.
This gives us this:
id drug1 drug2 drug3 drug4
1 Gabapentin Escitalopram Fluoxetine
2 Solotol Hydrochloride
3 Hydrocodone Neurontin Topamax Carbamazepine

Again, I don’t think we have this issue too much with claims data, but you never know what you might run into. This is a good tool to have in your back pocket.
Please let me know what things you’d like to see more of.
If you have any questions or comments, please email me at michaelqmaguire2@cop.ufl.edu, message me on Teams, or stop by my office at HPNP 2330.
Script for this post can be located and downloaded here.