Splitting Strings with Delimiters in SAS

Have you ever encountered a single variable containing multiple values separated by some delimiter?

Maybe you want to split it into multiple columns. That seems like a common task on the forums.

This is what I’m talking about:

id  string
 Gabapentin,Escitalopram,Fluoxetine
 Solotol Hydrochloride
 Hydrocodone,Neurontin,Topamax,Carbamazepine

I’m not sure how common this is in claims data, but it certainly happens with some messy data. With any type of non-structured data, it’s going to get dirty pretty quick.

Alt Text

Let’s cut to the chase. Say we have this:

data have;
infile datalines delimiter = "|" truncover;
input id string $100.;
datalines;
1|Gabapentin,Escitalopram,Fluoxetine
2|Solotol Hydrochloride
3|Hydrocodone,Neurontin,Topamax,Carbamazepine
;

It’s the same thing that I posted up top since I’m a bit lazy. I intentionally added a drug with a space delimiter to make this a bit more difficult. Using things like COUNTW and SCAN will use a default set of delimiters (!,., , etc.) to parse your value if you do not explicitly request your delimter.

R has some pretty convenient functions to clean up variables like string above, namely tidyr::separate.

So how are we going to parse this string? Here’s our code:

data want;
	set have;
	do i = 1 to countw(string, ","); /* #1 */
		drug = scan(string, i, ","); /* #2 */
		output; /* #3 */
	end; /* #4 */
run;

As per usual, let’s run through the numbered comments above.

Here, I tell SAS to begin a DO loop.
- I tell SAS that I want a variable i to represent the number of iterations it needs to go through for a given observation’s string value. You could name this anything you want (e.g., x, y, z, gogators, etc.).
- I want SAS to do it from the first (1) instance to the end condition of my DO loop.
- The end condition of my loop uses the COUNTW function. COUNTW counts the number of words in a given variable. Here, I am requesting that it counts the number of words before the comma delimiter.
- This becomes a dynamic DO loop that executes depending on the number of words in a given variable. You don’t have to set an explicit numbered range – SAS does it for you. SAS assigns this number to our variable i that is automatically created in the DO loop.
Within my DO loop, I want to create a variable called drug that captures the i‘th word for a given observation’s string.
- We also use the SCAN function to extract the drug of interest from string. Using the i variable, we are telling SAS the starting position of the string that we want to extract, and the "," represents the end of the string that we want SCAN to stop at.
The explicit OUTPUT statement outputs the i‘th word to our dataset. We must request this explicit OUTPUT because a DO loop will only OUTPUT the last instance without the explicit reference.
The END statement reverts back to the DO loop in #1.
- If there is another word to process (another word bounded by a comma), then it continues processing string.
- Simlutaneously, the i variable is incremented by 1. If i is not equal to the dynamic upper limit (set by COUNTW), then it continues processing.
- If it has reached the upper limit and there are no more words to process, the DO loop stops executing.

Alright, that’s a lot of information. What did we end up getting? Sometimes it makes more sense to jsut see what happened instead of reading a block of text.

id 	string 	                                        i 	drug
Gabapentin,Escitalopram,Fluoxetine 	        1 	Gabapentin
Gabapentin,Escitalopram,Fluoxetine 	        2 	Escitalopram
Gabapentin,Escitalopram,Fluoxetine 	        3 	Fluoxetine
Solotol Hydrochloride 	                        1 	Solotol Hydrochloride
Hydrocodone,Neurontin,Topamax,Carbamazepine 	1 	Hydrocodone
Hydrocodone,Neurontin,Topamax,Carbamazepine 	2 	Neurontin
Hydrocodone,Neurontin,Topamax,Carbamazepine 	3 	Topamax
Hydrocodone,Neurontin,Topamax,Carbamazepine 	4 	Carbamazepine

Ugh, now we’ve gone from wide to long, and that may not be what we wanted. Is this what we’re left with?

Alt Text

An easy PROC TRANSPOSE can take care of this.

proc transpose
	data = want /* #1 */
	out = want_tp (drop = _name_) /* #2 */
	prefix = drug; /* #3 */
		by id; /* #4 */
		var drug; /* #5 */
run;

Again, let’s break down those comments.

I’m transposing our want dataset.
I want to output a new dataset, want_tp, containing my transposed table values.
- Further, I want to drop the automatic _NAME_ variable that SAS creates when you transpose. This is the name of the variable that you transposed.
I use the PREFIX = argument to give the parsed columns a newly prefixed name. Here, I set it to drug, so each column will be labeled drug1. drug2, drug3, etc.
We are transposing by ID. We want our values to be collapsed into one observation per ID.
Finally, we are transposing the value drug. We want the values for drug to be collapsed to a given id.

This gives us this:

id 	drug1 	              drug2 	        drug3 	    drug4
Gabapentin            Escitalopram      Fluoxetine 	 
Solotol Hydrochloride	  	  	 
Hydrocodone           Neurontin 	Topamax     Carbamazepine

Alt Text

Again, I don’t think we have this issue too much with claims data, but you never know what you might run into. This is a good tool to have in your back pocket.

Please let me know what things you’d like to see more of.

If you have any questions or comments, please email me at michaelqmaguire2@cop.ufl.edu, message me on Teams, or stop by my office at HPNP 2330.

Script for this post can be located and downloaded here.