Locating Specific Variables Across Data Sets Using SAS

Overview

This post is doesn’t contain anything novel, but I think it is a valuable technique to familiarize yourself with files.

I was recently asked to use all Medicare files to complete a task. Given that this was my first time using Medicare files, I was a bit confused – how do I know which files contain what we need?

Specifically, I needed to find ICD codes across all files within a year.

Some traditional techniques include:

1. Running `PROC CONTENTS` on the entire library.
2. Opening individual files within the library and clicking through all the columns.

Of course, you may be in some trouble if your data sets contain hundreds of columns, which seems to be quite normal with healthcare data sets.

I really don’t like doing (2), and it doesn’t help that my mouse at work has some erratic behaviors that makes clicking/scrolling difficult.

Simulate the Issue

Let’s create a fake library and fake data sets to replicate the problem.

options minoperator;

%macro fake_claims;							/*	Begin macro that will create 10 data sets with differing numbers of DX's.			*/

%do yr = 10 %to 20;							/* 	Setting yr macro variable to iterate from 10 (2010) to 20 (2020). 					*/

	%if &yr. in (10 11 12 13 14 15) %then %do;
	data 		work.cc_ot_svc_&yr (drop = i); 	/* 	Below creates the actual data sets.													*/
		array 	dx [2] 3. icd_dgns_cd_1-icd_dgns_cd_2;
		length 	year $4.;
	    do  	pt = 1 to 5000;					/* 	Create a record for each patient													*/
		do		i = 1 to dim(dx);				/* 	Do the following for each diagnosis in the array.									*/
				dx[i] = rand("bernoulli", .4);	/*	Generate a random number between 0 and 1. Just fake placeholders.					*/
				year = catt("20", &yr.);
		end; 									/*	End the <do i = 1 to dim(dx)> do statement											*/
		output;									/* 	Necessary to output information for each patient.									*/
		end;									/*	End the <do pt = 1 to 5000> do statement.											*/
	run;										/* 	Run the data step.																	*/
	%end;
		%else %if &yr. in (16 17 18) %then %do;
			data 		work.cc_ot_svc_&yr (drop = i); 	/* 	Below creates the actual data sets.													*/
				array 	dx [2] 3. icd_prcd_cd_1-icd_prcd_cd_2;
				length 	year $4.;
				
			    do  	pt = 1 to 5000;					/* 	Create a record for each patient													*/
				do		i = 1 to dim(dx);				/* 	Do the following for each diagnosis in the array.									*/
						dx[i] = rand("bernoulli", .4);	/*	Generate a random number between 0 and 1. Just fake placeholders.					*/
						year = catt("20", &yr.);
				end; 									/*	End the <do i = 1 to dim(dx)> do statement											*/
				output;									/* 	Necessary to output information for each patient.									*/
				end;									/*	End the <do pt = 1 to 5000> do statement.											*/    
			run;										/* 	Run the data step.																	*/
		%end;
			%else %do;
                    data 		work.cc_ot_svc_&yr (drop = i); 	/* 	Below creates the actual data sets.													*/
                        array 	dx [2] 3. cpt_1-cpt_2;
                        length 	year $4.;
                        
                        do  	pt = 1 to 5000;					/* 	Create a record for each patient													*/
                        do		i = 1 to dim(dx);				/* 	Do the following for each diagnosis in the array.									*/
                                dx[i] = rand("bernoulli", .4);	/*	Generate a random number between 0 and 1. Just fake placeholders.					*/
                                year = catt("20", &yr.);
                        end; 									/*	End the <do i = 1 to dim(dx)> do statement											*/
                        output;									/* 	Necessary to output information for each patient.									*/
                        end;									/*	End the <do pt = 1 to 5000> do statement.											*/	    
                    run;										/* 	Run the data step.																	*/
            %end;
%end;										/* 	End the %DO processing above.														*/
%mend fake_claims;							/*	End the macro.																		*/
/* Execute the macro */
%fake_claims;

If you’re not running the code, I have basically created ten datasets that either get ICD diagnosis codes, ICD procedure codes, or CPT codes. They’re just a binary variable here and not an actual code.

So let’s say someone tells you they need you to use all files that have ICD codes. You could scroll through a PROC CONTENTS, click and view the data set and view its contents, or some other method that I’m not aware of.

Solution

My preferred way is through SAS’s Dictionary Tables. Let’s get the data sets that have ICD diagnosis codes.

proc sql;
	select
				distinct memname
	from
				dictionary.columns
	where
				upcase(libname) = 'WORK' and upcase(name) contains ('ICD_DGNS');
quit;

Here, I am extracting from the dictionary tables (not to be confused with dictionary.tables) the data set names in the WORK library that are diagnosis codes.

Member Name 
CC_OT_SVC_10 
CC_OT_SVC_11 
CC_OT_SVC_12 
CC_OT_SVC_13 
CC_OT_SVC_14 
CC_OT_SVC_15 
CC_OT_SVC_16 
CC_OT_SVC_17 
CC_OT_SVC_18 

To extract this information, you have to specify the dictionary table you want to use, along with the types of information you would like to see.

Here, I specify libname = 'WORK' to tell SAS to look for data sets in the WORK library. I also use the UPCASE function to convert the value of libname to uppercase.

SAS converts certain fields to uppercase in dictionary tables. By specifying UPCASE or LOWCASE, you can avoid this hassle.

Also, I recommend this resource to examine all the dictionary tables. It contains a lot of information about the metadata of a data set.

We can use the INTO: clause to create a macro to store these data set names instead of creating an HTML listing.

proc sql noprint;
	select
				distinct memname
					into: datasets separated by ' '
	from
				dictionary.columns
	where
				upcase(libname) = 'WORK' and upcase(name) contains ('ICD_DGNS');
quit;

The NOPRINT option suppresses the HTML listing of the data sets.

If you have more than one value that you’re placing in the macro variable, you need to use the SEPARATED BY argument and specify a delimiter.

You can examine the macro variable in the log by using the %PUT macro statement:

%put &datasets.;

Produces this in the log:

CC_OT_SVC_10 CC_OT_SVC_11 CC_OT_SVC_12 CC_OT_SVC_13 CC_OT_SVC_14 CC_OT_SVC_15

You can then iteratively extract each data set from the macro variable and do the necessary processing.

%macro extract_icd ();
	%do i = 1 %to %sysfunc(countw(&datasets., " "));
	%let dsn = %scan(&datasets., &i., " ");
	title "10 observations from &dsn.";
		proc sql inobs = 10;
			select
						*
			from
						&dsn.
			where
						icd_dgns_cd_1 = 1 and icd_dgns_cd_2 = 1;
		quit;
	title;
	%end;
%mend extract_icd;

%extract_icd;

See my post regarding Creating Data Driven Macros Using SAS for more details on using %SYSFUNC(COUNTW()) in your macro processing.

This results in the following (for 2010):

icd_dgns_cd_1	icd_dgns_cd_2	year	pt
1	1	2010	7
1	1	2010	18
1	1	2010	34
1	1	2010	40
1	1	2010	42
1	1	2010	43
1	1	2010	53
1	1	2010	54
1	1	2010	72
1	1	2010	79

You can even create some data sets if you want:

proc sql;
	create table 	dsn_and_variables as
		select
					memname,
					name
		from
					dictionary.columns
		where
					upcase(libname) = 'WORK' and upcase(name) contains ('ICD_DGNS');
quit;

Which results in this:

memname	name
CC_OT_SVC_10	icd_dgns_cd_1
CC_OT_SVC_10	icd_dgns_cd_2
CC_OT_SVC_11	icd_dgns_cd_1
CC_OT_SVC_11	icd_dgns_cd_2
CC_OT_SVC_12	icd_dgns_cd_1
CC_OT_SVC_12	icd_dgns_cd_2
CC_OT_SVC_13	icd_dgns_cd_1
CC_OT_SVC_13	icd_dgns_cd_2
CC_OT_SVC_14	icd_dgns_cd_1
CC_OT_SVC_14	icd_dgns_cd_2
CC_OT_SVC_15	icd_dgns_cd_1
CC_OT_SVC_15	icd_dgns_cd_2

Program is located here.