Overview
This post is doesn’t contain anything novel, but I think it is a valuable technique to familiarize yourself with files.
I was recently asked to use all Medicare files to complete a task. Given that this was my first time using Medicare files, I was a bit confused – how do I know which files contain what we need?
Specifically, I needed to find ICD codes across all files within a year.
Some traditional techniques include:
1. Running `PROC CONTENTS` on the entire library.
2. Opening individual files within the library and clicking through all the columns.
Of course, you may be in some trouble if your data sets contain hundreds of columns, which seems to be quite normal with healthcare data sets.
I really don’t like doing (2), and it doesn’t help that my mouse at work has some erratic behaviors that makes clicking/scrolling difficult.
Simulate the Issue
Let’s create a fake library and fake data sets to replicate the problem.
options minoperator;
%macro fake_claims; /* Begin macro that will create 10 data sets with differing numbers of DX's. */
%do yr = 10 %to 20; /* Setting yr macro variable to iterate from 10 (2010) to 20 (2020). */
%if &yr. in (10 11 12 13 14 15) %then %do;
data work.cc_ot_svc_&yr (drop = i); /* Below creates the actual data sets. */
array dx [2] 3. icd_dgns_cd_1-icd_dgns_cd_2;
length year $4.;
do pt = 1 to 5000; /* Create a record for each patient */
do i = 1 to dim(dx); /* Do the following for each diagnosis in the array. */
dx[i] = rand("bernoulli", .4); /* Generate a random number between 0 and 1. Just fake placeholders. */
year = catt("20", &yr.);
end; /* End the <do i = 1 to dim(dx)> do statement */
output; /* Necessary to output information for each patient. */
end; /* End the <do pt = 1 to 5000> do statement. */
run; /* Run the data step. */
%end;
%else %if &yr. in (16 17 18) %then %do;
data work.cc_ot_svc_&yr (drop = i); /* Below creates the actual data sets. */
array dx [2] 3. icd_prcd_cd_1-icd_prcd_cd_2;
length year $4.;
do pt = 1 to 5000; /* Create a record for each patient */
do i = 1 to dim(dx); /* Do the following for each diagnosis in the array. */
dx[i] = rand("bernoulli", .4); /* Generate a random number between 0 and 1. Just fake placeholders. */
year = catt("20", &yr.);
end; /* End the <do i = 1 to dim(dx)> do statement */
output; /* Necessary to output information for each patient. */
end; /* End the <do pt = 1 to 5000> do statement. */
run; /* Run the data step. */
%end;
%else %do;
data work.cc_ot_svc_&yr (drop = i); /* Below creates the actual data sets. */
array dx [2] 3. cpt_1-cpt_2;
length year $4.;
do pt = 1 to 5000; /* Create a record for each patient */
do i = 1 to dim(dx); /* Do the following for each diagnosis in the array. */
dx[i] = rand("bernoulli", .4); /* Generate a random number between 0 and 1. Just fake placeholders. */
year = catt("20", &yr.);
end; /* End the <do i = 1 to dim(dx)> do statement */
output; /* Necessary to output information for each patient. */
end; /* End the <do pt = 1 to 5000> do statement. */
run; /* Run the data step. */
%end;
%end; /* End the %DO processing above. */
%mend fake_claims; /* End the macro. */
/* Execute the macro */
%fake_claims;
If you’re not running the code, I have basically created ten datasets that either get ICD diagnosis codes, ICD procedure codes, or CPT codes. They’re just a binary variable here and not an actual code.
So let’s say someone tells you they need you to use all files that have ICD codes. You could scroll through a PROC CONTENTS
, click and view the data set and view its contents, or some other method that I’m not aware of.
Solution
My preferred way is through SAS’s Dictionary Tables. Let’s get the data sets that have ICD diagnosis codes.
proc sql;
select
distinct memname
from
dictionary.columns
where
upcase(libname) = 'WORK' and upcase(name) contains ('ICD_DGNS');
quit;
Here, I am extracting from the dictionary tables (not to be confused with dictionary.tables
) the data set names in the WORK
library that are diagnosis codes.
Member Name
CC_OT_SVC_10
CC_OT_SVC_11
CC_OT_SVC_12
CC_OT_SVC_13
CC_OT_SVC_14
CC_OT_SVC_15
CC_OT_SVC_16
CC_OT_SVC_17
CC_OT_SVC_18
To extract this information, you have to specify the dictionary table you want to use, along with the types of information you would like to see.
Here, I specify libname = 'WORK'
to tell SAS to look for data sets in the WORK
library. I also use the UPCASE
function to convert the value of libname
to uppercase.
SAS converts certain fields to uppercase in dictionary tables. By specifying
UPCASE
orLOWCASE
, you can avoid this hassle.
Also, I recommend this resource to examine all the dictionary tables. It contains a lot of information about the metadata of a data set.
We can use the INTO:
clause to create a macro to store these data set names instead of creating an HTML listing.
proc sql noprint;
select
distinct memname
into: datasets separated by ' '
from
dictionary.columns
where
upcase(libname) = 'WORK' and upcase(name) contains ('ICD_DGNS');
quit;
The NOPRINT
option suppresses the HTML listing of the data sets.
If you have more than one value that you’re placing in the macro variable, you need to use the SEPARATED BY
argument and specify a delimiter.
You can examine the macro variable in the log by using the %PUT
macro statement:
%put &datasets.;
Produces this in the log:
CC_OT_SVC_10 CC_OT_SVC_11 CC_OT_SVC_12 CC_OT_SVC_13 CC_OT_SVC_14 CC_OT_SVC_15
You can then iteratively extract each data set from the macro variable and do the necessary processing.
%macro extract_icd ();
%do i = 1 %to %sysfunc(countw(&datasets., " "));
%let dsn = %scan(&datasets., &i., " ");
title "10 observations from &dsn.";
proc sql inobs = 10;
select
*
from
&dsn.
where
icd_dgns_cd_1 = 1 and icd_dgns_cd_2 = 1;
quit;
title;
%end;
%mend extract_icd;
%extract_icd;
See my post regarding Creating Data Driven Macros Using SAS for more details on using
%SYSFUNC(COUNTW())
in your macro processing.
This results in the following (for 2010):
icd_dgns_cd_1 | icd_dgns_cd_2 | year | pt |
---|---|---|---|
1 | 1 | 2010 | 7 |
1 | 1 | 2010 | 18 |
1 | 1 | 2010 | 34 |
1 | 1 | 2010 | 40 |
1 | 1 | 2010 | 42 |
1 | 1 | 2010 | 43 |
1 | 1 | 2010 | 53 |
1 | 1 | 2010 | 54 |
1 | 1 | 2010 | 72 |
1 | 1 | 2010 | 79 |
You can even create some data sets if you want:
proc sql;
create table dsn_and_variables as
select
memname,
name
from
dictionary.columns
where
upcase(libname) = 'WORK' and upcase(name) contains ('ICD_DGNS');
quit;
Which results in this:
memname | name |
---|---|
CC_OT_SVC_10 | icd_dgns_cd_1 |
CC_OT_SVC_10 | icd_dgns_cd_2 |
CC_OT_SVC_11 | icd_dgns_cd_1 |
CC_OT_SVC_11 | icd_dgns_cd_2 |
CC_OT_SVC_12 | icd_dgns_cd_1 |
CC_OT_SVC_12 | icd_dgns_cd_2 |
CC_OT_SVC_13 | icd_dgns_cd_1 |
CC_OT_SVC_13 | icd_dgns_cd_2 |
CC_OT_SVC_14 | icd_dgns_cd_1 |
CC_OT_SVC_14 | icd_dgns_cd_2 |
CC_OT_SVC_15 | icd_dgns_cd_1 |
CC_OT_SVC_15 | icd_dgns_cd_2 |
Program is located here.