Overview
Here are the general issues:
- The records have a sequenced pattern (A, B, C, etc.), but each individual ID does not have every part of the sequence. In other words, some ID’s have missing observations for parts of the sequence.
- You need to fill any missing pattern in the sequence with the previous record’s value.
Most of the work that I’ve completed here has involved constructing time-series data sets. Accordingly, I’m quite familiar with their structure, and it is pretty natural for this example.
Here’s the example data set:
data have;
infile datalines delimiter = ",";
input year_qtr :yymmn6. state $ num_dspnd;
format year_qtr yymmd7.;
datalines;
201701,FL,24780
201702,FL,29854
201704,FL,24131
201701,IL,28392
201704,IL,16488
201801,FL,23774
201802,FL,23040
201803,FL,20945
201804,FL,19439
201801,IL,15597
201802,IL,17375
201803,IL,17781
201804,IL,17639
201901,FL,15716
201902,FL,15465
201901,IL,17290
201902,IL,17223
201903,IL,17334
201904,IL,16837
202001,FL,13568
202004,FL,14181
202003,IL,16790
202004,IL,9673
;
run;
This dataset represents drugs dispensed in Florida and Illinois by year-qtr
and state
.
Looking closely at the first ten observations, you can see that Florida has missing observations for quarter 2, 3, and 4 during 2017, 2018, and 2019, respectively.
year_qtr | state | num_dspnd |
---|---|---|
2017-01 | FL | 24780 |
2017-03 | FL | 24974 |
2017-04 | FL | 24131 |
2018-01 | FL | 23774 |
2018-02 | FL | 23040 |
2018-04 | FL | 19439 |
2019-01 | FL | 15716 |
2019-02 | FL | 15465 |
2019-03 | FL | 14450 |
2020-01 | FL | 13568 |
Creating the missing records
We essentially need to expand the year_qtr
field such that each state
shows each year and quarter.
One tool that I really like in these kinds of applications is PROC FREQ
. Most of us are familiar with how PROC FREQ
produces frequencies, plots, and cross-tabulations for variables of interest.
PROC FREQ
has an interesting little option on the TABLES
statement that is buried in the documentation – SPARSE
.
The SAS documentation explains it succinctly:
SPARSE
includes all possible combinations of variable levels in theLIST
table andOUT=
data set
And what is LIST
?
LIST
displays two-way to n-way tables in list format
In other words, instead of a 2x2 table (or any other iteration), we get each variable combination specified in our TABLES
statement output as an individual row.
So how do we employ it on our data set?
proc freq data = have noprint;
tables year_qtr * state / list sparse out = have_1 (drop = percent count);
run;
Looking at the first ten observations:
state | state |
---|---|
2017-01 | FL |
2017-01 | IL |
2017-02 | FL |
2017-02 | IL |
2017-03 | FL |
2017-03 | IL |
2017-04 | FL |
2017-04 | IL |
2018-01 | FL |
2018-01 | IL |
Some potentially confusing parts of this code:
NOPRINT
is used to suppress the SAS HTML output from the results window. This is preferential.TABLES year_qtr * state
is asking for a cross-tabulation of theyear_qtr
variable with thestate
variable.LIST
is requesting that we convert the 2x2 table into row-column format. We want each pair ofyear_qtr
andstate
to be output.SPARSE
, as described above, includes all possible combinations ofyear_qtr
andstate
.- An important assumption here is that all your
year_qtr
(or whatever variable you’re requesting) values are represented in at least one pair.
- An important assumption here is that all your
OUT=
is requesting a data set to be created that contains the output fromTABLES year_qtr * state
and theLIST
option.- Finally, I drop variables automatically created with
PROC FREQ
–PERCENT
andCOUNT
.- If you omit this portion, you will see that SAS creates a
COUNT
column with 1’s representing that the combination is present, and 0 representing that the combination was missing. Percent is practically non-sensical.
- If you omit this portion, you will see that SAS creates a
Remember, quarter 2 for Florida was missing, and now it is present!
Getting our final results
Now that we have all possible combinations, we need to MERGE
it back with our original dataset that has the number of drugs dispensed by year_qtr
and state
.
First, since the output for PROC FREQ
doesn’t seem to stay how we want it, we need to sort it before we can merge it. While we could use PROC SQL
to avoid ordering here, we’re going to need to use RETAIN
and BY-GROUP
processing to get our final results.
proc sort data = have_1;
by state year_qtr;
run;
Now we can MERGE
. First, I’m going to show you what the data set looks like after the MERGE
, and then I will show you the full code that includes the final results.
Of note, not specifying an
IN
option with theMERGE
statement implies aFULL JOIN
.
data want;
merge have have_1;
by state year_qtr;
run;
First ten observations:
year_qtr | state | num_dspnd |
---|---|---|
2017-01 | FL | 24780 |
2017-02 | FL | . |
2017-03 | FL | 24974 |
2017-04 | FL | 24131 |
2018-01 | FL | 23774 |
2018-02 | FL | 23040 |
2018-03 | FL | . |
2018-04 | FL | 19439 |
2019-01 | FL | 15716 |
2019-02 | FL | 15465 |
Now our original num_dspnd
values are back, and those year_qtr
values that do not have a match for a given state
have a missing num_dspnd
value.
For those courageous enough to do it in one step:
data want;
merge have have_1;
by state year_qtr;
retain num_dspnd_final;
if first.state then call missing(num_dspnd_final);
if not missing(num_dspnd) then num_dspnd_final = num_dspnd;
run;
Which produces this (first 10 observations):
year_qtr | state | num_dspnd | num_dspnd_final |
---|---|---|---|
2017-01 | FL | 24780 | 24780 |
2017-02 | FL | . | 24780 |
2017-03 | FL | 24974 | 24974 |
2017-04 | FL | 24131 | 24131 |
2018-01 | FL | 23774 | 23774 |
2018-02 | FL | 23040 | 23040 |
2018-03 | FL | . | 23040 |
2018-04 | FL | 19439 | 19439 |
2019-01 | FL | 15716 | 15716 |
2019-02 | FL | 15465 | 15465 |
This takes subtle manipulation of the PDV. See this paper if you’re unfamiliar with SAS’s program data vector. Here’s what happened in the last step:
- We
MERGE
onstate
andyear_qtr
. - We tell SAS that we want to
RETAIN
the value of a variable callednum_dspnd_final
. By usingRETAIN
, we are holding the value ofnum_dspnd_final
in the PDV when additional records are read. - We then instruct SAS that if the variable is the
FIRST
value forstate
, then we want to call ournum_dspnd_final
value missing.- This ensures that we do not retain values across different
state
values. We do not want Illinois toRETAIN
a Florida value.
- This ensures that we do not retain values across different
- Finally, we tell SAS to populate the
num_dspnd_final
field withnum_dspnd
if it is not missing.- If it is missing, nothing will be done, and the prior value will be held in the PDV by the
RETAIN
statement.
- If it is missing, nothing will be done, and the prior value will be held in the PDV by the
Additionally, you can DROP
the num_dspnd
variable, but I just left it for transparency.
This is also applicable if you have a lot of missing records where you need the last non-missing observation’s records carried forward. For example, this data set.
data test;
input id var;
datalines;
1 2
1 .
1 .
1 .
1 4
1 .
1 .
2 3
2 .
2 5
3 1
;
run;
Same code, different variables:
data test_final;
set test;
by id;
retain locf;
if first.id then call missing(locf);
if not missing(var) then locf = var;
run;
id | var | locf |
---|---|---|
1 | 2 | 2 |
1 | . | 2 |
1 | . | 2 |
1 | . | 2 |
1 | 4 | 4 |
1 | . | 4 |
1 | . | 4 |
2 | 3 | 3 |
2 | . | 3 |
2 | 5 | 5 |
3 | 1 | 1 |
Program is here.