Overview
Here are the general issues:
- The records have a sequenced pattern (A, B, C, etc.), but each individual ID does not have every part of the sequence. In other words, some ID’s have missing observations for parts of the sequence.
- You need to fill any missing pattern in the sequence with the previous record’s value.
Most of the work that I’ve completed here has involved constructing time-series data sets. Accordingly, I’m quite familiar with their structure, and it is pretty natural for this example.
Here’s the example data set:
data have;
infile datalines delimiter = ",";
input year_qtr :yymmn6. state $ num_dspnd;
format year_qtr yymmd7.;
datalines;
201701,FL,24780
201702,FL,29854
201704,FL,24131
201701,IL,28392
201704,IL,16488
201801,FL,23774
201802,FL,23040
201803,FL,20945
201804,FL,19439
201801,IL,15597
201802,IL,17375
201803,IL,17781
201804,IL,17639
201901,FL,15716
201902,FL,15465
201901,IL,17290
201902,IL,17223
201903,IL,17334
201904,IL,16837
202001,FL,13568
202004,FL,14181
202003,IL,16790
202004,IL,9673
;
run;
This dataset represents drugs dispensed in Florida and Illinois by year-qtr and state.
Looking closely at the first ten observations, you can see that Florida has missing observations for quarter 2, 3, and 4 during 2017, 2018, and 2019, respectively.
| year_qtr | state | num_dspnd |
|---|---|---|
| 2017-01 | FL | 24780 |
| 2017-03 | FL | 24974 |
| 2017-04 | FL | 24131 |
| 2018-01 | FL | 23774 |
| 2018-02 | FL | 23040 |
| 2018-04 | FL | 19439 |
| 2019-01 | FL | 15716 |
| 2019-02 | FL | 15465 |
| 2019-03 | FL | 14450 |
| 2020-01 | FL | 13568 |
Creating the missing records
We essentially need to expand the year_qtr field such that each state shows each year and quarter.
One tool that I really like in these kinds of applications is PROC FREQ. Most of us are familiar with how PROC FREQ produces frequencies, plots, and cross-tabulations for variables of interest.
PROC FREQ has an interesting little option on the TABLES statement that is buried in the documentation – SPARSE.
The SAS documentation explains it succinctly:
SPARSEincludes all possible combinations of variable levels in theLISTtable andOUT=data set
And what is LIST?
LISTdisplays two-way to n-way tables in list format
In other words, instead of a 2x2 table (or any other iteration), we get each variable combination specified in our TABLES statement output as an individual row.
So how do we employ it on our data set?
proc freq data = have noprint;
tables year_qtr * state / list sparse out = have_1 (drop = percent count);
run;
Looking at the first ten observations:
| state | state |
|---|---|
| 2017-01 | FL |
| 2017-01 | IL |
| 2017-02 | FL |
| 2017-02 | IL |
| 2017-03 | FL |
| 2017-03 | IL |
| 2017-04 | FL |
| 2017-04 | IL |
| 2018-01 | FL |
| 2018-01 | IL |
Some potentially confusing parts of this code:
NOPRINTis used to suppress the SAS HTML output from the results window. This is preferential.TABLES year_qtr * stateis asking for a cross-tabulation of theyear_qtrvariable with thestatevariable.LISTis requesting that we convert the 2x2 table into row-column format. We want each pair ofyear_qtrandstateto be output.SPARSE, as described above, includes all possible combinations ofyear_qtrandstate.- An important assumption here is that all your
year_qtr(or whatever variable you’re requesting) values are represented in at least one pair.
- An important assumption here is that all your
OUT=is requesting a data set to be created that contains the output fromTABLES year_qtr * stateand theLISToption.- Finally, I drop variables automatically created with
PROC FREQ–PERCENTandCOUNT.- If you omit this portion, you will see that SAS creates a
COUNTcolumn with 1’s representing that the combination is present, and 0 representing that the combination was missing. Percent is practically non-sensical.
- If you omit this portion, you will see that SAS creates a
Remember, quarter 2 for Florida was missing, and now it is present!
Getting our final results
Now that we have all possible combinations, we need to MERGE it back with our original dataset that has the number of drugs dispensed by year_qtr and state.
First, since the output for PROC FREQ doesn’t seem to stay how we want it, we need to sort it before we can merge it. While we could use PROC SQL to avoid ordering here, we’re going to need to use RETAIN and BY-GROUP processing to get our final results.
proc sort data = have_1;
by state year_qtr;
run;
Now we can MERGE. First, I’m going to show you what the data set looks like after the MERGE, and then I will show you the full code that includes the final results.
Of note, not specifying an
INoption with theMERGEstatement implies aFULL JOIN.
data want;
merge have have_1;
by state year_qtr;
run;
First ten observations:
| year_qtr | state | num_dspnd |
|---|---|---|
| 2017-01 | FL | 24780 |
| 2017-02 | FL | . |
| 2017-03 | FL | 24974 |
| 2017-04 | FL | 24131 |
| 2018-01 | FL | 23774 |
| 2018-02 | FL | 23040 |
| 2018-03 | FL | . |
| 2018-04 | FL | 19439 |
| 2019-01 | FL | 15716 |
| 2019-02 | FL | 15465 |
Now our original num_dspnd values are back, and those year_qtr values that do not have a match for a given state have a missing num_dspnd value.
For those courageous enough to do it in one step:
data want;
merge have have_1;
by state year_qtr;
retain num_dspnd_final;
if first.state then call missing(num_dspnd_final);
if not missing(num_dspnd) then num_dspnd_final = num_dspnd;
run;
Which produces this (first 10 observations):
| year_qtr | state | num_dspnd | num_dspnd_final |
|---|---|---|---|
| 2017-01 | FL | 24780 | 24780 |
| 2017-02 | FL | . | 24780 |
| 2017-03 | FL | 24974 | 24974 |
| 2017-04 | FL | 24131 | 24131 |
| 2018-01 | FL | 23774 | 23774 |
| 2018-02 | FL | 23040 | 23040 |
| 2018-03 | FL | . | 23040 |
| 2018-04 | FL | 19439 | 19439 |
| 2019-01 | FL | 15716 | 15716 |
| 2019-02 | FL | 15465 | 15465 |
This takes subtle manipulation of the PDV. See this paper if you’re unfamiliar with SAS’s program data vector. Here’s what happened in the last step:
- We
MERGEonstateandyear_qtr. - We tell SAS that we want to
RETAINthe value of a variable callednum_dspnd_final. By usingRETAIN, we are holding the value ofnum_dspnd_finalin the PDV when additional records are read. - We then instruct SAS that if the variable is the
FIRSTvalue forstate, then we want to call ournum_dspnd_finalvalue missing.- This ensures that we do not retain values across different
statevalues. We do not want Illinois toRETAINa Florida value.
- This ensures that we do not retain values across different
- Finally, we tell SAS to populate the
num_dspnd_finalfield withnum_dspndif it is not missing.- If it is missing, nothing will be done, and the prior value will be held in the PDV by the
RETAINstatement.
- If it is missing, nothing will be done, and the prior value will be held in the PDV by the
Additionally, you can DROP the num_dspnd variable, but I just left it for transparency.
This is also applicable if you have a lot of missing records where you need the last non-missing observation’s records carried forward. For example, this data set.
data test;
input id var;
datalines;
1 2
1 .
1 .
1 .
1 4
1 .
1 .
2 3
2 .
2 5
3 1
;
run;
Same code, different variables:
data test_final;
set test;
by id;
retain locf;
if first.id then call missing(locf);
if not missing(var) then locf = var;
run;
| id | var | locf |
|---|---|---|
| 1 | 2 | 2 |
| 1 | . | 2 |
| 1 | . | 2 |
| 1 | . | 2 |
| 1 | 4 | 4 |
| 1 | . | 4 |
| 1 | . | 4 |
| 2 | 3 | 3 |
| 2 | . | 3 |
| 2 | 5 | 5 |
| 3 | 1 | 1 |

Program is here.