Overview

Here are the general issues:

  1. The records have a sequenced pattern (A, B, C, etc.), but each individual ID does not have every part of the sequence. In other words, some ID’s have missing observations for parts of the sequence.
  2. You need to fill any missing pattern in the sequence with the previous record’s value.

Most of the work that I’ve completed here has involved constructing time-series data sets. Accordingly, I’m quite familiar with their structure, and it is pretty natural for this example.

Here’s the example data set:

data have;
infile datalines delimiter = ",";
input year_qtr :yymmn6. state $ num_dspnd;
format year_qtr yymmd7.;
datalines;
201701,FL,24780
201702,FL,29854
201704,FL,24131
201701,IL,28392
201704,IL,16488
201801,FL,23774
201802,FL,23040
201803,FL,20945
201804,FL,19439
201801,IL,15597
201802,IL,17375
201803,IL,17781
201804,IL,17639
201901,FL,15716
201902,FL,15465
201901,IL,17290
201902,IL,17223
201903,IL,17334
201904,IL,16837
202001,FL,13568
202004,FL,14181
202003,IL,16790
202004,IL,9673
;
run;

This dataset represents drugs dispensed in Florida and Illinois by year-qtr and state.

Looking closely at the first ten observations, you can see that Florida has missing observations for quarter 2, 3, and 4 during 2017, 2018, and 2019, respectively.

year_qtr state num_dspnd
2017-01 FL 24780
2017-03 FL 24974
2017-04 FL 24131
2018-01 FL 23774
2018-02 FL 23040
2018-04 FL 19439
2019-01 FL 15716
2019-02 FL 15465
2019-03 FL 14450
2020-01 FL 13568

Creating the missing records

We essentially need to expand the year_qtr field such that each state shows each year and quarter.

One tool that I really like in these kinds of applications is PROC FREQ. Most of us are familiar with how PROC FREQ produces frequencies, plots, and cross-tabulations for variables of interest.

PROC FREQ has an interesting little option on the TABLES statement that is buried in the documentation – SPARSE.

The SAS documentation explains it succinctly:

SPARSE includes all possible combinations of variable levels in the LIST table and OUT= data set

And what is LIST?

LIST displays two-way to n-way tables in list format

In other words, instead of a 2x2 table (or any other iteration), we get each variable combination specified in our TABLES statement output as an individual row.

So how do we employ it on our data set?

proc freq data = have noprint;
	tables year_qtr * state / list sparse out = have_1 (drop = percent count);
run;

Looking at the first ten observations:

state state
2017-01 FL
2017-01 IL
2017-02 FL
2017-02 IL
2017-03 FL
2017-03 IL
2017-04 FL
2017-04 IL
2018-01 FL
2018-01 IL

Some potentially confusing parts of this code:

  1. NOPRINT is used to suppress the SAS HTML output from the results window. This is preferential.
  2. TABLES year_qtr * state is asking for a cross-tabulation of the year_qtr variable with the state variable.
  3. LIST is requesting that we convert the 2x2 table into row-column format. We want each pair of year_qtr and state to be output.
  4. SPARSE, as described above, includes all possible combinations of year_qtr and state.
    • An important assumption here is that all your year_qtr (or whatever variable you’re requesting) values are represented in at least one pair.
  5. OUT= is requesting a data set to be created that contains the output from TABLES year_qtr * state and the LIST option.
  6. Finally, I drop variables automatically created with PROC FREQPERCENT and COUNT.
    • If you omit this portion, you will see that SAS creates a COUNT column with 1’s representing that the combination is present, and 0 representing that the combination was missing. Percent is practically non-sensical.

Remember, quarter 2 for Florida was missing, and now it is present!

Getting our final results

Now that we have all possible combinations, we need to MERGE it back with our original dataset that has the number of drugs dispensed by year_qtr and state.

First, since the output for PROC FREQ doesn’t seem to stay how we want it, we need to sort it before we can merge it. While we could use PROC SQL to avoid ordering here, we’re going to need to use RETAIN and BY-GROUP processing to get our final results.

proc sort data = have_1;
    by state year_qtr;
run;

Now we can MERGE. First, I’m going to show you what the data set looks like after the MERGE, and then I will show you the full code that includes the final results.

Of note, not specifying an IN option with the MERGE statement implies a FULL JOIN.

data want;
	merge have have_1;
	by state year_qtr;
run;

First ten observations:

year_qtr state num_dspnd
2017-01 FL 24780
2017-02 FL .
2017-03 FL 24974
2017-04 FL 24131
2018-01 FL 23774
2018-02 FL 23040
2018-03 FL .
2018-04 FL 19439
2019-01 FL 15716
2019-02 FL 15465

Now our original num_dspnd values are back, and those year_qtr values that do not have a match for a given state have a missing num_dspnd value.

For those courageous enough to do it in one step:

data want;
	merge have have_1;
	by state year_qtr;
	retain num_dspnd_final;
		if first.state then call missing(num_dspnd_final);
		if not missing(num_dspnd) then num_dspnd_final = num_dspnd;
run;

Which produces this (first 10 observations):

year_qtr state num_dspnd num_dspnd_final
2017-01 FL 24780 24780
2017-02 FL . 24780
2017-03 FL 24974 24974
2017-04 FL 24131 24131
2018-01 FL 23774 23774
2018-02 FL 23040 23040
2018-03 FL . 23040
2018-04 FL 19439 19439
2019-01 FL 15716 15716
2019-02 FL 15465 15465

This takes subtle manipulation of the PDV. See this paper if you’re unfamiliar with SAS’s program data vector. Here’s what happened in the last step:

  1. We MERGE on state and year_qtr.
  2. We tell SAS that we want to RETAIN the value of a variable called num_dspnd_final. By using RETAIN, we are holding the value of num_dspnd_final in the PDV when additional records are read.
  3. We then instruct SAS that if the variable is the FIRST value for state, then we want to call our num_dspnd_final value missing.
    • This ensures that we do not retain values across different state values. We do not want Illinois to RETAIN a Florida value.
  4. Finally, we tell SAS to populate the num_dspnd_final field with num_dspnd if it is not missing.
    • If it is missing, nothing will be done, and the prior value will be held in the PDV by the RETAIN statement.

Additionally, you can DROP the num_dspnd variable, but I just left it for transparency.

This is also applicable if you have a lot of missing records where you need the last non-missing observation’s records carried forward. For example, this data set.

data test;
input id var;
datalines;
1 2
1 .
1 .
1 .
1 4
1 .
1 .
2 3
2 .
2 5
3 1
;
run;

Same code, different variables:

data test_final;
	set test;
	by id;
	retain locf;
		if first.id then call missing(locf);
		if not missing(var) then locf = var;
run;
id var locf
1 2 2
1 . 2
1 . 2
1 . 2
1 4 4
1 . 4
1 . 4
2 3 3
2 . 3
2 5 5
3 1 1

Alt Text

Program is here.