We often need to calculate how many days have passed from an index date or prior date. This is a common procedure required with any data containing dates, and it is a frequently asked question on the SAS forums.
First! Do you not have a local SAS license? No problem! SAS is freely available here. You will need to create an account.
Back on track–one way to accomplish this is through the LAG
function in SAS. LAG
grabs the previous record’s value and stores it in a new field.
Let’s first create a fake data set that mimics medical claim records. Do not be afraid to run the code below.
/* The DATA step below creates patient ID's and dates. */
data have;
do patid = 1 to 10000;
do date = "01JAN2000"d to "31DEC2020"d;
output;
end;
end;
format date mmddyy10.;
run;
/* The PROC below selects a sample of 5000 people and outputs a data set called have_sampled. */
/* Note the seed option is set to 1225 (numbers of UF COP address). This will create a consistent data set. */
proc surveyselect data = have
out = have_sampled
seed = 1225
sampsize = 5000;
run;
To show what went on, let’s run a PROC PRINT
.
/* NOOBS suppresses default printing of OBS column. */
proc print data = have_sampled (obs = 10) noobs;
run;
/* Data below */
patid date
3 08/24/2012
6 04/18/2020
7 12/06/2014
7 02/16/2017
8 01/18/2003
9 04/13/2004
10 01/30/2010
11 03/13/2000
14 05/30/2012
15 06/22/2011
You can see that patient 7 has two records. This is an instance where we would want to calculate how many days there are between two dates for patient 7.
Before you can do that, though, your data need to be sorted by ‘patid’ and ‘date’ in ascending order to get the earliest record on top for a given patient.
Use PROC SORT
to sort your data.
proc sort data = have_sampled;
by patid date;
run;
Now that our data are sorted, we need to invoke by-group processing to make SAS recognize that we want our data cased by ‘patid’.
By-group processing is invoked via the BY
statement. A corollary of the BY
statement is the FIRST.
and LAST.
statements which SAS internally marks as the first and last records for a given patient. I use it below to clear out the lagged date value from other records.
NOTE Even though you have invoked by-group processing, it doesn’t necessarily mean that SAS automatically recognizes that previous values could be for another patient.
Accordingly, you must invoke the MISSING
function to clear these result if it’s the first record for a given patient.
data have_sampled_days_diff;
set have_sampled;
format lag_date mmddyy10.;
by patid; /* Here is the BY statement - it tells SAS we want results cased by 'patid'. */
lag_date = lag(date); /* Invoking the LAG function to take previous record REGARDLESS if it's the same patient. */
if first.patid then call missing(lag_date); /* If it's the first record for a given patient , clear out the value for 'lag_date'--they cannot have a lag value if it's the first record. */
if not missing(lag_date) then do; /* This IF-THEN-DO prevents a NOTE in the log showing that missing values were generated. */
difference = intck("days", lag_date, date, "c"); /* The INTCK function can compute days, years, etc. You first call the time-unit you want, followed by the earlier date or lagged date, and then the current date value. The "c" option requests SAS to compute continuous days to get the complete number of days between the two dates. */
end; /* You MUST end the IF-THEN-DO statement. */
run;
Again, let’s run a small PROC PRINT
to see exactly what happened.
proc print data = have_sampled_days_diff (obs = 10) noobs;
run;
patid date lag_date difference
3 08/24/2012 . .
6 04/18/2020 . .
7 12/06/2014 . .
7 02/16/2017 12/06/2014 803
8 01/18/2003 . .
9 04/13/2004 . .
10 01/30/2010 . .
11 03/13/2000 . .
14 05/30/2012 . .
15 06/22/2011 . .
We see that patient 7 had 803 days between their visit in 2014 and their visit in 2017. What about if a patient has multiple records?
I’m going to write a quick PROC SQL
query since it’s relatively easy to get this information. You don’t need to know anything about PROC SQL
, this just pulls individuals with more than two records.
/* This pulls all records for a given patient where they have more than two records. */
proc sql outobs = 19;
select *
from have_sampled_days_diff
group by patid
having count(patid) > 2
order by patid, date;
quit;
/* The data */
patid date lag_date difference
158 02/20/2003 . .
158 03/03/2015 02/20/2003 4394
158 11/06/2020 03/03/2015 2075
178 10/07/2007 . .
178 03/08/2008 10/07/2007 153
178 04/28/2010 03/08/2008 781
325 02/16/2000 . .
325 01/09/2001 02/16/2000 328
325 03/03/2009 01/09/2001 2975
385 02/04/2006 . .
385 08/18/2016 02/04/2006 3848
385 01/19/2018 08/18/2016 519
501 03/14/2017 . .
501 03/25/2018 03/14/2017 376
501 01/23/2019 03/25/2018 304
504 06/22/2011 . .
504 03/02/2012 06/22/2011 254
504 06/14/2014 03/02/2012 834
504 12/09/2015 06/14/2014 543
As you can see, we are consistently calculating the days difference between the current record and the prior record for a given patient. Adding one to the total is common here, but should be discussed or determined on a case-by-case basis.
TLDR:
- Sort your data by ID and your date.
- Create a new data set in a
DATA
step.SET
the sorted data set. - Use
BY
statement to case on ‘patid’. - Use
LAG
to get previous value. This gets the last record no matter the ‘patid’! - Use
FIRST
andMISSING
to clear out the value obtained from the previous step for the patient’s first record. - Use
IF-THEN
processing to minimize notes in the log and only compute the days difference metric on those records that have values. - Use
INTCK
to calculate the number of days between the patient’s current record’s date and the last date. - Make your decision as to what you need to do!
Also, here are some additional resources that may be helpful if you want to truly understand what is going on underneath the hood.
- The program data vector (PDV)
- By-group processing
- Fun little blog on INTCK and its counterpart, INTNX.
- Some alright SAS documentation on
LAG