How do you aggregate in SAS? You often have to use some PROC
’s to get what you need.
What is aggregating?
Aggregating, if you’re not aware, is simply rolling up information from a lot of records into a more concise record.
Think of it like attempting to sum the number of antidepressants dispensed across year-quarter intervals with a PDE file.
Most beneficiaries taking antidepressants will several/hundreds of records per month over a given period. How can we get them down to one record?
Translating from other languages
If you have a Stata background, I’m talking about using the
collapse
command.
If you have an SPSS background, I’m talking about the
AGGREGATE
command.
If you have an R background, I’m talking about the
aggregate
,dplyr::summarize()
, or any other variant of aggregation within R.
Methods of Aggregating
Note: I try to be careful when using the word ‘count’ with aggregating. While counting is a form of aggregation, it can get confusing depending on what research question you are trying to examine. Counting in SAS typically refers to row-by-row processing/indexing, while aggregating usually involves some single measure for a given classification.
Also, I am using a simulated data set of drugs from the DE-SynPUF PDE file. I run two PROC SURVEYSELECT
statements with different seeds to get two data sets, and then I pull from each one 1,250 times for a total of 2,500 runs.
I also use Redbook to get the drug names based on the NDC number and remove records that do not have a match in Redbook.
This creates a data set with 12,225,787 records.
Since the size of this data set is too large to post, I will not be able to make this reproducible, but I will make code available that uses data sets in the SASHELP
library.
Here are the first ten observations:
PDE_ID | SRVC_DT | PROD_SRVC_ID | QTY_DSPNSD_NUM | DAYS_SUPLY_NUM | PTNT_PAY_AMT | TOT_RX_CST_AMT | year_qtr | gennme |
---|---|---|---|---|---|---|---|---|
233694491634414 | 20081115 | 00088179730 | 30.000 | 30 | 10.00 | 10.00 | 2008Q4 | Diltiazem Hydrochloride |
233524489010534 | 20100420 | 52959054604 | 10.000 | 30 | 0.00 | 10.00 | 2010Q2 | Lamivudine/Zidovudine |
233444491537554 | 20081002 | 54569469800 | 30.000 | 30 | 0.00 | 60.00 | 2008Q4 | Valsartan |
233244491270323 | 20100806 | 63874022060 | 30.000 | 30 | 10.00 | 10.00 | 2010Q3 | Naloxone Hydrochloride/Pentazocine Hydrochloride |
233364489627783 | 20081003 | 53746019405 | 270.000 | 10 | 10.00 | 10.00 | 2008Q4 | Naproxen Sodium |
233814492055785 | 20081125 | 00603030521 | 90.000 | 90 | 0.00 | 80.00 | 2008Q4 | Docusate Calcium |
233924489706141 | 20080221 | 00349700910 | 30.000 | 10 | 0.00 | 10.00 | 2008Q1 | Probenecid |
233404490150323 | 20090822 | 00574006945 | 120.000 | 30 | 30.00 | 10.00 | 2009Q3 | Dextrose |
233964488612114 | 20091006 | 55953013270 | 60.000 | 30 | 0.00 | 10.00 | 2009Q4 | Captopril |
233964490921001 | 20080531 | 00378054477 | 90.000 | 30 | 0.00 | 20.00 | 2008Q2 | Hydrochlorothiazide/Quinapril Hydrochloride |
PROC SQL
The first and most popular is probably PROC SQL
. It’s intuitive, versatile, and efficient with large data sets.
proc sql;
create table aggregate_proc_sql_no_count as
select
year_qtr,
gennme,
sum(qty_dspnsd_num) as total_rx_prescribed
from
base_file
group by
year_qtr, formatted_name;
quit;
And if we wanted to get a value count, we just use the COUNT
function.
proc sql;
create table aggregate_proc_sql_w_count as
select
year_qtr, /* #1 */
gennme, /* #2 */
count(gennme) as freq, /* #3 */
sum(qty_dspnsd_num) as total_rx_prescribed /* #3 */
from
want_year_qtr
group by
year_qtr, gennme; /* #4 */
quit;
year_qtr | gennme | freq | total_rx_prescribed |
---|---|---|---|
2008Q1 | Acetaminophen | 10000 | 528920 |
2008Q1 | Acetaminophen/Aspirin/Caffeine | 2451 | 220590 |
2008Q1 | Acetaminophen/Butalbital | 2451 | 24510 |
2008Q1 | Acetaminophen/Codeine Phosphate | 98 | 3920 |
2008Q1 | Acetaminophen/Diphenhydramine Hydrochloride | 2500 | 150000 |
2008Q1 | Acetaminophen/Hydrocodone Bitartrate | 12549 | 1337750 |
2008Q1 | Acetaminophen/Magnesium Salicylate/Pamabrom | 49 | 2940 |
2008Q1 | Acetaminophen/Oxycodone Hydrochloride | 49 | 490 |
2008Q1 | Acetaminophen/Propoxyphene Napsylate | 2549 | 34310 |
2008Q1 | Acetazolamide | 2549 | 76470 |
- Select the
year_qtr
variable which is the date the drug was dispensed formatted using theYYQ6.
format. - Select the
gennme
variable which represents the generic drug name. COUNT
the number of instances of the generic name.- Finally, group the results by
year_qtr
andgennme
We get this (first ten observations):
The issue with the PROC SQL
is that it is slower than other SAS procedures when aggregating. Odd, right? But here was the amount of time it took to run the above code:
NOTE: PROCEDURE SQL used (Total process time):
real time 5.37 seconds
cpu time 7.76 seconds
PROC MEANS
But most SAS PROC’s require sorting, right?
While you can sort and use the BY
statement in PROC MEANS
, you can completely avoid the PROC SORT
by using the CLASS
argument in PROC MEANS
. Here’s what I mean:
proc means
data = want_year_qtr
noprint /* #1 */
nway; /* #2 */
class year_qtr formatted_name; /* #3 */
var qty_dspnsd_num; /* #4 */
output out = aggregate_proc_means /* #5 */
sum = / autoname; /* 6 & 7 */
run;
NOPRINT
suppresses the default HTML listing. You want to add this so SAS isn’t trying to produce a report containing all the observations.NWAY
tells SAS to aggregate to the maximum number of elements in yourCLASS
statement. This is an oversimplification, but you can read about the formal definition here- You can use a
WAYS
statement instead, but you have to explicitly define how many levels you want the information aggregated to.WAYS
0 will produce overall statistics for the data set, whileWAYS
1 will produce overall statistics byyear_qtr
, and so on.
- You can use a
CLASS
tells SAS what level of detail you want.- I am requesting SAS to calculate statistics for the
qty_dspnsd_num
variable and that variable only.- You can request additional variables here. If you omit this statement, you are going to be returned summary statistics for all numeric variables in the data set.
- I am outputting a data set to my
WORK
library calledaggregate_proc_means
SUM =
tells SAS that I want to do with the field listed no theVAR
statement. You can choose a variety of statistics here - mean, median, etc. The right side of the argument typically specifies what you want to aggregate. I could addsum(qty_dspnsd_num) = sum
if I want to explicitly request that it aggregates a given field. Omitting this implies that you want toSUM
the field on theVAR
statement. The left side is your summary statistic request, and the right side is the variable name that you want.- The
/ autoname
tells SAS that you want to access certain options. Here, I am requesting that SAS automatically names the variables once they’re aggregated. This means the summarized output variable will be prefixed by the requested statistic and the variable name (Sum_qty_dspnsd_num
).
The CLASS
statement essentially, in my eyes, classifies or cases the records to the level of detail that we desire. In other words, we don’t want patient-level aggregations, we want it cased or rolled-up by year_qtr
and formatted_name
.
Also, it automatically creates two variables on the output data set: _TYPE_
and _FREQ_
. _TYPE_
represents the number of classifications. Notice how _FREQ_
gives the same results as the COUNT
function from PROC SQL
.
You can drop these fields on your
OUT =
statement by specifyingoutput out = your_ds_here (drop = _:)
This results in this:
year_qtr | formatted_name | TYPE | FREQ | QTY_DSPNSD_NUM_Sum |
---|---|---|---|---|
2008Q1 | Acetaminophen | 3 | 10000 | 528920.000 |
2008Q1 | Acetaminophen/Aspirin/Caffeine | 3 | 2451 | 220590.000 |
2008Q1 | Acetaminophen/Butalbital | 3 | 2451 | 24510.000 |
2008Q1 | Acetaminophen/Codeine Phosphate | 3 | 98 | 3920.000 |
2008Q1 | Acetaminophen/Diphenhydramine Hydrochloride | 3 | 2500 | 150000.000 |
2008Q1 | Acetaminophen/Hydrocodone Bitartrate | 3 | 12549 | 1337750.000 |
2008Q1 | Acetaminophen/Magnesium Salicylate/Pamabrom | 3 | 49 | 2940.000 |
2008Q1 | Acetaminophen/Oxycodone Hydrochloride | 3 | 49 | 490.000 |
2008Q1 | Acetaminophen/Propoxyphene Napsylate | 3 | 2549 | 34310.000 |
2008Q1 | Acetazolamide | 3 | 2549 | 76470.000 |
Notice how PROC MEANS
retains the original formatting/specificity of the field.
And the amount of time it took on my system (i7-10857H, 32GB RAM, Windows 10):
NOTE: PROCEDURE MEANS used (Total process time):
real time 1.02 seconds
cpu time 4.07 seconds
This is much quicker than PROC SQL
, and you don’t have to use PROC SORT
!
Summary
Overall, while we do like PROC SQL
due to it’s translatability and intuitive syntax, you will see that some base SAS procedures can be significantly faster.
This isn’t always the case, but you sometimes have to play around and see.
Also, I would be remiss if I didn’t mention some other ways of aggregating, namely: PROC UNIVARIATE
and PROC SUMMARY
.
Additionally, the times that I posted are from a single run. Times can certainly vary across multiple runs, but the times will be normally distributed across hundreds of runs – think Central Limit Theorem!
You can find an example program here. Again, since it isn’t practical to post the entire data set, this will just show you and provide you a way to examine how these PROC
’s work.
I encourage you to play around with the CLASS
statement to see how you can modify the level of aggregation.