How do you aggregate in SAS? You often have to use some PROC’s to get what you need.
What is aggregating?
Aggregating, if you’re not aware, is simply rolling up information from a lot of records into a more concise record.
Think of it like attempting to sum the number of antidepressants dispensed across year-quarter intervals with a PDE file.
Most beneficiaries taking antidepressants will several/hundreds of records per month over a given period. How can we get them down to one record?
Translating from other languages
If you have a Stata background, I’m talking about using the
collapsecommand.
If you have an SPSS background, I’m talking about the
AGGREGATEcommand.
If you have an R background, I’m talking about the
aggregate,dplyr::summarize(), or any other variant of aggregation within R.
Methods of Aggregating
Note: I try to be careful when using the word ‘count’ with aggregating. While counting is a form of aggregation, it can get confusing depending on what research question you are trying to examine. Counting in SAS typically refers to row-by-row processing/indexing, while aggregating usually involves some single measure for a given classification.
Also, I am using a simulated data set of drugs from the DE-SynPUF PDE file. I run two PROC SURVEYSELECT statements with different seeds to get two data sets, and then I pull from each one 1,250 times for a total of 2,500 runs.
I also use Redbook to get the drug names based on the NDC number and remove records that do not have a match in Redbook.
This creates a data set with 12,225,787 records.
Since the size of this data set is too large to post, I will not be able to make this reproducible, but I will make code available that uses data sets in the SASHELP library.
Here are the first ten observations:
| PDE_ID | SRVC_DT | PROD_SRVC_ID | QTY_DSPNSD_NUM | DAYS_SUPLY_NUM | PTNT_PAY_AMT | TOT_RX_CST_AMT | year_qtr | gennme |
|---|---|---|---|---|---|---|---|---|
| 233694491634414 | 20081115 | 00088179730 | 30.000 | 30 | 10.00 | 10.00 | 2008Q4 | Diltiazem Hydrochloride |
| 233524489010534 | 20100420 | 52959054604 | 10.000 | 30 | 0.00 | 10.00 | 2010Q2 | Lamivudine/Zidovudine |
| 233444491537554 | 20081002 | 54569469800 | 30.000 | 30 | 0.00 | 60.00 | 2008Q4 | Valsartan |
| 233244491270323 | 20100806 | 63874022060 | 30.000 | 30 | 10.00 | 10.00 | 2010Q3 | Naloxone Hydrochloride/Pentazocine Hydrochloride |
| 233364489627783 | 20081003 | 53746019405 | 270.000 | 10 | 10.00 | 10.00 | 2008Q4 | Naproxen Sodium |
| 233814492055785 | 20081125 | 00603030521 | 90.000 | 90 | 0.00 | 80.00 | 2008Q4 | Docusate Calcium |
| 233924489706141 | 20080221 | 00349700910 | 30.000 | 10 | 0.00 | 10.00 | 2008Q1 | Probenecid |
| 233404490150323 | 20090822 | 00574006945 | 120.000 | 30 | 30.00 | 10.00 | 2009Q3 | Dextrose |
| 233964488612114 | 20091006 | 55953013270 | 60.000 | 30 | 0.00 | 10.00 | 2009Q4 | Captopril |
| 233964490921001 | 20080531 | 00378054477 | 90.000 | 30 | 0.00 | 20.00 | 2008Q2 | Hydrochlorothiazide/Quinapril Hydrochloride |
PROC SQL
The first and most popular is probably PROC SQL. It’s intuitive, versatile, and efficient with large data sets.
proc sql;
create table aggregate_proc_sql_no_count as
select
year_qtr,
gennme,
sum(qty_dspnsd_num) as total_rx_prescribed
from
base_file
group by
year_qtr, formatted_name;
quit;
And if we wanted to get a value count, we just use the COUNT function.
proc sql;
create table aggregate_proc_sql_w_count as
select
year_qtr, /* #1 */
gennme, /* #2 */
count(gennme) as freq, /* #3 */
sum(qty_dspnsd_num) as total_rx_prescribed /* #3 */
from
want_year_qtr
group by
year_qtr, gennme; /* #4 */
quit;
| year_qtr | gennme | freq | total_rx_prescribed |
|---|---|---|---|
| 2008Q1 | Acetaminophen | 10000 | 528920 |
| 2008Q1 | Acetaminophen/Aspirin/Caffeine | 2451 | 220590 |
| 2008Q1 | Acetaminophen/Butalbital | 2451 | 24510 |
| 2008Q1 | Acetaminophen/Codeine Phosphate | 98 | 3920 |
| 2008Q1 | Acetaminophen/Diphenhydramine Hydrochloride | 2500 | 150000 |
| 2008Q1 | Acetaminophen/Hydrocodone Bitartrate | 12549 | 1337750 |
| 2008Q1 | Acetaminophen/Magnesium Salicylate/Pamabrom | 49 | 2940 |
| 2008Q1 | Acetaminophen/Oxycodone Hydrochloride | 49 | 490 |
| 2008Q1 | Acetaminophen/Propoxyphene Napsylate | 2549 | 34310 |
| 2008Q1 | Acetazolamide | 2549 | 76470 |
- Select the
year_qtrvariable which is the date the drug was dispensed formatted using theYYQ6.format. - Select the
gennmevariable which represents the generic drug name. COUNTthe number of instances of the generic name.- Finally, group the results by
year_qtrandgennme
We get this (first ten observations):
The issue with the PROC SQL is that it is slower than other SAS procedures when aggregating. Odd, right? But here was the amount of time it took to run the above code:
NOTE: PROCEDURE SQL used (Total process time):
real time 5.37 seconds
cpu time 7.76 seconds
PROC MEANS
But most SAS PROC’s require sorting, right?

While you can sort and use the BY statement in PROC MEANS, you can completely avoid the PROC SORT by using the CLASS argument in PROC MEANS. Here’s what I mean:
proc means
data = want_year_qtr
noprint /* #1 */
nway; /* #2 */
class year_qtr formatted_name; /* #3 */
var qty_dspnsd_num; /* #4 */
output out = aggregate_proc_means /* #5 */
sum = / autoname; /* 6 & 7 */
run;
NOPRINTsuppresses the default HTML listing. You want to add this so SAS isn’t trying to produce a report containing all the observations.NWAYtells SAS to aggregate to the maximum number of elements in yourCLASSstatement. This is an oversimplification, but you can read about the formal definition here- You can use a
WAYSstatement instead, but you have to explicitly define how many levels you want the information aggregated to.WAYS0 will produce overall statistics for the data set, whileWAYS1 will produce overall statistics byyear_qtr, and so on.
- You can use a
CLASStells SAS what level of detail you want.- I am requesting SAS to calculate statistics for the
qty_dspnsd_numvariable and that variable only.- You can request additional variables here. If you omit this statement, you are going to be returned summary statistics for all numeric variables in the data set.
- I am outputting a data set to my
WORKlibrary calledaggregate_proc_means SUM =tells SAS that I want to do with the field listed no theVARstatement. You can choose a variety of statistics here - mean, median, etc. The right side of the argument typically specifies what you want to aggregate. I could addsum(qty_dspnsd_num) = sumif I want to explicitly request that it aggregates a given field. Omitting this implies that you want toSUMthe field on theVARstatement. The left side is your summary statistic request, and the right side is the variable name that you want.- The
/ autonametells SAS that you want to access certain options. Here, I am requesting that SAS automatically names the variables once they’re aggregated. This means the summarized output variable will be prefixed by the requested statistic and the variable name (Sum_qty_dspnsd_num).
The CLASS statement essentially, in my eyes, classifies or cases the records to the level of detail that we desire. In other words, we don’t want patient-level aggregations, we want it cased or rolled-up by year_qtr and formatted_name.
Also, it automatically creates two variables on the output data set: _TYPE_ and _FREQ_. _TYPE_ represents the number of classifications. Notice how _FREQ_ gives the same results as the COUNT function from PROC SQL.
You can drop these fields on your
OUT =statement by specifyingoutput out = your_ds_here (drop = _:)
This results in this:
| year_qtr | formatted_name | TYPE | FREQ | QTY_DSPNSD_NUM_Sum |
|---|---|---|---|---|
| 2008Q1 | Acetaminophen | 3 | 10000 | 528920.000 |
| 2008Q1 | Acetaminophen/Aspirin/Caffeine | 3 | 2451 | 220590.000 |
| 2008Q1 | Acetaminophen/Butalbital | 3 | 2451 | 24510.000 |
| 2008Q1 | Acetaminophen/Codeine Phosphate | 3 | 98 | 3920.000 |
| 2008Q1 | Acetaminophen/Diphenhydramine Hydrochloride | 3 | 2500 | 150000.000 |
| 2008Q1 | Acetaminophen/Hydrocodone Bitartrate | 3 | 12549 | 1337750.000 |
| 2008Q1 | Acetaminophen/Magnesium Salicylate/Pamabrom | 3 | 49 | 2940.000 |
| 2008Q1 | Acetaminophen/Oxycodone Hydrochloride | 3 | 49 | 490.000 |
| 2008Q1 | Acetaminophen/Propoxyphene Napsylate | 3 | 2549 | 34310.000 |
| 2008Q1 | Acetazolamide | 3 | 2549 | 76470.000 |
Notice how PROC MEANS retains the original formatting/specificity of the field.
And the amount of time it took on my system (i7-10857H, 32GB RAM, Windows 10):
NOTE: PROCEDURE MEANS used (Total process time):
real time 1.02 seconds
cpu time 4.07 seconds
This is much quicker than PROC SQL, and you don’t have to use PROC SORT!
Summary
Overall, while we do like PROC SQL due to it’s translatability and intuitive syntax, you will see that some base SAS procedures can be significantly faster.
This isn’t always the case, but you sometimes have to play around and see.
Also, I would be remiss if I didn’t mention some other ways of aggregating, namely: PROC UNIVARIATE and PROC SUMMARY.
Additionally, the times that I posted are from a single run. Times can certainly vary across multiple runs, but the times will be normally distributed across hundreds of runs – think Central Limit Theorem!
You can find an example program here. Again, since it isn’t practical to post the entire data set, this will just show you and provide you a way to examine how these PROC’s work.
I encourage you to play around with the CLASS statement to see how you can modify the level of aggregation.