Aggregating in SAS - Efficiency of two approaches

How do you aggregate in SAS? You often have to use some PROC’s to get what you need.

What is aggregating?

Aggregating, if you’re not aware, is simply rolling up information from a lot of records into a more concise record.

Think of it like attempting to sum the number of antidepressants dispensed across year-quarter intervals with a PDE file.

Most beneficiaries taking antidepressants will several/hundreds of records per month over a given period. How can we get them down to one record?

Translating from other languages

If you have a Stata background, I’m talking about using the collapse command.

If you have an SPSS background, I’m talking about the AGGREGATE command.

If you have an R background, I’m talking about the aggregate, dplyr::summarize(), or any other variant of aggregation within R.

Methods of Aggregating

Note: I try to be careful when using the word ‘count’ with aggregating. While counting is a form of aggregation, it can get confusing depending on what research question you are trying to examine. Counting in SAS typically refers to row-by-row processing/indexing, while aggregating usually involves some single measure for a given classification.

Also, I am using a simulated data set of drugs from the DE-SynPUF PDE file. I run two PROC SURVEYSELECT statements with different seeds to get two data sets, and then I pull from each one 1,250 times for a total of 2,500 runs.

I also use Redbook to get the drug names based on the NDC number and remove records that do not have a match in Redbook.

This creates a data set with 12,225,787 records.

Since the size of this data set is too large to post, I will not be able to make this reproducible, but I will make code available that uses data sets in the SASHELP library.

Here are the first ten observations:

PDE_ID	SRVC_DT	PROD_SRVC_ID	QTY_DSPNSD_NUM	DAYS_SUPLY_NUM	PTNT_PAY_AMT	TOT_RX_CST_AMT	year_qtr	gennme
233694491634414	20081115	00088179730	30.000	30	10.00	10.00	2008Q4	Diltiazem Hydrochloride
233524489010534	20100420	52959054604	10.000	30	0.00	10.00	2010Q2	Lamivudine/Zidovudine
233444491537554	20081002	54569469800	30.000	30	0.00	60.00	2008Q4	Valsartan
233244491270323	20100806	63874022060	30.000	30	10.00	10.00	2010Q3	Naloxone Hydrochloride/Pentazocine Hydrochloride
233364489627783	20081003	53746019405	270.000	10	10.00	10.00	2008Q4	Naproxen Sodium
233814492055785	20081125	00603030521	90.000	90	0.00	80.00	2008Q4	Docusate Calcium
233924489706141	20080221	00349700910	30.000	10	0.00	10.00	2008Q1	Probenecid
233404490150323	20090822	00574006945	120.000	30	30.00	10.00	2009Q3	Dextrose
233964488612114	20091006	55953013270	60.000	30	0.00	10.00	2009Q4	Captopril
233964490921001	20080531	00378054477	90.000	30	0.00	20.00	2008Q2	Hydrochlorothiazide/Quinapril Hydrochloride

`PROC SQL`

The first and most popular is probably PROC SQL. It’s intuitive, versatile, and efficient with large data sets.

proc sql;
	create table			aggregate_proc_sql_no_count as
		select
					year_qtr,
					gennme,		
					sum(qty_dspnsd_num) as total_rx_prescribed
		from
					base_file
		group by
					year_qtr, formatted_name;
quit;

And if we wanted to get a value count, we just use the COUNT function.

proc sql;
	create table			aggregate_proc_sql_w_count as
		select
					year_qtr, /* #1 */
					gennme, /* #2 */
                    			count(gennme) as freq, /* #3 */
					sum(qty_dspnsd_num) as total_rx_prescribed /* #3 */
		from
					want_year_qtr
		group by
					year_qtr, gennme; /* #4 */
quit;

year_qtr	gennme	freq	total_rx_prescribed
2008Q1	Acetaminophen	10000	528920
2008Q1	Acetaminophen/Aspirin/Caffeine	2451	220590
2008Q1	Acetaminophen/Butalbital	2451	24510
2008Q1	Acetaminophen/Codeine Phosphate	98	3920
2008Q1	Acetaminophen/Diphenhydramine Hydrochloride	2500	150000
2008Q1	Acetaminophen/Hydrocodone Bitartrate	12549	1337750
2008Q1	Acetaminophen/Magnesium Salicylate/Pamabrom	49	2940
2008Q1	Acetaminophen/Oxycodone Hydrochloride	49	490
2008Q1	Acetaminophen/Propoxyphene Napsylate	2549	34310
2008Q1	Acetazolamide	2549	76470

Select the year_qtr variable which is the date the drug was dispensed formatted using the YYQ6. format.
Select the gennme variable which represents the generic drug name.
COUNT the number of instances of the generic name.
Finally, group the results by year_qtr and gennme

We get this (first ten observations):

The issue with the PROC SQL is that it is slower than other SAS procedures when aggregating. Odd, right? But here was the amount of time it took to run the above code:

NOTE: PROCEDURE SQL used (Total process time):
      real time           5.37 seconds
      cpu time            7.76 seconds

`PROC MEANS`

But most SAS PROC’s require sorting, right?

Alt Text

While you can sort and use the BY statement in PROC MEANS, you can completely avoid the PROC SORT by using the CLASS argument in PROC MEANS. Here’s what I mean:

proc means 
	data = want_year_qtr
	noprint /* #1 */
    	nway; /* #2 */
		class year_qtr formatted_name; /* #3 */
		var qty_dspnsd_num; /* #4 */
	    	output out = aggregate_proc_means /* #5 */
	    		sum = / autoname; /* 6 & 7 */
run;

NOPRINT suppresses the default HTML listing. You want to add this so SAS isn’t trying to produce a report containing all the observations.
NWAY tells SAS to aggregate to the maximum number of elements in your CLASS statement. This is an oversimplification, but you can read about the formal definition here
- You can use a WAYS statement instead, but you have to explicitly define how many levels you want the information aggregated to. WAYS 0 will produce overall statistics for the data set, while WAYS 1 will produce overall statistics by year_qtr, and so on.
CLASS tells SAS what level of detail you want.
I am requesting SAS to calculate statistics for the qty_dspnsd_num variable and that variable only.
- You can request additional variables here. If you omit this statement, you are going to be returned summary statistics for all numeric variables in the data set.
I am outputting a data set to my WORK library called aggregate_proc_means
SUM = tells SAS that I want to do with the field listed no the VAR statement. You can choose a variety of statistics here - mean, median, etc. The right side of the argument typically specifies what you want to aggregate. I could add sum(qty_dspnsd_num) = sum if I want to explicitly request that it aggregates a given field. Omitting this implies that you want to SUM the field on the VAR statement. The left side is your summary statistic request, and the right side is the variable name that you want.
The / autoname tells SAS that you want to access certain options. Here, I am requesting that SAS automatically names the variables once they’re aggregated. This means the summarized output variable will be prefixed by the requested statistic and the variable name (Sum_qty_dspnsd_num).

The CLASS statement essentially, in my eyes, classifies or cases the records to the level of detail that we desire. In other words, we don’t want patient-level aggregations, we want it cased or rolled-up by year_qtr and formatted_name.

Also, it automatically creates two variables on the output data set: _TYPE_ and _FREQ_. _TYPE_ represents the number of classifications. Notice how _FREQ_ gives the same results as the COUNT function from PROC SQL.

You can drop these fields on your OUT = statement by specifying output out = your_ds_here (drop = _:)

This results in this:

year_qtr	formatted_name	TYPE	FREQ	QTY_DSPNSD_NUM_Sum
2008Q1	Acetaminophen	3	10000	528920.000
2008Q1	Acetaminophen/Aspirin/Caffeine	3	2451	220590.000
2008Q1	Acetaminophen/Butalbital	3	2451	24510.000
2008Q1	Acetaminophen/Codeine Phosphate	3	98	3920.000
2008Q1	Acetaminophen/Diphenhydramine Hydrochloride	3	2500	150000.000
2008Q1	Acetaminophen/Hydrocodone Bitartrate	3	12549	1337750.000
2008Q1	Acetaminophen/Magnesium Salicylate/Pamabrom	3	49	2940.000
2008Q1	Acetaminophen/Oxycodone Hydrochloride	3	49	490.000
2008Q1	Acetaminophen/Propoxyphene Napsylate	3	2549	34310.000
2008Q1	Acetazolamide	3	2549	76470.000

Notice how PROC MEANS retains the original formatting/specificity of the field.

And the amount of time it took on my system (i7-10857H, 32GB RAM, Windows 10):

NOTE: PROCEDURE MEANS used (Total process time):
      real time           1.02 seconds
      cpu time            4.07 seconds

This is much quicker than PROC SQL, and you don’t have to use PROC SORT!

Summary

Overall, while we do like PROC SQL due to it’s translatability and intuitive syntax, you will see that some base SAS procedures can be significantly faster.

This isn’t always the case, but you sometimes have to play around and see.

Also, I would be remiss if I didn’t mention some other ways of aggregating, namely: PROC UNIVARIATE and PROC SUMMARY.

Additionally, the times that I posted are from a single run. Times can certainly vary across multiple runs, but the times will be normally distributed across hundreds of runs – think Central Limit Theorem!

You can find an example program here. Again, since it isn’t practical to post the entire data set, this will just show you and provide you a way to examine how these PROC’s work.

I encourage you to play around with the CLASS statement to see how you can modify the level of aggregation.