How do you aggregate in SAS? You often have to use some PROC’s to get what you need.

What is aggregating?

Aggregating, if you’re not aware, is simply rolling up information from a lot of records into a more concise record.

Think of it like attempting to sum the number of antidepressants dispensed across year-quarter intervals with a PDE file.

Most beneficiaries taking antidepressants will several/hundreds of records per month over a given period. How can we get them down to one record?

Translating from other languages

If you have a Stata background, I’m talking about using the collapse command.

If you have an SPSS background, I’m talking about the AGGREGATE command.

If you have an R background, I’m talking about the aggregate, dplyr::summarize(), or any other variant of aggregation within R.

Methods of Aggregating

Note: I try to be careful when using the word ‘count’ with aggregating. While counting is a form of aggregation, it can get confusing depending on what research question you are trying to examine. Counting in SAS typically refers to row-by-row processing/indexing, while aggregating usually involves some single measure for a given classification.

Also, I am using a simulated data set of drugs from the DE-SynPUF PDE file. I run two PROC SURVEYSELECT statements with different seeds to get two data sets, and then I pull from each one 1,250 times for a total of 2,500 runs.

I also use Redbook to get the drug names based on the NDC number and remove records that do not have a match in Redbook.

This creates a data set with 12,225,787 records.

Since the size of this data set is too large to post, I will not be able to make this reproducible, but I will make code available that uses data sets in the SASHELP library.

Here are the first ten observations:

PDE_ID SRVC_DT PROD_SRVC_ID QTY_DSPNSD_NUM DAYS_SUPLY_NUM PTNT_PAY_AMT TOT_RX_CST_AMT year_qtr gennme
233694491634414 20081115 00088179730 30.000 30 10.00 10.00 2008Q4 Diltiazem Hydrochloride
233524489010534 20100420 52959054604 10.000 30 0.00 10.00 2010Q2 Lamivudine/Zidovudine
233444491537554 20081002 54569469800 30.000 30 0.00 60.00 2008Q4 Valsartan
233244491270323 20100806 63874022060 30.000 30 10.00 10.00 2010Q3 Naloxone Hydrochloride/Pentazocine Hydrochloride
233364489627783 20081003 53746019405 270.000 10 10.00 10.00 2008Q4 Naproxen Sodium
233814492055785 20081125 00603030521 90.000 90 0.00 80.00 2008Q4 Docusate Calcium
233924489706141 20080221 00349700910 30.000 10 0.00 10.00 2008Q1 Probenecid
233404490150323 20090822 00574006945 120.000 30 30.00 10.00 2009Q3 Dextrose
233964488612114 20091006 55953013270 60.000 30 0.00 10.00 2009Q4 Captopril
233964490921001 20080531 00378054477 90.000 30 0.00 20.00 2008Q2 Hydrochlorothiazide/Quinapril Hydrochloride

PROC SQL

The first and most popular is probably PROC SQL. It’s intuitive, versatile, and efficient with large data sets.

proc sql;
	create table			aggregate_proc_sql_no_count as
		select
					year_qtr,
					gennme,		
					sum(qty_dspnsd_num) as total_rx_prescribed
		from
					base_file
		group by
					year_qtr, formatted_name;
quit;

And if we wanted to get a value count, we just use the COUNT function.

proc sql;
	create table			aggregate_proc_sql_w_count as
		select
					year_qtr, /* #1 */
					gennme, /* #2 */
                    			count(gennme) as freq, /* #3 */
					sum(qty_dspnsd_num) as total_rx_prescribed /* #3 */
		from
					want_year_qtr
		group by
					year_qtr, gennme; /* #4 */
quit;
year_qtr gennme freq total_rx_prescribed
2008Q1 Acetaminophen 10000 528920
2008Q1 Acetaminophen/Aspirin/Caffeine 2451 220590
2008Q1 Acetaminophen/Butalbital 2451 24510
2008Q1 Acetaminophen/Codeine Phosphate 98 3920
2008Q1 Acetaminophen/Diphenhydramine Hydrochloride 2500 150000
2008Q1 Acetaminophen/Hydrocodone Bitartrate 12549 1337750
2008Q1 Acetaminophen/Magnesium Salicylate/Pamabrom 49 2940
2008Q1 Acetaminophen/Oxycodone Hydrochloride 49 490
2008Q1 Acetaminophen/Propoxyphene Napsylate 2549 34310
2008Q1 Acetazolamide 2549 76470
  1. Select the year_qtr variable which is the date the drug was dispensed formatted using the YYQ6. format.
  2. Select the gennme variable which represents the generic drug name.
  3. COUNT the number of instances of the generic name.
  4. Finally, group the results by year_qtr and gennme

We get this (first ten observations):

The issue with the PROC SQL is that it is slower than other SAS procedures when aggregating. Odd, right? But here was the amount of time it took to run the above code:

NOTE: PROCEDURE SQL used (Total process time):
      real time           5.37 seconds
      cpu time            7.76 seconds

PROC MEANS

But most SAS PROC’s require sorting, right?

Alt Text

While you can sort and use the BY statement in PROC MEANS, you can completely avoid the PROC SORT by using the CLASS argument in PROC MEANS. Here’s what I mean:

proc means 
	data = want_year_qtr
	noprint /* #1 */
    	nway; /* #2 */
		class year_qtr formatted_name; /* #3 */
		var qty_dspnsd_num; /* #4 */
	    	output out = aggregate_proc_means /* #5 */
	    		sum = / autoname; /* 6 & 7 */
run;
  1. NOPRINT suppresses the default HTML listing. You want to add this so SAS isn’t trying to produce a report containing all the observations.
  2. NWAY tells SAS to aggregate to the maximum number of elements in your CLASS statement. This is an oversimplification, but you can read about the formal definition here
    • You can use a WAYS statement instead, but you have to explicitly define how many levels you want the information aggregated to. WAYS 0 will produce overall statistics for the data set, while WAYS 1 will produce overall statistics by year_qtr, and so on.
  3. CLASS tells SAS what level of detail you want.
  4. I am requesting SAS to calculate statistics for the qty_dspnsd_num variable and that variable only.
    • You can request additional variables here. If you omit this statement, you are going to be returned summary statistics for all numeric variables in the data set.
  5. I am outputting a data set to my WORK library called aggregate_proc_means
  6. SUM = tells SAS that I want to do with the field listed no the VAR statement. You can choose a variety of statistics here - mean, median, etc. The right side of the argument typically specifies what you want to aggregate. I could add sum(qty_dspnsd_num) = sum if I want to explicitly request that it aggregates a given field. Omitting this implies that you want to SUM the field on the VAR statement. The left side is your summary statistic request, and the right side is the variable name that you want.
  7. The / autoname tells SAS that you want to access certain options. Here, I am requesting that SAS automatically names the variables once they’re aggregated. This means the summarized output variable will be prefixed by the requested statistic and the variable name (Sum_qty_dspnsd_num).

The CLASS statement essentially, in my eyes, classifies or cases the records to the level of detail that we desire. In other words, we don’t want patient-level aggregations, we want it cased or rolled-up by year_qtr and formatted_name.

Also, it automatically creates two variables on the output data set: _TYPE_ and _FREQ_. _TYPE_ represents the number of classifications. Notice how _FREQ_ gives the same results as the COUNT function from PROC SQL.

You can drop these fields on your OUT = statement by specifying output out = your_ds_here (drop = _:)

This results in this:

year_qtr formatted_name TYPE FREQ QTY_DSPNSD_NUM_Sum
2008Q1 Acetaminophen 3 10000 528920.000
2008Q1 Acetaminophen/Aspirin/Caffeine 3 2451 220590.000
2008Q1 Acetaminophen/Butalbital 3 2451 24510.000
2008Q1 Acetaminophen/Codeine Phosphate 3 98 3920.000
2008Q1 Acetaminophen/Diphenhydramine Hydrochloride 3 2500 150000.000
2008Q1 Acetaminophen/Hydrocodone Bitartrate 3 12549 1337750.000
2008Q1 Acetaminophen/Magnesium Salicylate/Pamabrom 3 49 2940.000
2008Q1 Acetaminophen/Oxycodone Hydrochloride 3 49 490.000
2008Q1 Acetaminophen/Propoxyphene Napsylate 3 2549 34310.000
2008Q1 Acetazolamide 3 2549 76470.000

Notice how PROC MEANS retains the original formatting/specificity of the field.

And the amount of time it took on my system (i7-10857H, 32GB RAM, Windows 10):

NOTE: PROCEDURE MEANS used (Total process time):
      real time           1.02 seconds
      cpu time            4.07 seconds

This is much quicker than PROC SQL, and you don’t have to use PROC SORT!

Summary

Overall, while we do like PROC SQL due to it’s translatability and intuitive syntax, you will see that some base SAS procedures can be significantly faster.

This isn’t always the case, but you sometimes have to play around and see.

Also, I would be remiss if I didn’t mention some other ways of aggregating, namely: PROC UNIVARIATE and PROC SUMMARY.

Additionally, the times that I posted are from a single run. Times can certainly vary across multiple runs, but the times will be normally distributed across hundreds of runs – think Central Limit Theorem!

You can find an example program here. Again, since it isn’t practical to post the entire data set, this will just show you and provide you a way to examine how these PROC’s work.

I encourage you to play around with the CLASS statement to see how you can modify the level of aggregation.