Performance of Freq of Categorical Data — SAS vs Python
In this article, we will discuss the performance of the SAS Proc Freq procedure and compare it to similar stats using Python Libraries.

This is the 2nd article for a comparison of SAS and Python, the 1st one is “Can we get SAS Proc Freq with Python?” and the 1st in the Performance Measurement series.
Disclaimer:
- I have used SAS University Edition here, the SAS EG used by the businesses may perform better.
- Every time you would try to run these queries, the time taken will be different. Although, I have tried to keep the average time as close as possible.
- Even though I wanted to use a huge file, SAS UE doesn’t allow more than 10 MB of the file. With that said, I will still rerun these queries in the future on a 10 MB file and will update these results.
Applications/Resources Used:
- SAS University Edition — For writing SAS Code
- Jupyter Notebook — For writing python code
- president_county_candidate.csv(1.47 MB) file from US Election 2020 Dataset — For performing data operations
- Pandas — For data frame operations
1. Normal Proc Freq
SAS
Importing the CSV
FILENAME Gov_C '/folders/myfolders/Assignments/governors_county.csv';PROC IMPORT DATAFILE=Gov_C DBMS=CSV OUT=WORK.Gov_C_SAS;
GETNAMES=YES;
RUN;
Running the Proc Freq
proc freq data=Gov_C_SAS;
tables state;
run;
Python
Importing Pandas and Reading the CSV
import pandas as pddata = pd.read_csv(“../input/us-election-2020/governors_county.csv”);
Variables - Actual Ordering
Just to match the default ordering of SAS.
datax = data[‘state’].value_counts().sort_index()
Series To Dataframe
datay = pd.DataFrame({
'state': datax.index,
'Frequency': datax.values,
'Percent': ((datax.values/datax.values.sum())*100).round(2),
'Cumulative Frequency': datax.values.cumsum(),
'Cumulative Percent': ((datax.values.cumsum()/datax.values.sum())*100).round(2)
})
Total Time Consumed — Simple Proc Freq

2. Sort the rows from most frequent to least frequent
SAS: Specify additional option “order=freq”, default is “order=data”.
proc freq data=Gov_C_SAS order=freq;
tables state;
run;
Python: Just drop sort_index() method.
datax = data[‘state’].value_counts()
datay = pd.DataFrame({
‘state’: datax.index,
‘Frequency’: datax.values,
‘Percent’: ((datax.values/datax.values.sum())*100).round(2),
‘Cumulative Frequency’: datax.values.cumsum(),
‘Cumulative Percent’: ((datax.values.cumsum()/datax.values.sum())*100).round(2)
})
datay
Total Time Consumed

3. Additional options NOCUM, NOPERCENT
SAS: Specify nopercent and nocum options after the var name separated by a forward slash.
proc freq data = Gov_C_SAS;
tables state /nopercent nocum;
run;
Python: Just drop the last two columns while converting to a data frame.
datax = data[‘state’].value_counts().sort_index()
datay = pd.DataFrame({
‘state’: datax.index,
‘Frequency’: datax.values
})
datay
Total Time Consumed

4. Creating a Cross Tab
SAS: var1*var2 and dropping additional details to keep it simple.
proc freq data=Gov_C_SAS;
tables county*state / norow nocol nopercent;
run;
Python: Using pandas crosstab method.
datab = pd.crosstab(data.county, data.state, margins=True, margins_name=”Total”)
datab
Total Time Consumed

That’s it with the comparison. We saw that the pandas work quite fast as compared to SAS (University Edition), to get various frequencies on categorical columns. Do let me know your thoughts below.