Performance of Categorical Frequencies — SAS vs Python
In this article, we will compare the performance of the SAS Proc Freq procedure and Python libraries for analyzing categorical data. We will use the US Election 2020 Dataset and measure the time taken for various operations on the data using SAS and Python.
This is the 2nd article for a comparison of SAS and Python, the 1st one is “Can we get SAS Proc Freq with Python?” and the 1st in the Performance Measurement series.
Disclaimer:
It’s important to note that the performance may vary based on the system configuration and data size.
- I have used SAS University Edition here, the SAS EG used by the businesses may perform better.
- Every time you would try to run these queries, the time taken will be different. Although, I have tried to keep the average time as close as possible.
- Even though I wanted to use a huge file, SAS UE doesn’t allow more than 10 MB of the file. With that said, I will still rerun these queries in the future on a 10 MB file and will update these results (if there’s enough interest).
Applications/Resources Used:
- SAS University Edition — For writing SAS Code
- Jupyter Notebook — For writing python code
- president_county_candidate.csv(1.47 MB) file from US Election 2020 Dataset — For performing data operations
- Pandas — For data frame operations
Methodology
We will perform the following operations on the dataset:
Basic Proc Freq
: We will use Proc Freq in SAS and Pandas library in Python to obtain the frequency distribution of categorical data.Sorting rows
: We will sort the rows from most frequent to least frequent and compare the time taken in SAS and Python.Additional options
: We will compare the performance of SAS and Python when using additional options such as NOCUM and NOPERCENT.Creating a Cross Tab
: We will create a cross tab using SAS and Pandas to compare the performance of both tools.
1. Basic Proc Freq
In SAS, we imported the CSV file and used Proc Freq to obtain the frequency distribution of categorical data. In Python, we imported the Pandas library and read the CSV file. We then used value_counts() and sort_index() methods to obtain the frequency distribution. The time taken by SAS was longer than Python.
SAS
FILENAME Gov_C '/folders/myfolders/Assignments/governors_county.csv';
PROC IMPORT DATAFILE=Gov_C DBMS=CSV OUT=WORK.Gov_C_SAS;
GETNAMES=YES;
RUN;
proc freq data=Gov_C_SAS;
tables state;
run;
Python
import pandas as pd
data = pd.read_csv("../input/us-election-2020/governors_county.csv");
datax = data['state'].value_counts().sort_index()
datay = pd.DataFrame({
'state': datax.index,
'Frequency': datax.values,
'Percent': ((datax.values/datax.values.sum())*100).round(2),
'Cumulative Frequency': datax.values.cumsum(),
'Cumulative Percent': ((datax.values.cumsum()/datax.values.sum())*100).round(2)
})
Total Time Consumed — Simple Proc Freq
2. Sort the rows from most frequent to least frequent
In SAS, we used the “order=freq” option to sort the rows from most frequent to least frequent. In Python, we simply dropped the sort_index() method. The time taken by SAS was again longer than Python.
SAS
proc freq data=Gov_C_SAS order=freq;
tables state;
run;
Python
datax = data[‘state’].value_counts()
datay = pd.DataFrame({
‘state’: datax.index,
‘Frequency’: datax.values,
‘Percent’: ((datax.values/datax.values.sum())*100).round(2),
‘Cumulative Frequency’: datax.values.cumsum(),
‘Cumulative Percent’: ((datax.values.cumsum()/datax.values.sum())*100).round(2)
})
datay
Total Time Consumed
3. Additional Options
In SAS, we used the “nopercent” and “nocum” options to exclude cumulative frequency and percentage from the output. In Python, we dropped the last two columns while converting to a data frame. The time taken by both SAS and Python was comparable.
SAS
proc freq data = Gov_C_SAS;
tables state /nopercent nocum;
run;
Python
datax = data[‘state’].value_counts().sort_index()
datay = pd.DataFrame({
‘state’: datax.index,
‘Frequency’: datax.values
})
datay
Total Time Consumed
4. Creating a Cross Tab
In SAS, we used the “county*state” option to create a cross tab between county and state variables. In Python, we used the crosstab() method from Pandas. The time taken by SAS was longer than Python.
SAS
proc freq data=Gov_C_SAS;
tables county*state / norow nocol nopercent;
run;
Python
datab = pd.crosstab(data.county, data.state, margins=True, margins_name=”Total”)
datab
Total Time Consumed
In conclusion, we compared the performance of the SAS Proc Freq procedure and Python libraries for analyzing categorical data. We found that Python libraries, particularly Pandas, performed better than SAS in most cases. However, the performance may vary based on the system configuration and data size.