Performance of Categorical Frequencies — SAS vs Python

Pradeep Singh
4 min readJan 24, 2021

--

In this article, we will compare the performance of the SAS Proc Freq procedure and Python libraries for analyzing categorical data. We will use the US Election 2020 Dataset and measure the time taken for various operations on the data using SAS and Python.

Photo by Harley-Davidson on Unsplash

This is the 2nd article for a comparison of SAS and Python, the 1st one is “Can we get SAS Proc Freq with Python?” and the 1st in the Performance Measurement series.

Disclaimer:

It’s important to note that the performance may vary based on the system configuration and data size.

  1. I have used SAS University Edition here, the SAS EG used by the businesses may perform better.
  2. Every time you would try to run these queries, the time taken will be different. Although, I have tried to keep the average time as close as possible.
  3. Even though I wanted to use a huge file, SAS UE doesn’t allow more than 10 MB of the file. With that said, I will still rerun these queries in the future on a 10 MB file and will update these results (if there’s enough interest).

Applications/Resources Used:

  • SAS University Edition — For writing SAS Code
  • Jupyter Notebook — For writing python code
  • president_county_candidate.csv(1.47 MB) file from US Election 2020 Dataset — For performing data operations
  • Pandas — For data frame operations

Methodology

We will perform the following operations on the dataset:

  1. Basic Proc Freq: We will use Proc Freq in SAS and Pandas library in Python to obtain the frequency distribution of categorical data.
  2. Sorting rows: We will sort the rows from most frequent to least frequent and compare the time taken in SAS and Python.
  3. Additional options: We will compare the performance of SAS and Python when using additional options such as NOCUM and NOPERCENT.
  4. Creating a Cross Tab: We will create a cross tab using SAS and Pandas to compare the performance of both tools.

1. Basic Proc Freq

In SAS, we imported the CSV file and used Proc Freq to obtain the frequency distribution of categorical data. In Python, we imported the Pandas library and read the CSV file. We then used value_counts() and sort_index() methods to obtain the frequency distribution. The time taken by SAS was longer than Python.

SAS

FILENAME Gov_C '/folders/myfolders/Assignments/governors_county.csv';
PROC IMPORT DATAFILE=Gov_C DBMS=CSV OUT=WORK.Gov_C_SAS;
GETNAMES=YES;
RUN;

proc freq data=Gov_C_SAS;
tables state;
run;

Python

import pandas as pd

data = pd.read_csv("../input/us-election-2020/governors_county.csv");

datax = data['state'].value_counts().sort_index()

datay = pd.DataFrame({
'state': datax.index,
'Frequency': datax.values,
'Percent': ((datax.values/datax.values.sum())*100).round(2),
'Cumulative Frequency': datax.values.cumsum(),
'Cumulative Percent': ((datax.values.cumsum()/datax.values.sum())*100).round(2)
})

Total Time Consumed — Simple Proc Freq

2. Sort the rows from most frequent to least frequent

In SAS, we used the “order=freq” option to sort the rows from most frequent to least frequent. In Python, we simply dropped the sort_index() method. The time taken by SAS was again longer than Python.

SAS

proc freq data=Gov_C_SAS order=freq;
tables state;
run;

Python

datax = data[‘state’].value_counts()
datay = pd.DataFrame({
‘state’: datax.index,
‘Frequency’: datax.values,
‘Percent’: ((datax.values/datax.values.sum())*100).round(2),
‘Cumulative Frequency’: datax.values.cumsum(),
‘Cumulative Percent’: ((datax.values.cumsum()/datax.values.sum())*100).round(2)
})
datay

Total Time Consumed

3. Additional Options

In SAS, we used the “nopercent” and “nocum” options to exclude cumulative frequency and percentage from the output. In Python, we dropped the last two columns while converting to a data frame. The time taken by both SAS and Python was comparable.

SAS

proc freq data = Gov_C_SAS;
tables state /nopercent nocum;
run;

Python

datax = data[‘state’].value_counts().sort_index()
datay = pd.DataFrame({
‘state’: datax.index,
‘Frequency’: datax.values
})
datay

Total Time Consumed

4. Creating a Cross Tab

In SAS, we used the “county*state” option to create a cross tab between county and state variables. In Python, we used the crosstab() method from Pandas. The time taken by SAS was longer than Python.

SAS

proc freq data=Gov_C_SAS;
tables county*state / norow nocol nopercent;
run;

Python

datab = pd.crosstab(data.county, data.state, margins=True, margins_name=”Total”)
datab

Total Time Consumed

In conclusion, we compared the performance of the SAS Proc Freq procedure and Python libraries for analyzing categorical data. We found that Python libraries, particularly Pandas, performed better than SAS in most cases. However, the performance may vary based on the system configuration and data size.

--

--

Pradeep Singh

MLOps Engineer @ Genpact / psrajput.com / Running (10k in 59.12, 5k in 26.15) / Cricket / Trekking / Chess