# Performance of Categorical Frequencies — SAS vs Python

In this article, we will compare the performance of the SAS Proc Freq procedure and Python libraries for analyzing categorical data. We will use the US Election 2020 Dataset and measure the time taken for various operations on the data using SAS and Python.

This is the 2nd article for a comparison of SAS and Python, the 1st one is “Can we get SAS Proc Freq with Python?” and the 1st in the Performance Measurement series.

## Disclaimer:

It’s important to note that the performance may vary based on the system configuration and data size.

- I have used SAS University Edition here, the SAS EG used by the businesses may perform better.
- Every time you would try to run these queries, the time taken will be different. Although, I have tried to keep the average time as close as possible.
- Even though I wanted to use a huge file, SAS UE doesn’t allow more than 10 MB of the file. With that said, I will still rerun these queries in the future on a 10 MB file and will update these results (if there’s enough interest).

# Applications/Resources Used:

- SAS University Edition — For writing SAS Code
- Jupyter Notebook — For writing python code
- president_county_candidate.csv(1.47 MB) file from US Election 2020 Dataset — For performing data operations
- Pandas — For data frame operations

# Methodology

We will perform the following operations on the dataset:

: We will use Proc Freq in SAS and Pandas library in Python to obtain the frequency distribution of categorical data.**Basic Proc Freq**

: We will sort the rows from most frequent to least frequent and compare the time taken in SAS and Python.**Sorting rows**

: We will compare the performance of SAS and Python when using additional options such as NOCUM and NOPERCENT.**Additional options**

: We will create a cross tab using SAS and Pandas to compare the performance of both tools.**Creating a Cross Tab**

# 1. Basic Proc Freq

In SAS, we imported the CSV file and used Proc Freq to obtain the frequency distribution of categorical data. In Python, we imported the Pandas library and read the CSV file. We then used value_counts() and sort_index() methods to obtain the frequency distribution. The time taken by SAS was longer than Python.

## SAS

`FILENAME Gov_C '/folders/myfolders/Assignments/governors_county.csv';`

PROC IMPORT DATAFILE=Gov_C DBMS=CSV OUT=WORK.Gov_C_SAS;

GETNAMES=YES;

RUN;

proc freq data=Gov_C_SAS;

tables state;

run;

## Python

`import pandas as pd`

data = pd.read_csv("../input/us-election-2020/governors_county.csv");

datax = data['state'].value_counts().sort_index()

datay = pd.DataFrame({

'state': datax.index,

'Frequency': datax.values,

'Percent': ((datax.values/datax.values.sum())*100).round(2),

'Cumulative Frequency': datax.values.cumsum(),

'Cumulative Percent': ((datax.values.cumsum()/datax.values.sum())*100).round(2)

})

## Total Time Consumed — Simple Proc Freq

# 2. Sort the rows from most frequent to least frequent

In SAS, we used the “order=freq” option to sort the rows from most frequent to least frequent. In Python, we simply dropped the sort_index() method. The time taken by SAS was again longer than Python.

**SAS**

*proc freq data=Gov_C_SAS order=freq;*

tables state;

run;

**Python**

*datax = data[‘state’].value_counts()*

datay = pd.DataFrame({

‘state’: datax.index,

‘Frequency’: datax.values,

‘Percent’: ((datax.values/datax.values.sum())*100).round(2),

‘Cumulative Frequency’: datax.values.cumsum(),

‘Cumulative Percent’: ((datax.values.cumsum()/datax.values.sum())*100).round(2)

})

datay

## Total Time Consumed

# 3. Additional Options

In SAS, we used the “nopercent” and “nocum” options to exclude cumulative frequency and percentage from the output. In Python, we dropped the last two columns while converting to a data frame. The time taken by both SAS and Python was comparable.

**SAS**

*proc freq data = Gov_C_SAS;*

tables state /nopercent nocum;

run;

**Python**

*datax = data[‘state’].value_counts().sort_index()*

datay = pd.DataFrame({

‘state’: datax.index,

‘Frequency’: datax.values

})

datay

## Total Time Consumed

# 4. Creating a Cross Tab

In SAS, we used the “county*state” option to create a cross tab between county and state variables. In Python, we used the crosstab() method from Pandas. The time taken by SAS was longer than Python.

**SAS**

*proc freq data=Gov_C_SAS;*

tables county*state / norow nocol nopercent;

run;

*Python*

*Python*

*datab = pd.crosstab(data.county, data.state, margins=True, margins_name=”Total”)*

datab

## Total Time Consumed

In conclusion, we compared the performance of the SAS Proc Freq procedure and Python libraries for analyzing categorical data. We found that Python libraries, particularly Pandas, performed better than SAS in most cases. However, the performance may vary based on the system configuration and data size.