Performance of Freq of Categorical Data — SAS vs Python

Photo by Glenn on Unsplash

Disclaimer:

  1. I have used SAS University Edition here, the SAS EG used by the businesses may perform better.
  2. Every time you would try to run these queries, the time taken will be different. Although, I have tried to keep the average time as close as possible.
  3. Even though I wanted to use a huge file, SAS UE doesn’t allow more than 10 MB of the file. With that said, I will still rerun these queries in the future on a 10 MB file and will update these results.

Applications/Resources Used:

  • SAS University Edition — For writing SAS Code
  • Jupyter Notebook — For writing python code
  • president_county_candidate.csv(1.47 MB) file from US Election 2020 Dataset — For performing data operations
  • Pandas — For data frame operations

1. Normal Proc Freq

SAS

Importing the CSV

FILENAME Gov_C '/folders/myfolders/Assignments/governors_county.csv';PROC IMPORT DATAFILE=Gov_C DBMS=CSV OUT=WORK.Gov_C_SAS;
GETNAMES=YES;
RUN;
proc freq data=Gov_C_SAS;
tables state;
run;

Python

Importing Pandas and Reading the CSV

import pandas as pddata = pd.read_csv(“../input/us-election-2020/governors_county.csv”);
datax = data[‘state’].value_counts().sort_index()
datay = pd.DataFrame({
'state': datax.index,
'Frequency': datax.values,
'Percent': ((datax.values/datax.values.sum())*100).round(2),
'Cumulative Frequency': datax.values.cumsum(),
'Cumulative Percent': ((datax.values.cumsum()/datax.values.sum())*100).round(2)
})

Total Time Consumed — Simple Proc Freq

2. Sort the rows from most frequent to least frequent

SAS: Specify additional option “order=freq”, default is “order=data”.

proc freq data=Gov_C_SAS order=freq;
tables state;
run;
datax = data[‘state’].value_counts()
datay = pd.DataFrame({
‘state’: datax.index,
‘Frequency’: datax.values,
‘Percent’: ((datax.values/datax.values.sum())*100).round(2),
‘Cumulative Frequency’: datax.values.cumsum(),
‘Cumulative Percent’: ((datax.values.cumsum()/datax.values.sum())*100).round(2)
})
datay

Total Time Consumed

3. Additional options NOCUM, NOPERCENT

SAS: Specify nopercent and nocum options after the var name separated by a forward slash.

proc freq data = Gov_C_SAS;
tables state /nopercent nocum;
run;
datax = data[‘state’].value_counts().sort_index()
datay = pd.DataFrame({
‘state’: datax.index,
‘Frequency’: datax.values
})
datay

Total Time Consumed

4. Creating a Cross Tab

SAS: var1*var2 and dropping additional details to keep it simple.

proc freq data=Gov_C_SAS;
tables county*state / norow nocol nopercent;
run;
datab = pd.crosstab(data.county, data.state, margins=True, margins_name=”Total”)
datab

Total Time Consumed

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store