Performance of Freq of Categorical Data — SAS vs Python

Photo by Glenn on Unsplash

Disclaimer:

  1. I have used SAS University Edition here, the SAS EG used by the businesses may perform better.
  2. Every time you would try to run these queries, the time taken will be different. Although, I have tried to keep the average time as close as possible.
  3. Even though I wanted to use a huge file, SAS UE doesn’t allow more than 10 MB of the file. With that said, I will still rerun these queries in the future on a 10 MB file and will update these results.

Applications/Resources Used:

  • SAS University Edition — For writing SAS Code
  • Jupyter Notebook — For writing python code
  • president_county_candidate.csv(1.47 MB) file from US Election 2020 Dataset — For performing data operations
  • Pandas — For data frame operations

1. Normal Proc Freq

SAS

FILENAME Gov_C '/folders/myfolders/Assignments/governors_county.csv';PROC IMPORT DATAFILE=Gov_C DBMS=CSV OUT=WORK.Gov_C_SAS;
GETNAMES=YES;
RUN;
proc freq data=Gov_C_SAS;
tables state;
run;

Python

import pandas as pddata = pd.read_csv(“../input/us-election-2020/governors_county.csv”);
datax = data[‘state’].value_counts().sort_index()
datay = pd.DataFrame({
'state': datax.index,
'Frequency': datax.values,
'Percent': ((datax.values/datax.values.sum())*100).round(2),
'Cumulative Frequency': datax.values.cumsum(),
'Cumulative Percent': ((datax.values.cumsum()/datax.values.sum())*100).round(2)
})

Total Time Consumed — Simple Proc Freq

2. Sort the rows from most frequent to least frequent

proc freq data=Gov_C_SAS order=freq;
tables state;
run;
datax = data[‘state’].value_counts()
datay = pd.DataFrame({
‘state’: datax.index,
‘Frequency’: datax.values,
‘Percent’: ((datax.values/datax.values.sum())*100).round(2),
‘Cumulative Frequency’: datax.values.cumsum(),
‘Cumulative Percent’: ((datax.values.cumsum()/datax.values.sum())*100).round(2)
})
datay

Total Time Consumed

3. Additional options NOCUM, NOPERCENT

proc freq data = Gov_C_SAS;
tables state /nopercent nocum;
run;
datax = data[‘state’].value_counts().sort_index()
datay = pd.DataFrame({
‘state’: datax.index,
‘Frequency’: datax.values
})
datay

Total Time Consumed

4. Creating a Cross Tab

proc freq data=Gov_C_SAS;
tables county*state / norow nocol nopercent;
run;
datab = pd.crosstab(data.county, data.state, margins=True, margins_name=”Total”)
datab

Total Time Consumed

--

--

--

Web Dev @ psrajput.com | Writer @ eduonline24.in | Analyst @ Genpact

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

How Big Data Increases Inequality and Threatens Democracy

How to Turn 5 Years of Task Data Into Meaningful Insights

Knowledge Different Expectations of Evidence ForCases https://t.co/6XJFnlrcQA

Data structure — Heap

Major Challenges In Data Mining

Python Plotting API: Expose your scientific python plots through a flask API

Using SIRD model to predict the COVID-19 outbreak in India

How Powerful/Aerially-Connected is Your City?

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Pradeep Singh

Pradeep Singh

Web Dev @ psrajput.com | Writer @ eduonline24.in | Analyst @ Genpact

More from Medium

Extracting historical weather data of any location for your data science project (into pandas df)

Scope and Lifetime of Variables in Python

Deriving useful metrics from Pandas data frame comparison

Add Speed Metrics like ‘Average Moving Speed’ to Strava Activity descriptions Using Python