10 Pandas functions that will improve your Data Analysis Workflow
Pandas is a widely used and powerful Python library that facilitates data manipulation and analysis. Whether you are a data scientist, analyst, or someone who works with data regularly, mastering Pandas can significantly improve your work efficiency. In this article, we will discuss 10 essential Pandas functions that you can use daily to streamline your data analysis workflow.
1. read_csv()
Reading data stored in files is a fundamental task in data analysis. Pandas simplifies this process through its read_csv() function.
This function reads data from a CSV file and creates a DataFrame, which is a powerful data structure for working with tabular data.
Additionally, you can customize the import process by specifying parameters such as the delimiter, header, encoding, and more.
# Reading a CSV file using read_csv()
df = pd.read_csv('example.csv')
2. head()
The head() function is a simple and useful function in Pandas that allows you to display the first few rows of a DataFrame. It’s particularly useful when you’re working with large datasets and want to quickly preview the data.
To use the head() function, you simply need to call it on your DataFrame variable.
For example, after reading a CSV file using the read_csv() function and storing it in a variable “df”, you can call the head() function on it to display the first five rows of the DataFrame:
# Displaying the first 5 rows of the DataFrame
print(df.head())
3. info()
The info() function is a useful function in Pandas that provides information about a DataFrame. It displays a summary of the DataFrame, including the data types of each column, the number of non-null values, and memory usage.
This function is useful for checking the completeness of your dataset and identifying potential issues with data types or missing values.
To use the info() function, simply call it on your DataFrame variable:
# Displaying the summary of the DataFrame
print(df.info())
This will print the summary of the DataFrame stored in the variable “df”.
This function provides information about the DataFrame, such as the data types of each column, the number of non-null values, and memory usage.
4. describe()
The describe() function is a powerful function in Pandas that generates descriptive statistics for each column in a DataFrame. These statistics include the count, mean, standard deviation, minimum, maximum, and quartile values. This function is useful for quickly understanding the distribution and range of your data.
To use the describe() function, simply call it on your DataFrame variable:
# Generating descriptive statistics for the DataFrame
print(df.describe())
This will print the descriptive statistics for each column in the DataFrame stored in the variable “df”.
5. dropna()
The dropna() function is a helpful function in Pandas that removes rows or columns with missing values from a DataFrame. This function is useful for preparing your data for analysis or visualization.
To use the dropna() function, simply call it on your DataFrame variable:
# Removing rows with missing values
df = df.dropna()
This will remove all rows with missing values from the DataFrame stored in the variable “df”.
6. groupby()
The groupby() function is a powerful function in Pandas that is used to group data by one or more columns and perform aggregations on each group. This function is useful for calculating summary statistics by group.
To use the groupby() function, simply call it on your DataFrame variable, specifying the column(s) to group by:
# Grouping data by the "category" column and calculating the mean of each group
df.groupby('category').mean()
This will group the data in the DataFrame stored in the variable “df” by the “category” column and calculate the mean of each group.
7. pivot_table()
The pivot_table() function is a helpful function in Pandas that is used to create a pivot table from a DataFrame. A pivot table allows you to summarize the data in a tabular format, making it easier to understand and analyze.
To use the pivot_table() function, simply call it on your DataFrame variable, specifying the columns to use as rows and columns, and the values to aggregate:
# Creating a pivot table with "category" as rows, "date" as columns, and "sales" as values
pd.pivot_table(df, index='category', columns='date', values='sales')
This will create a pivot table from the DataFrame stored in the variable “df”, with “category” as rows, “date” as columns, and “sales” as values.
8. merge()
The merge() function is a useful function in Pandas that is used to merge two DataFrames based on a common column or index. This function is useful for combining data from multiple sources into a single DataFrame for analysis.
To use the merge() function, simply call it on your two DataFrame variables, specifying the common column or index to merge on:
# Merging two DataFrames on the "id" column
pd.merge(df1, df2, on='id')
This will merge two DataFrames stored in the variables “df1” and “df2” on the “id” column.
9. apply()
The apply() function is a versatile function in Pandas that is used to apply a function to each row or column of the DataFrame. It’s particularly useful for data cleaning and transformation. This function can take a Python function or a lambda function as input.
For example, let’s say you have a DataFrame with columns containing mixed data types, and you want to convert all the values to lowercase. You can achieve this using the apply() function with a lambda function:
# create a sample DataFrame
df = pd.DataFrame({
'Name': ['John', 'JANE', 'Jim', 'Janet'],
'Age': [25, 30, 35, 40],
'Salary': ['$5000', '$6000', '$7000', '$8000']
})
# apply a lambda function to convert all string columns to lowercase
df[['Name', 'Salary']] = df[['Name', 'Salary']].apply(lambda x: x.str.lower())
print(df)
This will output:
Name Age Salary
0 john 25 $5000
1 jane 30 $6000
2 jim 35 $7000
3 janet 40 $8000
In this example, the lambda function is applied to the ‘Name’ and ‘Salary’ columns using the apply() function to convert all the values to lowercase.
10. plot()
The plot() function is used to create various types of plots, including line plots, bar plots, and scatter plots, directly from a DataFrame. This function is built on top of the Matplotlib library and provides a convenient way to visualize data. You can customize the plot by specifying various parameters such as the title, axis labels, and more.
import pandas as pd
import matplotlib.pyplot as plt
# Create a DataFrame
data = {'year': [2015, 2016, 2017, 2018, 2019, 2020],
'sales': [100, 120, 150, 180, 200, 220]}
df = pd.DataFrame(data)
# Create a line plot
df.plot(x='year', y='sales', kind='line', title='Sales over time')
# Display the plot
plt.show()
These 10 Pandas functions are just the tip of the iceberg when it comes to the functionality that Pandas provides. By mastering these functions, you'll be able to streamline your data analysis workflow and make more informed decisions based on your data.