When talking about data analysis, Pandas is surely the most powerful and widely used Python library for data manipulation, cleaning, and preprocessing.
Thanks to its features, in fact, we can work with tabular data, retrieving them from SQL databases or Excel spreadsheets, for example.
Then, we have the possibility to manipulate and clean the data to prepare them for further analyses like plots, and also for Machine Learning.
In this article, we’ll introduce Pandas and the concept of a Pandas data frame.
Then, we’ll show some of the features of Pandas with hands-on practical examples.
Introduction to Pandas and data frames
Introducing Pandas
What actually Pandas is?
Well, this is what the developers say on the Pandas website about their mission:
“pandas aims to be the fundamental high-level building block for doing practical, real-world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis/manipulation tool available in any language.”
So, we’re dealing with what aims to be the only library that will pop up in our minds when we think of the word “data analysis”.
And, to be honest, it is so.
How to install Pandas
There are a couple of ways to install Pandas on your machine.
If you already have Python installed, you can install Pandas like so:
$ pip install pandas
The other way is to install Anaconda on your machine. This is my favorite way because it will install Python, Pandas, all the libraries related to Data Science and Machine Learning, and many more.
So, by installing Anaconda, you practically won’t need to install other Python libraries in the future.
To discover how to install it on your machine, you can check out their website here.
Understanding Pandas series and Pandas data frames
A data frame is a table, somehow like a spreadsheet, where we can organize and analyze data and information.
In other words, we can think of a data frame as a container that holds and organizes data in columns and rows like so:
So, a data frame is an ordered container of data that can be in the form of text or numbers, organized in columns and rows.
In particular, any Pandas column is also called a “Pandas series”. So, another way to see a Pandas data frame is that it is an ordered collection of Pandas series.
Creating and visualizing a Pandas data frame
To work with tabular data we have two possibilities:
- The data have been created somewhere and stored in a file. We can open them in Pandas and use them.
- We can create a data frame ourselves and use it immediately.
Here we’re showing the second option, while the first will be shown in the next paragraph.
In Pandas, we can create a data frame with the same syntax we’d create a Python dictionary. So, suppose we want to create a data frame that stores the values regarding measured times, in seconds, of some people who made several trials running. We can do it like so:
import pandas as pd
# Create data frame
times = pd.DataFrame({"Jhon":[20, 18, 36], "Simon":[15, 21, 19], "Karen":[22, 19, 16]})
# Show first 10 values
times.head()
Code language: PHP (php)
So, we import Pandas as “pd”, first. Then, we create a data frame with the method pd.DataFrame(), then we use the method head() to show the first 10 values of the data frame. This is what we obtain:
The data frame. Image by Author.
Now, let’s use some actual data to show some features of Pandas.
Data manipulation in Pandas: a hands-on tutorial with actual data
The best way to learn Pandas is by getting some data and putting our hands on the keyboard.
To show some of its features, we’ll get some data related to the world population (the file is downloadable from Kaggle here) to analyze it.
As we can see, the data are in the form of a CSV file. This is one of the most typical formats.
Supposing we’ve renamed the file “population.csv” this is how we can open it and show the first five values for each column:
import pandas as pd
# Read CSV
population = pd.read_csv("population.csv")
# Show head
population.head()
Code language: PHP (php)
And we get:
The data frame of our data. Image by author.
So, we have a data frame with “a lot” of columns. But how many columns does it have? And how many rows?
To show this data, we can type the following:
# Show shape
population.shape
>>>
(234, 17)
Code language: CSS (css)
So, our data frame has 234 rows and 17 columns.
One of the things we have to deal with when analyzing data is null values.
To check if we have any of them, we type the following:
# Show Null values
population.isnull().sum()
>>>
Rank 0
CCA3 0
Country/Territory 0
Capital 0
Continent 0
2022 Population 0
2020 Population 0
2015 Population 0
2010 Population 0
2000 Population 0
1990 Population 0
1980 Population 0
1970 Population 0
Area (km²) 0
Density (per km²) 0
Growth Rate 0
World Population Percentage 0
dtype: int64
Code language: PHP (php)
And the result shows us that, for each column, we have 0 cells with null values. So, we can proceed with our analysis without worrying about nulls.
Now, suppose we’re not interested in all the columns of the data frame because we believe some of them provide data we’re not interested in. We can create another data frame that has only the columns we want to analyze.
For example, say that we’re not interested in the following columns: “Rank” and “CCA3”. We can decide to create another data frame without these two by tying the following:
population = population.drop(["Rank", "CCA3"], axis=1)
Code language: JavaScript (javascript)
So, with the drop() method, we drop the columns we’re not interested in, but we also need to specify the axis. In Pandas, axis=1 represents the vertical direction, while axis=0 represents the horizontal direction (and we use it when we want to drop rows).
Also, we don’t like too much that a column is named “Country/Territory” because if we need to filter it we’ll need to write the whole column name. So let’s rename it:
population = population.rename(columns={"Country/Territory":"Country"})
Code language: JavaScript (javascript)
Now, suppose we want to plot the population trend over time, related to the Asian top three countries for growth rate. In other words: we want to intercept the top three Asian countries for growth rate over time and we want to plot the population trend over time.
We can do it like so:
# Transpose the columns
asia_population_df = asia_population_df.transpose()
# Select the Country in Asia
asia_population_df.columns = asia["Country"]
# Create plot
# Select the top 3 Countries per growth rate in Asia
top_3_population_df = asia_population_df[top_3_countries["Country"]]
top_3_population_df.plot(figsize=(10, 6))
# Label title and axes
plt.title("Population in Asia with the max growth rate")
plt.xlabel("Year")
plt.ylabel("Population")
# Show legend
plt.legend(loc="upper left")
# Rotate x-axis values
plt.xticks(rotation = 'vertical')
# Show plot
plt.show()
Code language: PHP (php)
And we get:
The trend we were searching for. Image by Author.
So, in the above code we have:
- Selected “Asia” in the Country column with population[population[“Continent”] == “Asia”]
- Sorted the data frame with the method sort_values(), choosing “Growth Rate” as a column. We’ve, then, selected the first three with the method head(3).
- Taken all the columns whose name ends with “Population”, with the method endswith(), and we’ve created another data frame called asia_population_df.
- Ordered in temporal order the column with iloc().
- Transposed the columns with the method transpose(). This means that the columns containing the data related to the population over the years have now become rows: this way, we can plot these data in the horizontal axis.
- Selected Asia as Country with asia[“Country”].
- Selected the top three countries with [top_3_countries[“Country”], as we wanted.
- Used Matplotlib to plot the data.
Now, suppose we want to show the population trend over the years related to the 5 Countries with the maximum value of “World Population Percentage”. We can do it like so:
import matplotlib.pyplot as plt
# Sort for pop. percentage
population = population.sort_values(by="World Population Percentage", ascending=False)
# Top 5
top_5_countries = population.head(5)
filtered_df = population[population["Country"].isin(top_5_countries["Country"])]
# Select the column with population values
population_columns = [col for col in filtered_df.columns if col.endswith("Population")]
population_df = filtered_df[population_columns]
# Order columns from 1970 to 2022
population_df = population_df.iloc[:, ::-1]
# Traspose the columns
population_df = population_df.transpose()
# Filter for Country
population_df.columns = filtered_df["Country"]
# Plot
population_df.plot.line(figsize=(10, 6))
plt.title("Population trend over the years: top 5 Countries per World Population Percentage")
plt.xlabel("Year")
plt.ylabel("Population")
# Rotate x-axis values
plt.xticks(rotation = 'vertical')
# Show legend
plt.legend(loc="upper left")
# Show plot
plt.show()
Code language: PHP (php)
And we get:
The trend we were searching for. Image by Author.
The code is quite identical to the previous one. The only difference is that here we haven’t filtered for any Country: we’ve selected the top 5 Countries in the world for Population Percentage, and we’ve plotted the population trend over the years.
This plot, in particular, show how the population in China and India is growing fast over the years.
Conclusions
In this article, we’ve seen how we can use Pandas to manipulate the data to get insights from them, by plotting specific graphs.
The best way to learn Pandas is to get some data we’re interested in and, thanks to curiosity, explore them. This way we learn Pandas with a hands-on approach which is the only one that will help you learn it effectively.