python pandas dataframe

In this tutorial, we will explore Python Pandas DataFrames in-depth, covering essential concepts and techniques for efficient data manipulation and analysis. By understanding the core components of DataFrames, you’ll be able to unlock the full potential of Pandas in your data analysis projects.

Introduction to DataFrames

python pandas datatypes: dataframe and series

DataFrames are flexible data structures in the Pandas library used for handling and analyzing two-dimensional data, where information is organized in rows and columns. A Pandas DataFrame can store various data types and provide essential functionality for data science tasks.

To work with DataFrames, first import the Pandas library:

import pandas as pd

Throughout this tutorial, we will cover Python dataframe examples and explore various dataframe features to help you become proficient in working with DataFrames.

understanding dataframe

Creating DataFrames

create dataframe from

There are several ways to create a DataFrame in Pandas:

  • From a Python dictionary
  • From a NumPy array
  • From a list of dictionaries
  • From a list of lists
  • From a CSV file
  • From other DataFrames

In this section, we’ll explore different methods of creating DataFrames and provide Python pandas dataframe examples.

From a Python Dictionary

To create a DataFrame from a dictionary, each key-value pair represents a column name and its corresponding data. The data can be a list, NumPy array, or other iterable.

import pandas as pd

data = {
    "Name": ["Seter Design", "Seter Develop", "Seter Mngmt"],
    "Auditory": [10000, 100000, 50000],
    "Creator": ["Vasiukou", "Hontar", "Hontar"]
}

df = pd.DataFrame(data)
print(df)

The output will be:

            Name  Auditory   Creator
0   Seter Design     10000  Vasiukou
1  Seter Develop    100000    Hontar
2    Seter Mngmt     50000    Hontar

From a NumPy Array

You can create a DataFrame from a NumPy array. To do this, you need to specify the index, columns, and the data itself.

import pandas as pd
import numpy as np

data = np.random.randint(0, 100, size=(3, 3))
index = ["Row1", "Row2", "Row3"]
columns = ["Column1", "Column2", "Column3"]

df = pd.DataFrame(data, index=index, columns=columns)
print(df)

The output will look something like this:

      Column1  Column2  Column3
Row1       56       36       94
Row2       89       92       61
Row3       52       10       75

From a List of Dictionaries

A list of dictionaries can be used to create a DataFrame, with each dictionary representing a row of data. The keys in the dictionaries will be used as column names.

import pandas as pd

data = [
    {"total_bill": 16.99, "smoker": "No", "time": "Dinner"},
    {"total_bill": 14.29, "smoker": "Yes", "time": "Lunch"},
    {"total_bill": 26.99, "smoker": "No", "time": "Dinner"},
]

df = pd.DataFrame(data)
print(df)

The output will be:

   total_bill smoker    time
0       16.99     No  Dinner
1       14.29    Yes   Lunch
2       26.99     No  Dinner

From a List of Lists

Another way to create a DataFrame is from a list of lists, where each inner list represents a row of data.

import pandas as pd

data = [
    ["Sun", "Dinner", 2],
    ["Mon","Dinner", 4],
    ["Tue", "Lunch", 1]
]

column_names = ["day", "time", "size"]

df = pd.DataFrame(data, columns=column_names)
print(df)

The output will be:

   day    time  size
0  Sun  Dinner     2
1  Mon  Dinner     4
2  Tue   Lunch     1

From a CSV File

You can create a DataFrame from a CSV file using the read_csv function, as demonstrated in our previous article, Effortlessly Import Data with Pandas: A Guide to CSV.

From Other DataFrames

To create dataframe from dataframe, you can use the copy() method, which creates a new DataFrame with the same data and structure as the original. This is useful when you want to create a separate DataFrame for further manipulation without affecting the original data.

import pandas as pd

data = {
    "Name": ["Seter Design", "Seter Develop", "Seter Mngmt"],
    "Auditory": [10000, 100000, 50000],
    "Creator": ["Vasiukou", "Hontar", "Hontar"]
}

original_df = pd.DataFrame(data)
new_df = original_df.copy()

print(new_df)

The output will be:

            Name  Auditory   Creator
0   Seter Design     10000  Vasiukou
1  Seter Develop    100000    Hontar
2    Seter Mngmt     50000    Hontar

Selecting Data in DataFrames

Pandas DataFrames provide various ways to access dataframe columns, row labels, and column labels. You can also select multiple rows or filter the data based on certain conditions.

You can learn more about selecting data in DataFrames in our dedicated article.

Selecting Columns

To select a single column, you can use the column name in square brackets or with a dot notation.

import pandas as pd

data = {
    "Name": ["Alice", "Bob", "Carol"],
    "Age": [25, 30, 35],
    "Gender": ["F", "M", "F"]
}

df = pd.DataFrame(data)

# Using square brackets
name_column = df["Name"]

# Using dot notation
age_column = df.Age

print("Name column:\n", name_column)
print("Age column:\n", age_column)

The output will be:

Name column:
 0     Seter Design
1    Seter Develop
2      Seter Mngmt
Name: Name, dtype: object
Age column:
 0     10000
1    100000
2     50000
Name: Auditory, dtype: int64

To select multiple columns, pass a list of column names inside the square brackets.

import pandas as pd

data = {
    "Name": ["Seter Design", "Seter Develop", "Seter Mngmt"],
    "Auditory": [10000, 100000, 50000],
    "Creator": ["Vasiukou", "Hontar", "Hontar"]
}

df = pd.DataFrame(data)

selected_columns = df[["Name", "Creator"]]
print(selected_columns)

The output will be:

            Name   Creator
0   Seter Design  Vasiukou
1  Seter Develop    Hontar
2    Seter Mngmt    Hontar

Selecting Rows

To select rows based on their index, you can use the iloc and loc attributes.

iloc is used to access rows by integer index, while loc is used to access rows by label-based index.

import pandas as pd

data = {
    "Name": ["Seter Design", "Seter Develop", "Seter Mngmt"],
    "Auditory": [10000, 100000, 50000],
    "Creator": ["Vasiukou", "Hontar", "Hontar"]
}

df = pd.DataFrame(data)

# Selecting a row by integer index using iloc
row1 = df.iloc[1]

# Selecting a row by label-based index using loc
row2 = df.loc[2]

print("Row 1:\n", row1)
print("Row 2:\n", row2)

The output will be:

Row 1:
 Name        Seter Develop
Auditory           100000
Creator            Hontar
Name: 1, dtype: object
Row 2:
 Name        Seter Mngmt
Auditory          50000
Creator          Hontar
Name: 2, dtype: object

Selecting Data Based on Conditions

You can select data from a DataFrame based on conditions. This is useful when you want to filter data based on specific criteria.

import pandas as pd

data = {
    "Name": ["Seter Design", "Seter Develop", "Seter Mngmt"],
    "Auditory": [10000, 100000, 50000],
    "Creator": ["Vasiukou", "Hontar", "Hontar"]
}

df = pd.DataFrame(data)

# Selecting rows where Auditory is greater than 20000
auditory_greater_than_20k = df[df["Auditory"] > 20000]

print(auditory_greater_than_20k)

The output will be:

            Name  Auditory Creator
1  Seter Develop    100000  Hontar
2    Seter Mngmt     50000  Hontar

You can also use multiple conditions by combining them with & (and) or | (or) operators.

import pandas as pd

data = {
    "Name": ["Seter Design", "Seter Develop", "Seter Mngmt"],
    "Auditory": [10000, 100000, 50000],
    "Creator": ["Vasiukou", "Hontar", "Hontar"],
    "Gender": ["M", "M", "F"]
}

df = pd.DataFrame(data)

# Selecting female rows where Auditory is greater than 30000
female_auditory_greater_than_30k = df[(df["Auditory"] > 30000) & (df["Gender"] == "F")]

print(female_auditory_greater_than_30k)

The output will be:

          Name  Auditory Creator Gender
2  Seter Mngmt     50000  Hontar      F

Modifying DataFrames

When working with DataFrames, you may encounter missing values or missing data. You can handle these cases by dropping or filling them with appropriate values.

We write an article about data manipulation, check this out.

Adding Columns

To add a new column to a DataFrame, simply assign a new column name with the desired values.

import pandas as pd

data = {
    "Name": ["Seter Design", "Seter Develop", "Seter Mngmt"],
    "Auditory": [10000, 100000, 50000],
    "Creator": ["Vasiukou", "Hontar", "Hontar"],
}

df = pd.DataFrame(data)

# Adding a new column 'Gender'
df["Gender"] = ["M", "M", "F"]

print(df)

The output will be:

            Name  Auditory   Creator Gender
0   Seter Design     10000  Vasiukou      M
1  Seter Develop    100000    Hontar      M
2    Seter Mngmt     50000    Hontar      F

Updating Columns

To update an existing column, simply reassign the values for that column.

import pandas as pd

data = {
    "Name": ["Seter Design", "Seter Develop", "Seter Mngmt"],
    "Auditory": [10000, 100000, 50000],
    "Creator": ["Vasiukou", "Hontar", "Hontar"],
}

df = pd.DataFrame(data)

# Updating the 'Auditory' column by adding 50000 to each value
df["Auditory"] = df["Auditory"] + 50000

print(df)

The output will be:

            Name  Auditory   Creator
0   Seter Design     60000  Vasiukou
1  Seter Develop    150000    Hontar
2    Seter Mngmt    100000    Hontar

Deleting Columns

To delete a column from a DataFrame, you can use the drop method with the axis parameter set to 1.

import pandas as pd

data = {
    "Name": ["Seter Design", "Seter Develop", "Seter Mngmt"],
    "Auditory": [10000, 100000, 50000],
    "Creator": ["Vasiukou", "Hontar", "Hontar"],
}

df = pd.DataFrame(data)

# Deleting the 'Creator' column
df = df.drop("Creator", axis=1)

print(df)

The output will be:

            Name  Auditory
0   Seter Design     10000
1  Seter Develop    100000
2    Seter Mngmt     50000

Adding Rows

To add rows to a DataFrame, you can use the append(depricated) and concat method, which takes a dictionary or another DataFrame as input.

import pandas as pd

data = {
    "Name": ["Seter Design", "Seter Develop", "Seter Mngmt"],
    "Auditory": [10000, 100000, 50000],
    "Creator": ["Vasiukou", "Hontar", "Hontar"],
}

df = pd.DataFrame(data)

# Adding a new row using a dictionary
new_row = {"Name": "Seter AI", "Auditory": 40000, "Creator": "Smith"}
df = pd.concat([df, pd.DataFrame(new_row, index=[0])], ignore_index=True)

print(df)

The output will be:

            Name  Auditory   Creator
0   Seter Design     10000  Vasiukou
1  Seter Develop    100000    Hontar
2    Seter Mngmt     50000    Hontar
3       Seter AI     40000     Smith
<ipython-input-17-a51cf5d392cc>:13: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  df = df.append(new_row, ignore_index=True)
import pandas as pd

data = {
    "Name": ["Seter Design", "Seter Develop", "Seter Mngmt"],
    "Auditory": [10000, 100000, 50000],
    "Creator": ["Vasiukou", "Hontar", "Hontar"],
}

df = pd.DataFrame(data)

# Adding a new row using a dictionary
new_row = {"Name": "Seter AI", "Auditory": 40000, "Creator": "Smith"}
df = pd.concat([df, pd.DataFrame(new_row, index=[0])], ignore_index=True)

print(df)

The output will be:

            Name  Auditory   Creator
0   Seter Design     10000  Vasiukou
1  Seter Develop    100000    Hontar
2    Seter Mngmt     50000    Hontar
3       Seter AI     40000     Smith

Updating Rows

To update a row in a DataFrame, you can use the loc attribute with the desired index and reassign the values for that row.

import pandas as pd

data = {
    "Name": ["Seter Design", "Seter Develop", "Seter Mngmt"],
    "Auditory": [10000, 100000, 50000],
    "Creator": ["Vasiukou", "Hontar", "Hontar"],
}

df = pd.DataFrame(data)

# Updating the second row with new values
df.loc[1] = ["Seter AI", 40000, "Smith"]

print(df)

The output will be:

           Name  Auditory   Creator
0  Seter Design     10000  Vasiukou
1      Seter AI     40000     Smith
2   Seter Mngmt     50000    Hontar

Deleting Rows

To delete a row from a DataFrame, you can use the drop method with the desired index.

import pandas as pd

data = {
    "Name": ["Seter Design", "Seter Develop", "Seter Mngmt"],
    "Auditory": [10000, 100000, 50000],
    "Creator": ["Vasiukou", "Hontar", "Hontar"],
}

df = pd.DataFrame(data)

# Deleting the first row
df = df.drop(0)

print(df)

The output will be:

            Name  Auditory Creator
1  Seter Develop    100000  Hontar
2    Seter Mngmt     50000  Hontar

DataFrame Operations

Pandas DataFrames support various arithmetic operations and allow you to perform computations on data frame elements. You can also manipulate index values, find duplicate index values, and reshape the data by pivoting or melting. Check out our article for more information on sorting and manipulating Pandas DataFrames.

Sorting DataFrames

To sort a DataFrame, you can use the sort_values method with the by parameter specifying the column to sort by.

import pandas as pd

data = {
    "Name": ["Seter Design", "Seter Develop", "Seter Mngmt"],
    "Auditory": [10000, 100000, 50000],
    "Creator": ["Vasiukou", "Hontar", "Hontar"],
}

df = pd.DataFrame(data)

# Sorting by 'Auditory' in descending order
df_sorted = df.sort_values(by="Auditory", ascending=False)

print(df_sorted)

The output will be:

            Name  Auditory   Creator
1  Seter Develop    100000    Hontar
2    Seter Mngmt     50000    Hontar
0   Seter Design     10000  Vasiukou

Aggregating DataFrames

To aggregate data in a DataFrame, you can use the groupby method with the desired columns to group by and the desired function to aggregate.

import pandas as pd

data = {
    "Name": ["Seter Design", "Seter Develop", "Seter Mngmt"],
    "Auditory": [10000, 100000, 50000],
    "Creator": ["Vasiukou", "Hontar", "Hontar"],
}

df = pd.DataFrame(data)

# Grouping by 'Creator' and aggregating by average 'Auditory'
df_grouped = df.groupby("Creator").agg({"Auditory": "mean"})

print(df_grouped)

The output will be:

          Auditory
Creator           
Hontar     75000.0
Vasiukou   10000.0

Merging DataFrames

To merge two or more DataFrames, you can use the merge method with the desired columns to merge on and the desired type of merge.

import pandas as pd

data1 = {
    "Name": ["Seter Design", "Seter Develop", "Seter Mngmt"],
    "Auditory": [10000, 100000, 50000],
    "Creator": ["Vasiukou", "Hontar", "Hontar"],
}

data2 = {
    "Name": ["Seter Design", "Seter Develop", "Seter AI"],
    "City": ["New York", "Los Angeles", "Chicago"]
}

df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

# Merging on 'Name' column
df_merged = pd.merge(df1, df2, on="Name")

print(df_merged)

The output will be:

            Name  Auditory   Creator         City
0   Seter Design     10000  Vasiukou     New York
1  Seter Develop    100000    Hontar  Los Angeles

Join DataFrames

pandas join dataframe
import pandas as pd

data1 = {
    "Name": ["Seter Design", "Seter Develop", "Seter Mngmt"],
    "Auditory": [10000, 100000, 50000],
    "Creator": ["Vasiukou", "Hontar", "Hontar"],
}

data2 = {
    "Name": ["Seter Design", "Seter Develop", "Seter AI"],
    "City": ["New York", "Los Angeles", "Chicago"]
}

df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

# Joining on 'Name' column
df_joined = df1.set_index("Name").join(df2.set_index("Name"))

print(df_joined)

This code performs a join operation on two DataFrames based on the “Name” column. The resulting DataFrame contains all the columns from both input DataFrames, with rows matched based on the “Name” column.

DataFrame Methods:

FUNCTIONDESCRIPTION
index()Method returns index (row labels) of the DataFrame
insert()Method inserts a column into a DataFrame
add()Method returns addition of dataframe and other, element-wise (binary operator add)
sub()Method returns subtraction of dataframe and other, element-wise (binary operator sub)
mul()Method returns multiplication of dataframe and other, element-wise (binary operator mul)
div()Method returns floating division of dataframe and other, element-wise (binary operator truediv)
unique()Method extracts the unique values in the dataframe
nunique()Method returns count of the unique values in the dataframe
value_counts()Method counts the number of times each unique value occurs within the Series
columns()Method returns the column labels of the DataFrame
axes()Method returns a list representing the axes of the DataFrame
isnull()Method creates a Boolean Series for extracting rows with null values
notnull()Method creates a Boolean Series for extracting rows with non-null values
between()Method extracts rows where a column value falls in between a predefined range
isin()Method extracts rows from a DataFrame where a column value exists in a predefined collection
dtypes()Method returns a Series with the data type of each column. The result’s index is the original DataFrame’s columns
astype()Method converts the data types in a Series
values()Method returns a Numpy representation of the DataFrame i.e. only the values in the DataFrame will be returned, the axes labels will be removed
sort_values()Method sorts a data frame in Ascending or Descending order of passed Column
sort_index()Method sorts the values in a DataFrame based on their index positions or labels instead of their values but sometimes a data frame is made out of two or more data frames and hence later index can be changed using this method
loc[]Method retrieves rows based on index label
iloc[]Method retrieves rows based on index position
ix[]Method retrieves DataFrame rows based on either index label or index position. This method combines the best features of the .loc[] and .iloc[] methods
rename()Method is called on a DataFrame to change the names of the index labels or column names
columns()Method is an alternative attribute to change the coloumn name
drop()Method is used to delete rows or columns from a DataFrame
pop()Method is used to delete rows or columns from a DataFrame
sample()Method pulls out a random sample of rows or columns from a DataFrame
nsmallest()Method pulls out the rows with the smallest values in a column
nlargest()Method pulls out the rows with the largest values in a column
shape()Method returns a tuple representing the dimensionality of the DataFrame
ndim()Method returns an ‘int’ representing the number of axes / array dimensions.
Returns 1 if Series, otherwise returns 2 if DataFrame
dropna()Method allows the user to analyze and drop Rows/Columns with Null values in different ways
fillna()Method manages and let the user replace NaN values with some value of their own
rank()Values in a Series can be ranked in order with this method
query()Method is an alternate string-based syntax for extracting a subset from a DataFrame
copy()Method creates an independent copy of a pandas object
duplicated()Method creates a Boolean Series and uses it to extract rows that have duplicate values
drop_duplicates()Method is an alternative option to identifying duplicate rows and removing them through filtering
set_index()Method sets the DataFrame index (row labels) using one or more existing columns
reset_index()Method resets index of a Data Frame. This method sets a list of integer ranging from 0 to length of data as index
where()Method is used to check a Data Frame for one or more condition and return the result accordingly. By default, the rows not satisfying the condition are filled with NaN value

Frequently asked questions

A DataFrame is a two-dimensional, size-mutable, and heterogeneous data structure with labeled axes (rows and columns). It can be thought of as a table of data, similar to a spreadsheet. A Series, on the other hand, is a one-dimensional, size-mutable, and heterogeneous data structure with a labeled index. It can be thought of as a single column within a DataFrame or a single row of data.

While both DataFrames and NumPy arrays are used for handling data, DataFrames are more suitable when working with heterogeneous data types, missing values, or labeled axes (rows and columns). NumPy arrays are better suited for numerical operations, homogeneous data types, and when performance is critical.

Yes, Pandas supports various data storage formats like Excel, JSON, HTML, HDF5, Parquet, Msgpack, Stata, SAS, SQL databases, and Google BigQuery. You can read from and write to these formats using specific functions like read_excel(), read_json(), read_sql(), and to_excel(), to_json(), to_sql(), respectively.

Pandas provides several methods for handling missing data, such as isna(), notna(), fillna(), dropna(), and interpolate(). Depending on your needs, you can choose to fill missing values with a specified value, drop rows or columns with missing values, or interpolate missing values using various techniques.

You can use the dtypes attribute to check the data types of all columns in a DataFrame:

df.dtypes

To replace values in a DataFrame based on a condition, use the loc[] indexer:

df.loc[df['column1'] < 0, 'column1'] = 0

This FAQ chapter provides additional insights and addresses common concerns related to Pandas DataFrames. By complementing the main article, it should help users better understand and work with Pandas DataFrames.

Conclusion

In this article, we have covered the basics of Pandas DataFrames, including creating, selecting, modifying, and aggregating data. We also discussed common DataFrame operations, such as sorting and merging. With these skills, you are ready to work with DataFrames in Pandas and explore the vast array of data manipulation and analysis techniques available in Python.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *