Data selection and filtering are essential data manipulation techniques for the data scientist, that allow you to extract specific data from a dataset. In this article, we will cover how to filter data in Pandas using various techniques.
To learn more about importing data into Pandas, take a look at our article on importing CSV files with Pandas for a detailed guide.
Filtering Data in Pandas Dataframe
Filtering Data Using Boolean Indexing
Boolean indexing is a powerful and flexible technique that allows you to filter data based on a condition. It involves passing a Boolean array of the same length as the DataFrame or Series being filtered to the indexing operator, which returns the subset of data where the Boolean array is True.
Note: For a more in-depth look at Pandas DataFrames and their features, check out our article on understanding Pandas DataFrames for a comprehensive overview.”
Filter Rows Based on a Condition
To filter rows based on a condition, we create a Boolean array by applying a condition to one or more columns of the DataFrame. We can then pass this Boolean array to the indexing operator to select only the rows where the condition is True.
import pandas as pd
data = {
"Name": ["Seter Design", "Seter Develop", "Seter Mngmt"],
"Auditory": [10000, 100000, 50000],
"Creator": ["Vasiukou", "Hontar", "Hontar"]
}
df = pd.DataFrame(data)
# Selecting rows where Auditory is greater than 20000
auditory_greater_than_20k = df[df["Auditory"] > 20000]
print(auditory_greater_than_20k)
The output will be:
Name Auditory Creator 1 Seter Develop 100000 Hontar 2 Seter Mngmt 50000 Hontar
In this example, we create a Boolean array by comparing the “Auditory” column of the DataFrame to the value 20000. We then pass this Boolean array to the indexing operator of the DataFrame to select only the rows where the condition is True.
Filter Rows Based on Multiple Conditions
We can also filter rows based on multiple conditions by combining Boolean arrays using logical operators such as &
(and), |
(or), and ~
(not).
import pandas as pd
data = {
"Name": ["Seter Design", "Seter Develop", "Seter Mngmt"],
"Auditory": [100000, 100000, 50000],
"Creator": ["Vasiukou", "Hontar", "Hontar"],
"Gender": ["M", "M", "F"]
}
df = pd.DataFrame(data)
# Selecting female rows where Auditory is greater than 30000
female_auditory_greater_than_30k = df[(df["Auditory"] > 30000) & (df["Gender"] == "F")]
# Selecting rows where the auditory is greater than or equal to 50000 or the gender is "F"
female_or_auditory = df[(df["Auditory"] >= 50000) | (df["Gender"] == "F")]
# Selecting rows where the auditory is greater than 50000 and the creator is not Hontar
auditory_and_not_creator = df[(df["Auditory"] > 50000) & ~(df["Creator"] == "Hontar")]
print(female_auditory_greater_than_30k)
print(female_or_auditory)
print(auditory_and_not_creator)
The output will be:
Name Auditory Creator Gender 2 Seter Mngmt 50000 Hontar F Name Auditory Creator Gender 0 Seter Design 100000 Vasiukou M 1 Seter Develop 100000 Hontar M 2 Seter Mngmt 50000 Hontar F Name Auditory Creator Gender 0 Seter Design 100000 Vasiukou M
In this example, the code selects the rows where the “Auditory” is greater than 30000 and the “Gender” is “F”. This is done by creating a Boolean array with the condition “(df[“Auditory”] > 30000) & (df[“Gender”] == “F”)”, and then passing it to the DataFrame using square brackets.
Next, the code selects the rows where the “Auditory” is greater than or equal to 50000 or the “Gender” is “F”. This is done using the |
operator to combine the two conditions into a single Boolean array.
Finally, the code selects the rows where the “Auditory” is greater than 50000 and the “Creator” is not “Hontar”. This is done using the &
and ~
operators to combine and negate the two conditions, respectively.
Filter Columns Based on a Condition
We can also filter columns based on a condition by passing a Boolean array to the indexing operator of the DataFrame.
import pandas as pd
data = {
"Name": ["Seter Design", "Seter Develop", "Seter Mngmt"],
"Auditory": [100000, 100000, 50000],
"Creator": ["Vasiukou", "Hontar", "Hontar"],
"Gender": ["M", "M", "F"]
}
df = pd.DataFrame(data)
# 1. Filter columns where at least one element has a value greater than 70000 (applies to numeric columns)
numeric_columns = df.select_dtypes(include=[pd.np.number])
filtered_columns = numeric_columns.loc[:, (numeric_columns > 70000).any()]
print("Filtered columns with at least one element greater than 70000:\n", filtered_columns)
# 2. Filter columns where all elements are of type string
filtered_columns = df.loc[:, df.applymap(lambda x: isinstance(x, str)).all()]
print("\nFiltered columns with all elements of type string:\n", filtered_columns)
# 3. Filter columns where the column name contains the letter 'e'
filtered_columns = df.loc[:, [col for col in df.columns if 'e' in col]]
print("\nFiltered columns with column names containing the letter 'e':\n", filtered_columns)
# 4. Filter columns where the number of unique values is greater than or equal to 2
filtered_columns = df.loc[:, df.nunique() >= 2]
print("\nFiltered columns with the number of unique values greater than or equal to 2:\n", filtered_columns)
The output will be:
Filtered columns with at least one element greater than 70000: Auditory 0 100000 1 100000 2 50000 Filtered columns with all elements of type string: Name Creator Gender 0 Seter Design Vasiukou M 1 Seter Develop Hontar M 2 Seter Mngmt Hontar F Filtered columns with column names containing the letter 'e': Name Creator Gender 0 Seter Design Vasiukou M 1 Seter Develop Hontar M 2 Seter Mngmt Hontar F Filtered columns with the number of unique values greater than or equal to 2: Name Auditory Creator Gender 0 Seter Design 100000 Vasiukou M 1 Seter Develop 100000 Hontar M 2 Seter Mngmt 50000 Hontar F
- Filtering columns where at least one element has a value greater than 70000 (applies to numeric columns).
- Filtering columns where all elements are of type string.
- Filtering columns where the column name contains the letter ‘e’.
- Filtering columns where the number of unique values is greater than or equal to 2.
Filtering Data with query() Function
The query()
function allows you to filter data by specifying a query string. The query string should be a valid Python expression that evaluates to a boolean value. Here’s an example:
# Filter data where Age is greater than 30 and Gender is 'M'
filtered_data = df.query("Age > 30 and Gender == 'M'")
print(filtered_data)
This code will return all rows where the Age is greater than 30 and Gender is ‘M’. The query string is passed as an argument to the query()
function.
Filtering Data with isin() Function
The isin()
function is used to filter data based on a list of values. It returns a boolean mask that can be used to filter the DataFrame. Here’s an example:
# Filter data where Name is 'Alice', 'Bob', or 'Carol'
name_filter = df["Name"].isin(["Alice", "Bob", "Carol"])
filtered_data = df[name_filter]
print(filtered_data)
This code will return all rows where the Name is ‘Alice’, ‘Bob’, or ‘Carol’. The isin()
function returns a boolean mask that is used to filter the original DataFrame.
Advantages of Using query() and isin() Functions
Using the query()
and isin()
functions offers several advantages over other filtering methods. First, these functions provide a more concise and readable syntax for complex filtering operations. They also allow you to filter data without creating intermediate variables, which can help reduce memory usage and improve performance.
Additionally, the query()
function allows you to reference columns directly in the query string, which can make it easier to write complex queries. The isin()
function also provides an easy way to filter data based on a list of values, which is a common task in data analysis.
Selecting Data in Pandas Dataframe
Selecting Data Using Indexing Operators
In addition to filtering data, we can also use indexing operators to select specific rows and columns from a DataFrame or Series. There are several indexing operators available in Pandas, each with its own syntax and use cases.
Selecting Rows by Index Label
The .loc
operator is used to select rows and columns by the label. To select a single row by label, we pass the row label to the .loc
operator.
import pandas as pd
data = {
"Name": ["Seter Design", "Seter Develop", "Seter Mngmt"],
"Auditory": [100000, 100000, 50000],
"Creator": ["Vasiukou", "Hontar", "Hontar"],
"Gender": ["M", "M", "F"]
}
df = pd.DataFrame(data, index=["A", "B", "C"])
# 1. Select row with label "A"
selected_row = df.loc["A"]
print(selected_row)
# 2. Select row with label "B"
selected_row = df.loc["B"]
print(selected_row)
# 3. Select multiple rows using labels "A" and "C"
selected_rows = df.loc[["A", "C"]]
print(selected_rows)
# 4. Select rows with labels between "A" and "C" (inclusive)
selected_rows = df.loc["A":"C"]
print(selected_rows)
The output will be:
Name Seter Design Auditory 100000 Creator Vasiukou Gender M Name: A, dtype: object Name Seter Develop Auditory 100000 Creator Hontar Gender M Name: B, dtype: object Name Auditory Creator Gender A Seter Design 100000 Vasiukou M C Seter Mngmt 50000 Hontar F Name Auditory Creator Gender A Seter Design 100000 Vasiukou M B Seter Develop 100000 Hontar M C Seter Mngmt 50000 Hontar F
- Select row with label “A”: In this example, we retrieve the row associated with the index label “A” from the DataFrame.
- Select row with label “B”: In this example, we retrieve the row associated with the index label “B” from the DataFrame.
- Select multiple rows using labels “A” and “C”: In this example, we retrieve multiple rows associated with the index labels “A” and “C” from the DataFrame.
- Select rows with labels between “A” and “C” (inclusive): In this example, we retrieve a range of rows from the DataFrame, starting with the row associated with index label “A” and ending with the row associated with index label “C”, including both of these rows.
Selecting Rows by Index Position
The .iloc
operator is used to select rows and columns by position. To select a single row by position, we pass the row position to the .iloc
operator.
import pandas as pd
data = {
"Name": ["Seter Design", "Seter Develop", "Seter Mngmt"],
"Auditory": [100000, 100000, 50000],
"Creator": ["Vasiukou", "Hontar", "Hontar"],
"Gender": ["M", "M", "F"]
}
df = pd.DataFrame(data, index=["A", "B", "C"])
# Select row at position 2
selected_row = df.iloc[2]
print(selected_row)
The output will be:
Name Seter Mngmt Auditory 50000 Creator Hontar Gender F Name: C, dtype: object
In this example, we select the row at position 2 (i.e., the third row) by passing the position to the .iloc
operator of the DataFrame. The resulting subset DataFrame contains only the row at position 2.
Selecting Columns by Label or Position
We can also use the .loc
and .iloc
operators to select specific columns from a DataFrame. To select a single column by label, we pass the column label to the .loc
operator.
import pandas as pd
data = {
"Name": ["Seter Design", "Seter Develop", "Seter Mngmt"],
"Auditory": [100000, 100000, 50000],
"Creator": ["Vasiukou", "Hontar", "Hontar"],
"Gender": ["M", "M", "F"]
}
df = pd.DataFrame(data)
# Select column "Name" by label
selected_column = df.loc[:, "Name"]
print(selected_column)
The output will be:
0 Seter Design 1 Seter Develop 2 Seter Mngmt Name: Name, dtype: object
In this example, we select the “Name” column by passing the label “Name” to the .loc
operator of the DataFrame. The resulting subset Series contains only the “Name” column.
To select a single column by position, we pass the column position to the .iloc
operator.
Select column at position 1
import pandas as pd
data = {
"Name": ["Seter Design", "Seter Develop", "Seter Mngmt"],
"Auditory": [100000, 100000, 50000],
"Creator": ["Vasiukou", "Hontar", "Hontar"],
"Gender": ["M", "M", "F"]
}
df = pd.DataFrame(data)
selected_column = df.iloc[:, 1]
print(selected_column)
The output will be:
0 100000 1 100000 2 50000 Name: Auditory, dtype: int64
In this example, we select the column at position 1 (i.e., the second column) by passing the position 1 to the `.iloc` operator of the DataFrame. The resulting subset Series contains only the “Age” column.
We can also select multiple columns by passing a list of column labels or positions to the `.loc` or `.iloc` operator.
import pandas as pd
data = {
"Name": ["Seter Design", "Seter Develop", "Seter Mngmt"],
"Auditory": [100000, 100000, 50000],
"Creator": ["Vasiukou", "Hontar", "Hontar"],
"Gender": ["M", "M", "F"]
}
df = pd.DataFrame(data)
# Select columns "Name" and "Gender" by label
selected_columns = df.loc[:, ["Name", "Gender"]]
print(selected_columns)
The output will be:
Name Gender 0 Seter Design M 1 Seter Develop M 2 Seter Mngmt F
In this example, we select the “Name” and “Gender” columns by passing a list of labels ["Name", "Gender"]
to the .loc
operator of the DataFrame. The resulting subset DataFrame contains only the “Name” and “Gender” columns.
We can also select multiple columns by passing a list of positions to the .iloc
operator.
import pandas as pd
data = {
"Name": ["Seter Design", "Seter Develop", "Seter Mngmt"],
"Auditory": [100000, 100000, 50000],
"Creator": ["Vasiukou", "Hontar", "Hontar"],
"Gender": ["M", "M", "F"]
}
df = pd.DataFrame(data)
# Select columns at positions 0 and 2
selected_columns = df.iloc[:, [0, 2]]
print(selected_columns)
The output will be:
Name Creator 0 Seter Design Vasiukou 1 Seter Develop Hontar 2 Seter Mngmt Hontar
In this example, we select the columns at positions 0 and 2 (i.e., the first and third columns) by passing a list of positions [0, 2]
to the .iloc
operator of the DataFrame. The resulting subset DataFrame contains only the “Name” and “Gender” columns.
Selecting Multiple Rows
In this section, we will cover how to select multiple rows in a Pandas DataFrame using various methods. This is a useful technique when you want to analyze a specific subset of your data.
Using the `.loc[]` and `.iloc[]` methods
The `.loc[]` method is used for selecting rows based on index labels, while `.iloc[]` is used for selecting rows based on index positions. To select multiple rows, you can pass a list of labels or index positions.
import pandas as pd
# Sample DataFrame
data = {'A': [1, 2, 3, 4], 'B': [5, 6, 7, 8]}
df = pd.DataFrame(data)
# Select multiple rows using index labels with .loc[]
rows_labels = [0, 2]
selected_rows_loc = df.loc[rows_labels]
# Select multiple rows using index positions with .iloc[]
rows_positions = [1, 3]
selected_rows_iloc = df.iloc[rows_positions]
print("Rows selected using .loc[]:")
print(selected_rows_loc)
print("\nRows selected using .iloc[]:")
print(selected_rows_iloc)
Using slicing
You can also use slicing to select a range of rows. This works with both .loc[]
and .iloc[]
methods. When using slicing, you need to specify the start and end index of the range. Note that when using .loc[]
, the end index is inclusive, while with .iloc[]
, the end index is exclusive.
# Select a range of rows using slicing with .loc[]
start_label, end_label = 1, 3
selected_rows_loc_slice = df.loc[start_label:end_label]
# Select a range of rows using slicing with .iloc[]
start_position, end_position = 1, 3
selected_rows_iloc_slice = df.iloc[start_position:end_position]
print("Rows selected using slicing with .loc[]:")
print(selected_rows_loc_slice)
print("\nRows selected using slicing with .iloc[]:")
print(selected_rows_iloc_slice)
Using boolean indexing
Another way to select multiple rows is by using boolean indexing. You can pass a boolean series or array with the same length as the DataFrame to filter rows based on a specific condition.
# Select rows where column 'A' is greater than 2
condition = df['A'] > 2
selected_rows_boolean = df[condition]
print("Rows selected using boolean indexing:")
print(selected_rows_boolean)
Frequently asked questions
Conclusion
In this article, we have covered the basics of data selection and filtering in Pandas. We have learned how to filter data based on conditions, how to select specific rows and columns using indexing operators, and how to combine multiple filtering conditions using logical operators.
Pandas offers a rich set of tools for data manipulation and analysis, and mastering these tools is essential for anyone working with data in Python. We hope this article has provided you with a solid foundation in Pandas data selection and filtering, and that you will be able to apply these techniques to your own data analysis projects.