Understanding hypothesis testing with Covid-19 case study (Z-test and t-test)
2020-05-31
Exploratory Data Analysis (EDA) example: Road safety dataset case study
2020-07-01
Show all

Pandas data selection using .loc and .iloc

8 mins read

When it comes to select data on a DataFrame, Pandas loc and iloc are two top favorites. They are quick, fast, easy to read, and sometimes interchangeable.

In this post, we’ll explore the differences between loc and iloc, take a looks at their similarities, and check how to perform data selection with them. We will go over the following topics:

  1. Differences between loc and iloc
  2. Selecting via a single value
  3. Selecting via a list of values
  4. Selecting a range of data via slice
  5. Selecting via conditions and callable
  6. loc and iloc are interchangeable when labels are 0-based integers

1. Differences between loc and iloc

The main distinction between loc and iloc is:

  • loc is label-based, which means that you have to specify rows and columns based on their row and column labels.
  • iloc is integer position-based, so you have to specify rows and columns by their integer position values (0-based integer position).

Here are some differences and similarities between loc and iloc :

For demonstration, we create a DataFrame and load it with the Day column as the index.

df = pd.read_csv('data/data.csv', index_col=['Day'])

2. Selecting via a single value

Both loc and iloc allow input to be a single value. We can use the following syntax for data selection:

  • loc[row_label, column_label]
  • iloc[row_position, column_position]

For example, let’s say we would like to retrieve Friday’s temperature value.

With loc, we can pass the row label 'Fri' and the column label 'Temperature'.

# To get Friday's temperature
df.loc['Fri', 'Temperature']
10.51

The equivalent iloc statement should take the row number 4 and the column number 1 .

# The equivalent `iloc` statement
df.iloc[4, 1]
10.51

We can also use : to return all data. For example, to get all rows:

# To get all rows
df.loc[:, 'Temperature']
Day
Mon    12.79
Tue    19.67
Wed    17.51
Thu    14.44
Fri    10.51
Sat    11.07
Sun    17.50
Name: Temperature, dtype: float64

# The equivalent `iloc` statement
df.iloc[:, 1]

And to get all columns:

# To get all columns
df.loc['Fri', :]
Weather        Shower
Temperature     10.51
Wind               26
Humidity           79
Name: Fri, dtype: object

# The equivalent `iloc` statement
df.iloc[4, :]

Note that the above 2 outputs are Series. loc and iloc will return a Series when the result is 1-dimensional data.

3. Selecting via a list of values

We can pass a list of labels to loc to select multiple rows or columns:

# Multiple rows
df.loc[['Thu', 'Fri'], 'Temperature']
Day
Thu    14.44
Fri    10.51
Name: Temperature, dtype: float64

# Multiple columns
df.loc['Fri', ['Temperature', 'Wind']]
Temperature    10.51
Wind              26
Name: Fri, dtype: object

Similarly, a list of integer values can be passed to iloc to select multiple rows or columns. Here are the equivalent statements using iloc:

df.iloc[[3, 4], 1]Day
Thu    14.44
Fri    10.51
Name: Temperature, dtype: float64

df.iloc[4, [1, 2]]
Temperature    10.51
Wind              26
Name: Fri, dtype: object

All the above outputs are Series because their results are 1-dimensional data.

The output will be a DataFrame when the result is 2-dimensional data, for example, to access multiple rows and columns

# Multiple rows and columns
rows = ['Thu', 'Fri']
cols=['Temperature','Wind']
df.loc[rows, cols]

The equivalent iloc statement is:

rows = [3, 4]
cols = [1, 2]
df.iloc[rows, cols]

4. Selecting a range of data via slice

Slice (written as start:stop:step) is a powerful technique that allows selecting a range of data. It is very useful when we want to select everything in between two items.

loc with slice

With loc, we can use the syntax A:B to select data from label A to label (Both A and B are included):

# Slicing column labels
rows=['Thu', 'Fri']
df.loc[rows, 'Temperature':'Humidity' ]
# Slicing row labels
cols = ['Temperature', 'Wind']
df.loc['Mon':'Thu', cols]

We can use the syntax A:B:S to select data from label A to label with step size S (Both A and B are included):

# Slicing with step
df.loc['Mon':'Fri':2 , :]

iloc with slice

With iloc, we can also use the syntax n:m to select data from position n (included) to position m (excluded). However, the main difference here is that the endpoint (m) is excluded from the iloc result.

For example, selecting columns from positions 0 up to 3 (excluded):

df.iloc[[1, 2], 0 : 3]

Similarly, we can use the syntax n:m:s to select data from position n (included) to position m (excluded) with step size s. Notes that the endpoint m is excluded.

df.iloc[0:4:2, :]

5. Selecting via conditions and callable

Conditions

loc with conditions

Often we would like to filter the data based on conditions. For example, we may need to find the rows where humidity is greater than 50.

With loc, we just need to pass the condition to the loc statement.

# One condition
df.loc[df.Humidity > 50, :]

Sometimes, we may need to use multiple conditions to filter our data. For example, find all the rows where humidity is more than 50 and the weather is Shower:

## multiple conditions
df.loc[
    (df.Humidity > 50) & (df.Weather == 'Shower'), 
    ['Temperature','Wind'],
]

iloc with conditions

For iloc, we will get a ValueError if pass the condition straight into the statement:

# Getting ValueError
df.iloc[df.Humidity > 50, :]

We get the error because iloc cannot accept a boolean Series. It only accepts a boolean list. We can use the list() function to convert a Series into a boolean list.

# Single condition
df.iloc[list(df.Humidity > 50)]

Similarly, we can use list() to convert the output of multiple conditions into a boolean list:

## multiple conditions
df.iloc[
    list((df.Humidity > 50) & (df.Weather == 'Shower')), 
    :,
]

Callable function

loc with callable

loc accepts a callable as an indexer. The callable must be a function with one argument that returns valid output for indexing.

For example to select columns

# Selecting columns
df.loc[:, lambda df: ['Humidity', 'Wind']]

And to filter data with a callable:

# With condition
df.loc[lambda df: df.Humidity > 50, :]

iloc with callable

iloc can also take a callable as an indexer.

df.iloc[lambda df: [0,1], :]

To filter data with callable, iloc will require list() to convert the output of conditions into a boolean list:

df.iloc[lambda df: list(df.Humidity > 50), :]

6. loc and iloc are interchangeable when labels are 0-based integers

For demonstration, let’s create a DataFrame with 0-based integers as headers and index labels.

df = pd.read_csv(
    'data/data.csv', 
    header=None, 
    skiprows=[0],
)

With header=None, the Pandas will generate 0-based integer values as headers. With skiprows=[0], those headers WeatherTemperature, etc we have been using will be skipped.

Now, loc, a label-based data selector, can accept a single integer and a list of integer values. For example:

df.loc[1, 2]
19.67

df.loc[1, [1, 2]]
1    Sunny
2    19.67
Name: 1, dtype: object

The reason they are working is that those integer values (1 and 2) are interpreted as labels of the index. This use is not an integer position along with the index and is a bit confusing.

In this case, loc and iloc are interchangeable when selecting via a single value or a list of values.

df.loc[1, 2] == df.iloc[1, 2]
True

df.loc[1, [1, 2]] == df.iloc[1, [1, 2]]
1    True
2    True
Name: 1, dtype: bool

Note that loc and iloc will return different results when selecting via slice and conditions. They are essentially different because:

  • slice: endpoint is excluded from iloc result, but included in loc
  • conditions: loc accepts boolean Series, but iloc can only accept a boolean list.

Conclusion

Finally, here is a summary

loc is label based and allowed inputs are:

  • A single label 'A' or 2 (Note that 2 is interpreted as a label of the index.)
  • A list of labels ['A', 'B', 'C'] or [1, 2, 3] (Note that 1, 2, 3 are interpreted as labels of the index.)
  • A slice with labels 'A':'C' (Both are included)
  • Conditions, a boolean Series or a boolean array
  • callable function with one argument

iloc is integer position based and allowed inputs are:

  • An integer e.g. 2.
  • A list or array of integers [1, 2, 3].
  • A slice with integers 1:7(the endpoint 7 is excluded)
  • Conditions, but only accept a boolean array
  • callable function with one argument

loc and iloc are interchangeable when the labels of Pandas DataFrame are 0-based integers

I hope this article will help you to save time in learning Pandas data selection. I recommend you to check out the documentation to know about other things you can do.

Source:

https://towardsdatascience.com/how-to-use-loc-and-iloc-for-selecting-data-in-pandas-bd09cb4c3d79

Amir Masoud Sefidian
Amir Masoud Sefidian
Machine Learning Engineer

Comments are closed.