[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

pandas loc on str lower for column comparison

Sayth Renshaw <flebber.crue at gmail.com> writes:

> Hi
> I am importing 4 columns into a dataframe from a spreadsheet.
> My goal is to create a 5th column with TRUE or False if column 4 (str) matches column 3.
> Trying to leverage this answer https://stackoverflow.com/a/35940955/461887
> This is my code 
> import pandas as pd
> xls = pd.ExcelFile("Melbourne.xlsx")
> df = xls.parse('Sheet1', skiprows= 4)
> df1 = df[['UID','Name','New Leader','Current Team', 'New Team']]
> df1['Difference'] = df1['Current
> Team'].str.lower().str.replace('s/+',"") == df1['New
> Team'].str.lower().str.replace('s/+',"")
> Which gives this error
> C:\Users\u369811\AppData\Local\Continuum\anaconda3\lib\site-packages\ipykernel_launcher.py:6:
> SettingWithCopyWarning:
> A value is trying to be set on a copy of a slice from a DataFrame.

I think what happens is that df1 is not a copy of the subset of df that you want, but it is a VIEW on df instead. This means that df1 shares memory with df (for memory savings reasons). But if you would change this view by adding a column, where should it be put? In df? If so, where?

So the correct way to do this is to make df1 a copy rather than a view.

df1 = df.loc[:, ('UID','Name','New Leader','Current Team', 'New Team')]

And than it should work.
Except that the str.replace is wrong for what you want.  It just replaces the literal string "s/+" with an empty string instead of white space. This was wrong in the stackoverflow post.
To replace whitespace it should be str.replace('\\s+',"", regex=True). But simpler is to use str.strip():

df1['Difference'] = df1['Current Team'].str.lower().str.strip() == df1['New Team'].str.lower().str.strip()
Piet van Oostrum <piet-l at vanoostrum.org>
WWW: http://piet.vanoostrum.org/
PGP key: [8DAE142BE17999C4]