osdir.com


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

pandas split and melt()


Sayth Renshaw wrote:

> Hi
> 
> Having fun with pandas filtering a work excel file.
> My current script opens selected and filters the data and saves as excel.
> 
> import pandas as pd
> import numpy as np
> 
> log = pd.read_excel("log_dump_py.xlsx")
> df = log.filter(items=['Completed', 'Priority', 'Session date',
> 'Consultant', 'Coach',
>        'Delivery Method', 'Focus of Coaching', 'Leader', 'Site',
>        'Coaching Description','Motor/Property',
>        ],)
> completed_tasks = df.loc[(df['Completed'] == 'Yes') &
> (df['Motor/Property'] == 'Motor') & (df['Delivery Method'] == 'Group
> Coaching')] print(completed_tasks.head(n=5))
> completed_tasks.to_excel("filtered_logs.xlsx")
> 
> This leaves me with a set of several columns. The main column of concern
> for this example is a consultant
> 
> Session date	Consultant
> 2019-06-21 11:15:00	WNEWSKI, Joan;#17226;#BALIN, 
Jock;#18139;#DUNE,
> Colem;#17230;
> 
> How can I split the consultant column, keep only names and drop the
> numbers and for every session date create a line with data and consultants
> name?
> 
> NB. There are varied amounts of consultants so splitting across columns is
> uneven. if it was even melt seems like it would be good
> https://dfrieds.com/data-analysis/melt-unpivot-python-pandas
> 
> 
> Thanks
> 
> Sayth

Since I didn't find a cool shortcut I decided to use brute force:

$ cat pandas_explode_column.py
import pandas as pd

df = pd.DataFrame(
    [
        [
            "2019-06-21 11:15:00",
            "WNEWSKI, Joan;#17226;#BALIN, Jock;#18139;#DUNE, Colem;#17230;"
        ],
        [
            "2019-06-22 10:00:00", "Doe, John;#42;Robbins, Rita;"
        ]
    ],
    columns=["Session date", "Consultant"]
)

def explode_consultants(consultants):
        consultants = (c.lstrip("#") for c in consultants.split(";"))
        return (c for c in consultants if c.strip("0123456789"))

def explode_column(df, column, split):
    for _index, row in df.iterrows():
        for part in split(row[column]):
            yield [part if c == column else row[c] for c in df.columns]

def explode(df, column, split):
    return pd.DataFrame(
        explode_column(df, "Consultant", split), columns=df.columns
    )

df2 = explode(df, "Consultant", explode_consultants)

print(df)
print(df2)
$ python3 pandas_explode_column.py
          Session date                                         Consultant
0  2019-06-21 11:15:00  WNEWSKI, Joan;#17226;#BALIN, Jock;#18139;#DUNE...
1  2019-06-22 10:00:00                       Doe, John;#42;Robbins, Rita;

[2 rows x 2 columns]
          Session date     Consultant
0  2019-06-21 11:15:00  WNEWSKI, Joan
1  2019-06-21 11:15:00    BALIN, Jock
2  2019-06-21 11:15:00    DUNE, Colem
3  2019-06-22 10:00:00      Doe, John
4  2019-06-22 10:00:00  Robbins, Rita

[5 rows x 2 columns]
$