Advanced data cleaning

This project was about preparing weekly sales data from wholesalers selling various products. The main problem was that the data came as multiple datasets inside ONE csv file.

What needs to be done?

As seen below there are 2 empty lines between each dataset and 6 rows (headlines) with info about the next dataset. The information from 5 out of these 6 rows needs to be extrated and kept as columns in the final dataset. (The first headline is not interesting, so that can be skipped)

The sales for each product is recorded in 6 corresponding to 6 KPIs; Sales value, Sales units, Transactions, Buying customer, Promo sales and Average price per sales unit. So the number of columns are 6 x [n of products] plus an aggregate of all the products (“All products”, first 6 columns after the Time column). We want to change this “wide” format into a long format so we only have 13 columns in the end. One column for date info (Time), one for product info, the 6 KPIs and the 5 columns extracted from the headlines. Notice there are two header rows after the headlines (multiindex), where the first contains the product info and the second contains the KPI info.

The approach

The approach I took to go from this csv containing multiple datasets to one big dataset in long format, was the following:

Locate each dataset and save them seperately as dataframes in a dictionnary.
Extract the info from the five headlines in each dataframe.
Concat the header info into one row and melt the data into long format for each dataframe.
Stack all the dataframes into one big final dataset.

Let’s begin

Firstly, if any value is missing it means there are no sales for that specific product in that given week. So let us replace these with 0. This also helps us to locate where each new dataset begins, as all other values in the first column (col_0) will be either headlines or date info.

df = df.fillna(0)

Next, we can locate all the row indexes where the first column (col_0) is 0:

indexes = df.index[df['col_0'] == 0].tolist()[::2]

When slicing the list with “::2”, we take every other value in the list. Remember there are 2 empty rows between each dataset in the csv file. We just need one of them to be able to navigate.

Now we can slice the csv files based on these row indexes and save each dataframe to a dictionnary.

all_df = {}
all_df['df_1'] = df.iloc[:indexes[0]]
df_num = 2
index_before = indexes[0]

for index in indexes[1:]:
    all_df['df_{}'.format(df_num)] = df.iloc[index_before + 2:index]
    df_num += 1
    index_before = index

We loop through the index list created before and keep track the index used in the former loop, so we are sure we also move the starting row for the next dataframe (index_before).

For each dataframe we need to save the info for the 5 headlines every time as these may change. We loop through each dataframe in the dictionnary and save the text after the “:” to a column named the same as the text before the “:”.

for df in all_df.keys():
    # Adding header info to values in designated columns
    all_df[df]['Geography'] = all_df[df].iloc[1,0].split(':')[1]
    all_df[df]['Customer Type'] = all_df[df].iloc[2,0].split(':')[1]
    all_df[df]['Purchase Channel'] = all_df[df].iloc[3,0].split(':')[1]
    all_df[df]['National Account'] = all_df[df].iloc[4,0].split(':')[1]
    all_df[df]['Supplier'] = all_df[df].iloc[5,0].split(':')[1]
    all_df[df] = all_df[df][6:]

    # Adding header info column names to the second line as these are the column names we are going to use
    extra_cols = ['Geography', 'Customer Type', 'Purchase Channel', 'National Account', 'Supplier']
    for item in range(5):
        all_df[df].iloc[1,-5:][item] = extra_cols[item]

    # Copy second row to column names
    all_df[df].columns = all_df[df].iloc[1]

    # Drop second row as it has been copied to the column names
    all_df[df] = all_df[df].drop(all_df[df].index[1])

The last three sections above does a bit of house keeping to make sure we can get rid of the multiindex in the column names in the next part. We want the column names for these 5 new columns to be in the first level of the mulitiindex.

Now we are ready to change the format of all the dataframes from wide to long format by using the pd.melt() function. We dont want to melt the 5 new columns as these contain info about each entire dataframe, so we save these in a seperate dataframe. This is also so we can fix the multiindex in the rest of the dataframe. We concat the two column rows into one with “XXXX” in between so we have something to split on later. Then we melt everything in each dataframe, but keep the “Time” column intact.

The product info and KPIs are now stored as column values in the “variable” column generated by the melt function. We split this column using the “XXXX” into two columns called “product” and “column”. Now we can pivot the data so the KPI info in the “column” column becomes our header. Finally, we add the 5 columns we seperated in the beginning of the loop.

for df in all_df.keys():
    # Splitting the last 5 coloumns from the dataframe
    df_melt = all_df[df].iloc[:, :-5].copy()
    last_five = all_df[df].iloc[:, -5:].copy()

    # Adding the information in the first row to the coloumn names so we can do the melt
    df_melt.columns = df_melt.iloc[0] + 'XXXX' + df_melt.columns
    df_melt.rename(columns={'TimeXXXXTime': 'Time'}, inplace=True)

    # Melting the dataframe, but keeping the Time coloumn as is
    df_melt = df_melt.iloc[1:, :].melt(id_vars='Time')

    # Spliting the variable coloumn name on the XXXX and putting the values into Product and Column coloumns
    df_melt['product'] = df_melt['variable'].apply(lambda x : x.split('XXXX')[0]).values
    df_melt['column'] = df_melt['variable'].apply(lambda x : x.split('XXXX')[1]).values

    # Converting the value coloumn to floats so the pivot will run faster
    df_melt['value'] = pd.to_numeric(df_melt['value'])

    # Pivoting the data into the right format
    df_melt = df_melt.pivot_table(index=['Time', 'product'], columns='column', values='value', aggfunc='sum').reset_index()

    # Adding the last 5 coloumns back into the melted and pivoted dataframe
    for column_name, value in last_five.iloc[0].reset_index().values:
        df_melt[column_name] = value
    all_df[df] = df_melt

The melt function is usually able to handle multiindexes, so it is not necessary to apply an ad hoc solution using “XXXX” to split on etc. However, as I recall it caused me some problems in this case, so I had to be creative.

Finally, we loop through all the dataframes in the dictionnary and stack them on top of eachother into one big dataframe with about 2.3 million rows and 13 columns.

# Stacking all the dataframes into one big dataframe
stacked_df = all_df['df_1']
for df in all_df.keys():
    if df == 'df_1':
        continue
    stacked_df = pd.concat([stacked_df, all_df[df]])

The result

Screeshot of the final dataset in Dataiku:

Further work could be to use some regex to split the info in the “product” column into multiple new columns like product_id, count_per_pack, package_type, brand/description and packsize.