Basic use of Labelizer using Pandas in Python

Python provides data scientists with machine learning (ML) libraries Such as sykit learn. Models cannot be trained over categorical data (such as Gender or US State) - so these features must be pre-processed. The processing is known as encoding - and there are two options: label encoding and binary (binarizer) encoding.

Note: Labels are mutually exclusive, not orthogonal. Results of a label encoder will need to be normalized later, while a bin encoder needs no further processing.

Label encoding will convert a column into distinct numerical values. There is no order to these values - just that each unique category option has a unique number. Label encoding does not handle ordinal values. Ordinal values must be converted manually for best results.

Bin/Binary/Binarizer encoding will turn one column of categorical values into a collection of columns - having values of 1 or 0. The number of columns created is equal to the different number of categorical options. It is important to provide these columns with appropriate names that are distinct and hopefully indicative of the original category. There is a small ~~bug~~ feature with Binarizer in that two categorical values are converted to a single column of 1 or 0 (it's not entirely wrong, but sometimes you don't want to have to add the other 1's column back in later). If you don't want this behavior, reprocess_binary_encoded_data is defined in the code below:


import pandas
from sklearn import preprocessing

def encode_with_label_binarizer(dataset, test_dataset, categorical_columns):
    created_columns = []
    combined = pandas.concat([dataset, test_dataset]) # will fail if wrong schema

    for col in [x for x in combined.columns if x in categorical_columns]:
        if str(combined.dtypes[col]) != 'float64' and str(combined.dtypes[col]) != 'int64':
            label_encoder = preprocessing.LabelBinarizer()
            categorical_data_encoder = label_encoder.fit(combined[col])
            cols = categorical_data_encoder.classes_
            enc_data = reprocess_binary_encoded_data(categorical_data_encoder.transform(dataset[col]), cols)
            temp_df = pandas.DataFrame(data=enc_data, columns=categorical_data_encoder.classes_)
            for c in temp_df.columns:
                dataset.insert(dataset.columns.get_loc(col), "%s_%s" % (col, c), temp_df[c].values)
                created_columns.append("%s_%s" % (col, c))

            del dataset[col]

            test_enc_data = reprocess_binary_encoded_data(categorical_data_encoder.transform(test_dataset[col]), cols)
            test_temp_df = pandas.DataFrame(data=test_enc_data, columns=categorical_data_encoder.classes_)
            for c in test_temp_df.columns:
                test_dataset.insert(test_dataset.columns.get_loc(col), "%s_%s" % (col, c), test_temp_df[c].values)

            del test_dataset[col]

    return dataset, test_dataset, created_columns

	
def reprocess_binary_encoded_data(enc_data, cols):
    if enc_data.shape[1] > 1 or len(cols) < 2:
        return enc_data

    if enc_data.shape[1] == 1 and len(cols) == 2:
        re_enc = []
        for index in [i for i in range(enc_data.shape[0])]:
            temp = [0, 1]
            if enc_data[index] == 0:
                temp = [1, 0]

            re_enc.append(temp)

        return re_enc

    return enc_data

	
def encode_with_label_encoder(dataset, test_dataset, categorical_columns):
    combined = pandas.concat([dataset, test_dataset]) # will fail if wrong schema

    for col in [x for x in combined.columns if x in categorical_columns]:
        label_encoder = preprocessing.LabelEncoder()
        categorical_data_encoder = label_encoder.fit(combined[col])
        dataset[col] = categorical_data_encoder.transform(dataset[col])
        test_dataset[col] = categorical_data_encoder.transform(test_dataset[col])

    return dataset, test_dataset, categorical_columns

In both cases, the original column may be removed or replaced with the new values.