Skip to content

Feature Engineering

Feature Engineering is one of the most important part of model building. Collecting and creating of relevant features from existing ones are most often the determinant of a high prediction value.

They can be classified broadly as follows.

Type Desc
Aggregations recalculation of a column in the feature by calculation before/after it
Transformations change a feature to something meaningful, e.g., address to its spatial coordinates
Decompositions break a feature into several ones, e.g. time series decomposition
Interactions new feature created by interacting between two or more features

Feature engineering usually require a good understand of the domain in order to generate useful features. Below are just some non-exhaustive examples to get you started.

Decomposition

Datetime Breakdown

Very often, various dates and times of the day have strong interactions with your predictors. Here’s a script to pull those values out.

def extract_time(df):
    df['timestamp'] = pd.to_datetime(df['timestamp'])
    df['hour'] = df['timestamp'].dt.hour
    df['mth'] = df['timestamp'].dt.month
    df['day'] = df['timestamp'].dt.day
    df['dayofweek'] = df['timestamp'].dt.dayofweek
    return df

To get holidays, use the package holidays.

import holidays
train['holiday'] = train['timestamp'].apply(lambda x: 0 if holidays.US().get(x) is None else 1)

Time Series Decomposition

This is a popular decomposition method for time-series, whereby it is divided into trend (long-term), seaonality (short-term), residuals (noise). There are two methods to decompose:

Type Desc
Additive The component is present and is added to the other components to create the overall forecast value
Multiplicative The component is present and is multiplied by the other components to create the overall forecast value

Usually, an additive time-series will be used if there are no seasonal variations over time.

import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

res = sm.tsa.seasonal_decompose(final2['avg_mth_elect'], \      
                                model='multiplicative')

# plot
res.plot()

# set decomposed parts into dataframe
decomp=pd.concat([res.observed, res.trend, res.seasonal, res.resid], axis=1)
decomp.columns = ['avg_mth','trend','seasonal','residual']
decomp.head()

Automated Feature Engineering

FeatureTools

FeatureTools is extremely useful if you have datasets with a base data, with other tables that have relationships to it.

We first create an EntitySet, which is like a database. Then we create entities, i.e., individual tables with a unique id for each table, and showing their relationships between each other.

import featuretools as ft

def make_entityset(data):
    es = ft.EntitySet('Dataset')
    es.entity_from_dataframe(dataframe=data,
                            entity_id='recordings',
                            index='index',
                            time_index='time')

    es.normalize_entity(base_entity_id='recordings',
                        new_entity_id='engines',
                        index='engine_no')

    es.normalize_entity(base_entity_id='recordings',
                        new_entity_id='cycles',
                        index='time_in_cycles')
    return es

es = make_entityset(data)
es

We then use something called Deep Feature Synthesis (dfs) to generate features automatically.

Primitives are the type of new features to be extracted from the datasets. They can be aggregations (data is combined) or transformation (data is changed via a function) type of extractors. The list can be found via ft.primitives.list_primitives(). External primitives like tsfresh, or custom calculations can also be input into FeatureTools.

feature_matrix, feature_names = ft.dfs(entityset=es,
                                        target_entity = 'normal',
                                        agg_primitives=['last', 'max', 'min'],
                                        trans_primitives=[],
                                        max_depth = 2,
                                        verbose = 1,
                                        n_jobs = 3)
# see all old & new features created
feature_matrix.columns

FeatureTools appears to be a very powerful auto-feature extractor. Some resources to read further are as follows:

tsfresh

tsfresh is a feature extraction package for time-series. It can extract more than 1200 different features, and filter out features that are deemed relevant. In essence, it is a univariate feature extractor.

To extract all possible features...

from tsfresh import extract_features

def list_union_df(fault_list):
    """Convert list of faults with a single signal value 
    into a dataframe with an id for each fault sample
    Data transformation prior to feature extraction
    """
    # convert nested list into dataframe
    dflist = []
    # give an id field for each fault sample
    for a, i in enumerate(verified_faults):
        df = pd.DataFrame(i)
        df['id'] = a
        dflist.append(df)

    df = pd.concat(dflist)
    return df

df = list_union_df(fault_list)

# tsfresh
extracted_features = extract_features(df, column_id='id')
# delete columns which only have one value for all rows
for i in extracted_features.columns:
    col = extracted_features[i]
    if len(col.unique()) == 1:
        del extracted_features[i]

To generate only relevant features...

from tsfresh import extract_relevant_features

# y = is the target vector
    # length of y = no. of samples in timeseries, not length of the entire timeseries
# column_sort = for each sample in timeseries, time_steps column will restart
# fdr_level = false discovery rate, is default at 0.05,
    # it is the expected percentage of irrelevant features
    # tune down to reduce number of created features retained, tune up to increase

features_filtered_direct = extract_relevant_features(
                                    timeseries, y,
                                    column_id='id',
                                    column_sort='time_steps',
                                    fdr_level=0.05)