Feature Engineering
Feature Engineering is one of the most important part of model building. Collecting and creating of relevant features from existing ones are most often the determinant of a high prediction value.
They can be classified broadly as follows.
Type | Desc |
---|---|
Aggregations | recalculation of a column in the feature by calculation before/after it |
Transformations | change a feature to something meaningful, e.g., address to its spatial coordinates |
Decompositions | break a feature into several ones, e.g. time series decomposition |
Interactions | new feature created by interacting between two or more features |
Feature engineering usually require a good understand of the domain in order to generate useful features. Below are just some non-exhaustive examples to get you started.
Decomposition
Datetime Breakdown
Very often, various dates and times of the day have strong interactions with your predictors. Here’s a script to pull those values out.
def extract_time(df):
df['timestamp'] = pd.to_datetime(df['timestamp'])
df['hour'] = df['timestamp'].dt.hour
df['mth'] = df['timestamp'].dt.month
df['day'] = df['timestamp'].dt.day
df['dayofweek'] = df['timestamp'].dt.dayofweek
return df
To get holidays, use the package holidays.
import holidays
train['holiday'] = train['timestamp'].apply(lambda x: 0 if holidays.US().get(x) is None else 1)
Time Series Decomposition
This is a popular decomposition method for time-series, whereby it is divided into trend (long-term), seaonality (short-term), residuals (noise). There are two methods to decompose:
Type | Desc |
---|---|
Additive | The component is present and is added to the other components to create the overall forecast value |
Multiplicative | The component is present and is multiplied by the other components to create the overall forecast value |
Usually, an additive time-series will be used if there are no seasonal variations over time.
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
res = sm.tsa.seasonal_decompose(final2['avg_mth_elect'], \
model='multiplicative')
# plot
res.plot()
# set decomposed parts into dataframe
decomp=pd.concat([res.observed, res.trend, res.seasonal, res.resid], axis=1)
decomp.columns = ['avg_mth','trend','seasonal','residual']
decomp.head()
Automated Feature Engineering
FeatureTools
FeatureTools is extremely useful if you have datasets with a base data, with other tables that have relationships to it.
We first create an EntitySet, which is like a database. Then we create entities, i.e., individual tables with a unique id for each table, and showing their relationships between each other.
import featuretools as ft
def make_entityset(data):
es = ft.EntitySet('Dataset')
es.entity_from_dataframe(dataframe=data,
entity_id='recordings',
index='index',
time_index='time')
es.normalize_entity(base_entity_id='recordings',
new_entity_id='engines',
index='engine_no')
es.normalize_entity(base_entity_id='recordings',
new_entity_id='cycles',
index='time_in_cycles')
return es
es = make_entityset(data)
es
We then use something called Deep Feature Synthesis (dfs) to generate features automatically.
Primitives are the type of new features to be extracted from the datasets. They can be aggregations (data is combined) or transformation (data is changed via a function) type of extractors. The list can be found via ft.primitives.list_primitives(). External primitives like tsfresh, or custom calculations can also be input into FeatureTools.
feature_matrix, feature_names = ft.dfs(entityset=es,
target_entity = 'normal',
agg_primitives=['last', 'max', 'min'],
trans_primitives=[],
max_depth = 2,
verbose = 1,
n_jobs = 3)
# see all old & new features created
feature_matrix.columns
FeatureTools appears to be a very powerful auto-feature extractor. Some resources to read further are as follows:
tsfresh
tsfresh is a feature extraction package for time-series. It can extract more than 1200 different features, and filter out features that are deemed relevant. In essence, it is a univariate feature extractor.
To extract all possible features...
from tsfresh import extract_features
def list_union_df(fault_list):
"""Convert list of faults with a single signal value
into a dataframe with an id for each fault sample
Data transformation prior to feature extraction
"""
# convert nested list into dataframe
dflist = []
# give an id field for each fault sample
for a, i in enumerate(verified_faults):
df = pd.DataFrame(i)
df['id'] = a
dflist.append(df)
df = pd.concat(dflist)
return df
df = list_union_df(fault_list)
# tsfresh
extracted_features = extract_features(df, column_id='id')
# delete columns which only have one value for all rows
for i in extracted_features.columns:
col = extracted_features[i]
if len(col.unique()) == 1:
del extracted_features[i]
To generate only relevant features...
from tsfresh import extract_relevant_features
# y = is the target vector
# length of y = no. of samples in timeseries, not length of the entire timeseries
# column_sort = for each sample in timeseries, time_steps column will restart
# fdr_level = false discovery rate, is default at 0.05,
# it is the expected percentage of irrelevant features
# tune down to reduce number of created features retained, tune up to increase
features_filtered_direct = extract_relevant_features(
timeseries, y,
column_id='id',
column_sort='time_steps',
fdr_level=0.05)