The definition of Feature Engineering is applying domain knowledge of the data to create new features that allow Machine Learning algorithms to work better, or to work at all. The action of creating new features is an iterative process that follows the next schema:
New features can be created from the given data or come from external sources. For instance, in a dataset in which we are given sales per day, we can create a variable from the given data that is the Trend and another variable from an external source that is the temperature per day. There are no steps to follow to create features, the only thing to keep in mind is, once created the feature, define it as the correct type of feature (Text, Categorical or Numerical). Other kinds of features that usually work well are Boolean variables. For instance, in the same example of a sales dataset, we can create a feature Is_Weekend with values 0 and 1 (False and True respectively).
The first thing we must remember is that not every feature created or given is usable. Imagine that, in the Sales dataset, we want to predict sales and we have another field Customers. Customers and Sales are highly correlated, so it might be an error to consider both features. Additionally, we might not have this feature available at the prediction moment. When predicting next day sales we will not know how many customers will attend our store. This is known as Data Leakage, providing features that already describe the label resulting in a high dependence on a single variable and a high performance.
On the other hand, not every feature created or given has to be used. Some features might be too random or noisy, and this can make that our model does not perform well and it is not able to generalize. For this reason, it is important to check the Feature Importance after model creation, this will provide information on which features have more impact on the result and will help us to decide which ones will improve the model performance. Furthermore, there are some algorithms that help to choose features (e.g. Boruta Algorithm). However, those algorithms are recommended only when the number of features is large.
Writter: Adrián López
Reviewer: César Hernández