This is our first post about Advanced Analytics. Here, you will find an Introduction about Machine Learning and all the steps you must implement to perform successful and sustainable projects. During the next pills, you will find our experience through the executed projects.
Define project & Explore data
Given a set of needs to approach, we will set a topic to model. Afterward, we will Exploratory Data Analysis (EDA) in order to find correlations, anomalies or association rules that will help us cleaning the dataset.
Once the data has been explored and cleaned, we are in conditions to think of new variables that may help the model we will create. This is called Feature Engineering. The newly created features may not be created from the given data. If after this step our dataset does not have enough features to work with, we may come back to the first step to redefine our objective.
In order to test a feature’s usefulness, we will proceed to split the data, create some models, and check its efficiency. If a newly added feature makes its evaluation worse, it means it is not a good feature.
Another option during the Feature Engineering phase is to perform some Feature Elimination/Selection algorithms that, among a set of given features, it returns a subset of the n best features, where n is the chosen number of features we want to have.
Once the new features have been added to the data, we must split our dataset into a Train and Test dataset. We will train the model into the Train dataset and test it in the Test dataset. The split can be done taking 80%-20% of the data for train and test respectively. If data depends on time, the split should be linear, whereas if it is not, we may split randomly.
Subsequently, we will start training different models on the Train dataset in order to find the best one. We will try with different settings to look for the one that fits better our data.
Consequently, after each model creation, we will evaluate it by predicting into the Test dataset and checking how good the model is. This step is the most important because it is where we need to check whether a model is good or not, being extremely careful so that the model is not overfitting the data.
The last step of the Modelling part is getting the final model. Once we obtained the best tuning for a model, we train that model into the full dataset (Train+Test) in order to train the model with all the available data.
Production and model update
Finally, the model is prepared to predict future events, so we can introduce future events and start showing the predictions. On the other hand, as time goes by the model will start behaving worse. Because of that, it is necessary to update the model when predictions start to overly fail.
Author: Adrian Lopez
Reviewer: Enric García