Working deeply on achieving the features that best describe your problem always beats model tuning. In this pill, we will explain the different phases you may face when dealing with data and some recommendations.

Generally speaking, it is best to have a top-down approach. Instead of inspecting the data and defining possible use cases from it, start by looking for a pain point in.

Data Acquisition

The first thing in a Machine Learning project is defining the question we want to answer, our goal. Later on, we must start gathering the data, no matter if internal or external in order to organize the first dataset. If the collected data is not enough, to create a model or analyze it, we need to reformulate our problem in order to find data that can support our results. However, if we do not want to reformulate the problem, we will have to create a process to automatically collect the data.

Data Exploration

It is crucial at this point to perform an Exploratory Data Analysis (EDA). It will provide us a first understanding of how variables are related, the distribution of the variables… Moreover, we can start making assumptions based on what we see on the EDA. As we mentioned in the previous post, performing an EDA will have a high impact during the Feature Engineering phase.

Data cleaning

Assuming that our dataset is in good condition to be explored or modeled is a big mistake. We need to homogenize the dataset. First, we must check that every cell of each variable has the same format, in other words, if we have a variable called Duration that is expressed in seconds, we need to check that there is no cell with values such as 5:20 seconds.

On the other hand, it is also important to deal with Missing Values. Note that some algorithms cannot work with Missing Values, so we have to modify our dataset depending on the algorithm we will use or create more than one dataset. 

To deal with Missing values, we need to homogenize those Missing values so that they all have the same format. In this case, we must replace those by the same kind of Missing value for each cell in the dataset. We will find two kinds of Missing values

  • Meaningful missing: The fact that a value is missing adds information.
  • Meaningless missing: The fact that a value is missing is accidental.

Depending on the type of Missing value we will proceed as follows:

  • Drop rows with missing values if there is a small percentage.
  • Drop features with a high percentage of missing values.
  • Impute missing values with static content: median, mode, mean…
  • Use Machine Learning techniques to impute missing values.

Data transformations

In order to improve the efficiency of algorithms, sometimes it is recommended to apply transformations to some variables. For instance, it is known that many algorithms work better if data is Normal distributed, thus it is common to apply the Z-score to a feature to normalize it. Another common data transformation is to apply filters to some features in order to remove outliers and check its range.
Lastly, it is possible that we have some data given by numbers but that it is not numerical, for instance, the day of the month. In these cases, we must change its type to categorical, because it is an error to consider it as a numerical feature. 

Author: Adrián López
Reviewer: César Hernández