EDA vs Data Preprocessing: What's the Difference?

By Mbali Kalirane on 04 Apr 2024

Both Exploratory Data Analysis (EDA) and data preprocessing play important roles in the data cleaning pipeline, but they serve different purposes. In this article, you will learn the difference between EDA and data processing, and how each process is used differently in data preparation. You’ll understand the steps involved in each process, and you’ll learn how to implement these steps in Python.

What is Exploratory Data Analysis (EDA)?
What is Data Preprocessing?
Example of EDA vs Data Preprocessing
The Process of EDA
The Process of Data Preprocessing
Conclusion

What is Exploratory Data Analysis (EDA)?

The aim of EDA is to get an understanding of the characteristics and structure of your data. It’s the process of analyzing your data to look for patterns in the data, evaluate how your data is distributed, and look for potential errors in your data.

What is Data Preprocessing?

Data Preprocessing is the process of fixing the errors discovered during EDA. It consists of a variety of data cleaning processes, such as removing missing values from the dataset, or fixing incorrectly written values. Data preprocessing is beneficial for preparing the data for the model training process. For algorithms to effectively process your data, it must be clean and well-prepared.

Example of EDA vs Data Preprocessing

Let’s take an example of a classification dataset, which aims to determine the likelihood of a heart attack based on a patient’s health information and characteristics. Before modelling this data, EDA had to be performed on this data in order to identify issues in the data. Pre-processing then had to be applied on the data to fix the identified issues. Let’s take a look at the steps for performing EDA and preprocessing on data.

The Process of EDA

The EDA will consist of finding the characteristics of the dataset, this includes checking for the following:

Viewing the data
Dataset proportions
Missing Values
Data types
Duplicate values
Inconsistent values
Data Distribution

Viewing the Data

Suppose the dataset is named data.

Firstly, you need to view your data. For this, you can use data.head().

Check for Dataset's Proportions

Evaluate the number of rows or columns in the dataset.

You can use .shape() to determine the dataset’s proportions. Thus, you’ll have:

data.shape

The output below shows that the dataset has 918 rows and 12 columns.

Check for Missing Values

Evaluate the dataset for any missing values.

You can check for missing values using isnull().values.any(). This will tell you whether the dataset has any missing values by returning either a True or False. Thus, applying this to the dataset, you’ll have:

data.isnull().values.any()

As can be seen by the output below, there are no missing values in the dataset:

alt text

Check for Data types

Check the data types of each variable and identify variables which need to be converted to numerical types.

You can use dtypesfor checking the data types of each variable. Applying this to the data, you’ll have:

data.dtypes

As can be seen from the above output, ST Slope, ExerciseAngina, RestingECG and ChestPainTypehave non-numeric data types, and will have to be converted to numeric data types.

Check for Duplicate Values

Check if any instances or rows have been duplicated.

For this, you can use the method .duplicated(), as follows:

This code generates a statements saying whether the dataset contains duplications or not.

From the above output it’s clear that the dataset contains no duplicate rows.

Check for Inconsistent Values

Inspect the categorical variables to identify any incorrectly written, misspelled or repetitive data.

For this, you can use a For-loop to check the unique values of each categorical column in the dataset.

From the output below, you can see that there does not appear to be any inconsistent values in any categorical column of the dataset.

Check for Outliers

Check for the presence of outliers in the numerical variables of the dataset.

To do this, you can use boxplots as follows:

The outliers are represented by the hollow circles in each boxplot. And as can be seen from the boxplots, there are outliers in each of the numerical columns, except for Age.

Check the Data's Distribution

Evaluate the balance of the data, particularly the target variable. Evaluating the balance of the target variable is important for identifying potential biases in the data, especially for classification problems.

You can use a bar chart to evaluate the proportion of each class in the target variable:

To get a more concrete understanding of the proportions of each class in the target variable, you can calculate the percentages for each class as follows:

As can be seen from the above output, the target column is just slightly imbalanced. In our dataset, the target variable comprises around 55% of data in one class and about 45% in another class. This indicates a relatively balanced distribution between the classes. Thus, applying a method to balance the data may be unnecessary.

The process of Data Preprocessing

After applying EDA to our dataset, we can see that the dataset presents several issues that need to be fixed and require preprocessing.

Issues Identified:

The problems identified in the data are as follows:

There are outliers
There are non-numerical variables
There’s a sligt imbalance in the data

Remove Outliers

Remove Outliers from the dataset.

To do this, you can use the Inter quartile range (IQR). The IQR helps calculate upper or lower limits, allowing the identification and removal of outliers beyond these limits.

Thus, from our output, we can see that the outliers have been removed from the data:

Convert Categorical Data to Numeric Data

The dataset contains categorical variables, thus you need to convert these into numeric format.

You’ll need to convert the following categorical variables into numeric form:

Sex
ChestPainType
RestingECG
ExerciseAngina
ST_Slope

Sex, ExerciseAngina and ST_Slope are nominal variables, and ChestPainType and RestingECG are ordinal variables.

Thus, for nominal variables, you can use get_dummies as the encoding method, and for ordinal variables, you can use OrdinalEncoder as your encoding method.

Encoding Nominal Data

Using get_dummies on nominal data:

Encoding Ordinal Data

Using OrdinalEncoder on ordinal data:

Thus, we have the following output of encoded data:

Balance the Data

Balance the target column. To do this, you can use an upsampling method as follows:

Conclusion

EDA is about exploring your dataset to understand its structure, issues and characteristics. Data preprocessing is about fixing the issues found the data, and cleaning the data in preparation for the modelling stage.

In the above analysis, I identified the following problems from applying EDA: the presence of outliers, non-numerical data, and imbalanced data. During data preprocessing, I removed the outliers, encoded the non-numerical data and balanced the data.

However, it’s important to note that the order of some data cleaning processes need to be carefully considered. For example, the removal of outliers from a dataset can impact data balance, and vice versa. Thus, data balancing will need to be applied after the removal of outliers.

Also, careful consideration is needed when choosing between the two approaches. If outliers come from measurement errors or anomalies, removing them may be necessary, irrespective of class imbalance. However, when faced with significant class imbalance, such as a 90:10 ratio, addressing this issue takes precedence.

In cases where both issues exist, employing both data balancing techniques and outlier removal may be necessary. After applying outlier removal, you may need to check you data balance again.

Mbali Kalirane Connect on Linkedin Hello! I’m Mbali, your Analytics Writer. I’m a passionate data scientist who believes in the power of education and sharing knowledge. Feel free to connect with me on Linkedin by clicking on the 'Connect' button above

« Elbow Method vs Silhouette Method: Which is Better for Finding Optimal K?

Calculating Principal Component Analysis (PCA) - A Step-by-Step Guide »