By - Manish Jangid
Exploratory Data Analysis (EDA)
Exploratory Data Analysis (EDA) is a fundamental process in data science and statistics, enabling analysts to make sense of datasets by summarizing their main characteristics often through visual methods. This process helps in understanding the structure of the data, detecting anomalies, checking assumptions, and uncovering patterns. In this blog, we will delve into the theory behind EDA and provide a small example to illustrate its application.
What is EDA?
EDA is an approach to analyzing datasets to summarize their main characteristics, often using visual methods. It is an essential step before applying any machine learning models or statistical techniques. The main goals of EDA are to:
Understand the underlying structure of the data.
Detects outliers and anomalies.
Check assumptions required by statistical methods.
Identify patterns and relationships between variables.
Generate hypotheses for further analysis.
EDA can be broadly categorized into:
Descriptive Statistics: Summarizing the basic features of the data.
Univariate Analysis: Analyzing each variable individually.
Bivariate Analysis: Examining the relationships between two variables.
Multivariate Analysis: Analyzing more than two variables simultaneously.
Example:
To illustrate the concepts of EDA, let's use a small example dataset. We will use the famous Titanic dataset, which contains information about the passengers of the Titanic.
Step 1: Importing Libraries and Loading the Dataset
First, we need to import the necessary libraries and load the dataset into a Pandas DataFrame.

Step 2: Overview of the Dataset
Let's start by taking a look at the first few rows of the dataset to understand its structure.

We can also get a summary of the dataset including the column names, data types, and missing values.


Step 3: Handling Missing Values
Identifying and handling missing values is an essential part of EDA. Let's check for missing values in our dataset.

From the output, we can see that the 'Age', 'Cabin', and 'Embarked' columns have missing values. We need to decide how to handle these.
One common approach is to fill in missing values with the median or mean of the column.
Since 'Cabin' has many missing values, we can drop this column.

Step 4: Univariate Analysis
Univariate analysis involves examining the distribution of individual variables.
Analyzing Categorical Variables
We can see the value distribution for the 'Survived' column in the graph.

Step 5: Bivariate Analysis
Bivariate analysis involves examining the relationship between two variables.
We can see the relationship between 'Survived' and 'Sex' in the graph.

Step 6: Multivariate Analysis
Multivariate analysis involves examining the relationships between three or more variables.
We can see the relationship between 'Survived', 'Sex', and 'Pclass' in the graph.

Step 7: Correlation Analysis
Correlation analysis helps us understand the relationship between numerical variables. We need to ensure that only numeric columns are included in the correlation matrix.

The correlation matrix reveals mostly weak relationships between the numerical variables in the Titanic dataset. Notably, there is a slight positive correlation between Parch and Fare, suggesting passengers with more parents/children aboard paid higher fares. SibSp and Parch show a positive correlation, indicating passengers with siblings/spouses also tended to have parents/children with them. Overall, there are no strong correlations between survival and the other variables.
In this small example, we performed a basic EDA on the Titanic dataset. We started by loading and examining the dataset, handling missing values, and performing univariate, bivariate, and multivariate analyses. Finally, we analyzed the correlations between numerical variables. EDA is a crucial step in understanding the dataset and preparing it for further analysis or modeling.
Comments