Machine Learning EDA Notebooks: Unveiling Insights Through Exploration

Exploratory Data Analysis (EDA) is a critical initial step in the data science workflow. It involves using various techniques to inspect, summarize, and visualize data to uncover trends, patterns, and relationships. EDA is crucial for understanding datasets, identifying patterns, and informing subsequent analysis.

Introduction to Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) is a method of analyzing datasets to understand their main characteristics. It involves summarizing data features, detecting patterns, and uncovering relationships through visual and statistical techniques. Data pre-processing, Feature Engineering, and EDA are fundamental early steps after data collection. Still, They do not limit themselves to simply visualizing, plotting, and manipulating data without any assumptions to assess data quality and build models. EDA doesn't just involve creating aesthetically pleasing visualizations; it's more about preparing the data and visualizations that can answer the most relevant questions, leading to effective data-driven decision-making.

EDA helps assess assumptions and statistical models, ensuring the accuracy and reliability of the analyses. By visually representing the data through graphs and charts, EDA helps identify patterns, trends, and relationships between variables, extracting valuable insights that might go unnoticed in raw data.

Why is EDA Important?

  • Understanding Data: EDA helps data professionals understand the data, its structure, the relationship among features, and other relevant information.
  • Identifying Issues: EDA can help you identify outliers, missing values, anomalies, and unknown relationships in the dataset.
  • Informing Decisions: EDA helps answer the most relevant questions, leading to effective data-driven decision-making.

Essential Steps in EDA

1. Data Collection and Initial Exploration

Sources of Data

There are various ways to load the data. You can use a file uploader or establish a connection to different databases and warehouses.

Collect Basic Information

Once the dataset is loaded, you can take a glance at information about the data such as different column names, the shape of data, the data types of each column, and several non-null values using the info() method.

Read also: Read more about Computer Vision and Machine Learning

Descriptive Statistics

Next, you can check some of the statistical values such as mean, median, max, min, standard deviation, and IQR for the numerical features using the describe() method. Analysts can quickly detect outliers, evaluate data dispersion, and comprehend the overall trend of the collection by using these summary statistics. Descriptive statistics serve as a starting point for deeper analysis, helping to form hypotheses, guide further exploration, and communicate key insights to stakeholders.

2. Data Cleaning

Next comes the data cleaning stage where we process the data to remove outliers, missing values, redundant data, and standardize the data to bring feature values to the same scale.

Dealing with Outliers

Outliers can ruin your entire statistical analysis if not handled properly. These outliers can be the result of data entry errors, sampling problems, or natural variation. The two most common methods of identifying outliers are IQR (already seen with the describe() method) and Z Score. Once identified you need to handle them properly. It is quite prominent to have the domain knowledge as sometimes these outliers may convey some crucial information and by removing them you could possibly lose it.

The first thing to handle the outliers is visualizing the features that might have them. This can give you an overall picture of the feature and its values. The most common way to visualize the outliers is using the box plots.

Handling Missing Values

When dealing with real-world data, it is quite common to have missing data (NaN values) in the dataset. These NaN values affect the reliability of analysis and modeling. There are several ways to handle these missing values, three of the widely used ones are:

Read also: Revolutionizing Remote Monitoring

  • Dropping the rows from the dataset that contain NaN values. This is not a good approach as most of the time we tend to lose the important information that could be crucial for analysis.
  • Replacing the NaN values with the measures of central tendency (mean, median, and mode). However, this approach assumes that the missing values are missing at random.
  • Finally, use the machine learning algorithms to predict the missing values using the rest of the features in the dataset.

Data Normalization and Standardization

Another common problem of the real-world dataset is that sometimes feature values do not share the same scale. This can be due to the wrong measurement, different measurement units for the same feature, nature of features, etc. When the features in the dataset have different scales, those with larger magnitudes may dominate the learning process. It causes the ML models to converge slower as compared to the same scale features.

The solution to this problem is standardizing the values in a feature to bring them all to the same scale. The most common approaches are Min Max Scaling and Standard Scaling.

Removing Duplicate Rows

Often you have duplicate rows in your data, you need to get rid of these duplicate rows to ensure the data integrity.

3. Feature Engineering

In real-world scenarios, sometimes your data can have a lot of features (thousands of features), sometimes it will have a very small number of features, and finally, it can have just the right amount of features. In any case, there might be a need to manipulate your dataset to remove the existing features, create new features from the existing ones, and combine different features. This process of manipulating the data is known as feature engineering. The need for feature engineering differs from use case to use case but there is a high possibility that your dataset will require feature engineering in one way or the other.

Some of the common feature engineering operations are, creating new features, reducing the number of features using dimensionality reduction techniques such as PCA, and handling the categorical variables for effective ML and deep learning modeling.

Read also: Boosting Algorithms Explained

4. Visual Question Answering

The process of analyzing the data or visual question answering has different types of analysis that can help you answer different questions related to data. These analyses can be broadly classified into three categories:

  • Univariate Analysis: This type of analysis deals with analyzing a single feature at a time and reveals all the descriptive properties of that feature. The data types that we deal with under univariate analysis are numeric, categorical ordered, and categorical unordered. The most common plots and graphs used for univariate analysis are Bar charts, histograms, and pie charts.
  • Bivariate Analysis: Bivariate analysis helps you explore the relationship between two features in a dataset. This analysis aims to analyze the numeric-numeric, numeric-categorical, and categorical-categorical features relationships in the dataset. scatter plots, histograms, box plots, heat maps, and bar graphs are the most common graphs used for bivariate analysis.
  • Multivariate Analysis: This type of analysis aims to identify and recognize the patterns in a set of features. Heatmaps and Correlation matrices are popular analysis techniques used for this multivariate analysis.

Machine Learning EDA Notebook Examples

1. Bear Species Classification

This project demonstrates how to build a complete machine learning project for classifying bear species using physical characteristics and image analysis.

Data

  • Snowflake Account: Access to a Snowflake account.
  • Image Data (images/ folder): The second portion is a collection of images, where each image corresponds to a unique ID from the tabular data (e.g. ABB_01, EUR_01, GRZ_01 and KDK_01). These physical measurements are key differentiators between bear species. In contrast, the absence of a shoulder hump and a straight "Roman nose" facial profile are key features of the American Black Bear.

Key Steps

  • Cortex AI Feature Extraction: The core of this notebook is using the AI_COMPLETE SQL function (from Snowflake Cortex AI) to perform image analysis.
  • Data Exploration: With a clean dataset in Snowflake, the next step is to explore it.
  • Feature Relationships: Interactive scatter plots to explore the relationship between any two selected features (e.g., Claw_Length_cm vs. Body_Weight_kg).
  • Feature Scaling: Apply preprocessing to the data.
  • Hyperparameter Tuning: Perform a grid search by training multiple Random Forest models with different combinations of hyperparameters (n_estimators, max_depth, etc.).
  • Register Model: After training the final model with the best parameters, you will register it in the Snowflake Model Registry, giving it a name and version.
  • Streamlit Application: A standalone Streamlit application is created to allow for easy interaction with the model.

2. Online Retail Data Analysis

Let’s work with a case study that comes from the online retail data set and are available through the UCI Machine Learning Repository. This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. The company mainly sells unique all-occasion gifts.

Business Scenario

The management team expects to spend less time in projection models and gain more accuracy in forecasting revenue.

Key Steps

  1. Leverage the data to calculate the monthly revenue of the online retail store.
  2. During the Exploratory Data Analysis (EDA) process, data integrity issues are identified sometimes.
  3. After extracting data it is important to include checks for quality assurance even on the first pass through the project workflow. Quality assurance step must implement checks for duplicity and missing values. Missing values are generally dealt with depending on the category of missingness i.e MCAR (Missing completely at random), MAR (Missing at random) and MNAR (Missing not at random).

3. Used Car Price Prediction

In this dataset, we are trying to analyze the used car’s price and how EDA focuses on identifying the factors influencing the car price.

Data Exploration

  • data.info() shows the variables Mileage, Engine, Power, Seats, New_Price, and Price have missing values.
  • Numeric variables like Mileage, Power are of datatype as float64 and int64.
  • nunique() based on several unique values in each column and the data description, we can identify the continuous and categorical columns in the data.

Feature Engineering

Feature engineering refers to the process of using domain knowledge to select and transform the most relevant variables from raw data when creating a predictive model using machine learning or statistical modeling.

  • Play around with the variables Year and Name in our dataset. Since car names will not be great predictors of the price in our current data. But we can process this column to extract important information using brand and Model names.
  • Some names of the variables are not relevant and not easy to understand.
  • Some data may have data entry errors, and some variables may need data type conversion. In the example, The brand name ‘Isuzu’ ‘ISUZU’ and ‘Mini’ and ‘Land’ looks incorrect.

Key Findings

  • On average of Kilometers-driven in Used cars are ~58k KM. The range shows a huge difference between min and max as max values show 650000 KM shows the evidence of an outlier.
  • Min value of Mileage shows 0 cars won’t be sold with 0 mileage.
  • The average number of seats in a car is 5.
  • The max price of a used car is 160k which is quite weird, such a high price for used cars.
  • ~82 % of cars are First owned cars.
  • Price and Kilometer-Driven variables are highly skewed and on a larger scale.
  • Car age positively correlates with kilometers driven, as the age of the car increases, the kilometers driven also increase.

Missing Data Imputation

  • We cannot impute the data with a simple Mean/Median. We must need business knowledge or common insights about the data. If we have domain knowledge, it will add value to the imputation.
  • We observed earlier some observations have zero Mileage. This looks like a data entry issue. We can fix this by filling null values with zero and then using the mean value of Mileage, since the mean and median values are nearly the same for this variable.
  • Similarly, imputation for Seats. Let’s assume some cars brand and Models have features like Engine, Mileage, Power, and Number of seats that are nearly the same.
  • In general, there are no defined or perfect rules for imputing missing values in a dataset. Each method can perform better for some datasets but may perform even worse.

4. Fake News Detection

The goal is to build a fake news detection system that can detect the truthfulness of snippets of text from different sources including political debates, social media platforms, etc.

Data Acquisition

  • Use an existing publicly-available dataset.
  • The Liar Liar Dataset released in ACL 2017 has the characteristics we want. It has a little under 13K labelled statements from various contexts off of POLITIFACT.COM. The original dataset actually has a 6-way labelling scheme from PANTS-FIRE to TRUE, though for our purposes we consolidate the labels to a binary classification setting (TRUE/FALSE).

EDA Steps

  1. Data Cleaning: Be critical of your data and strive to eliminate as much noise as possible.
  2. Basic Statistics: Start your EDA with some basic statistics about your corpus.
  3. Label Distribution: Check the label distribution to see if the dataset is balanced.
  4. Affiliation Analysis: Analyze the label distributions between Republican and Democrat statements.
  5. N-gram Analysis: Compute ngram measures such as the most frequently seen unigram and bigrams in the data.
  6. Topic Modelling: Use latent semantic analysis (LSA) to find the words that are most representative of a topic.
  7. Sentiment Analysis: Apply sentiment analysis to understand the emotional content and tone of a statement.

Key Insights

  • The labels are roughly equally-distributed, though the proportion of extremely false statements (PANTS-FIRE) is quite a bit smaller than the others.
  • The distribution is slightly skewed toward the True examples.
  • There are slightly more Republican-affiliated statements than Democratic ones, and there is a heavy long-tail for these affiliations.
  • Democrat statements are often more True than False, especially as compared to the Republicans.
  • Obama, Hillary Clinton, Donald Trump, and other political individuals and phrases shows up a lot in both True and False statements.
  • In the True data, Topic 2 seems to revolve around healthcare related terms, Topic 3 deals with jobs, and Topic 4 roughly deals with taxes.
  • There is a slight difference in the sentiment scores between True and False statements.

5. Food and Weather Dataset Analysis

This example demonstrates a practical implementation of EDA on the Food and Weather dataset using Python. This dataset contains information about the weather and famous food places in different cities.

Key Questions to Answer

  • What are the most popular venues?
  • How does humidity affect visit times?
  • Which popular venues do people stay at the longest?

Steps

  1. Install and Import Packages: Use libraries like Pandas, NumPy, Matplotlib, Seaborn, and Pgeocode.
  2. Data Collection and Initial Exploration: Load the data from a Snowflake Warehouse.
  3. Data Cleaning: Handle outliers and missing values.
  4. Feature Engineering: Create new features such as coordinates (lat and long) for each venue.
  5. Visual Question Answering:
    • What are the Most Popular Venues? Identify which zip codes are the most popular for a venue of choice.
    • How does Humidity Affect Visit Times? Create a scatter plot to identify the relation between humidity and the average amount of time people stay at a venue.
    • Which Popular Venues do People Stay at the Longest? Identify the most popular venues where people like to stay for the longest.

Tools for EDA in Python

  • Pandas: For data manipulation and analysis.
  • NumPy: For numerical computing.
  • Matplotlib: For creating static, interactive, and animated visualizations in Python.
  • Seaborn: A Python data visualization library based on Matplotlib.
  • Jupyter Notebook: An open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text.
  • Plotly: A graphing library that makes interactive, publication-quality graphs.
  • Tableau: A powerful data visualization tool.
  • Power BI: A business analytics service by Microsoft.

Best Practices for EDA

  • Start with a clear process: Have a well-defined, process-based approach.
  • Focus on answering relevant questions: Prepare the data and visualizations that can answer the most relevant questions.
  • Be critical of your data: Strive to eliminate as much noise as possible.
  • Use domain knowledge: If you have domain knowledge, it will add value to the imputation.
  • Document your findings: Make sure that the data summaries, key findings, investigative process, conclusions are made clear.

tags: #machine #learning #eda #notebooks #examples

Popular posts: