Deep Learning Notebook Examples: A Practical Guide

Deep learning models, known for their substantial size and demand for considerable computational resources, are increasingly being adapted for mobile deployment. This article explores practical deep learning examples, focusing on training models and understanding the nuances of model development, deployment, and performance evaluation.

Integrating TensorFlow Lite with ArcGIS API for Python

One innovative approach involves integrating the training of TensorFlow Lite models with the ArcGIS API for Python. This integration allows for the creation of deep learning models that are both compact and suitable for mobile deployment. By leveraging the TensorFlow Lite framework, developers can train models for specific applications, such as classifying plant species, and create deployable files for direct inferencing on mobile devices.

Workflow Overview

The example workflow focuses on training a deep learning model to classify plant species and create corresponding files to be deployed for direct inferencing.

Prerequisites

To successfully execute this workflow, certain prerequisites must be met:

Training Dataset: A labeled image dataset of various plant species is required. The dataset should be appropriately sized, with consideration given to the available memory resources.
ArcGIS Notebook Server Manager: Administrative access to ArcGIS Notebook Server Manager is needed to adjust the memory limit of the notebook environment. The memory limit should be set to at least 15GB, depending on the size of the training data. Standard and Advanced notebook environments have default memory limits of 4GB and 6GB, respectively.
Raster Server (Conditional): If direct access to the training dataset is unavailable, a Raster Server is needed to generate suitable training data in the required format.

Data Preparation and Visualization

Preparing Data with ArcGIS API for Python

The prepare_data() function within the ArcGIS API for Python plays a crucial role in preparing data for deep learning workflows. This function automates the data preparation process by reading training samples and applying various transformations and augmentations to the training data. These augmentations are vital for training robust models with limited data and preventing overfitting.

For detailed information on the prepare_data() function parameters, refer to the arcgis.learn API reference.

Visualizing Data Samples

After preparing the data, the show_batch() function can be used to visualize samples, providing a visual check of the prepared data.

Model Loading and Training

Loading the Model Architecture

The Feature Classifier model in arcgis.learn is designed to determine the class of each feature. It requires specific parameters:

backbone: An optional string specifying the backbone convolutional neural network model used for feature extraction. The default is "resnet34." Supported backbones include the ResNet family and specified Timm models (experimental support) from backbones().
backend: An optional string controlling the backend framework, with 'pytorch' as the default.

An example of loading the model is:

model = FeatureClassifier(data, backbone="MobileNetV2", backend="tensorflow")

Determining the Learning Rate

The ArcGIS API for Python employs fast.ai's learning rate finder to identify an optimal learning rate for training models. The lr_find() method helps determine a suitable learning rate for training a robust model. Once the learning rate is established, it can be used as a fixed value for subsequent retraining runs.

Model Fitting

The fit() method is used to train the model, with the epoch parameter defining the number of times the model is exposed to the entire training dataset. Each epoch allows the model to learn and adjust its weights based on the data. While the example suggests running the model for three epochs for testing purposes, it is recommended to start with 25 epochs for a more accurate model suitable for deployment.

Model Validation and Deployment

Visualizing Results

To validate the model's results within the notebook, the show_results() method can be used to compare the model's predictions with random ground truth images.

model.show_results(rows=4, thresh=0.2)

Saving the Model

Once the accuracy of the trained model is confirmed, it can be saved for future deployment. By default, the model is saved as a .dlpk file in the models subfolder within the training data folder.

model.save("Plant-identification-25-tflite", framework="tflite")

Deploying the Model

The saved .dlpk file can be deployed with other datasets and shared within an organization.

Practical Deep Learning with Jupyter Notebook and Google Colab

Embarking on deep learning projects requires hands-on experience. Jupyter Notebook serves as an excellent starting point for beginners due to its interactive coding environment, allowing for block-by-block code execution, simplifying debugging and testing. It also provides a practical understanding of a PC’s resource limits before transitioning to more powerful computing options.

Read also: An Overview of Deep Learning Math

Starting Small with Jupyter Notebook

Jupyter Notebook is a great starting point for beginners because of its interactive coding environment. It allows you to execute code block by block, making debugging and testing easier. I also think it’s an excellent way to understand the limits of your PC’s resources before transitioning to more powerful computing options, especially for bioinformatics problems.

Begin by practicing basic machine learning (ML) concepts with datasets such as:

Iris Dataset: A classic dataset containing measurements of iris flowers (sepal length, sepal width, petal length, petal width) used for classification and regression tasks.
MNIST Dataset: Handwritten digit images for classification.
CIFAR Dataset: A more complex dataset containing images of animals, cars, and other objects for multi-class classification.

The focus should be on understanding the mechanics of model training, including loading data and designing architectures using frameworks like Keras and Pytorch. Deep learning is an iterative process that involves tweaking parameters to observe their impact on results.

Tutorials may contain outdated methods, highlighting the dynamic nature of deep learning frameworks and the need for adaptability.

Scaling Up with Google Colab

Google Colab, a cloud-based platform with access to GPUs, facilitates the training of larger models. Colab is easy to set up and supports seamless dataset imports.

Switching to Colab becomes necessary when dealing with larger datasets or encountering memory and storage limitations in Jupyter Notebook. Error messages such as "Segmentation fault (core dumped)" often indicate the need for more computational resources.

Scaling up Further with HPC

Colab is fantastic, yeah? But it’s not a High-Performance Computing cluster (HPC). And when you’re dealing with data like the whole human genome sequence, you need a real supercomputer or cloud infrastructure.

For extensive datasets, such as the entire human genome sequence, High-Performance Computing (HPC) clusters or cloud infrastructure are essential.

HPCs enable hyperparameter tuning, where variables like learning rates, depth, and batch size are adjusted to optimize model performance. This tuning process is crucial for ensuring that the final model is high-performing with respect to the specific data.

HPCs lack the interactive system of notebooks or Colab, requiring a solid understanding of the code's functionality. Automation is a key advantage, allowing for the training of multiple models without manual adjustments. It's advisable to start with a limited number of hyperparameters to minimize potential errors and ensure proper metric tracking.

Understanding the Training Process

Data Splitting

Before training, the dataset is divided into three parts:

Training set (~60-80%): The model learns from this data.
Validation set (~10-20%): The model uses this to practise what it has learnt and then adjust how it is learning. It also prevents overfitting.
Test set (~10-20%): A final, unseen dataset to evaluate model performance.

This division prevents the model from memorizing the data instead of learning underlying patterns.

Loss Function & Optimization

A loss function quantifies the error in the model's predictions. The optimizer, such as Adam or SGD, adjusts the model’s parameters to minimise this loss using gradient descent.

Common loss functions include cross-entropy loss for classification problems and mean squared error (MSE) for regression problems.

Handling Data Imbalance

Imbalanced datasets, common in bioinformatics, can bias the model. Strategies to address this include:

Oversampling the minority class (duplicating rare samples)
Undersampling the majority class (removing common samples)
Using weighted loss functions (penalizing incorrect minority-class predictions more heavily)

It is worth noting that you should not apply all the strategies at once. You can try out different imbalance strategies and select whichever has the least negative impact on the model training process (time) and performance.

Performance Metrics

Evaluating Model Performance

Accuracy alone is not always a reliable metric, especially with imbalanced datasets. Key metrics include:

F1 Score: Balances precision and recall.
AUC (Area Under the Curve): Measures the model's ability to distinguish between classes.
MCC (Matthews Correlation Coefficient): Recommended for imbalanced datasets, considering true positives, true negatives, false positives, and false negatives.

Learning Curves Visualization

Learning curves, which plot training and validation loss over time, provide insights into model performance. Ideally, both losses should decrease and stabilize. A significant gap between the curves indicates overfitting, where the model memorizes training data but fails to generalize to new data.

tags: #deep #learning #notebook #examples