Julia for Machine Learning: A Comprehensive Tutorial

Julia is a high-performance, dynamically-typed programming language particularly well-suited for numerical analysis and computational science. This article will guide you through using Julia for machine learning, drawing heavily on user-provided information to ensure accuracy and relevance. It aims to be accessible to both beginners and experienced practitioners, providing a solid foundation for building machine-learning solutions with Julia.

Setting Up Your Julia Environment

Before diving into machine learning, it's crucial to set up your Julia environment correctly.

Installation

It is highly recommended to install Julia natively on your machine. You can download the Julia package for your operating system and run it. After successful installation, you will be able to run the julia command to enter the Julia REPL (Read-Eval-Print Loop) environment. Here, you can write and run Julia code. To exit from REPL, type the exit() command. Also, you can write your code in any text editor and save to files with the .jl extension. Visual Studio Code (VSCode) with the Julia extension is a popular choice for Julia development.

Package Management

Julia's package manager, Pkg, makes it easy to install and manage external libraries. To install a package, use the Pkg.add("PackageName") command.

Pkg.add("DataFrames")Pkg.add("Plots")Pkg.add("ScikitLearn")

Using Packages

Once a package is installed, you can use it in your Julia script or notebook with the using command.

using DataFramesusing Plotsusing ScikitLearn

Including Helper Code

For many projects, you might receive helper code in the form of a Julia (.jl) file. To include this code in your Julia script or notebook, use the include("filename.jl") command. For example, include("readclassjson.jl") will include the code from the file readclassjson.jl.

Essential Julia Syntax and Concepts for Machine Learning

Julia has a simple syntax. If you're familiar with Python, then it will be easy to start writing in Julia. This tutorial will cover features required for machine learning and only the features we'll use to solve the Titanic Kaggle competition.

Basic Linear Algebra

Basic linear algebra features are already integrated into the Julia standard library. Each 1D array is a vector, and each 2D array works as a NumPy array by default. You do not need to include any additional packages for it.

DataFrames.jl: Working with Tabular Data

To work with datasets, you have to install an external DataFrames.jl module. The Julia package manager is implemented as a Pkg module, so, you have to import it and then use the add method to install any required packages. The DataFrames module imports the DataFrame data type that you will use to construct datasets and manipulate data frame objects.Array numbering in Julia starts with 1, not with 0 as in most other languages.

Selecting Data

You should specify the range of rows to select in <rows> and the range of columns to select in <columns>. You can use conditions to specify row ranges.

Read also: Long Beach State 49ers

Sorting Data

To sort data in a data frame, you can use the sort function.

More Complex Data Extraction

You can use the select function for more complex data extraction from frames.

Grouping and Combining Data

You can use the groupby and combine functions to group data and show summary information for each group. The former groups data by specified field or fields and the latter adds summary columns to it, like number of rows in each group or average value of some column in the group. Also, you can use a custom function to calculate summary data.

Plots.jl: Data Visualization

Using Plots.jl, you can create a lot of different graphs to analyze your data, similar to Matplotlib or Seaborn in Python.

A Practical Example: Solving the Titanic Kaggle Competition

"Titanic - Machine Learning from Disaster" is one of the first educational machine learning problems that you might see in many books, articles, or courses. In this task, you are provided with a dataset of data about Titanic passengers. Each passenger data includes an ID, name, sex, ticket cost, ticket class, cabin number, port of embarkation, and number of family members. For passengers in this dataset, it's known whether they survived or not, and the result is recorded in the "Survived" column. If the passenger survived, the value is 1, if not then 0. Formally, this is called a labeled or training dataset. There is also the second dataset with the same data about other passengers but without the "Survived" column. In other words, this dataset contains only the features matrix but does not have the labels vector. This is called the testing dataset. The task is to train a machine learning model on the training dataset and use this model to predict the "Survived" column values in the testing dataset.

Read also: Discover Julia Brown Schools

Data Exploration and Preparation

Let's walk through the key steps involved in tackling the Titanic competition using Julia.

Loading the Data

First, load the training dataset.

Inspecting the Data

As you can see, this dataset has 891 rows and 12 columns. This is the basic data about passengers like "Name", "Sex" and "Age". This summary table shows info about each column. It shows the min, max, mean and median of the data in each of them.

Feature Engineering

The basic goal of the data preparation is to transform these columns to features matrice and labels vector. The labels vector is ready - this is the "Survived" column with numeric values. Let's look at the nmissing and eltype for each column. The nmissing shows the number of missing values in the appropriate column, and the eltype shows the type of data values in them. The matrix should contain only numbers, but there are many columns of "string" data type. Also, the matrix should not have missing values, but we have some missing values in Age, Cabin and Embarked columns.

Handling Missing Values

As the previous table shows, the Age, Embarked and Cabin columns contain missing values. The Age contains 177 missing values. It's not a good idea to remove these rows, because we will lose about 20% of data in the dataset. So, let's just fill it with something, for example with median value. The median age is 28 as displayed in the description table. The Cabin column contains 687 missing values, which is more than 50% of the dataset. There are too few data in this column to be useful for predictions. Also, it's difficult to predict which data should be in these rows if more data is missing than exists.

Encoding Categorical Variables

As I explained before, all data should be encoded to numbers. But we have Name, PassengerId, Sex, Ticket and Embarked as strings. The Name and the PassengerId values are unique for each passenger, so the ML model can't use them to split the data into categories or classify it. For other string columns, we need to encode all text values to numbers. To do that, need to discover all unique values of these columns. This code grouped the dataset by the Embarked column and showed all possible values and their counts. So, here there are "S", "C" and "Q" values only. It's easy to encode them as S=1, C=2, and Q=3. Here we see that it has 680 different categories of tickets, which is more than 50% of the data. But we need to predict just two categories, either survived or not survived. We're not sure that this data can help the model make good predictions without additional processing to reduce the number of categories in this column. Although this goes beyond our current basic model, as some additional practice, you can play around some more with the data in this column to improve prediction results. For example, you can try to find out how to group tickets to more general categories and encode these categories by unique numbers.

Now all the string data is categorized, and all values replaced with numbers. You can see that all columns contain only numeric data and there are no missing values in them. Now, we're ready to train a machine learning model on our dataset.

Exploratory Data Analysis (EDA)

Here we see that 340 passengers survived. Interesting, there are two times as many females who survived than males in the training dataset. Assuming that data in the training and the testing datasets are distributed randomly, it's highly likely that a machine learning model trained on this data should predict that women in first or second class had a much higher chance of survival than others.

Now it really looks like a matrix - or, to be more precise, like a system of algebraic linear equations written in matrix form. Data in matrix format is exactly what most machine learning algorithms expect to get as an input.

Model Training and Evaluation

For machine learning, we will use the SciKitLearn.jl library, which replicates the SciKit-Learn library for Python. It provides an interface for commonly used machine learning models like Logistic Regression, Decision Tree, or Random Forest. SciKitLearn.jl is not a single package but a rich ecosystem with many packages, and you need to select which of them to install and import. You can find a list of supported models here. Some of them are built-in Julia models, others are imported from Python. Also, SciKitLearn.jl has a lot of tools to tune the learning process and evaluate results.

For this "Titanic" task, we will use the RandomForestClassifier model from the DecisionTree.jl package. Usually it works well for classification problems. Also, we will use the Cross Validation module to calculate accuracy of model predictions from the SciKitLearn.CrossValidation package. Then we will implement the training process. First we need to split the training dataset into features matrix and labels vector. Then we need to create the RandomForestClassifier model and train it using this data.

The cross validation splits X and y arrays into 5 parts (folds) and returns the array of accuracies for each of these parts. Then the minimum function selects the worst accuracy from this array, which means that all others are better than the selected one. Finally, the achieved accuracy is more than 0.78, which is 78% for our training data. You can try to improve this value by selecting different models, or by tuning their hyperparameters. For example, you can increase the number of trees (n_trees) from 100 to 1000 or reduce to 10 and see how it changes the accuracy.

Making Predictions on the Test Set

Now, when the model is ready, it's time to apply it to data from the test.csv file which does not have the "survived" labels. Here you can see the same problems with the data: missing values and string columns. You need to apply exactly the same transformations to this data as you did in the training dataset - except for removing rows, because the Kaggle requires that you do predictions for each row, so you can only fill in the missing values. Fortunately, the Embarked column does not have missing values, so there is no need to fix it. This dataset has a single missing value in the Fare column, but we did not have any missing values there in the training set. But the first thing that we need to do is save the PassengerId column to a separate variable.

Generating the Submission File

Now it's time to submit it to Kaggle. The competition requires a CSV file with two columns: "PassengerId" and "Survived". You already have all this data. The first line of this code constructs the submit_df data frame with the PassengerId column that was saved previously and the Survived column with predictions for each passenger ID. The second line saves this submit_df to the submission.csv file. Finally, go to the Kaggle competition page, press the "Submit Predictions" button, upload the submission.csv file, and see your result. The prediction accuracy is 0.76555 which is more than 76% and is close to the accuracy that we got on the training dataset. Not bad for the first time, but you can keep going: play with data, try different models, change their hyperparameters, surf the Internet for articles and Jupyter notebooks of other people who solved the Titanic competition before.

Deploying Your Model

It's fun to play around with machine learning on your computer, but it does not have much impact on real-world problems. Usually, customers do not have Jupyter Notebooks and they do not train the models. They need to have a simple tool that will help them make decisions based on predictions from data that they have. That is why the only really important thing is how your models work in production. First, you need to save the model from the notebook to a file. For this you can use the JLD2.jl module. This module used to serialize Julia objects to HDF5-compatible format (which is well known by Python data scientists) and save it to a file.

We're done with our work with Jupyter Notebook now. You should write all the following code as a separate application. Now you can create a text file titanic.jl which will contain the code of the web application that you will write soon. Use any text editor for this - VS Code with the Julia extension is a good choice.

This code imports the required modules first. As you can see, just two modules are required to run the prediction process: JLD2 to load the model object, and DecisionTree to run the predict function for the RandomForest…

Neural Networks with Flux.jl: A Quickstart

Flux.jl is the most popular Deep Learning framework in Julia. It provides a very elegant way of programming Neural Networks. Unfortunately, since Julia is still not as popular as Python, there aren’t as many tutorial guides on how to use it. Also, Julia is improving very fast, so things can change a lot in a short amount of time.

Building a Simple Classification Neural Network

The goal of this tutorial is to build a simple classification Neural Network. This will be enough for anyone who is interested in using Flux. After learning the very basics, the rest is pretty much altering Networks architectures and loss functions.

Generating a Dataset

Instead of importing data from somewhere, let’s do everything self-contained.

function generate_real_data(train_size) r = rand(train_size) θ = 2*π*rand(train_size) x1 = r.*cos(θ) x2 = r.*sin(θ) return vcat(x1,x2)endfunction generate_fake_data(train_size) r = rand(train_size) .+ 0.5 θ = 2*π*rand(train_size) x1 = r.*cos(θ) x2 = r.*sin(θ)+0.5 return vcat(x1,x2)end# Creating our datatrain_size = 5000real = generate_real_data(train_size)fake = generate_fake_data(train_size)# Visualizingscatter(real[1,1:500],real[2,1:500])scatter!(fake[1,1:500],fake[2,1:500])

Defining the Neural Network Architecture

The creation of Neural Network architectures with Flux.jl is very direct and clean (cleaner than any other Library I know). Here is how you do it:

function NeuralNetwork() return Chain( Dense(2, 25, relu), Dense(25, 1, x -> σ.(x)) )end

The code is very self-explanatory. The first layer is a dense layer with input 2, output 25 and relu for activation function. The second is a dense layer with input 25, output 1 and a sigmoid activation function. The Chain ties the layers together.

Training the Model

Next, let’s prepare our model to be trained.

# Organizing the data in batchesX = hcat(real, fake)Y = vcat(ones(train_size), zeros(train_size))data = Flux.Data.DataLoader((X, Y'), batchsize=100, shuffle=true);# Defining our model, optimization algorithm and loss functionm = NeuralNetwork()opt = Descent(0.05)loss(x, y) = sum(Flux.Losses.binarycrossentropy(m(x), y))

In the code above, we first organize our data into one single dataset. We use the DataLoader function from Flux, that helps us create the batches and shuffles our data. Then, we call our model and define the loss function and the optimization algorithm. In this example, we are using gradient descent for optimization and cross-entropy for the loss function.

Everything is ready, and we can start training the model.

ps = Flux.params(m)epochs = 20for i in 1:epochs Flux.train!(loss, ps, data, opt)endprintln(mean(m(real)), mean(m(fake))) # Print model prediction

Visualizing the Results

Finally, the model is trained, and we can visualize it’s performance again the dataset.

scatter(real[1,1:100], real[2,1:100], zcolor=m(real)')scatter!(fake[1,1:100], fake[2,1:100], zcolor=m(fake)', legend=false)

Additional Resources for Learning Julia

Introduction to Julia for mathematics undergraduates.
Julia Programming: A Hands-On Tutorial, and Numerical Computing in Julia by Martín D. Maas.
Zero2Hero intensive workshop by George Datseris.
From zero to Julia! by Aurelio Amerio.
Programming in Julia (Quantitative Economics) - by Jesse Perla, Thomas J. Sargent, and John Stachurski.
Julia language: a concise tutorial by Antonello Lobianco.
Basics of Projects Example by Rob Farrow.
Programación básica en Julia and Claves para programar en Julia by Helios De Rosario.
Grundlagen der Programmiersprache Julia and Statistik mit Julia by Georg Kindermann.

tags: #Julia #machine #learning #tutorial