Stock Prediction Using Machine Learning Methods

Introduction

The stock market is a dynamic marketplace where investors buy and sell ownership interests, securities, and company shares. It is a key component of the larger equity capital markets, facilitating the issuance and trading of shares in publicly listed companies. Companies go public to raise capital for expansion or debt repayment by issuing stocks on stock exchanges, allowing investors to become shareholders and benefit from the company’s growth and profits. The stock market’s significance has led to a strong interest in accurate market predictions and trend evaluations, helping participants make informed decisions. This interest has grown, attracting individual traders, market participants, and data analysts focusing on machine learning and artificial intelligence, all aiming to gain insights into market movements and make well-informed financial decisions.

This research aims to investigate and comprehend the intricate dynamics of stock markets against the backdrop of a growing interest in the topic. Trading on the stock market and investing in the stock market both come with certain inherent risks since share values are affected by a wide variety of factors. These factors include the performance of a company’s revenue and sales, as well as external influences such as governmental regulations, microeconomic indicators, and the ever-changing relationship between supply and demand. Because of the complexity of these factors, it is necessary to conduct extensive research to design and develop software and programs that make use of advanced learning techniques such as artificial intelligence, machine learning, deep learning, and neural networks. These forecasts prove to be of tremendous value because they reduce the likelihood of monetary loss while simultaneously increasing the possibility of monetary gain because of the investment process. Traders can use these algorithms to engage in algorithmic trading strategies, stock forecasting, and the establishment of their trading setups with target prices and stop-loss levels.

This research aims to enhance predictive accuracy in the financial sector by exploring the application of machine learning algorithms for stock price prediction. Accurate stock price forecasting is critical for investors, financial analysts, and other market participants as it informs investment strategies and decision-making processes. To achieve this, the study employs a structured approach that integrates Agile Scrum methodologies with the OSEMN (Obtain, Scrub, Explore, Model, and iNterpret) framework. This combination ensures a flexible and iterative development process while maintaining a systematic and comprehensive approach to data analysis and model development.

The research evaluates the performance of six distinct algorithms: Linear Forecast, Naive Forecast, Simple Moving Average (SMA) with windows of 5 and 20, ARIMA (Autoregressive Integrated Moving Average), and Long Short-Term Memory (LSTM) with a 30-day prediction horizon. These algorithms are assessed using the Mean Absolute Error (MAE) metric, providing a robust measure of their predictive accuracy. By comparing these models, the research seeks to identify the most effective machine learning techniques for stock price prediction, ultimately contributing to more reliable and actionable financial insights.

Justification

Problem Statement

In the volatile stock market, traders and financial organizations face many obstacles in predicting stock values. This complex financial landscape makes accurate predictions difficult. Market instability and uncertainty make stock price predictions challenging. Economic, political, technical, and investor mood all affect the stock market. Stock price changes are complicated by these various and unpredictable influences.

Read also: Navigating Scale AI's IPO

Financial data is enormous and diversified, complicating the issue. Market indicators, news stories, social media sentiment, business financial statements, and historical price data fill the stock market. Data comes in many types and structures, making integration and processing difficult. Meaningful insights and powerful prediction models depend on data quality and consistency. However, the sheer volume and variety of financial data make it difficult to sort through the noise and discover relevant information, leaving traders desiring effective methods and tools to make sense of it.

The complex and non-linear interactions between stock price variables are another difficulty. In the stock market, variables interact non-linearly, affecting stock prices. Such nuanced linkages challenge the basic assumptions of typical linear models, requiring more advanced methods to identify these hidden patterns.

Another challenge is determining trade entry and departure sites. Trading at the right time maximizes profits and minimizes losses. Price fluctuations and external factors make entry and exit positions challenging for traders. Traders can’t confidently and precisely execute transactions without appropriate tools and methods. Trading in fast-changing markets requires real-time data analysis. Traders need real-time data and insights to make smart judgments. Traditional manual analysis methods may not process and analyze real-time data quickly, putting traders at a disadvantage in the dynamic market.

Market participants endure manual prediction errors. Manual predictions are biased and inaccurate due to the stock market’s complexity and many influences. Predictions based on gut impressions or biases might lead to inconsistent results and ineffective trading tactics. Without dependable analytical tools and objective procedures, traders may struggle to anticipate and maintain profits. Investment bias should be considered when predicting stock values. Preconceived beliefs or emotional biases may influence traders’ financial decisions.

Proposed Solution

Using advanced machine learning methods, this research project aims to address the complexities of stock market analysis and enhance predictive accuracy. By analyzing historical price data, the research seeks to uncover underlying patterns and trends, providing vital insights into market behavior. The heterogeneity and complexity of financial data, which includes fundamental indicators and social media sentiment, present significant challenges. To overcome these obstacles, the study employs robust data processing techniques, such as Long Short-Term Memory (LSTM) models, which can isolate significant signals from background noise.

Read also: Future Outlook for New Oriental

This research evaluates six distinct models: Linear Forecast, which applies a straightforward linear approach to future price prediction; Naive Forecast, which assumes that future prices will be the same as the most recent prices; Simple Moving Average (SMA) with windows of 5 and 20, which smoothens price data to identify trends; ARIMA (Autoregressive Integrated Moving Average), which is a comprehensive time series model combining autoregression, differencing, and moving average components; and LSTM (Long Short-Term Memory) with a 30-day prediction horizon, known for capturing temporal correlations and complex patterns in sequential data. These models are assessed using the Mean Absolute Error (MAE) metric to determine their predictive accuracy. By identifying the key variables driving stock price fluctuations and automating calculations, this research aims to provide traders with crucial information to inform their investment decisions, allowing them to focus on other strategic aspects of their investment methods.

The typical process for creating a machine learning model to predict stock prices includes several stages: gathering historical data through an API, pre-processing the data, developing a forecasting model, and assessing the model’s performance. During pre-processing, zero values are removed, duplicates are eliminated, and features are scaled. Key features are then selected, and valid data is chosen for predicting or forecasting stock prices. This article examines various popular machine learning and deep learning algorithms, such as linear regression, moving average, naive Bayes, ARIMA (autoregressive integrated moving average), and LSTM (long short-term memory). The mean absolute error (MAE) is used to evaluate the performance of the regression or price forecasting models.

Research Aim and Objective

The research aims to build six distinct machine learning models leveraging historical stock prices and assess their predictive accuracy using the Mean Absolute Error (MAE) metric. Through a comprehensive literature review, existing studies, research papers, and scholarly articles will be explored to identify methodologies and evaluate their effectiveness in predicting stock prices. By addressing limitations and proposing solutions for algorithmic prediction, the study seeks to enhance the efficacy of these models. Clean and structured datasets will be prepared to facilitate model training and evaluation. Subsequently, the most accurate model identified will be implemented into a user-friendly web application, forming the foundation of an accurate full-stack price prediction system.

Research Questions

How does the accuracy of different machine learning algorithms compare in predicting stock prices?
What are the key factors that significantly influence the performance of machine learning algorithms in stock price prediction?
What are the ethical considerations and dilemmas associated with using machine learning algorithms for stock price prediction?

Materials and Methods

Materials

Data Collection

The research used different datasets for different purposes. Historical financial data for NABIL bank’s stock was obtained from the NepseAlpha platform, while real-time and historical data for various stocks were fetched using the Yahoo Finance API. NABIL dataset was primarily used for training and testing the models while Yahoo Finance API was used to fetch real-time stock data for web applications.

NepseAlpha

NepseAlpha consists of all the stock data listed in NEPSE which is the stock index of Nepal. The study uses the NABIL dataset which is a banking stock for the model training and testing purpose. The dataset contains date, open price, high price, low price, close price, percent change, and volume for the stock.

Read also: EDUC: Challenges and Opportunities

Yahoo Finance

As there is no API that allows real-time data integration of NEPSE, Yahoo Finance was used to fetch real-time stock data for foreign stocks for the web application. It consists of date, open price, high price, low price, close price, volume, dividends, and stock splits.

Methods

Data Processing and Analysis

The project utilized various tools and technologies to process and analyze data for stock price prediction. Python was used for data collection and storage, with its versatility and libraries making it ideal for data manipulation and analysis tasks. NumPy was used for data manipulation and analysis, while Pandas was used for data cleaning and preprocessing. The structured and cleaned data were prepared for further analysis stages. For feature engineering, Python libraries such as NumPy and Pandas were again instrumental in creating new features from the raw data. Features like day return, log day return, and weekday were included.

The exploratory data analysis (EDA) process was carried out using Python’s data visualization libraries, Matplotlib, and Seaborn. These libraries allowed for the creation of informative visualizations, enabling a deeper understanding of the data distribution, and identifying potential correlations between features and stock price movements.

The combination of these tools and technologies ensured a comprehensive and efficient data phase. By leveraging the power of Python and its libraries, the project was able to process and analyze large volumes of financial data from NepseAlpha and Yahoo Finance API effectively. The resulting clean and structured dataset, enriched with engineered features and insights from EDA, laid a solid groundwork for the subsequent stages of machine learning model development and web application creation.

Building Models

In the model-building phase of the project, the primary objective was to develop, evaluate, and compare six distinct machine learning algorithms for stock price prediction. These algorithms, namely Naive, Simple Moving Average (SMA 5 with a 5 day trading week window), Simple Moving Average (SMA 20 with a 20 day trading month window), Linear Regression, Autoregressive Integrated Moving Average (ARIMA) model, and Long Short-Term Memory (LSTM) model were carefully selected based on their suitability for time-series prediction tasks. Python’s powerful libraries, Scikit-learn, TensorFlow, and Keras, were used to implement and train machine learning algorithms. Scikit-learn provided an extensive collection of machine learning models, while TensorFlow and Keras offered a robust platform for developing and configuring deep learning models like LSTM.

The machine learning process began with preparing a dataset divided into training and testing sets. Each algorithm was meticulously configured by setting appropriate hyperparameters, such as window size, epochs, learning rates, and activation functions, to ensure accurate predictions and capture underlying patterns in stock price data. The Mean Absolute Error (MAE) metric was used to evaluate the performance of each algorithm. An iterative and interactive approach was adopted during the experimentation process using Jupyter Notebook, allowing for easy modification of hyperparameters and rapid comparison of the algorithm’s performance on the dataset.

The machine learning models were primarily tested on the NABIL dataset. Testing the algorithms on multiple datasets provided valuable insights into their generalizability and performance across different stocks and market conditions. After a comprehensive evaluation, the best-performing model was selected for further refinement and integration into the web application for stock price prediction. Techniques like hyperparameter tuning and cross-validation were applied to optimize the model’s performance and ensure its ability to generalize well to unseen stock price data. The rigorous evaluation and comparison of the machine learning algorithms provided valuable insights into their strengths and weaknesses for stock price prediction.

Exploratory Data Analysis (EDA)

EDA is an approach to analyzing the data using visual techniques. From the analysis of Tesla Stock Price data, it can be concluded that all the rows of columns 'Close' and 'Adj Close' have the same data. In the distribution plot of OHLC data, we can see two peaks which means the data has varied significantly in two regions. From the boxplots, we can conclude that only volume data contains outliers in it but the data in the rest of the columns are free from any outlier.

Feature Engineering

Feature Engineering helps to derive some valuable features from the existing ones. A quarter is defined as a group of three months. Every company prepares its quarterly results and publishes them publicly so, that people can analyze the company's performance. Added some more columns which will help in the training of our model. The target feature which is a signal whether to buy or not will be trained to predict.

After selecting the features to train the model on, the data should be normalized because normalized data leads to stable and fast training of the model.

Model Training and Validation

Among the three models trained, XGBClassifier has the highest performance but it is pruned to overfitting as the difference between the training and the validation accuracy is too high.

tags: #stock #prediction #using #machine #learning #methods