Improving the prediction of stockmarket

In a previous post I trained a machine learning model (RandomForestClassifier) to predict the behavior of the stockmarket share price, determining whether the price would raise or fall.

However, the performance of the model was very poor. With an accuracy of about 50%, it was as effective as making decisions by tossing a coin 😖.

As promised, I now bring an improved model.

The scikit-learn module of python includes a huge mount of tools, for example many machine learning models, pre-processing algorithms, neural network models, score metrics, and more. An interesting tool is the VotingClassifier.

The VotingClassifier in scikit-learn is a meta-estimator that combines multiple machine learning models to improve prediction accuracy. It uses either hard voting, which selects the class label with the majority votes, or soft voting, which considers the predicted probabilities from each model.

The code below uses yfinance to download the financial information of Microsoft (MSFT) from 2022 to 2025, transforms the data multi-indices to single feature indices and cast the time index to datetime. Next, it engineer features that are relevant for financial analysis (returns, single moving average, and target), and drop the rows with NA cells. Further, we divide the data into predictors and response, scale the predictor, and split the data into train and test (in a rate 9:1).

At that stage, we instantiate three models:

RandomForestClassifier,
GradientBoostingClassifier,
MLPClassifier: Multi-layer Perceptron classifier,

and use VotingClassifier to ensemble them into our ensemble_model.

Finally, we train our ensemble_model and use the trained model to predict the test data.

import yfinance as yf
import pandas as pd
from sklearn.ensemble import RandomForestClassifier, \
    GradientBoostingClassifier, VotingClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, \
    recall_score, f1_score


def retrieve_clean_data(symbol: str, start_date: str, end_date: str):
    """
    Retrieve the stockmarket data:
    - symbol: Stock abbreviation, e.g. "AAPL", "MSFT"
    - start_date: in format YYYY-MM-DD
    - end_date: in format YYYY-MM-DD

    Returns a pandas DataFrame
    """
    data = yf.download(symbol, start=start_date, end=end_date)

    if isinstance(data.columns, pd.MultiIndex):
        data.columns = data.columns.get_level_values(0)

    data = data.rename(columns={
        "Open": "open",
        "High": "high",
        "Low": "low",
        "Close": "close",
        "Volume": "volume"
    })

    data.index = pd.to_datetime(data.index)
    return data.sort_index()


def feature_engineering(df: pd.DataFrame) -> pd.DataFrame:
    """
    Feature Engineering. Creates the features:
    - return: decimal equivalent of percentage change
    - ma10: moving average over 10 days
    - target: variable indicating raising (1) or falling (0)
      of the share price.
    """
    df['return'] = df['close'].pct_change()
    df['ma10'] = df['close'].rolling(window=10).mean()
    df["target"] = (df["return"] > 0).astype(int)
    df.dropna(inplace=True)
    return df


def data_preparation(df: pd.DataFrame, test_size: float = 0.1):
    """
    Takes the retrieved data, creates the important features,
    define the predictors and target, and
    returns the split data.
    """
    df = feature_engineering(df)
    features = ["open", "high", "low", "close", "volume", "ma10"]
    X = df[features]
    y = df["target"]

    # Scaling
    scale = StandardScaler()
    X_scale = scale.fit_transform(X)

    # Train-Test split
    X_train, X_test, y_train, y_test = train_test_split(
        X_scale, y, test_size=0.1, shuffle=False
    )
    return X_train, X_test, y_train, y_test


def main():
    symbol = "MSFT"
    start_date = "2022-01-01"
    end_date = "2026-01-01"
    df = retrieve_clean_data(symbol, start_date, end_date)
    X_train, X_test, y_train, y_test = data_preparation(df)

    # Model definition and training
    rf_model = RandomForestClassifier(
        n_estimators=100,
    )
    gb_model = GradientBoostingClassifier(
        n_estimators=10,
        learning_rate=0.1
    )
    nn_model = MLPClassifier(
        hidden_layer_sizes=(50, 25),
        activation="relu",
        solver="adam",
        max_iter=100
    )
    ensemble_model = VotingClassifier(
s        estimators=[
            ('rf', rf_model),
            ('gb', gb_model),
            ('nn', nn_model)
        ], voting='soft')
    ensemble_model.fit(X_train, y_train)

    # Evaluation
    y_hat = ensemble_model.predict(X_test)

    accuracy = accuracy_score(y_test, y_hat)
    precision = precision_score(y_test, y_hat)
    recall = recall_score(y_test, y_hat)
    f1 = f1_score(y_test, y_hat)

    print(f'ML Accuracy: {accuracy:.4f}, Precision: {precision:.4f}')
    print(f'Recall: {recall:.4f}, F1 Score: {f1:.4f}')


if __name__ == "__main__":
    main()

The results of this strategy:

ML Accuracy: 0.6400, Precision: 0.6143
Recall: 0.8269, F1 Score: 0.7049

💥 Note that the accuracy improved in 25% 🤯

If you can, try it! And let me know your results.