Improving the prediction of stockmarket
In a previous post I trained a machine learning model (RandomForestClassifier) to predict the behavior of the stockmarket share price, determining whether the price would raise or fall.
However, the performance of the model was very poor. With an accuracy of about 50%, it was as effective as making decisions by tossing a coin 😖.
As promised, I now bring an improved model.
The scikit-learn module of python includes a huge mount of tools, for example many machine learning models, pre-processing algorithms, neural network models, score metrics, and more. An interesting tool is the VotingClassifier.
The
VotingClassifierin scikit-learn is a meta-estimator that combines multiple machine learning models to improve prediction accuracy. It uses eitherhardvoting, which selects the class label with the majority votes, orsoftvoting, which considers the predicted probabilities from each model.
The code below uses yfinance to download the financial information of Microsoft (MSFT) from 2022 to 2025, transforms the data multi-indices to single feature indices and cast the time index to datetime. Next, it engineer features that are relevant for financial analysis (returns, single moving average, and target), and drop the rows with NA cells. Further, we divide the data into predictors and response, scale the predictor, and split the data into train and test (in a rate 9:1).
At that stage, we instantiate three models:
RandomForestClassifier,GradientBoostingClassifier,MLPClassifier: Multi-layer Perceptron classifier,
and use VotingClassifier to ensemble them into our ensemble_model.
Finally, we train our ensemble_model and use the trained model to predict the test data.
import yfinance as yf import pandas as pd from sklearn.ensemble import RandomForestClassifier, \ GradientBoostingClassifier, VotingClassifier from sklearn.neural_network import MLPClassifier from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score, precision_score, \ recall_score, f1_score def retrieve_clean_data(symbol: str, start_date: str, end_date: str): """ Retrieve the stockmarket data: - symbol: Stock abbreviation, e.g. "AAPL", "MSFT" - start_date: in format YYYY-MM-DD - end_date: in format YYYY-MM-DD Returns a pandas DataFrame """ data = yf.download(symbol, start=start_date, end=end_date) if isinstance(data.columns, pd.MultiIndex): data.columns = data.columns.get_level_values(0) data = data.rename(columns={ "Open": "open", "High": "high", "Low": "low", "Close": "close", "Volume": "volume" }) data.index = pd.to_datetime(data.index) return data.sort_index() def feature_engineering(df: pd.DataFrame) -> pd.DataFrame: """ Feature Engineering. Creates the features: - return: decimal equivalent of percentage change - ma10: moving average over 10 days - target: variable indicating raising (1) or falling (0) of the share price. """ df['return'] = df['close'].pct_change() df['ma10'] = df['close'].rolling(window=10).mean() df["target"] = (df["return"] > 0).astype(int) df.dropna(inplace=True) return df def data_preparation(df: pd.DataFrame, test_size: float = 0.1): """ Takes the retrieved data, creates the important features, define the predictors and target, and returns the split data. """ df = feature_engineering(df) features = ["open", "high", "low", "close", "volume", "ma10"] X = df[features] y = df["target"] # Scaling scale = StandardScaler() X_scale = scale.fit_transform(X) # Train-Test split X_train, X_test, y_train, y_test = train_test_split( X_scale, y, test_size=0.1, shuffle=False ) return X_train, X_test, y_train, y_test def main(): symbol = "MSFT" start_date = "2022-01-01" end_date = "2026-01-01" df = retrieve_clean_data(symbol, start_date, end_date) X_train, X_test, y_train, y_test = data_preparation(df) # Model definition and training rf_model = RandomForestClassifier( n_estimators=100, ) gb_model = GradientBoostingClassifier( n_estimators=10, learning_rate=0.1 ) nn_model = MLPClassifier( hidden_layer_sizes=(50, 25), activation="relu", solver="adam", max_iter=100 ) ensemble_model = VotingClassifier( s estimators=[ ('rf', rf_model), ('gb', gb_model), ('nn', nn_model) ], voting='soft') ensemble_model.fit(X_train, y_train) # Evaluation y_hat = ensemble_model.predict(X_test) accuracy = accuracy_score(y_test, y_hat) precision = precision_score(y_test, y_hat) recall = recall_score(y_test, y_hat) f1 = f1_score(y_test, y_hat) print(f'ML Accuracy: {accuracy:.4f}, Precision: {precision:.4f}') print(f'Recall: {recall:.4f}, F1 Score: {f1:.4f}') if __name__ == "__main__": main()
The results of this strategy:
ML Accuracy: 0.6400, Precision: 0.6143 Recall: 0.8269, F1 Score: 0.7049
💥 Note that the accuracy improved in 25% 🤯
If you can, try it! And let me know your results.