This is not investment advice.
This notebook is based on blog post from quantisti.com https://blog.quantinsti.com/decision-tree/.
The article was trying to predict stock price one day ahead using decision tree algorithm and stock technical indicators.
Here we will expand on that aticle by:
My tests are showing that predicting price ahead give very noisy results (at least with random forests). But alteranative approach is giving very nice predictive results.
So the question is: can random forests predict stock trend tens of days ahead with reasonable accuracy?
Pretty much YES, really looks like the trend predictions do not give as many noisy data as we would expect.
Also as we will see in the final prediction, the random forest is advising to be buying for the whole duration of an uptrend as long as it thinks the uptrend will be continuing. So the model will not give us buy signal at one specific
Note:
Also it looks like an interesting idea to do something similar using XGBoost algorithm as described on this Kaggle kernel: https://www.kaggle.com/mtszkw/using-xgboost-for-stock-trend-prices-prediction
Note: Normalized volume has not been used, but might be interesting predictor to use.
!which python3
import talib as ta
import joblib
import pandas as pd
#suppress 'SettingWithCopy' warning
pd.set_option('mode.chained_assignment', None)
#!pip install pandas_datareader
#!pip3 install seaborn
import seaborn as sns
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
# ___library_import_statements___
import pandas as pd
# for pandas_datareader, otherwise it might have issues, sometimes there is some version mismatch
pd.core.common.is_list_like = pd.api.types.is_list_like
# make pandas to print dataframes nicely
pd.set_option('expand_frame_repr', False)
import pandas_datareader.data as web
import numpy as np
import matplotlib.pyplot as plt
import datetime
import time
#newest yahoo API
import yfinance as yahoo_finance
#optional
#yahoo_finance.pdr_override()
%matplotlib inline
import talib as ta
import numpy as np
import matplotlib.pyplot as plt
# was giving me some warnings
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
# ___variables___
#ticker = 'AAPL'
#ticker = 'TSLA'
#ticker = 'FB'
#ticker = 'MSFT'
#ticker = 'NFLX'
#ticker = 'GOOGL'
#ticker = 'BIDU'
#ticker = 'AMZN'
#ticker = 'IBM'
start_time = datetime.datetime(1980, 1, 1)
#end_time = datetime.datetime(2019, 1, 20)
end_time = datetime.datetime.now().date().isoformat() # today
def get_data(ticker):
# yahoo gives only daily historical data
connected = False
while not connected:
try:
df = web.get_data_yahoo(ticker, start=start_time, end=end_time)
connected = True
print('connected to yahoo')
except Exception as e:
print("type error: " + str(e))
time.sleep( 5 )
pass
# use numerical integer index instead of date
df = df.reset_index()
#print(df.head(5))
return df
#df = get_data(ticker)
#df.head()
#df.tail()
#df.shape
def compute_technical_indicators(df):
df['EMA5'] = ta.EMA(df['Adj Close'].values, timeperiod=5)
df['EMA10'] = ta.EMA(df['Adj Close'].values, timeperiod=10)
df['EMA15'] = ta.EMA(df['Adj Close'].values, timeperiod=15)
df['EMA20'] = ta.EMA(df['Adj Close'].values, timeperiod=10)
df['EMA30'] = ta.EMA(df['Adj Close'].values, timeperiod=30)
df['EMA40'] = ta.EMA(df['Adj Close'].values, timeperiod=40)
df['EMA50'] = ta.EMA(df['Adj Close'].values, timeperiod=50)
df['EMA60'] = ta.EMA(df['Adj Close'].values, timeperiod=60)
df['EMA70'] = ta.EMA(df['Adj Close'].values, timeperiod=70)
df['EMA80'] = ta.EMA(df['Adj Close'].values, timeperiod=80)
df['EMA90'] = ta.EMA(df['Adj Close'].values, timeperiod=90)
df['EMA100'] = ta.EMA(df['Adj Close'].values, timeperiod=100)
df['EMA150'] = ta.EMA(df['Adj Close'].values, timeperiod=150)
df['EMA200'] = ta.EMA(df['Adj Close'].values, timeperiod=200)
df['upperBB'], df['middleBB'], df['lowerBB'] = ta.BBANDS(df['Adj Close'].values, timeperiod=20, nbdevup=2, nbdevdn=2, matype=0)
df['SAR'] = ta.SAR(df['High'].values, df['Low'].values, acceleration=0.02, maximum=0.2)
df['RSI'] = ta.RSI(df['Adj Close'].values, timeperiod=14)
df.tail()
return df
#df = compute_technical_indicators(df)
def compute_features(df):
# computes features for forest decisions
df['aboveEMA5'] = np.where(df['Adj Close'] > df['EMA5'], 1, -1)
df['aboveEMA10'] = np.where(df['Adj Close'] > df['EMA10'], 1, -1)
df['aboveEMA15'] = np.where(df['Adj Close'] > df['EMA15'], 1, -1)
df['aboveEMA20'] = np.where(df['Adj Close'] > df['EMA20'], 1, -1)
df['aboveEMA30'] = np.where(df['Adj Close'] > df['EMA30'], 1, -1)
df['aboveEMA40'] = np.where(df['Adj Close'] > df['EMA40'], 1, -1)
df['aboveEMA50'] = np.where(df['Adj Close'] > df['EMA50'], 1, -1)
df['aboveEMA60'] = np.where(df['Adj Close'] > df['EMA60'], 1, -1)
df['aboveEMA70'] = np.where(df['Adj Close'] > df['EMA70'], 1, -1)
df['aboveEMA80'] = np.where(df['Adj Close'] > df['EMA80'], 1, -1)
df['aboveEMA90'] = np.where(df['Adj Close'] > df['EMA90'], 1, -1)
df['aboveEMA100'] = np.where(df['Adj Close'] > df['EMA100'], 1, -1)
df['aboveEMA150'] = np.where(df['Adj Close'] > df['EMA150'], 1, -1)
df['aboveEMA200'] = np.where(df['Adj Close'] > df['EMA200'], 1, -1)
df['aboveUpperBB'] = np.where(df['Adj Close'] > df['upperBB'], 1, -1)
df['belowLowerBB'] = np.where(df['Adj Close'] < df['lowerBB'], 1, -1)
df['aboveSAR'] = np.where(df['Adj Close'] > df['SAR'], 1, -1)
df['oversoldRSI'] = np.where(df['RSI'] < 30, 1, -1)
df['overboughtRSI'] = np.where(df['RSI'] > 70, 1, -1)
# very important - cleanup NaN values, otherwise prediction does not work
df=df.fillna(0).copy()
df.tail()
return df
def plot_train_data(df):
# plot price
plt.figure(figsize=(15,2.5))
plt.title('Stock data ' + str(ticker))
plt.plot(df['Date'], df['Adj Close'])
#plt.title('Price chart (Adj Close) ' + str(ticker))
plt.show()
return None
def define_target_condition(df):
# price higher later - bad predictive results
#df['target_cls'] = np.where(df['Adj Close'].shift(-34) > df['Adj Close'], 1, 0)
# price above trend multiple days later
df['target_cls'] = np.where(df['Adj Close'].shift(-55) > df.EMA150.shift(-55), 1, 0)
# important, remove NaN values
df=df.fillna(0).copy()
df.tail()
return df
You can set the 'warm_start' parameter to True in the model. This will ensure the retention of learned parameters from previous round using .fit method.
def splitting_and_training(df):
# __predictors__
predictors_list = ['aboveSAR','aboveUpperBB','belowLowerBB','RSI','oversoldRSI','overboughtRSI',
'aboveEMA5','aboveEMA10','aboveEMA15','aboveEMA20','aboveEMA30','aboveEMA40','aboveEMA50',
'aboveEMA60','aboveEMA70','aboveEMA80','aboveEMA90',
'aboveEMA100']
# __features__
X = df[predictors_list].fillna(0)
X.tail()
# __targets__
y_cls = df.target_cls.fillna(0)
y_cls.tail(10)
# __train test split__
from sklearn.model_selection import train_test_split
y=y_cls
X_cls_train, X_cls_test, y_cls_train, y_cls_test = train_test_split(X, y, test_size=0.3, random_state=432, stratify=y)
print (X_cls_train.shape, y_cls_train.shape)
print (X_cls_test.shape, y_cls_test.shape)
# __RANDOM FOREST __ - retrainable - warm_start
from sklearn.ensemble import RandomForestClassifier
#Create a Gaussian Classifier - incremental training - warm_start=True
clf=RandomForestClassifier(n_estimators=500, criterion='gini', max_depth=20, min_samples_leaf=10,
n_jobs=-1, warm_start=True)
# __ACTUAL TRAINING __
clf = clf.fit(X_cls_train, y_cls_train)
clf
# __making accuracy report__
# ideally should be getting better with each round
y_cls_pred = clf.predict(X_cls_test)
from sklearn.metrics import classification_report
report = classification_report(y_cls_test, y_cls_pred)
print(report)
return clf
# get trained classifier
#clf = splitting_and_training(df)
#plot_train_data(df)
#df
def predict_timeseries(df, clf):
# making sure we have good dimensions
# column will be rewritten later
df['Buy'] = df['target_cls']
for i in range(len(df)):
X_cls_valid = [[df['aboveSAR'][i],df['aboveUpperBB'][i],df['belowLowerBB'][i],
df['RSI'][i],df['oversoldRSI'][i],df['overboughtRSI'][i],
df['aboveEMA5'][i],df['aboveEMA10'][i],df['aboveEMA15'][i],df['aboveEMA20'][i],
df['aboveEMA30'][i],df['aboveEMA40'][i],df['aboveEMA50'][i],
df['aboveEMA60'][i],df['aboveEMA70'][i],df['aboveEMA80'][i],df['aboveEMA90'][i],
df['aboveEMA100'][i]]]
y_cls_pred_valid = clf.predict(X_cls_valid)
df['Buy'][i] = y_cls_pred_valid[0].copy()
print(df.head())
return df
def plot_stock_prediction(df, ticker):
# plot values and significant levels
plt.figure(figsize=(20,7))
plt.title('Predictive model ' + str(ticker))
plt.plot(df['Date'], df['Adj Close'], label='High', alpha=0.2)
plt.plot(df['Date'], df['EMA10'], label='EMA10', alpha=0.2)
plt.plot(df['Date'], df['EMA20'], label='EMA20', alpha=0.2)
plt.plot(df['Date'], df['EMA30'], label='EMA30', alpha=0.2)
plt.plot(df['Date'], df['EMA40'], label='EMA40', alpha=0.2)
plt.plot(df['Date'], df['EMA50'], label='EMA50', alpha=0.2)
plt.plot(df['Date'], df['EMA100'], label='EMA100', alpha=0.2)
plt.plot(df['Date'], df['EMA150'], label='EMA150', alpha=0.99)
plt.plot(df['Date'], df['EMA200'], label='EMA200', alpha=0.2)
plt.scatter(df['Date'], df['Buy']*df['Adj Close'], label='Buy', marker='^', color='magenta', alpha=0.15)
#lt.scatter(df.index, df['sell_sig'], label='Sell', marker='v')
plt.legend()
plt.show()
return None
def save_model(clf):
import joblib
joblib.dump(clf, "./random_forest.joblib")
return None
# training stock data
tickers = ['SPY', 'F', 'IBM', 'GE', 'AAPL', 'ADM']
# other optional tickers for more training:
# 'XOM', 'GM','MMM','KO','PEP','SO','GS',
# 'HAS','PEAK','HPE','HLT','HD','HON','HRL','HST','HPQ','HUM','ILMN',
# 'INTC','ICE','INTU','ISRG','IVZ','IRM','JNJ','JPM','JNPR','K','KMB',
# 'KIM', 'KMI','KSS','KHC', 'KR', 'LB', 'LEG', 'LIN', 'LMT','LOW',
# 'MAR', 'MA','MCD','MDT', 'MRK', 'MET', 'MGM', 'MU','MSFT', 'MAA',
# 'MNST', 'MCO','MS', 'MSI',
# 'MMM', 'ABT','ACN','ATVI','ADBE','AMD','A','AKAM','ARE','GOOG','AMZN','AAL',
# 'AMT', 'AMGN','AIV','AMAT','ADM', 'AVB','BAC', 'BBY', 'BIIB', 'BLK', 'BA','BXP',
# 'BMY', 'AVGO','CPB','COF','CAH', 'CCL', 'CAT', 'CBOE', 'CBRE','CNC', 'CNP', 'SCHW','CVX',
# 'CMG', 'CI','CSCO','C','CLX', 'CME', 'KO', 'CTSH', 'CL', 'CMCSA', 'ED', 'COST','CCI',
# 'CVS', 'DAL','DLR', 'D','DPZ', 'DTE', 'DUK', 'DRE', 'EBAY', 'EA', 'EMR', 'ETR', 'EFX', 'EQIX',
# 'EQR', 'ESS', 'EL','EXC', 'EXPE','XOM', 'FFIV','FB','FRT', 'FDX', 'FE','GPS', 'GRMN',
# 'IT', 'GD', 'GE','GIS', 'GM','GS', 'GWW', 'HAL'
# ]
for ticker in tickers:
df = get_data(ticker)
plot_train_data(df)
df = compute_technical_indicators(df)
df = compute_features(df)
df=define_target_condition(df)
clf = splitting_and_training(df)
save_model(clf)
# commenting out saves time during training
#df = predict_timeseries(df, clf)
#plot_stock_prediction(df, ticker)
Here the model will perform trend predictions on an unknown dataset (has not seen it during training or testing).
#ticker='BP'
#ticker='ABBV'
#ticker='GILD'
#ticker='NGG'
ticker='BPY'
# load classifier, no need to initialize the loaded_rf
loaded_clf = joblib.load("./random_forest.joblib")
clf = loaded_clf
new_df = get_data(ticker)
new_df = compute_technical_indicators(new_df)
new_df = compute_features(new_df)
new_df=define_target_condition(new_df)
new_df = predict_timeseries(new_df, clf)
Below plots are showing predictions on unseen dataset. When the triangle overlay is on the price data, it means buy. When the triangle is on the zero level, it means, don't buy. This model is pretty much giving only long signals, but cen be extrapolated to sell signals as well.
So the Buy signal means, that the model thinks that in n days (here 55 days) the price will be above specific Exponencial Moving Average (here was trained to be above 150 EMA in 55 days).
We see that the model is giving some false positive signals (of course it is), but not that many actually. It just sometimes expects trend reversal too early, but if we are using this as an investing advisor for long term hold or long term swing trades, the signals provided by the model are very nice.
plot_stock_prediction(new_df, ticker)
# zoom in on the data
temp_df = new_df[-700:]
plot_stock_prediction(temp_df, ticker)
Wisdom of the trees (and forests):
Trend is your friend until the end.