My idea for this experiment was to get a years NRL stats, try to do some machine learning with the first part of the season. And see how accurately I can then predict the results for the last part of the season.
I can then apply my learnings to other sports and future seasons and maybe even try a few $1 bets to see how much money I can win or lose.
I have been learning python so will use that. If you don’t know how to use Python it will be best to go and do a few beginners tutorials to get up and running then come back.
I got my initial stats from here: Historical NRL Results and Odds Data
The main fields included are the data are these ones below, plus a whole pile of betting stats which I ignored.
It took me a long time to get my head around this process. All though there are a large number of tutorials, articles and videos out there to actually make it work was not easy.
Importing the Data
I took the 2016 season to begin with and reduced the dataset to something very simple.
I will add features overtime, however our base model takes in HomeTeam, AwayTeam and their scores as in the below image.
import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn import datasets nrl = pd.read_csv("C:\\Users\\rob\\Documents\\Data Science NRL Project\\2016FullDataSet_2.csv", index_col= 'Date', parse_dates=True) nrl.head(20)
Now that our data is imported we can start cleaning it up.
Clean up Data
Create a new field for whether or not the home team won. This will be a simple true false. Luckily there is seldom a draw in the NRL due to the extra time rules, so for simplicity will just leave that out for now.
nrl["HomeWin"] = nrl["AwayScore"] < nrl["HomeScore"]
We can’t feed the classification model a bunch of text for the team names (believe me I tried), so we need to replace the text
nrl.describe() nrlteamlist = nrl.HomeTeam.unique() list(nrlteamlist) """"'Melbourne Storm', 'Cronulla Sharks', 'Canberra Raiders', 'North QLD Cowboys', 'Penrith Panthers', 'Brisbane Broncos', 'New Zealand Warriors', 'Wests Tigers', 'St George Dragons', 'Canterbury Bulldogs', 'Parramatta Eels', 'Newcastle Knights', 'Gold Coast Titans', 'Manly Sea Eagles', 'South Sydney Rabbitohs', 'Sydney Roosters'""" #Need to convert team names to numbers def tran_teamname(x): if x == 'Melbourne Storm': return 1 if x == 'Cronulla Sharks': return 2 if x == 'Canberra Raiders': return 3 if x == 'North QLD Cowboys': return 4 if x == 'Penrith Panthers': return 5 if x == 'Brisbane Broncos': return 6 if x == 'New Zealand Warriors': return 7 if x == 'Wests Tigers': return 8 if x == 'St George Dragons': return 9 if x == 'Canterbury Bulldogs': return 10 if x == 'Parramatta Eels': return 11 if x == 'Newcastle Knights': return 12 if x == 'Gold Coast Titans': return 13 if x == 'Manly Sea Eagles': return 14 if x == 'South Sydney Rabbitohs': return 15 if x == 'Sydney Roosters': return 16 nrl = nrl.drop(['HomeScore'],1) nrl = nrl.drop(['AwayScore'],1) nrl['HomeTeam'] = nrl['HomeTeam'].apply(tran_teamname) nrl['AwayTeam'] = nrl['AwayTeam'].apply(tran_teamname) nrl
Fit to classification
Now we need to fit our creation to a classifier. The data needs to be in an array so what we do is declare the classifier model that we want to use, split our data in testing and training, and then convert x and y to numpy arrays.
from sklearn import svm clf = svm.SVC(gamma=0.001, C=100.) X= np.array(nrl.drop(['HomeWin'],1)) y= np.array(nrl['HomeWin']) X_train, X_test, y_train, y_test = cross_validation.train_test_split(X,y, test_size = 0.75) X_test clf = DecisionTreeClassifier(random_state=14) clf.fit(X_train,y_train) svm.SVC(C=100.0, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma=0.001, kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=True) <h2>Run predicition and get accuracy</h2> clf.predict(X_test) accuracy = clf.score(X_test, y_test) print(accuracy)
In this example we are getting around 66% prediction results. Not bad but plenty of room for improvement. My next tasks will be to add features to our data in an attempt to increase our accuracy percentage.
I hope you manage to get a basic model going and let us know of any ideas you might have on features we can add to improve the score.
I found this video series very helpful: