Predicting NRL Rugby League Results With Data Science (Python and Pandas)

My idea for this experiment was to get a years NRL stats, try to do some machine learning with the first part of the season. And see how accurately I can then predict the results for the last part of the season.

I can then apply my learnings to other sports and future seasons and maybe even try a few $1 bets to see how much money I can win or lose.

I have been learning python so will use that. If you don’t know how to use Python it will be best to go and do a few beginners tutorials to get up and running then come back.

I got my initial stats from here: Historical NRL Results and Odds Data

The main fields included are the data are these ones below, plus a whole pile of betting stats which I ignored.

It took me a long time to get my head around this process. All though there are a large number of tutorials, articles and videos out there to actually make it work was not easy.

Importing the Data

I took the 2016 season to begin with and reduced the dataset to something very simple.

I will add features overtime, however our base model takes in HomeTeam, AwayTeam and their scores as in the below image.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets

nrl = pd.read_csv("C:\\Users\\rob\\Documents\\Data Science NRL Project\\2016FullDataSet_2.csv", index_col= 'Date', parse_dates=True)

nrl.head(20)

Now that our data is imported we can start cleaning it up.

Clean up Data

Create a new field for whether or not the home team won. This will be a simple true false. Luckily there is seldom a draw in the NRL due to the extra time rules, so for simplicity will just leave that out for now.

nrl["HomeWin"] = nrl["AwayScore"] < nrl["HomeScore"]

We can’t feed the classification model a bunch of text for the team names (believe me I tried), so we need to replace the text

nrl.describe()


nrlteamlist = nrl.HomeTeam.unique()

list(nrlteamlist)

""""'Melbourne Storm',
 'Cronulla Sharks',
 'Canberra Raiders',
 'North QLD Cowboys',
 'Penrith Panthers',
 'Brisbane Broncos',
 'New Zealand Warriors',
 'Wests Tigers',
 'St George Dragons',
 'Canterbury Bulldogs',
 'Parramatta Eels',
 'Newcastle Knights',
 'Gold Coast Titans',
 'Manly Sea Eagles',
 'South Sydney Rabbitohs',
 'Sydney Roosters'"""

#Need to convert team names to numbers

def tran_teamname(x):
     if x == 'Melbourne Storm':
        return 1
     if x == 'Cronulla Sharks':
        return 2
     if x == 'Canberra Raiders':
        return 3
     if x == 'North QLD Cowboys':
        return 4
     if x == 'Penrith Panthers':
        return 5
     if x == 'Brisbane Broncos':
        return 6
     if x == 'New Zealand Warriors':
        return 7
     if x == 'Wests Tigers':
        return 8
     if x == 'St George Dragons':
        return 9
     if x == 'Canterbury Bulldogs':
        return 10
     if x == 'Parramatta Eels':
        return 11
     if x == 'Newcastle Knights':
        return 12
     if x == 'Gold Coast Titans':
        return 13
     if x == 'Manly Sea Eagles':
        return 14
     if x == 'South Sydney Rabbitohs':
        return 15
     if x == 'Sydney Roosters':
        return 16

nrl = nrl.drop(['HomeScore'],1)
nrl = nrl.drop(['AwayScore'],1)       
  
nrl['HomeTeam'] = nrl['HomeTeam'].apply(tran_teamname)   
nrl['AwayTeam'] = nrl['AwayTeam'].apply(tran_teamname)   

nrl

Fit to classification

Now we need to fit our creation to a classifier. The data needs to be in an array so what we do is declare the classifier model that we want to use, split our data in testing and training, and then convert x and y to numpy arrays.

from sklearn import svm
clf = svm.SVC(gamma=0.001, C=100.)


X= np.array(nrl.drop(['HomeWin'],1))
y= np.array(nrl['HomeWin'])


X_train, X_test, y_train, y_test = cross_validation.train_test_split(X,y, test_size = 0.75)

X_test


clf = DecisionTreeClassifier(random_state=14)

clf.fit(X_train,y_train)  
svm.SVC(C=100.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.001, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=True)


<h2>Run predicition and get accuracy</h2>

clf.predict(X_test)

accuracy = clf.score(X_test, y_test) 

print(accuracy) 

In this example we are getting around 66% prediction results. Not bad but plenty of room for improvement. My next tasks will be to add features to our data in an attempt to increase our accuracy percentage.

I hope you manage to get a basic model going and let us know of any ideas you might have on features we can add to improve the score.

Resources

I found this video series very helpful:

Scikit-Learn Website

Intro to Scikit-Learn (oreilly.com)

Rob StGeorge

Senior SQL Server Database Administrator residing in Auckland, NZ

Leave a Reply