US open predictions

claire
Jun 15, 2023
10 min read

Updated: Jun 16, 2023

US Open @ LACC

With the US open upon us, I’m interested in creating several machine learning models to see if I can predict success.

The US open is known for extremely hard roughs, so I was thinking that perhaps this tournament, compared to others, might be predicted by success in certain areas (e.g. off the tee [more important to hit fairways?], around the green [being good in the rough or staying out of the rough all together]) more than putting for example.

I have strokes gained data from previous US opens (2019, 2021, 2022) from https://www.kaggle.com/datasets/robikscube/pga-tour-golf-data-20152022, and have scraped average strokes gained for the 2023 season thus far from https://www.pgatour.com/stats/strokes-gained.

If you don't know what strokes gained are, you can visit my post on the Masters tournament, but briefly they are a measure to compare how a specific player is doing compared to the field. Are they gaining strokes (i.e. performing better) on the field: off the tee, on the approach, around the green or putting?

Below, I plotted the zero-order correlations of all strokes gained variables (as well as a continuous "finish" variable) from the 2019, 2021 and 2022 US opens:

This correlation plot is almost exactly what i would expect. Ultimately, strokes gained don’t relate that strongly to each other (rs=|0.01-0.09|), with the exception of strokes gained tee to green and total. I’m not 100% sure how strokes gained tee to green and total are calculated, but they definitely are comprised of strokes gained in those other categories (e.g. strokes gained tee to green = strokes gained off the tee, on the approach and around the green). As is such, strokes gained tee to green does not correlate strongly with strokes gained putting. This is good as it shows our variables are largely independent of each other (an important assumption of linear data analysis)! However, it also shows we might be leery of including strokes gained tee to green and strokes gained total in our models, as they might induce multicollinearity. Strokes gained total correlates very highly with strokes gained tee to green, r=0.80. Strokes gained off the tee, on approach, around the green and putting correlate moderately with finish (rs=|0.21-0.42|).

This plot shows total average strokes gained by finish over the last 3 US opens. Strokes gained total has the clearest association with finish, which makes sense (green = lower finish = better). The red line represents a top 10 finish at a US open. For the most part, people finishing in the top 10 are also in the positive in strokes gained categories (but not always!)

Can we predict who will make the cut?

Golf tournaments introduce a cut after the first 2 rounds, meaning the top 60 (and ties) move on to the weekend, and those worse than top 60 go home. I wanted to see if I could predict with some confidence who will make the cut at this year's US open, based on who made the cut and what their strokes gained were at previous US opens.

Here, I used the US open strokes gained data from 2019, 2021 and 2022 as training data. These data serve to train a model, and help decide how important each strokes gained variable is for making or missing the cut.

Using `glm()` in R

Running a logistic regression model that had strokes gained off the tee, on approach, around the green and putting predict whether someone made the cut or not (binary, 0/1), we find every strokes gained variable significantly increases one's odds of making the cut.

mod<- glm(finish_class~sg_ott+sg_app+sg_putt+sg_arg + (season|season) + (1|`player id`), data=train)
summary(mod)

We also included random intercepts and slopes for both player and season. this allows each season and player to have it's own intercept (or "starting point") for the data-- meaning there is some variability between each season and player's baseline strokes gained. Allowing random intercepts helps account for non-independence of observations induced by occurrences of multiple players and years.

I also allowed each strokes gained category to have it’s own slope (or effect on making the cut) per season because the course changes year to year, and thus the effect of strokes gained in certain areas may have more weight than others depending on the course.

Now that we have a model, we can take a testing data set, that is average strokes gained for players from the 2023 season thus far, and predict whether they will make the cut, based on the model coefficients extracted from the above model.

test$preds <- mod %>% predict(test, type = "response")

Here, we are actually predicting the probability that someone makes the cut, and right now we are more interesting in whether they make the cut or not (a binary outcome). I just rounded the "preds" variable to get either 0s or 1s, instead of probabilities.

The above graph plots those who are projected to make or miss the cut, and their average total strokes gained from the 2023 season thus far. For the most part, it just shows those who are predicted to make the cut are mostly all in the positive strokes gained, and those who are projected to miss the cut are in the negative -- which makes sense.

Using sci-kit learn in Python

In the interest of seeing how machine learning models compare across methods (R and Python), I ran a model using sci-kit learn with the same data to predict who will make the cut in this year's US open.

I started again by looking at the data, here I plotted strokes gained off the tee by strokes gained putting for those who made and missed the cut in previous US opens. The red are those who made the cut, and the blue are those who missed the cut. The data seem relatively similar, which makes me wonder if it will be hard to predict who is finishing where. Perhaps this is due to something like the US open being a highly competitive field where players are all playing at a high clip... and might be hard to build a model with a high prediction accuracy!

The steps to run this model in python are slightly different, and a bit more explicit. First, I split the training data into training and test data sets to build and tune a model. This means I am not yet touching the 2023 strokes gained data, but instead I am using the 2019, 2021 and 2022 US open strokes gained data to ensure my model is predicting with good accuracy.

# split data into x and y
x= train_class.drop("finish_class", axis=1)
y= train_class.finish_class

np.random.seed(22)

# split into training and testing data (again, here just using the old training data)

x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2)

I tried fitting both a logistic regression and random forest classifier model. Logistic regression is a statistical method for predicting binary classes. The outcome or target variable is dichotomous in nature. Dichotomous means there are only two possible classes. For example, it can be used for cancer detection problems. It computes the probability of an event occurrence.

It is a special case of linear regression where the target variable is categorical in nature. It uses a log of odds as the dependent variable. Logistic Regression predicts the probability of occurrence of a binary event utilizing a logit function.

A random forest is a supervised machine learning estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. It consists of a large number of individual decision trees that operate as an ensemble. Each individual tree in the random forest spits out a class prediction and the class with the most votes becomes our model’s prediction.

The logistic regression model had a much worse (R^2= 58%) score than the random forest classifier model (R^2= 82%), so I decided to focus on the random forest.

## random forestnp.random.seed(42)
RF= RandomForestClassifier()
RF.fit(x_train,y_train)
RF.score(x_test,y_test) ### better

Sci-kit learn makes it easy to "tune" a model, or adjust the specified parameters. I did this using GridSearchCV, which takes parameters associated with the model and searches across a grid of different inputs for those parameters to find which combinations made the model perform the best.

# different hyper parameters for random forest
# these are just random input parameters I am trying here for this model. 
# tbh I am not 100% sure how to know what to try but this is what I learned

rf_grid = {'bootstrap': [True],
           'max_depth': [80, 90, 100, 110],
           'max_features': [2, 3],
           'min_samples_leaf': [3, 4, 5],
           'min_samples_split': [8, 10, 12],
           'n_estimators': [100, 200, 300, 1000]}

gs_rf = GridSearchCV(estimator = RandomForestClassifier(),            
                      param_grid = rf_grid,
                      cv = 3, 
                      n_jobs = -1, 
                      verbose = 2)

# fit
gs_rf.fit(x_train,y_train)

gs_rf.best_params_

After using grid search, you can simply ask what the best parameters were, and specify a model with those parameters.

clf= RandomForestClassifier(bootstrap=True,
        max_depth=100,max_features=2,
        min_samples_leaf=5,
        min_samples_split=12,
        n_estimators=100,
        random_state=22)

Using a cross validation method that splits the training data into different training/test sets and computes model performance scores for each split, I found this model had an average accuracy score of 82%. I was satisfied with that score, so I computed actual predictions based on the model for the (still) training data, and we can see it predicted who will make/miss the cut even higher!

We had ~88% accuracy from that model, as evidenced by the confusion matrix below:

Finally, reading in the actual testing data (2023 strokes gained data), we make predictions.

us_open_preds= clf.predict(test)

We won't be able to evaluate this until the US open makes cuts, but we can compare these results to the results from R!

Comparing methods

I looked across the predictions from R and Python, and found they correlated at 0.81. That is not terrible agreement across methods!

The below plot shows who was predicted to the make cut or not in both models.

if someone has a 2 on the y axis (confidence number) they were predicted to make the cut in both models. If someone only has a 1, they were only predicted to make the cut in one of the models. Finally, if someone has a zero, or no bar, they were predicted to miss the cut in both models.

Can we predict actual finish?

Using lmer() in R

Just as we used logistic regression above to predict whether someone makes the cut or not, we can use linear regression to predict where someone will finish in the 2023 US open based on their past data.

I ran a simple mixed model, again with random intercepts and slopes, using strokes gained to predict a continuous "finish" variable.

mod<- lmer(finish2~sg_ott+sg_app+sg_putt+sg_arg +(season|season) + (1|`player id`), data=linear)

Results showed every additional stroke gained in each area (off the tee, on the approach, around the green and putting) all decrease one’s expected finish by about 11 to 12 (looking at the 'Fixed effects estimates). These are significant and show how strokes gained impact expected finish! For example, each additional stroke gained off the tee (sg_ott) decreases one's predicted finish by 12.29.

Using a machine learning training function in R:

set.seed(2)

preProcess <- c("center","scale")
trControl <- trainControl(method = "repeatedcv",number = 10,repeats = 10)

model <- train(finish2~sg_putt+sg_arg+sg_app+sg_ott, data=linear, preProcess = preProcess, trControl=trControl) ### run the model 

test$finish_pred <- predict(model, test)

We find the model is not predicting anyone to finish first (the lowest it predicts someone is in fact 16), but perhaps that has to do with not enough data to train the model on. Remember, we only have data from three previous US opens so that means there are only three first place finishers.

However, a simple linear prediction model here suggests Scottie will finish the lowest, followed by Jon Rahm, Tyrell Hatton, Xander, Rory and Tony Finau. Not too bad if you ask me! I've interestingly heard a lot of talk about Tyrell this week -- I wouldn't have had him in my top 10 but if that data says so... maybe! Xander was a bit of a surprise since he has not done much lately, but he does tend to hang around in majors.

Using random forest regression in Python

Again, I wanted to see how these results converge across methods, so I ran a few models using sci-kit learn in python. Here we are going to try random forest regressor. This model is similar to random forest classifier (using decision trees) but now we are predicting linear values, not just binary outcome data.

# call model
from sklearn.ensemble import RandomForestRegressor

model= RandomForestRegressor(n_jobs=-1,
        random_state=22)
# fit

model.fit(train.drop("finish2", axis=1), train.finish2)

model.score(train.drop("finish2", axis=1), train.finish2)

A simple model call had a good score, so I decided to tune that model and run with it.

Similar to before with sci-kit learn, we split the training data into a training and test set to tune the model.

# split training data into x and y

x_train, y_train= train.drop("finish2", axis=1), train.finish2

rf_grid = {'bootstrap': [True],
        'max_depth': [80, 90, 100, 110],
        'max_features': [2, 3],
        'min_samples_leaf': [3, 4, 5],
        'min_samples_split': [8, 10, 12],
        'n_estimators': [100,150,278]}

gs_rf = GridSearchCV(estimator = RandomForestRegressor(), 
        param_grid = rf_grid,
        cv = 3, 
        n_jobs = -1, 
        verbose = 2)
    
# fit
gs_rf.fit(x_train,y_train)

gs_rf.score(x_train,y_train) 

gs_rf.best_params_

The model scored an R^2 of 92% -- this satisfied me so I used the best parameters suggested and created a model. Then, I used that model to predict finish from the 2023 strokes gained data thus far.

rf_model = RandomForestRegressor(n_estimators=278,
    min_samples_leaf=3,
    min_samples_split=8,
    max_features=3,
    max_depth=90,
    n_jobs=-1,
    random_state=22,
    bootstrap=True)
    
 rf_model.fit(x_train,y_train)
 
 test = pd.read_csv("usopen_test.csv")
 
 test_preds = rf_model.predict(test)

On first glance, this model looks similar to the linear regression model from R. It is predicting John Rahm to finish lowest in the US open, followed by Scottie, Tyrell Hatton and Xander. Granted the lowest finish is still only 14, but we can still try to track and see how the model does!

Comparing methods

Across the R and Python linear machine learning models, the predictions for finish had a 99% correlation! That is near perfect convergence across methods.

Here are how the results stack up:

The top 10 or so predictions are pretty uncannily similar. Nice proof of concept of the models -- but now we have to see how the model actually performs. Only one thing left to do but sit back and enjoy the US open from LACC!

But lastly, I selected the top 33 players who were predicted to make the cut in both models, and finish top 40 in both models. Their strokes gained (players plotted in order of projected finish) in 2023 are plotted here:

At the end of the day, I think if Scottie can get his putter going (he leads every other strokes gained category but is dead last in putting), it will be hard to beat him. On top of that he's won a major before and has been playing well for four rounds the last few weeks. Otherwise I'd say keep an eye on Xander (just because I'm curious/ surprised he's up there so high!) and Max Homa, the LA kid, who plays well at LACC and has solid strokes gained.

***but of course the biggest caveat is that I do not have strokes gained data in 2023 for LIV players.... so let's see if any of these guys can beat Brooks!***