Looking at similarities between NBA players from the 2015-2016 season
library(tidyverse) # Easily Install and Load the 'Tidyverse', CRAN v1.3.0
library(janitor) # Simple Tools for Examining and Cleaning Dirty Data, CRAN v2.1.0
library(here) # A Simpler Way to Find Your Files, CRAN v1.0.1
library(scales) # Scale Functions for Visualization, CRAN v1.1.1
library(ggfortify) # Data Visualization Tools for Statistical Analysis Results, CRAN v0.4.11
library(gghighlight) # Highlight Lines and Points in 'ggplot2', CRAN v0.3.1
library(plotly) # Create Interactive Web Graphics via 'plotly.js', CRAN v4.9.3
library(gt) # Easily Create Presentation-Ready Display Tables, CRAN v0.2.2
The data used for this task was obtained from the following link: data. I decided to analyze data from the National Basketball Association (NBA) player statistics from the 2015-2016 season. Each observation in this dataset is a player’s per game statistics. I choose to use PCA in order to see how the players differed across 11 features that are deemed to be important for a basketball player’s success.
nba_players <- read_csv(here("_texts",
"NBA_PCA",
"data", "nba_players.csv")) %>%
clean_names() %>%
separate(player, into = c("player", "html"), sep = "\\\\") %>% # clean the player name column
dplyr::filter(mp > 18) %>% # filter for players who played over 18 minutes a game (out of a possible 48)
dplyr::filter(g > 30) %>% # filter for players who played over 30 games (out of a possible 82)
drop_na(age, fga, e_fg_percent, ft_percent, trb:pts) # drop observations with missing values
nba_players_pca <- nba_players %>%
dplyr::select(age, fga, e_fg_percent, ft_percent, trb:pts) %>% # select the features for pca
scale() %>% # scale the features
prcomp() # run pca
# Quick look at the data
nba_players %>%
dplyr::select(player, pos, age, fga, e_fg_percent, ft_percent, trb:pts) %>%
filter(player %in% sample(player, size = 5)) %>%
gt() %>%
tab_header(
title = "Statistics from a Random Sample of Five Players",
subtitle = "From the 2015-2016 NBA regular season"
) %>%
fmt_percent(
columns = vars(e_fg_percent, ft_percent),
decimals = 1
) %>%
tab_style(
style = list(
cell_text(style = "italic"),
cell_borders(
side = c("right"),
color = "black",
weight = px(2)
)
),
locations = cells_body(
columns = 1
)) %>%
cols_label(
pos = "position"
)
Statistics from a Random Sample of Five Players | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
From the 2015-2016 NBA regular season | ||||||||||||
player | position | age | fga | e_fg_percent | ft_percent | trb | ast | stl | blk | tov | pf | pts |
Norris Cole | PG | 27 | 10.8 | 43.9% | 80.0% | 3.4 | 3.7 | 0.8 | 0.1 | 1.7 | 2.3 | 10.6 |
Danilo Gallinari | SF | 27 | 13.2 | 47.2% | 86.8% | 5.3 | 2.5 | 0.8 | 0.4 | 1.5 | 1.6 | 19.5 |
Paul Millsap | PF | 30 | 13.2 | 50.5% | 75.7% | 9.0 | 3.3 | 1.8 | 1.7 | 2.4 | 2.9 | 17.1 |
Chris Paul | PG | 30 | 15.1 | 51.7% | 89.6% | 4.2 | 10.0 | 2.1 | 0.2 | 2.6 | 2.5 | 19.5 |
J.R. Smith | SG | 30 | 11.0 | 53.5% | 63.4% | 2.8 | 1.7 | 1.1 | 0.3 | 0.8 | 2.6 | 12.4 |
autoplot(nba_players_pca,
data = nba_players,
loadings = TRUE,
loadings.label = TRUE,
loadings.colour = "khaki2",
loadings.label.colour = "black",
loadings.label.fontface = "bold",
colour = "pos" # organize colors based off position
) +
labs(title = "Biplot for PCA",
caption = "Biplot of NBA players basic statistics
from the 2015-2016 NBA season.\n Colors are organized by position.",
colour = "Position") +
theme_minimal() +
theme(axis.title = element_text(face = "bold", size = 12),
panel.grid.minor = element_blank(),
plot.title = element_text(face = "bold", size = 13)
)
A few observations from the above biplot:
The first 2 principal components (PCs) capture about 60% of the variance from the variables in the dataset. Since this is a fairly high amount of variance in the first 2 PCs we can explore some relationships between the features from this biplot.
Field goal attempts fga
, turnovers tov
, and points pts
are highly correlated. This falls in line with conventional wisdom since if a player takes more field goals than they have more of an opportunity to score more points. Also this usually would imply they have the ball a greater percentage of the time leading to more turnovers. These 3 variables along with steals stl
and assists ast
have the highest variance along the first PC.
Effective field goal percentage e_fg_percent
and free throw percentage ft_percent
are negatively correlated. The explanation for this could be that players with higher effective field goal percentage tend to be centers, who are known to have poor free throw percentages.
Looking at the players by position, in multivariate space, we can see how players are similar to one another according to their position. For instance, we see that most of the centers C
are are located at the bottom half of the biplot, while most point guards PG
are located at the top half of the biplot.
Below is the same biplot but I decided to highlight the 5 best players for that season (according to the MVP voting which can be found here: MVP voting) .
autoplot(nba_players_pca,
data = nba_players,
loadings = TRUE,
loadings.label = TRUE,
loadings.colour = "khaki2",
loadings.label.colour = "black",
loadings.label.fontface = "bold",
colour = "player"
) +
labs(title = "Biplot for PCA",
subtitle = "Top 5 players in MVP Voting are Highlighted",
caption = "Biplot highlighting some of the best players for the 2015-2016 NBA season") +
gghighlight(player %in% c("Kawhi Leonard", "Stephen Curry", "LeBron James",
"Russell Westbrook", "Kevin Durant")) + # top 5 players in MVP voting
theme_minimal() +
theme(axis.title = element_text(face = "bold", size = 12),
panel.grid.minor = element_blank(),
plot.title = element_text(face = "bold", size = 13),
plot.subtitle = element_text(size = 11)
)
plotly
to see Similarities Between PlayersLastly, in order to see which players are similar to one another I made an interactive plot where you can hover over each data point to revel the name of the player.
nba_pca_plot <- autoplot(nba_players_pca,
data = nba_players,
loadings = TRUE,
loadings.label = TRUE,
loadings.colour = "khaki2",
loadings.label.colour = "black",
loadings.label.fontface = "bold",
colour = "player", # organize colors based off position,
colour.show.legend = FALSE
) +
labs(title = "Interactive Biplot") +
theme_minimal() +
theme(axis.title = element_text(face = "bold", size = 12),
panel.grid.minor = element_blank(),
legend.position="none",
plot.title = element_text(face = "bold", size = 13)
)
ggplotly(nba_pca_plot, tooltip = "player") # interactive plot
For attribution, please cite this work as
Khanjian (2021, Jan. 25). Roupen Khanjian: NBA PCA Analysis. Retrieved from https://khanjian.github.io/roupen-website/texts/NBA_PCA/
BibTeX citation
@misc{khanjian2021nba, author = {Khanjian, Roupen}, title = {Roupen Khanjian: NBA PCA Analysis}, url = {https://khanjian.github.io/roupen-website/texts/NBA_PCA/}, year = {2021} }