NBA PCA Analysis

Principal Component Analysis R NBA

Looking at similarities between NBA players from the 2015-2016 season

Roupen Khanjian true
library(tidyverse) # Easily Install and Load the 'Tidyverse', CRAN v1.3.0
library(janitor) # Simple Tools for Examining and Cleaning Dirty Data, CRAN v2.1.0
library(here) # A Simpler Way to Find Your Files, CRAN v1.0.1
library(scales) # Scale Functions for Visualization, CRAN v1.1.1
library(ggfortify) # Data Visualization Tools for Statistical Analysis Results, CRAN v0.4.11
library(gghighlight) # Highlight Lines and Points in 'ggplot2', CRAN v0.3.1
library(plotly) # Create Interactive Web Graphics via 'plotly.js', CRAN v4.9.3
library(gt) # Easily Create Presentation-Ready Display Tables, CRAN v0.2.2 

Brief Introduction to Data

The data used for this task was obtained from the following link: data. I decided to analyze data from the National Basketball Association (NBA) player statistics from the 2015-2016 season. Each observation in this dataset is a player’s per game statistics. I choose to use PCA in order to see how the players differed across 11 features that are deemed to be important for a basketball player’s success.

Data Wrangling and PCA

nba_players <- read_csv(here("_texts", 
                             "data", "nba_players.csv")) %>% 
  clean_names() %>% 
  separate(player, into = c("player", "html"), sep = "\\\\") %>% # clean the player name column
  dplyr::filter(mp > 18) %>% # filter for players who played over 18 minutes a game (out of a possible 48)
  dplyr::filter(g > 30) %>% # filter for players who played over 30 games (out of a possible 82)
  drop_na(age, fga, e_fg_percent, ft_percent, trb:pts)  # drop observations with missing values 

nba_players_pca <-  nba_players %>%  
  dplyr::select(age, fga, e_fg_percent, ft_percent, trb:pts) %>% # select the features for pca
  scale() %>% # scale the features
  prcomp() # run pca

# Quick look at the data
nba_players %>%
  dplyr::select(player, pos, age, fga, e_fg_percent, ft_percent, trb:pts) %>% 
  filter(player %in% sample(player, size = 5)) %>% 
  gt() %>% 
      title = "Statistics from a Random Sample of Five Players",
      subtitle = "From the 2015-2016 NBA regular season"
    ) %>% 
      columns = vars(e_fg_percent, ft_percent),
      decimals = 1
    ) %>% 
    style = list(
      cell_text(style = "italic"),
        side = c("right"), 
        color = "black",
        weight = px(2)
    locations = cells_body(
      columns = 1
    ))  %>% 
    pos = "position"
Statistics from a Random Sample of Five Players
From the 2015-2016 NBA regular season
player position age fga e_fg_percent ft_percent trb ast stl blk tov pf pts
Norris Cole PG 27 10.8 43.9% 80.0% 3.4 3.7 0.8 0.1 1.7 2.3 10.6
Danilo Gallinari SF 27 13.2 47.2% 86.8% 5.3 2.5 0.8 0.4 1.5 1.6 19.5
Paul Millsap PF 30 13.2 50.5% 75.7% 9.0 3.3 1.8 1.7 2.4 2.9 17.1
Chris Paul PG 30 15.1 51.7% 89.6% 4.2 10.0 2.1 0.2 2.6 2.5 19.5
J.R. Smith SG 30 11.0 53.5% 63.4% 2.8 1.7 1.1 0.3 0.8 2.6 12.4


         data = nba_players,
         loadings = TRUE,
         loadings.label = TRUE,
         loadings.colour = "khaki2",
         loadings.label.colour = "black",
         loadings.label.fontface = "bold",
         colour = "pos" # organize colors based off position
         ) +
  labs(title = "Biplot for PCA",
       caption = "Biplot of NBA players basic statistics 
       from the 2015-2016 NBA season.\n Colors are organized by position.",
       colour = "Position") +
  theme_minimal() +
  theme(axis.title = element_text(face = "bold", size = 12),
        panel.grid.minor = element_blank(),
        plot.title = element_text(face = "bold", size = 13)

A few observations from the above biplot:

Biplot Highlighting a Few Players

Below is the same biplot but I decided to highlight the 5 best players for that season (according to the MVP voting which can be found here: MVP voting) .

         data = nba_players,
         loadings = TRUE,
         loadings.label = TRUE,
         loadings.colour = "khaki2",
         loadings.label.colour = "black",
         loadings.label.fontface = "bold",
         colour = "player"
         ) +
  labs(title = "Biplot for PCA",
       subtitle = "Top 5 players in MVP Voting are Highlighted",
       caption = "Biplot highlighting some of the best players for the 2015-2016 NBA season") +
  gghighlight(player %in% c("Kawhi Leonard", "Stephen Curry", "LeBron James",
                            "Russell Westbrook", "Kevin Durant")) + # top 5 players in MVP voting
  theme_minimal() +
  theme(axis.title = element_text(face = "bold", size = 12),
        panel.grid.minor = element_blank(),
        plot.title = element_text(face = "bold", size = 13),
        plot.subtitle = element_text(size = 11)

Biplot Using plotly to see Similarities Between Players

Lastly, in order to see which players are similar to one another I made an interactive plot where you can hover over each data point to revel the name of the player.

nba_pca_plot <- autoplot(nba_players_pca,
         data = nba_players,
         loadings = TRUE,
         loadings.label = TRUE,
         loadings.colour = "khaki2",
         loadings.label.colour = "black",
         loadings.label.fontface = "bold",
         colour = "player", # organize colors based off position, = FALSE
         ) +
  labs(title = "Interactive Biplot") +
  theme_minimal() +
  theme(axis.title = element_text(face = "bold", size = 12),
        panel.grid.minor = element_blank(),
        plot.title = element_text(face = "bold", size = 13)

ggplotly(nba_pca_plot, tooltip = "player") # interactive plot


For attribution, please cite this work as

Khanjian (2021, Jan. 25). Roupen Khanjian: NBA PCA Analysis. Retrieved from

BibTeX citation

  author = {Khanjian, Roupen},
  title = {Roupen Khanjian: NBA PCA Analysis},
  url = {},
  year = {2021}