When LeBron James returned to Cleveland in the summer of 2015 to take his talents to the depths of Lake Erie, NBA salary cap rules restricted the amount that the Cavaliers could pay him. LeBron wisely signed a one year deal worth $23.0 million, making him the second-highest paid player in the league behind Kobe Bryant at the time. Unlike LeBron, Kobe was in the twilight of his career and benefitted from the Lakers wanting to make sure he had even more money to waste before he follows the trend and inevitably falls into bankruptcy.

By signing a one year deal, LeBron positioned himself to take advantage of the rising salary cap (the salary cap increased by 34.5% from $70.0 million to $94.5 million) and signed a 3-year $99.9 million deal the following year that made him the highest paid player in the NBA in terms of annual salary.

Given LeBron’s contributions on the basketball court, one can intelligently argue to his/her friends that he deserves to be the highest paid player in the NBA. But, assuming a hard salary cap, is $31.0 million (LeBron’s salary in the first year of his deal) comparable to his market value? Are any NBA players paid equal to the market for their services? More interestingly, what factors explain how salaries are set in the NBA?

This is the first, and hopefully longest, of a four part series that will explore these topics.

- Part 1 (this post) will discuss the factors that explain salaries in the NBA.
- Part 2 will analyze the results of that model and present the expected salaries of NBA players using 2015-16 season statistics to explain their 2016-17 salaries.
- Part 3 will detail the implications of this model, including which players are the most overpaid and underpaid and what certain teams’ rosters would look like if teams paid players their expected (or deserved) salaries.
- Part 4 will present the results of this analysis and predict the expected salaries of players for the 2017-18 season.

**Model**

To answer the NBA wage equation, I used 2015-16 season player data to explain 2016-17 salaries. Since I use data that spans two seasons, I restricted the set to players who played in both the 2015-16 and 2016-17 seasons and to players who earned at least $100k during the 2016-17 season.

Regulations (e.g., the salary cap, max contracts, the super max, minimum salaries, the rookie wage scale, the mid-level exception, etc.) dominate the wage setting structure in the NBA. Fortunately, we can reasonably estimate what salaries would be if we strip away these regulations (except the salary cap) by adjusting the salary data.

Players signed contracts in different years. The economics of each contract reflect the economics at the time they signed the contract. Since the economics differ by when the contracts were signed, I adjusted all contracts as if they were signed this past offseason by adjusting the contracts proportional to the difference in the current salary cap and the salary cap of the first year the contract is effective. For example, Kemba Walker and Kyle Lowry both played on $12 million contracts this past season. But Lowry’s $12 million is actually worth $17.9 million (signed in 2014) in today’s dollars compared to $16.1 million for Walker (signed in 2015). After incorporating this adjustment, Chris Paul is the de facto highest paid player in the league. Sorry, LeBron!

This creates an interesting question that these posts will discuss. What would be the NBA market if every player was a free agent?

**Regression Analysis**

To answer this, I ran an ordinary least squares linear regression using the 2016-17 salary data for each player in the NBA that is included in my set. Actually, I ran about 100 regressions until I settled on this one, which I think best models the unrestricted NBA labor market. Below I present each variable and discuss why each variable was included and why others were excluded.

First, you might find it helpful if I explain what a regression is. I promise you it’s not voodoo magic. An easy way to see the relationship between two variables is to look at the correlation coefficient. A correlation coefficient close to 1 (or -1) indicates a strong positive (or negative) relationship between the two variables. The closer the correlation coefficient is to 0, the weaker the relationship between two variables. You know how people say correlation doesn’t always mean causation? Well that’s because the correlation coefficient doesn’t control for other variables that could influence the relationship. A linear regression does that. Each independent variable, combined, explains the relationship between the dependent variable and all of the independent variables, which, ideally, are all of the factors that influence the dependent variable. This can be expressed as an equation:

y = a + b*X + e

Where:

y = the dependent variable

a = the intercept term

b = the coefficient on X, the independent variable

X = the independent variable

e = the error term

In this equation, there is only one independent variable. However, it is rare that only one variable will explain the variation in the dependent variable. In the real world, there are always multiple factors that explain a certain result. You wouldn’t tell someone that Donald Trump won the presidency solely because of James Comey’s actions. You would also mention his electoral strategy, the bias/error in the polls, Hillary’s mistakes, and a few other factors that people smarter than me would know.

What is great is that we can add multiple independent variables to an equation. The coefficient on an independent variable displays the effect of that independent variable on the dependent variable, holding all other factors constant.

The NBA salary equation is modeled below.

ln(salary) = a + b*years in league + c*years in league^{2} + d*minutes per game + e*points per 36 min + f*career All-NBA + g*offensive win shares + h*defensive win shares + e

If you just glossed over that and didn’t read it, that’s understandable. Honestly, I usually gloss over equations when I see them in the middle of posts too. Don’t worry, you can keep reading and not be confused. Below I describe the salary variable (the variable we want to explain) and the variables I included in the model that explain why some players make more than others.

**Variables**

**Salary**

Salary is the variable that the other variables in the model explain. As I stated before, I adjusted the salary data to account for the increasing salary cap. I also transformed the data to use the natural logarithm of the salary data. Natural logarithm? We’re probably stepping into the realm of too nerdy here so bear with me. As shown in the chart below, salary data is distributed non-linearly.

Since the data is not linear (it follows a logarithmic curve), we can’t use the data in a linear regression without making an adjustment. From a least squares perspective, the difference between the squares and the curve would be pretty large. Taking the natural logarithm normalizes the data making it applicable to a linear regression. It creates a linear curve (with a much smaller difference between the squares and the curve), as shown below.

Something tells me a linear regression line will better fit this data than the pre-transformation data.

**Independent Variables**

**Years in the League**

The number of years a player has played in the NBA accounts for how salaries change as players progress in their careers. It makes sense that the longer a player sticks in the NBA, the higher a player’s salary would be, right? This is true, but only to an extent. I would expect that a player’s salary would be highly correlated with his performance, which is subject to an aging curve. Players improve until they hit their prime. Once they start finding gray hairs in the shower, their performance declines until the league kicks them to the street or they lose their dignity and play for the Knicks or Kings.

To account for the aging curve, not only do I include the number of years a player has played in the NBA as a variable but I also include the squared term. The squared term allows us to model the aging curve. The squared term bends the curve to create a negative slope after players reach their peak salary. As shown in the curve below, on average, players reach their salary peak in their tenth year in the league, after holding all other factors constant.

This makes sense. Players play on their rookie contract for the first four years of their career. Assuming they can actually play basketball, they then receive a rookie extension that covers the next four years. The next contract is when players typically cash in on their biggest contract (assuming they’re still in the league), which would cover years 9 through either 12 or 13. After that, contract values typically start to fall as teams realize too late that players are washed up and not worth nearly as much as they had been.

The curve is indexed to year 10 (the peak) which has a value of 1. This means, for example, that rookies earn 20 percent of what players in their tenth year earn. There are a lot of interesting insights from this curve, which I will discuss in more detail in Part 2.

Why use years of experience rather than age? Years of experience has a higher correlation with salary than age. Since salaries are structured in the NBA, players receive raises when they hit certain milestones (e.g., expiration of their rookie contract) regardless of if that player is old enough to have teenage children or is a teenager himself. Also, players’ performance tend to decline with more years in the league, rather than based on age. Players with 10 years of experience who entered the NBA at 19 will probably have as much wear and tear on their bodies as 31 year olds with 10 years in the NBA. Unless you’re LeBron and adapt your game like a chameleon.

**Minutes per Game**

Players who play more earn more. One could argue that causation actually works the opposite way here. Teams feel forced to play their higher earners. But, I trust NBA coaches enough to think that they would play their best players regardless of salary. Although, the Lakers went on a win streak after they benched Timofey Mozgov and Luol Deng (ever heard of failed tanking?) which scared some that they would lose their lottery pick, although Luke Walton never worried that the league would conspire to make sure Lonzo stayed in LA.

**Points per 36 Minutes**

I actually didn’t intend to use points per 36 minutes or any points metric in this model; the economic forces pushed me into it. Points per 36 minutes serves as a proxy for usage rate, a statistic that measures the amount of team possessions a player uses. There is a high correlation between points scored and usage rate (0.91) since players need to “use” possessions in order to score, which means that (a) including both in the model would distort the results and (b) one can effectively proxy the other. Points per 36 minutes actually has a higher correlation with salary than usage rate, suggesting that it would better predict the variation in salary data.

**Career All-NBA**

Players who make more All-NBA teams should make more money than other players. Makes sense. Teams pay players for their former ability, whether or not they admit to it. LeBron has made enough All-NBA teams for the rest of the NBA and I think he gets paid well. If you’re not sure if that’s true, reread this post or watch basketball some time.

This variable has the least statistical connection to salary than any of the other variables included, mainly because it doesn’t separate a player like Paul Millsap (perennial almost All Star) with a guy like Mozgov who is only known for a couple of key blocks and getting way too much money from the Lakers.

**Offensive Win Shares & Defensive Win Shares**

Rather than include each box score statistic separately, I included one overall metric. Blocks, steals, assists, rebounds, turnovers, steals, fouls, and favorite animal (is that not included?) may not accurately reflect the variation in salary when analyzed separately. Different types of players rack up those statistics differently. Chris Paul or any point guard will have more assists than Rudy Gobert but Gobert or any other center will have more blocks. It makes more sense to include one overall metric that compiles all box score statistics to neutralize any distortions resulting from positional differences.

So why win shares and not some other metric? Win shares has a higher correlation with salary than any of the other overall metrics (RPM, BPM, PER) which means it should better explain why some players earn more than other.

Finally, why separate win shares into its offensive and defensive components? Since win shares is simply offensive win shares added to defensive win shares, including these separately does not change the model. But, there are more insights to glean from separating win shares into its components, something that I will discuss in the next post.

This post may not be the juiciest but it lays the ground work for the next three posts. Stay tuned for my next post where I will delve into the details behind the results of the model and some other interesting insights that we can take from the model.