The Z Files: Adjusting 2021 Projections for the Regional Schedule

Written by

Updated on October 8, 2020 9:59AM EST

Last time out, I reviewed several things I'm thinking about as I set to embark on 2021 MLB projections. One of those things really has the wheels spinning, so let's dig a little deeper. Specifically, should 2020 numbers be adjusted for quality of competition?

In a normal 162-game season, teams play 76 games (47 percent) against divisional opponents, so the stats are somewhat influenced by the strength of their divisional foes, but there's ample play against other teams to dilute the bias and not have to worry about it. However, in 2020, 40 games (67 percent) were played within the division with the other 20 (33 percent) contested in the cross-league geographical zone. Each club faced only nine others. In a standard campaign, everyone plays 20 other squads.

Intuitively, quality of opposition must have played a part in the past season's performances. It may be that the quality is the same across the three regions, so it washes out. What if it is not, though? A batter's 110 wRC+ may not equate to the same level of hitter from one of the other divisional pairings. Pitchers with a 24 percent strikeout rate may not exhibit the same level of dominance if they pitched on a team in one of the other groupings.

Even if there is a difference, adjustments will be more empirical than data driven. Let's dig into some numbers to see what we're dealing with and perhaps add a little objectivity to a mostly subjective dilemma.

Surface

Surface stats like ERA and WHIP can be misleading on an individual basis, but when applied en masse, their relatability comes in handy. Here's a breakdown of those ratios from the 2020 campaign, beginning with the three geographical zones.

	ERA	WHIP
East	4.72	1.41
Central	4.12	1.26
West	4.52	1.31

Sure enough, the three divisional pairings suggest the quality of players within each is different. That said, to say the Central has the best pitching while the worst is in the East isn't necessarily accurate. Think of it this way. Is a 4.00 ERA from a Triple-A pitcher akin to that of an MLB hurler? Of course not. This is not to say the difference in the MLB combined divisions is like that between Triple-A and the majors, but the example holds true.

The above ratios reflect the difference in prowess between the hitting and pitching of the respective zones. It may be the West or East pitching is collectively better than the Central, or that the hitting in the Central is significantly worse than the bookending coasts, or a mix of both.

Here's a look at the data per division:

	ERA	WHIP
AL East	4.53	1.37
NL East	4.92	1.45
AL Central	4.10	1.27
NL Central	4.14	1.25
AL West	4.65	1.33
NL West	4.39	1.30

Again, there isn't any cogent analysis applicable to projections from this data. The comparisons are all relative.

Some are drawing conclusions from the wild card playoff results, specifically how many teams from each division advanced to the next round.

	Wild Card Round	Divisional Round
AL East	3	2
NL East	2	2
AL Central	3	0
NL Central	4	0
AL West	2	2
NL West	2	2

The obvious deduction is that despite the two lowest ERAs of the six divisions, the AL Central and NL Central teams are worse than those from the East and West.

If each series were a coin flip, it would be 1 in 128, or a 0.78% chance, all the Central teams would lose. I don't know what the odds were, but for the sake of math, if each Central team was a 3:1 underdog, there's a 13 percent chance all would fail to advance. There seems to some teeth to the narrative.

Looking at one year in a vacuum can be misleading since there isn't anything against which to compare. However, an issue with investigating stats prior to 2020 is there wasn't a universal designated hitter, so the comparisons have to be kept at the league level (American to American, National to National). Here's the divisional ERA data since 2017, with the standard deviations between divisions included.

Division	2020	2019	2018	2017
AL East	4.53	4.99	4.59	4.17
AL Central	4.10	5.05	4.88	4.53
AL West	4.65	4.92	4.34	4.43
NL East	4.92	4.73	4.45	4.60
NL Central	4.14	4.69	4.34	4.27
NL West	4.39	4.80	4.27	4.15
AL St. Dev.	0.29	0.07	0.27	0.19
NL St. Dev.	0.40	0.06	0.09	0.23

The larger standard deviations from 2020 support the notion there's a bigger difference this season in divisional quality, but it's far from unequivocal proof. Not to mention, it could be driven by sample size, so let's repeat the breakdown using an equivalent number of games from each season.

Data from 2017-2019 will be used. Each season will be parsed into two-month intervals (Apr/May, Jun/Jul and Aug/Sep). Instead of presenting the individual data, here is the average of the nine standard deviations for each league, compared to the 2020 mark.

	2020	Average	High
AL	0.29	0.24	0.39
NL	0.40	0.17	0.35

The average of the standard deviations is smaller than 2020's level, suggesting the 2020 spike isn't simply due to variance. However, there was one period from the AL higher than the 2020 standard deviation, so there is still a chance the results are a sample size effect.

Admittedly, this is a simplistic approach fraught with flaws, since the assumption is each division from 2017-2019 was equal, and they almost assuredly weren't. The question is whether the difference is significant enough this season to neutralize them, since they were essentially three separate leagues.

A big factor yet to be investigated are the park effects. First, here's a synopsis of the data for runs and homers by handedness. An estimate is necessary for Globe Life Field (new venue), Marlins Stadium/Oracle Park (right field fences moved in at both venues) and Fenway Park/Citi Field/T-Mobile Park (each installed a humidor).

Division/Region	RUNS	HR LHB	HR RHB
AL East	100	102	106
NL East	98	100	100
AL Central	102	100	97
NL Central	100	101	98
AL West	96	102	100
NL West	101	93	96
East	99	101	103
Central	101	101	98
West	98	98	98

The key is the aggregate ERA in the Central were lowest, despite the games being played in the most run-friendly venues when viewed as a group. As will be discussed in a moment, ballpark neutralization is part of the projection process. The purpose of looking at this data is to determine if the parks are responsible for the discrepancies in regional ERA. They are not. In fact, the park data helps support the theory there was a difference in the quality of teams between the three geographical zones.

Let's work with the assumption the aggregate quality of the Central teams is weaker than the East and West, so an adjustment to their numbers is needed if the schedule returns to normal. Keep in mind, there's no guarantee this will be the case, but fantasy drafts will almost assuredly be approached in that manner, hence projections need to follow suit.

My method begins with generating a neutral projection. I take the actual numbers and flush out all the outside influences like park factors, luck and age. A weighted average of the neutralized stats is computed, which is the neutral projection. The actual projection then adds the context back – specifically age and parks. This offseason, neutralizing for regional quality will be part of the process. But how big a factor should it be?

One possibility is to regress each skill towards its corresponding league average. For example, here's the breakdown of strikeout rates from this past season.

Division/Region/League	K%
AL East	23.24%
NL East	22.91%
AL Central	24.11%
NL Central	25.77%
AL West	22.48%
NL West	22.23%
East	23.07%
Central	24.93%
West	22.35%
American	23.27%
National	23.60%

This is not necessarily indicative of what I'll do; it's an example of a regression calculation. Say I choose to regress 25 percent. If an AL pitcher posted a 22% strikeout rate, his neutralized mark would be (.75 x 22%) + (.25 x 23.27%) or 22.32%.

Over 1200 words and a bunch of tables and I'm still not sure what I'll do. At this point, it's fair to wonder, "Is it worth it?"

Let's use a very simplistic example: a pitcher that posted an identical 3.65 ERA over 170 innings in 2018 and 2019. If he repeated it again this season in 60 frames, his 2021 projection would be 3.65 (work with me, this isn't how ERA is projected, but I want to keep it relatable).

What if this pitcher toiled for one of the Central teams in 2020, though? His adjusted ERA would be higher. For the sake of this example, let's say the neutralized mark is 4.10. The manner I'd project the 2021 ERA is

((11 x 4.10 x 60) + (7 x 3.65 x 170) + (4 x 3.65 x 170)) / ((11 x 60) + (7 x 170) + (4 x 170)) = 3.77

In terms of projected earnings, the difference in ERA is about $1. If WHIP and strikeouts were also adjusted, we're looking at a $2-$3 difference. In the first round, this is just a spot or two. For the next couple of rounds, it's about one round difference. As you serpentine down the snake, we're talking abut several rounds.

So yeah, adjusting for region could make a difference.

Landing on the level is far from straightforward. In a way, I feel sorry I just went through all sorts of mental gymnastics for somewhat minimal payoff. That said, it serves as a great example of the issues involved with 2021 draft plans, whether you're a spreadsheet person or more of a feel-type drafter.

Welcome to my world.