Building a Rating System

Building a Rating System

Most games today have a poor ranking system. Take tennis. The ATP ranking system is based on “points”, which are awarded by participating in select tournaments in the previous 52 weeks. The system is needlessly complicated and somewhat arbitrary, and does not directly take strength of opposition into account. The upper echelons of Halo and Call of Duty online rankings are, according to a friend, often populated with mediocre players who just play a lot of games, a result of a system that rewards sheer quantity of play over quality of results1. I played high school quizbowl, and the system for ranking the teams was subjective and fairly arbitrary. The top 20 or so teams were fairly accurate because people knew who the best teams were, but beyond that it was guesswork based on unreliable metrics2.

Chess has a different rating system, one that nearly every game and sport should use. The underlying premise of the Elo rating system, named after its Hungarian inventor, is that the purpose of a rating is to be able to predict the outcome of future games. Philosophically, it was based on a big data approach before big data existed–the only thing that should determine the rating formula is the corpus of games played. With enough prior games, Elo contends, you should be able to calculate the probability of any player beating any other player. The Elo rating system is better than many other ranking systems for several reasons:

1. It rewards the quality and consistency of results over sheer frequency of playing. Three good results will gain more points than ten mediocre results.
1. It provides a way to accurately predict the probability of one player beating another. In a two player Elo system, a player rated 100 points higher than his or her opponent has a 64% probability of winning3.
1. The ratings directly account for the opposition strength. Beating strong opponents gains more points. Losing to weak opponents loses more points.

Developing the Algorithm

My friends and I started getting into the board game Settlers of Catan towards the end of my senior year of high school. We played a few games every week and would get pretty competitive about it. We’d argue over who was the best but had no metric to determine that other than, “well, he seems to win all the time.” After our first semester of college, we all came back home for the break and resumed playing. Inspired by chess ratings, I sought to create a rating system to put an end to the debate once and for all. I spent a day thinking about the various algorithms to use and a night coding an app that I later called the “Game Rating Calculator.” The rest of this post explains the process I went through in creating a rating algorithm, and the results after nearly a year of using it.

When I was developing a rating algorithm for the Game Rating Calculator, I was inspired to create a system modeled after the Elo system. I couldn’t port the Elo formula exactly, since it was designed for only two player games. I looked at a few approaches before settling (pun intended) on the final formula.

There are many variations of the Elo formula4. For the sake of clarity, all references in this post will be to the following formula:

Rnew= Rpre+ K(S-E)

Where Rnew is the new rating, Rpre is the pregame rating, K is the “k factor”, an arbitrary multiplier (a higher k factor means higher rating volatility), S is the total score in a rated event, and E is the total expected score in a rated event, which is calculated by:

where Ropp  is the opponent’s rating.

Assumptions:

I took my base scenario to be a four player game of Settlers of Catan, where all players were of equal strength. Intuitively, each player’s probability of winning the game is 25%. Assume that A came in first, and, if it is possible to have a second, third, and fourth place, then B came in second, C came in third, and D came in fourth. To determine a player’s opposition strength, I take the average rating of all of the opponents. All calculations using an Elo formula are done with a K factor of 16.

Approach #1 – The Pairwise Approach

The pairwise approach tries to get around Elo’s inability to work for more than two players by treating multiplayer games with one winner as a series of pairwise matches. In a four player game with players A, B, C, and D, if A won the game, you would rate the game as the result of three individual matches: A vs B, A vs C, and A vs D. But that leaves us with a problem: A has “played” three games, while B, C, and D have played one. The official online Settlers of Catan website and ranking system attempts to rectify this by taking into account relative positions in the game. That is, the second place player (determined by the number of points at the end of the game) will have “played” three games, losing against the first place player and winning against the third and fourth place players, and so on.

There are several problems with this approach. In our hypothetical case of four equally rated players playing a game, each player should achieve each possible position (first, second, third, and fourth) 25% of the time. In this case, the winning player’s rating was calculated as if he won three games in a row. Since each of his opponents is equally rated, the probability of him winning any individual game is 50% and the probability of winning three games in a row is .5*.5*.5=.125=12.5%. This means the formula acts as if the winner has an accomplished a task that is twice as difficult as it really was, and therefore rewards too many points. Similarly, the last place player will lose too many points, because even though his odds of coming in last were 25%, his rating will be calculated as if he lost three games in a row, which only has a 12.5% chance of occurring.

We can use the binomial theorem to calculate the implied odds for the second and third place players. The probability of winning two out of three games against an equally rated opponent (second place) is 3C2(.5)2(.5)1 = 3*.125= .375 =37.5%. Similarly, for winning one out of three games against an equally rated opponent: 3C1(.5)1(.5)2= 3*.125= .375 =37.5%5.

We can summarize the results in the following table:

 Position Actual Probability of Achieving Position Implied Probability of Achieving Position Through Pairwise Ratings Result First 0.25 0.125 Overrated Second 0.25 0.375 Underrated Third 0.25 0.375 Overrated Fourth 0.25 0.125 Underrated

We can determine the results using a standard elo calculator with a k factor of 32. Let all four players be rated 1000. Their new ratings would be as follows:

First place 1000 → 1048

Second place 1000 → 1016

Third place 1000 → 984

Fourth Place 1000 → 952

One could argue that even though the implied expected scores are off, since the players are all equal in strength, things should “balance out”. That is, in this example, the last place player is expected to come in first as many times as he comes in last, so things should even out. And if he doesn’t then he really deserves the lower rating.

The problem is that this really messes with the predictability with elo ratings. In that one example, there is now a 96 point difference between the first place player and last place player, which is so big that, after just one game, A’s predicted probability of beating D head-to-head had jumped from 50% to 63.5%. Even with a lowered K factor dampening the rating chance, the probability would still be overly optimistic for A. And since Elo uses the predicted probability (expected score) for calculating rating changes, this can mess with subsequent rating changes. And even if all the kinks could be worked out, it seems sloppy to use incorrect probabilities.

From a practical standpoint, this system also has two major problems.

1. In a game where there is only a winner and multiple losers, this system wouldn’t work. In Settlers of Catan, the player with the second most points at the end of the game is not necessarily the second most likely to win. Games where a player in third or even fourth “leapfrogs” to first and wins the game are not uncommon.
1. Having ratings dependent on positioning is ripe for manipulation. For instance, if you were were in dead last in a game of Settlers, you could offer to throw the game in favor of another player in exchange for help in moving up from last to place to third. This kind of manipulation is incentivized under this system.

Approach #2 – The Proportional Rating Approach

Elo operates on differences between actual performance and expected performance. As I studied potential solutions to a multiplayer Elo algorithm, I realized that if I could develop a way to determine expected performance in a multiplayer game, I could just plug it into the two player Elo formula. One way of doing this would be to give each player an expected performance equal to the proportion of their rating to the sum of the total ratings. If there were four players with ratings R1, R2, R3, R4, each player’s individual expected performance would be their rating divided by (R1+R2+R3+R4). In the case of our hypothetical example, each player’s expected score would be 1000/4000=.25, which is what our intuition expects.

The one issue with this approach it implies fairly slow improvement. Winning five games in a row is a pretty big deal (after all, in 4 person games the odds of that happening are .255= .09%), but the rating gain would only be 65 points, which means that in a sixth game against 1000 rated opponents the expected probability of winning would only be 1065/4065= 26.2%,when it clearly should be higher.

As for evaluating its accuracy, there was really no way to test if the expected score was reasonable other than to eyeball it for completely inaccurate values. In a four player game with players rated 2000, 1000, 1000, and 1000, the first player has a 40% chance of winning, which seems reasonable for a player who is twice as good as each of his individual opponents. The only problem is that it would have taken so many wins to get to 2000 that the player is almost certainly underrated to begin with.

Approach #3 – The Modified Expected Score Approach

There was one final approach to examine, and this was the simplest of them all. Take our base scenario of four players, each rated 1000. The expected score of each player is .25, while the expected score in a one game, pairwise comparison is .5. In a five player game, the expected score is .20, while the pairwise comparison score is .5. In a six player game, the expected score is .17, while the pairwise comparison score is .5. The expected score for a game of N equally rated players can be generalized as:

2Ep/N

where Ep is the expected score in a pairwise comparison. The key question was, could I expand this formula for the expected score beyond the case of all players being equally rated? I added this multiplier to the Elo formula and tried it out with a few hypothetical games, and the rating adjustments seemed to make sense. But the only true way to establish this was to actually use the system in production. After all, Elo was revised over 40 years of real world usage (though others have found better algorithms). There were two things I really liked about this formula:

1. It’s point neutral among established players (not counting rounding errors)6 .
1. It gives a reasonable point adjustment for different size games.

This seemed to be the most promising approach and the algorithm I ended up using. The only issue that might arise is how the ratings would operate in the case of a player rated far above the rest. In a four player game, the theoretical maximum expected score is .5, which means that even if a player were rated one million points higher than each of his opponents, he’d still gain a sizable 16 points from winning. This issue only came up in such unrealistic scenarios, so I wasn’t too worried about it.

Provisional Ratings

The United States Chess Federation has a concept of provisional ratings. For the first 25 rated games, players’ ratings are more volatile7. The idea behind this is that it will more quickly get players to their “true rating”. It also takes into account that moving from casual chess to tournament chess is a disruptive process that has a learning curve, and that a player’s strength will be more volatile as he moves up that curve.

I didn’t include such a system in the rating calculator for a few reasons. First, it is easy to play such a large number of chess games, which makes reaching 25 fairly simple. In a game like Settlers of Catan, where you need to gather at least three (usually more) people for 90+ minutes, this isn’t practical. In the rating pool I set up, only five people had played 25 or more games after 11 months.

There is also a real risk of rating deflation, depending on the number of provisional games. For example, let’s say a provisional rating is twice as volatile as a non-provisional rating. That is, you will gain or lose twice as many points if you are provisionally rated. If there was just one provisional rated game and an average game of Settlers had four players, 75% of the players would have their ratings deflated, since 75% would lose their first game. That 75% would get only half the points they would have otherwise gotten when they win later.

The idea is to tailor the number of provisional games to the point where players are equally likely to get their wins and losses in that set of provisional games as they are to win and lose in general.

The provisional period is five games in my rating pool. At the time, the average game seemed to be around five players. We’ve played many more games since then, and the average has dropped to four players per game. In the first four games, an average player is expected to win one game. But in the fifth game, 75% of the players will lose. In this case, 75% of the incoming players have a deflated rating. Unfortunately, due to a desire for consistency in the rating system, I can’t dynamically adjust the number of provisional players each time. I may retroactively recalculate the ratings based on an average game size of four, since that has held stable for a while.

Results

It was time to put this rating algorithm into production, and see how it held up. As of this writing, 80 rated games have been played, and a total of 28 players have played at least one rated game. Below are all the results for the players who’ve played enough games to not be provisionally rated (6+).

 Player Rating Games Played Average Game Size Winning Percentage* Expected Winning Percentage Net Difference ** K.S. 1073 44 3.82 40.9 26.2 14.7 W.A. 1052 13 4.85 38.5 20.6 17.8 G.N. 1052 38 4.37 31.6 22.9 8.7 A.S. 1028 60 3.72 25 26.9 -1.9 K.S. 1012 41 3.88 31.7 25.8 5.9 I.M. 984 16 4.5 25 22.2 2.8 M.R. 977 7 5 14.3 20 -5.7 A.S. 965 10 4.9 10 20.4 -10.4 G.M. 937 9 4.44 11.1 22.5 -11.4 C.W. 934 6 4.5 0 22.2 -22.2 S.S. 889 29 3.66 13.8 27.4 -13.6 B.S. 859 12 4.25 0 23.5 -23.5

Average: 980.2

Median: 980.5

Average game size of all games played: 3.93

* The Expected Winning Percentage is based on the average game size, assuming all other opponents were equally rated

** The Net Difference may not match Winning Percentage- Expected Winning Percentage due to rounding

At first, I arbitrarily started off all players at 1000. Once there was a decent sized pool, I started off all new players at the median rating of non-provisional players. The median rating has dropped over time, probably because of the deflationary effects of the provisional ratings I mentioned earlier.

Overall, the results look pretty good. The ratings intuitively correspond with how I feel most people should be rated. Importantly, the ratings fit one of the main criteria I wanted in a rating algorithm: it awards the quality and consistency of results, and not just the sheer quantity of games. For instance, W.A. and I are rated exactly the same. He has a much better net winning percentage than I do, but that is compensated by the fact that I’ve maintained a decent percentage for far more games. His strong results are balanced by my consistency. There is one obvious outlier here, and that is A.S. A.S. used to be much lower rated (below 900) but has recently surged with a lot of strong results. One possible explanation for his high ratting and low winning percentage is that he played a lot of games with high rated players. I haven’t calculated each player’s average opponent’s rating, so I can’t say for sure.

I’ll eventually put this all in a SQL database to automate some of the data gathering. But the best way to improve the rating system would be to run data analysis algorithms on the results. How accurately do the ratings predict results? Unfortunately there is still not enough data. Heuristically, I’d think there would have to be a rating pool on the order of at least 25 people who have played at least 25 games to accurately assess the system.

But for now, this algorithm will do. If you’re interested in learning more about the Elo rating system in order to develop your own, check out the book by Elo himself, The Rating of Chess Players: Past and Present.

1 But then again, that’s what the game makers want. These days games usually have some sort of in-game store or revenue steam, so more play usually means more money.

2 There have been some attempts to institute a chess style elo rating for quizbowl, but the ratings aren’t very accurate since data is sporadic.

3 Assuming you are using the Elo formula mentioned in this post.

4 If you’re interested in learning more about the Elo formula, here is a good treatise about it by Mark Glicko, the chief statistician in charge of the United States Chess Federation implementation of the Elo formula.

5 You could have also intuitively determined this result by noting that the probability distribution for the four possible outcomes were symmetrical, and since winning three in a row and losing three in a row are each .125, the other two possibilities must add to .75, hence .375.

6 It’s interesting to note that Elo system used by the United States Chess Federation and FIDE (the world chess federation) are not point neutral, because they have a K factor that declines as rating goes up. This was apparently used a way to stop rating inflation at the top of the rating pool. Not all chess organizations agree with this—the Internet Chess Club, the premier online chess server, does not use variable K factors outside of provisional games.

7 The volatility decreases within the set of 25 games.

35 thoughts to “Building a Rating System”

1. Kevin Forwin says:

For some reason, the revisions that I made to my previous comment refuse to appear. I don’t know why. I miscounted babies for no reason, or rather, I made a typo. Epm (modified expected score) can only maximally be 2/N, which does not make intuitive sense to me. Wanted your thoughts on this.

1. gautam says:

Hey Kevin,

The formula indeed breaks down at extremes, which I noted in the post when I said, “The only issue that might arise is how the ratings would operate in the case of a player rated far above the rest. In a four player game, the theoretical maximum expected score is .5, which means that even if a player were rated one million points higher than each of his opponents, he’d still gain a sizable 16 points from winning. This issue only came up in such unrealistic scenarios, so I wasn’t too worried about it.”

As you noted, the maximum possible expected score is 2/N, rather than asymptotically approaching 1, the way Elo does. This seems like a big weakness, but as I noted in the post in practical usage this really is never an issue, because such highly lopsided matches (you vs 3 babies) rarely occur–in fact, they occur less frequently in multiplayer games than they do in two player games like chess, because it’s to assembled a larger group such that none of the players are even remotely close to the top ranked player.

I know that’s not a satisfying answer. I haven’t really given this algorithm too much more thought since I first developed it because it’s worked really well for my use cases, but if I were to revisit it this would be the first improvement I’d try to make.

2. Kevin Forwin says:

To be clear, does this imply that the maximum expected score is 1/3? That can’t be right. If I’m so good I win 90% of the time (perhaps I’m playing with 3 infants every time), there’s no way to represent that with 2Ep/N, right?

1. Kevin Forwin says:

Sorry, I didn’t clarify N. At 6, maximum Epm (modified expected score), we get 1/3. I miscounted babies.

So, Epm is always 2/N. That doesn’t represent things very well, right?

3. Jesse Inghelram says:

Hi Gautam, Thank you for the article. I operate a tennis academy for youth in SF and am trying to custom build an ELO rating system that can execute a pure strength ladder. Do you have any advice or do you know of any Elo rating widgets that would be good to use? Thanks for any help!

Jesse

1. gautam says:

Hi Jesse,

In absence of a product that handles it for you (and unfortunately, I don’t know of any), sadly the best you can do is to google “elo rating calculator” and input results there and keep track in a spreadsheet, as I did prior to building the rating app. I’ve been hired to work a custom rating system for another sports domain (wrestling), and multiple people have suggested I make a generalized Elo-based ranking service so anyone can build an accurate, custom Elo ladder without little effort and without needing to worry about all the technical details of how Elo works. I may just have to build that!

4. Oren says:

than it is 10^-(1000-1000)/400=10^0=1
and 1/(1+1)= 0.5 so
if S=1 we get
16(1-0.5)=8
and player 1 win 3 games gives me that his new ELO is 1024. etc.

Since you are a math geek and I’m not I have to assume I misunderstood something.

1. gautam says:

Hey Oren, sorry for my slow reply. It appears you’re using a K factor of 16, while my calculations used a K factor of 32, which is a common default value. I should make that clearer in my post. Apologies for any confusion!

5. Oren says:

Hi,

I tried to redo your pairwise (approach #1) with the data you gave and I had to use s=1.5 (I thought S should be 1)
Why is that?
also the 4th player should be 952 and not 956.

1. gautam says:

Hey Oren,

I looked through the calculations and S=1 worked for me. Could you elaborate on your calculations?

And hanks for catching that mistake, I’ve corrected it.

6. Helon Esteves Jr. says:

Hello Gautam,

First of all, congratulations on the article!! This was a very interesting read =). So, here’s my situation:

My friends and I are very avid players of the Game of Thrones board game, and we wanted to start ranking our matches for some time but there’s never a consensus on how we should keep score. That is, until I read this. I pretty much think this is the best way for us to rank ourselves, but there are 2 things I’ll kindly ask your help with:

1) Because of the particularities of this game, I wouldn’t like to reward just the winning player. Instead, I’d like to have a 1st, 2nd, 3rd etc… system. As much as I thought about it, I couldn’t figure out how to make it work on a elo system. Maybe the 1st, 2nd and 3rd are rewarded points while the 4th, 5th and 6th give up points, all based on the position you finished of course!! Or only the winning player take points, but the amount the others give up is decided by the position you finished. I think that this would motivate us to always strive for the best performance possible, even in a situation where victory is all but impossible.

2) The other thing is a smaller issue. In this game there are 3 houses that are slightly more likely to win than the other 3. Looking at internet matches stats we see that houses Stark, Greyjoy and Tyrell win ~56% of times when compared to houses Lannister, Martell and Baratheon, a stat that matches our own observations. This is slightly more than white’s advantage in chess, and while I don’t think chess rankings reward points differently based on whether you won with black or not, I think we should reward winning with the more difficult houses. When playing chess you usually play a set of X number of games agains someone where you change between black and white, whereas in this game we usually only have to time to play 1 or 2 games a month. Is there a way to add a small multiplier for when someone wins with the more difficult houses? Or do you think the discrepancy is too small to matter?

Greeting from Brazil =)

1. gautam says:

Hey Helon,

1)

Because of the particularities of this game, I wouldn’t like to reward just the winning player. Instead, I’d like to have a 1st, 2nd, 3rd etc… system.

I don’t know exactly how the Game of Thrones board game works (although I have friends who rave about it), but before implementing anything that gives ratings based on position I recommend looking over my reasoning for not doing this when creating the Settlers of Catan rating system. I’ll repost it here:

1. In a game where there is only a winner and multiple losers, this system wouldn’t work. In Settlers of Catan, the player with the second most points at the end of the game is not necessarily the second most likely to win. Games where a player in third or even fourth “leapfrogs” to first and wins the game are not uncommon.

2. Having ratings dependent on positioning is ripe for manipulation. For instance, if you were were in dead last in a game of Settlers, you could offer to throw the game in favor of another player in exchange for help in moving up from last to place to third. This kind of manipulation is incentivized under this system.

If those considerations don’t apply, then you’ve got a few options.

You could do approach #1 that’s outlined above. The first place player is awarded points as if he beat #2, #3, and #4 in individual matches, the second place player is awarded points as if lost to #1 and beat #3 and #4, etc. I describe the downsides of that approach above (namely, half the players end up overrated and half the players end up underrated) so I’m not a fan of it.

Another, better option (in my opinion) is to adjust the “actual” scores based on position. For instance, in a game of four players rated 1000, each player’s expected score is 0.25. In the system I describe in the article, the winning player gets an actual score of 1.0 while the losing players all get 0.0, meaning he outperformed his expected score by 0.75 while the rest underperformed their expected scores by 0.25.

How you assign actual scores is a bit arbitrary. My initial instinct was to assign the reciprocal of the position as the score–meaning first place gets a score of 1/1 (1.0), second place gets a score of 1/2 (0.5), third place gets a score of 1/3 (.33), fourth place gets a score of 1/4 (0.25) etc. The problem is since there is only one winner the score should sum to 1.0 (otherwise this violates the principle that the rating system should be zero-sum,e.g. the total number of points gained and lost in a single game should be 0), but the only way to do that is to give the first place winner an actual score that is less than 1.0, which could theoretically result in a situation where winning the game results in losing rating points!

So it’s clear that the actual scores of all the non-first place players have to sum to 0, and again it’s kind of arbitrary how you do this. You could have all the scores be negative, meaning everyone loses points, but you lose fewer points if you perform better. Or you could have some scores be positive, like second and third, and the rest be negative, meaning it’s still possible to gain points even if you don’t win the game, you just won’t gain as many as first place, and conversely the people in the last few places will lose even more points. You’ll have to figure this out based on your own gut feeling for the game–should second place gain rating points, or should they merely lose fewer rating points?

I generally think only the first place player should gain points, so here’s one possible scenario of the “actual score” assigned in a six player game

First place: 1.0

Second place: -.032

Third place: -.065

Fourth place: -.129

Fifth place: -.258

Sixth place: -.516

Here’s an example for a four player game:

First place: 1.0

Second place: -.143

Third place: -.286

Fourth place: -.571

2)

In this game there are 3 houses that are slightly more likely to win than the other 3. Looking at internet matches stats we see that houses Stark, Greyjoy and Tyrell win ~56% of times when compared to houses Lannister, Martell and Baratheon, a stat that matches our own observations. This is slightly more than white’s advantage in chess, and while I don’t think chess rankings reward points differently based on whether you won with black or not, I think we should reward winning with the more difficult houses.

In chess, white’s advantage is between 52 and 56 percent (at the highest levels), but you are right that elo ratings don’t differentiate between white and black because in tournaments and matches players alternate colors and end up playing each color the same number of times, so it balances out. To account for this, you’d have to adjust the expected scores accordingly, but once you do the rating system should work just fine. You’ll have to figure out how to adjust it based on the specifics of the game. Here’s an example of how you could do that:

Assume all six players are rated 1000, and half the houses have a combined 56% chance of winning and the other three have a combined likelihood of 44% chance of winning. In a one-on-one match, that’s the equivalent of a 42 point Elo boost. So you treat three of the houses as if they’re rated 1042 instead of 1000 for the purpose of calculating the rating adjustment (or alternatively, you treat the other three houses as if they’re rated 968), and perform that adjustment on the original 1000 ratings.

Of course, the simple solution is just have everyone play each house an equal number of times, but as you said with 1-2 games a month it may take years before rating imbalances sort themselves out!

***

Overall, it looks like it is doable (though not trivial) to make the necessary adjustments, although it does involve a fair bit of educated guesses. You’ll have to modify the code to account for all these different expected scores. Here’s the code in the Android app that actually does those calculations: https://github.com/gnarizzy/GameRatingCalculator/blob/master/src/com/Centaurii/app/RatingCalculator/util/Calculator.java

If you modify that correctly, you should be able to make it work for your game.

Sorry for this long, rambling response. This was basically me thinking aloud and typing as I came up with ideas. Nonetheless, hope you found it useful!

7. Hey Gautam! I was very pleasantly surprised to find your article on ELO for Catan. A bit of background on me, I played in the Catan National Championship last year and am working on a tournament simulator for said championship.

I think using an ELO-style system for Catan is pretty necessary if we’re trying to define skill. As in chess where you would surely beat me 100% of the time (my ELO is probably 1000-1200) I will win a very large chunk of 4-player Catan games against less skilled opponents. The biggest questions I think need to be answered are, what percentage of games will the world’s best player win against 3 average players? 3 inexperienced players? 3 very-good-but-not-great players?

Quick note on game size- in my personal experience, the 3-player game is quite different than the 4-player game, and both of those are extremely different from the 5-6 player expansion game. So my recommendation would be to keep those ladders separate.

1. gautam says:

Hey Lumin,

It’s cool to hear from a competitive Catan player! The big difference between chess and Catan elo ratings are of course, Catan has a moderate luck component which can only be worked around by playing multiple games. In the Catan National Championship, are players eliminated after one loss? Or is it a series of games? To be honest, I’m trying to think of a totally fair way of running such a tournament and can’t see one unless each player already has pre-established elo ratings and multiple games are played with the same group of people before deciding who advances to the next round. I’m curious how it actually works in practice.

And I definitely agree on different game sizes having different play characteristics (in particular, I find three player games to be the most different from the rest, and of course 5-6 player games have the Special Build Phase which changes things).

I don’t keep those ladders separate for a few reasons:

1) There’d be so few games per each individual ladder that there wouldn’t be enough data (among my play group, at least) to get a meaningful rating

2) I think it’s a safe assumption that someone who is good at three player games is also good at six player games, at vice-versa, even if the specific game dynamics aren’t exactly the same.

3) There’s enough variety in game sizes that any particular advantage a player had with one game size would be balanced out by the many other sized games that player would participate in.

4) I graphed ratings vs. average game size some time ago, and found no correlation (r^2 value of .01): http://i.imgur.com/MSz7LRP.png

8. janeshvar mishra says:

HELLO GAUTAM.. THANKS FOR UR EXPLANATION….

i need to ELO rating system for My Game in which there are maximum 6 player and when one of them lost all his life than it will eliminate and its loose points and all remain player behave as he win game than deducted point from looser player will divide to the all player or each of them get full point which loose by Looser player??

i have already done lots of work on it. but cant get any luck with it. so please suggest me formula for this and please guide me how to solve this.

1. gautam says:

Hey Janeshvar,

Just to make sure I understand how you want your system to work, the parameters you described are:

1) Maximum of six players

2) When a player loses, all remaining players should be treated as if they ‘won’ the game and the rating points gained should be evenly distributed among the remaining players.

3) The process of 2) repeats until there is only one player left.

Is all of that correct?

9. Joao Augusto says:

Hi Gautam,

I’m studying and researching the ratings system to implement in my scenario, I ended up in your blog, and really appreciate your explanations and adaptations.

The case is a virtual racing league, with multiples categories. Each category has about 20 players and they compete in 7 races every season. We have 3 seasons in a year. In order to refine the method, i have historical data since 2006.

My goal is to apply a method to measure the player’s winning chances (ELO system) and measure the strength of each category.

You think that the approach #3 can be my solution?

Any help will be highly appreciated.

Joao Augusto

1. gautam says:

Hey Joao,

1) Will players only be competing with other players within their categories?

2) How many players will typically be racing against each other?

3) Does it make sense for someone who gets second place in a race to be treated the same (rating wise) as someone who came in last? I make an argument in the article why it’s okay to do that in Settlers of Catan, but this may not make sense in the context of racing.

1. Joao Augusto says:

Hi Gautam,

Thank you for your response. Here come my awnsers!

1) Yes. Only players in the same category compete with other. But a player can compete in multiples categories, it is common to play 2 ou 3 categories in the same season.

2) There is categories with 10 players and another with 34. An average of 20 is accurate.

3) I don´t think so. A second and a third place, is always great result. in a 34 player’s race, be in the top 10 – is a great achivement even more if there is realy good players in, that’s the case of Pro Categories.

Be in 5th in a category with 10 players, can be a bad or a good thing depending of the average rating of the pilots.

I think in make something like a race average strength. Then, estimate a range of the positions that each player probably will finish, based on the other’s players rating. So, after the race, i’ll compare the final position with the ‘predicted’ result and add or remove points proportionaly.

Sounds like a good solution?

PS. There will be players who never will gonna win a race. Maybe be in the last part of grid in every race. So there will be a minimum rating to not punish that kind a player sooo bad.

1. gautam says:

Hey Joao,

If each category is a different style of play, it seems to make sense to maintain a separate category rank for each player. For instance, the United States Chess Federation maintains three ratings for each player–a regular rating (more than 30 minutes of clock time per side), a quick rating (between 10 and 30 minutes per side), and a blitz rating (up to 10 minutes per side). Players can have ratings in each category.

If each category is the same kind of play, then separate category ratings don’t really make sense, but may be necessary if there isn’t a whole lot of crossover between categories. Basically, each category becomes its own rating pool and incomparable to other rating pools unless there are many players who play in multiple categories, thus “mixing” the two pools together.

Based on your other comments, it seems like Elo ratings are probably not a good solution for this. Elo is great for 1 v 1, and this article was how I was able to reduce a six player game into a 1 v 1 for the purposes of Elo calculations, but this only works if you assume that it’s okay to treat a third place finish and a sixth place finish the same (and I argue why I think that’s okay in the post).

So your system presents a bit of a challenge–how do you appropriately rank players when you have 34 players playing each other simultaneously? I suppose one possible solution is to get the average (or median, which I think is more appropriate) rating and then treat each race as N 1 v 1 matchups, where N is the number of players and the player wins N-P-1 matchups, where P is there final position. So it there are 30 players with a median rating of 1000, you rate the player who came in third as if they won 28 1 v 1s against a 1000 player and lost 2 1 v 1s against a thousand player. I think that’d work, but I’m not entirely sure.

Unfortunately the software I wrote doesn’t support that scenario, so you’d figure out some way to automate it because otherwise it gets really tedious.

10. Kevin Forwin says:

I think there’s a problem with lumping all the different sized Catan games together. Different sized games have somewhat different mechanics, so 2 equally skilled players who primarily play different sized games would get different scores despite being good at the same sizes.

For example, player A is good at 3 sized games and so is player B. Player A focus on 3 sized games while player B focuses on 4 and 2 sized games. Player A will get a higher score even if player A and player B are the same person.

This almost ranks people more on how good they are at recognizing a game they’re good at rather than how good they are at playing the game.

1. gautam says:

I don’t really agree that the mechanics in three person games vs. six person games are so different that someone could specialize in Catan games with a certain number of players–especially when there still is a non-insignificant element of luck in the game, regardless of the size.

Nonetheless, this does present an interesting train of thought. I’m going to look through the game data I have to see if there is any evidence that people do specialize in a certain game size. I’m not sure I have enough data to draw any conclusions, but it’ll be cool to take a look.

In any case, if you do believe that is the case, you can create separate rating ladders for different game sizes.

11. yami says:

Hi Gautam,
I am building a simple quiz game for my class in which everybody starts with say five lives. The one who answers it fastest and accurately retains his lives while the others lose theirs. So In the end there is only one winner or no winners at all if nobody gets a single question right i.e. all their lives reduce to 0. Of course there could be 2nd spot, 3rd spot, etc. and there could be ties for those spots but there can never be a tie for the first place unless nobody gets a single question right. Can I apply your algorithm (Approach#3 ) to such a scenario?
Thanks

1. gautam says:

I don’t think this algorithm is necessary, since basically the “algorithm” behind your game is essentially: first player to lose five rounds loses, last player standing wins.

If you were writing a program to automate this, all you would have to do is keep track of each player’s wrong guesses. Once it hits five, that player is eliminated, and you can note somewhere what position that player came in.

After completing the elimination portion of each round, check how many players are left. If there is more than one player left, continue to the next question; if there is one player remaining, that player wins; if there are no players left, nobody wins (or alternatively, you check for how many rounds each player remained in the game and the player who lasted the most rounds wins).

1. yami says:

Thanks gautam for your prompt response. I still want to track an individual’s performance over a period of time. Percentage of games won is what I have been using so far to display the leaderboard. But with this ranking system somebody who has won 4/5 games would be rated at the same level as somebody who has won 80/100. I’d rather use your algorithm to account for provisional ratings.
Also I am calculating the elo rating for every question/round of the game. This gives a fair chance to evaluate the player w.r.t other players for every question displayed in the game. If I wait till the end of the game then I would have to complicate the algorithm by accounting for those who quit the game in the middle of it or those who got disconnected, etc. So do you think tracking the elo of every player for every question is bad logic? Do you see any flaws in this? I know that with this approach the player’s final outcome in the game has no bearing on his elo because his elo is calculated every round.

1. yami says:

Oh btw, the game algorithm has already been done and implemented. It is just the elo I am adding to the game for every question

1. gautam says:

Hey Yami, sorry for my delayed response! As long as the questions are of approximately equal difficulty (or at the very least, the questions are randomly distributed), then I think your system should work. I also agree that tracking elo after each question is better than waiting until the end of the game, and I think your reasoning is spot on.

12. Hi gautam, interesting article – thank you.

Where you say this “The key question was, could I expand this formula for the expected score beyond the case of all players being equally rated?”… did you actually come up with a different formula, or just plug in 2Ep/N in place of Ep in each pair-wise calculation?

I’m summing expected and actual scores over each pair of players already, then applying the K(Se – Sa), so I’m thinking I’ll need to apply the 2/N multiplier to both the expected and actual scores in each pairing.

1. Although thinking about it, if I do apply 2/N to both expected and actual scores, it refactors out into simply applying a 2/N multiplier on the final rating change value. So it would be having the effect of reducing the rating change as the number of players increases. Does this sound right?

1. gautam says:

Hey Aranda,

The 2/N is not a multiplier on the final rating, because it shouldn’t be applied to the actual score, just the expected score. What that means is that as the number of players increases, your expected score decreases. This makes sense, because all else being equal it is easier to win a two player game than a six player game. The magnitude of rating change depends on the K-factor used (which is a constant, and not really that important to this discussion) and the magnitude of the difference between the actual score (Sa) and expected score (Se).

What this works out to is that if you win a game with a large amount of players, your rating will increase a large amount, because Se will be small and Sa will be (relatively) large, so Sa – Se = large positive value. Similarly, losing a game with a large amount of players won’t result in losing many points, because Se is small, Sa is 0, so Sa – Se = small negative value.

As for moving beyond the case where all players are equally rated, I used a rather crude (but surprisingly effective, given my data set of over 100 games now) method of simply average the ratings of all opponents and treating it like a two player elo calculation.

Feel free to download either the Android app or desktop app I wrote to play around with different ratings and game sizes and see it for yourself. One note is that the desktop version only supports 3-6 player games, while the Android version supports 2-6 player games. Hope this clears things up!

Edit: if you’d like to see the actual algorithm implementation itself, the code is open source on GitHub. Here’s the specific file that contains the implementation: https://github.com/gnarizzy/GameRatingCalculator/blob/master/src/com/Centaurii/app/RatingCalculator/util/Calculator.java

Note that for “provisional players” I simply made K twice as a big–a fairly arbitrary and naive approach. I believe there is an algorithm the USCF uses to slide K down to a stable value after 25 games have elapsed.

13. How do I create a survey that uses the Elo rating system?

1. gautam says:

Could you explain this a bit more? What do you mean by a survey? Elo is typically used for one-to-one matchups.

14. Can’t wait to use this for a Civ 4 league!