Convergence
Firstly, there is how data points are used. Independent of what data is actually collected, both systems make different use of the data.
Trueskill 1/2 always only considers the most recent result and most recent ratings associated with each player. This makes it very easy to both calculate and work with, as every player always has just one rating. On the downside, this means that past ratings are fixed and can't be updated with future information.
Whole history rating uses data points more efficiently by storing all match results and associated ratings for a player in a time series. After each match, a new rating time series is chosen as to maximize the likelihood of all stored results. This means future information can be used to updated past ratings. For example if player A and B fought many games with an even amount of victories. Now a new player C enters and A beats C. In Trueskill, A's rating would now be higher than B's, while with WHR, A and B's rating will both be raised by the same amount (as the past has clearly shown A=B).
This means, given just victories, WHR will converge slightly faster than TS for a new player. With enough time, they will both come to the same result. The downside with WHR is that the rating update after each match is effectively an optimization problem for every single pair of players and matches they've played. This can be worked around by batching updates and doing them incrementally.
Zero-K updates each participating player's rating history directly after a match (runtime 0.01-0.1sec) and only calculates the causal effects hourly in a full update(runtime 1-10sec). Only the most recent ratings are cached in the db, this means the full rating history has to be recalculated on every server restart. This is a very expensive operation usually taking around 5 minutes. The current size of the battle dataset is around 200k battles, but the runtime complexity of full updates/initialization is worse than linear. The update of a single player is linear in the amount of battles they've played.
Trueskill can directly calculate new ratings from the ratings of the players involved in a match. It is many orders of magnitudes faster than WHR, even if WHR only updates the involved players, because it doesn't have to iterate past games.
Skill development
Both systems are modeling skill development as a Wiener process (continuous random walk), with all players being expected to make random changes in skill at the same rate as time progresses. This is independent from uncertainty, which is what causes new or inactive players to have large changes in estimated ratings after each game.
To compensate for newbies being generally worse at the game, both systems bias their ratings. Trueskill2 suggests a system where an experience score is used to subtract from the rating. This suggests that the negative bias may never increase again. ZK WHR adds a bias depending on uncertainty, meaning that both inactive and new players will get easier matches. This can easily be changed in both systems, so it's really up to the implementation to decide what system to use.
A problem I've encountered in practice is that the expected random change in skill is fixed for every player. When I was optimizing this parameter for datasets before Zero-K's steam release, it was best to set it near zero. Meaning players are modeled as having nearly constant skill with only random variations. This seems wrong, but for a playerbase that is so old, many players are older than the battle dataset and new accounts are often smurfs instead of actual new players. After the steam release, I've evaluated new matches and found that it's best to set the change rate excessively high. The players are suddenly playing like a whole new person every day. Of course this can be easily explained by many players coming into the game and often only playing for a few days, improving rapidly each day.
To solve this inconsistency, I've tried to model the expected random rating deviation as a function of time. Something along the lines of a 1/x function, starting high and tending towards zero. Hoping this would allow me to model both the old and the new data set effectively, it turned out to still only work well for one of them and no better than a well chosen constant at that. Unfortunately I still haven't found a solution to this and Trueskill 2 doesn't seem to do anything differently.
Team games
Both Trueskill 2 and WHR easily support team games. One decision you have to make is how to weight the games. In WHR this worked out best by simply weighting games and individual skill with 1/#teammates. Either giving high uncertainty (new) or low uncertainty (veteran) players increased weight made the predictions worse. Both systems also support a penalty which is used to compensate for good or bad teammates. This makes sure rating change depends on both teams and not just the enemies.
Outcome evaluation
This is where Trueskill 2 really differs from its predecessor and Elo based rating systems. While Elo based systems usually support a penalty for things such as who got to make the first move, Trueskill 2 comes with a whole list of parameters that are incorporated into win chances. Tracking in-game performance, whether the player quit early, for how long he played and even if he played with friends or strangers. By adding so much more data to every game instead of just treating it as a raw binary result, the rating system can become a lot better at predicting outcomes and converges faster.
Even though this sounds great, it has been intentionally avoided in strategy games. If in-game parameters are chosen that define a player's chance to win, they can be directly used to describe how a game has to be played to maximize the expected win chance. Unless the game has been solved for every possible position, such a system will make errors that can be exploited.
An example where such a system might be useful are big team games(8v8-16v16). These are very common in ZK and from my statistics, neither Elo nor WHR can make any good predictions of the outcome of these games. Feeding a rating system solely with 1v1 games actually led to better team game predictions than feeding it the team games. This is because the high numbers of players add too much noise for a simple Win/Loss outcome to be useful. If one wants to base ratings on this, there seems to be no way around including additional parameters.
On the other hand, when it comes to 1v1 games and possibly predicting the outcomes of competitive/tournament games, I think it's important not to have any prejudice on what are good or bad strategies and thus no in-game data should be used. If somebody wins in 30 seconds using some simple rush, that's a problem of the game, not anything the rating system should value less.
Choosing your system
When implementing your own version of these rating systems, all but the core rating update algorithm (See Convergence) can be easily adjusted. All of the other points can be added on top of both Trueskill and WHR. Thus I'd suggest looking whether you'd be willing to handle the computational and space overhead of WHR for slightly improved convergence speeds. Especially if players are playing very actively (many games per account) and against random opponents, both systems will perform nearly identically.
I've chosen WHR for Zero-K as there were many complaints about smurfs not being rated quickly enough and because it can support nearly every feature found in other rating systems. So far it has worked out well, especially for small teams and 1v1.
Addendum: Comparing Rating Systems
Correct prediction percentages have been mentioned multiple times, and I'd like to add that there is a more effective way to compare rating systems. Prediction percentages suffer from a high amount of noise and treat every prediction percentage as binary correct/incorrect. They are also highly susceptible to the data set. If most games are very balanced, it becomes hard to predict the winner and all systems will score near 50%, while the opposite is true for highly imbalanced games. What I've found to correlate well with prediction percentages while having lower noise, thus being more reliable in smaller testing batches, is to use a logarithmic scoring.
For every game and every team in those games, each rating system makes a win chance prediction p. It then gets a cumulative score calculated as follows:
Code: Select all
Team won: (1+log2(p))
Team lost: (1+log2(1-p))
A score of 0 signifies random choice, scores below are worse and above are better than random. Giving a completely wrong prediction (0% win chance team wins) will set the score to negative infinity.