TrueSkill 2 vs WHR

MaDDoX · Post by **MaDDoX** » 28 Jan 2019, 21:18

Hey guys, I'm considering the development of a new plugin for SPADS replacing the (now obsolete) TrueSkill by the new TrueSkill 2 calculation. Given it's an official MS thing, I just thought that maybe it should be an update (or option?) to SPADS itself, instead of a plugin. What do you think?

My personal take is that TS is a poor system. It disregards individual player effectiveness in defeated teams and with - at least in the current implementation - bad starting numbers, 25 instead of 0 for a starting rank. That's utterly unintuitive to new players, no matter how fast the system auto-tunes itself it still makes Smurf account-spammers look better than they are to the uninitiated. WHR, used by ZeroK, has proper start ranking (0, allowing negatives) and predicts outcomes a bit better than TS, but also ignores individual player effectiveness in team matches. TS2 has a sensibly higher predictive accuracy (68% vs 52%) and with a -25 rank offset could be as meaningful to newbies as WHR is.

What do you guys think? Bibim, advice?

Links:
https://www.remi-coulom.fr/WHR/WHR.pdf
https://www.microsoft.com/en-us/researc ... skill2.pdf

PS: In TS2, instead of the kill ratio we'd probably use player effectiveness - killed/lost units ratio or destroyed metal / lost metal ratio.

dansan · Post by **dansan** » 28 Jan 2019, 23:16

Sounds great.
Have you thought of a way to collect the required data (killed/lost units ratio or destroyed metal / lost metal ratio) from matches?

Silentwings · Post by **Silentwings** » 29 Jan 2019, 10:14

advice?

I have some interest in this stuff outside of Spring so I read the papers - thoughts:

-- WHR

WHR is a variant of Elo (which is likely why ZK came across it, I think ZK used to use Elo), and the paper assumes that only two player games occur. This is different to Trueskill, which handles multiple teams with possibly different numbers of players.

The "selling point" of WHR is that it computes skills by maximising the likelihood of a skill function across all time (i.e. the state space is functions of the form f(time)=current estimated skill), as opposed to the usual approach of updating a "current" skill estimate with incremental Baysian updates. This will make WHR less sensitive than e.g. TS to the order in which games occurred; could be helpful if only a small number of games have occurred, but (considered per player) after several games I'd guess it won't make much difference.

There is a heavy computational cost to WHRs approach, especially when many games have occurred, because it has to solve an optimization problem in a high dimensional space. In fact, they had to do some "hacks" and resort to an approximation scheme, based on seeding the optimizer with results from prevous runs and then not letting it run for long. The only trial given in the paper lists WHR as taking about 600 times longer to compute than TS and Elo did; 0.4s vs 252s on about 700,000 games to train the algos, and the gain WHR got in predictive accuracy is pretty small - 55.54% (for TS) vs 55.79% (for WHR).

The trial example is for Go, which is a 2 player game. In that situation, WHR gets to use its "I did much more computational work" advantage, taking much more time despite not dying on its feet, but TS doesn't get any chance to use its "I'm designed for team games" advantage. So, it is not a comparison that means much for Spring.

[Of course there is the hacky approach of modelling a XvY game as X*Y individual 1v1 games, where the player from the winning team is regarded as winning the 1v1 game. This doesn't look very sensible to me; because teams contain wide variation in skill it would frequently result in e.g. low skill players "winning" a fake 1v1 against high skill players, which under normal conditions would be extremely unlikely to happen, and consequently the likely result would be skill estimates with an artificially high variance. I can't see anything good about that, compared to the TS approach. It would be possible to make the main idea of WHR work in team games, but then the computational trouble will get even worse, and since it didn't happen in a 2 player game there's no real reason to expect much of a gain in predictive accuracy from it.]

Conclusion:It's an interesting idea, but for Spring it won't give any real advantage over Elo; so for us it won't perform better than TS either, and it might well be worse.

-- TS2

This is TS with lots of added bells and whistles. It can use more input data - if someone picks extra metrics to act as proxies for players performance. Adds an ability to have game modes with correlated skills, adds support for players who quit mid-game in team games, adds support for clans (all of which would be very useful for Spring). Also adds what looks like a variant of WHRs idea, but without computational cost, in that they started modelling skills as random walks with an upwards bias in the early stages.

The algorithm itself is the original TS one with extra steps bolted on. It relies on the same basic idea of expectation propogation, but it's not statistically elegent in the same way as TS1 was, so I guess some trial and error was involved in designing it. This makes it much harder to guess how effective it would be without actually testing it. I didn't have time to read all the details.

They get some serious improvements in predictve accuracy over TS1, in non-RTS online multiplayer games, but it is also looks like they did some work tuning parts of the algorithm (& choice of metrics) to individual games.

Conclusion: It might have significant improvements for Spring over TS1, but you won't find out without some hard work and testing.

Different Spring games would almost certainly need different metrics. I guess the biggest issue for RTS games is that it may be impossible to choose good metrics. The examples you suggested won't work imo - ratios of sums are "artificially" very high/low when the sums contain only a few elements. (I experimented with tonnes of things of this type when writing awards gadgets...). I don't know how to choose good metrics for BA. I spent along time trying to get non-stupid results for "efficiency", eventually https://github.com/Balanced-Annihilatio ... s.lua#L220, which is OKish and only added ~6 months ago.

-- 0 vs 25

TS (mu values) can take negative values, it just doesn't happen when you start at 25 unless a player is **really** record breakingly terrible. In fact TS is translation invariant in mu, so the difference between using 25 versus 0 as the initial mu value is 100% cosmetic (presumably, in both systems). This looks like a gimmick to me, and I can't help wondering about the stigma inevitably getting attached to negative skill values - "lets kick that dude he has -1 it makes us worse". They might well be intending to hide the skill data from players, which I think is not a bad idea.

MaDDoX · Post by **MaDDoX** » 31 Jan 2019, 11:07

dansan wrote: Have you thought of a way to collect the required data (killed/lost units ratio or destroyed metal / lost metal ratio) from matches?

This would be relatively trivial to do in a gadget, like the Awards gadget, then submit to SPADS - either natively or through a plugin, there's a callback to override that. If the collect logic is too "picky", like measuring the % of metal relative to each shot that inflicted damage on an enemy unit, it would probably be heavier than, say, simply measuring the metal cost of the unit killed - and probably not provide any more accuracy than the latter, given "distortions" like continuous healing of units and structures being damaged. As per SW's statement that "ratios of sums are artificially very high/low when the sums contain only a few elements' I don't believe it would apply to your average-length match. Nevertheless, it would probably be advisable to add a factor for short games and/or low global kill ratio (ie. double com-bombing, games with minimum engagement and com-sniping) to prevent this kind of bias.

Thanks for the writeup btw SilentWings, really well thought up. The fact that WHR by definition doesn't account for team games really limits its effectiveness, not to mention the processing overhead and marginal prediction accuracy gain.

SilentWings wrote:Conclusion: It might have significant improvements for Spring over TS1, but you won't find out without some hard work and testing.

Supposing I understood what you meant here, it's not realistic to expect a set of artificial data to reflect the actual numbers which will be obtained when the system's running on real games, especially given a different set of parameters will be provided to TS2, relative to TS1. Likewise, there'll never be an uniform distribution which works for all games or even a given game's current community. Of course, we have some domain knowledge relative to RTSes in general, but it's imperative to craft the `Prior` dynamically, as the stats from the matches kick in. The ultimate goal IMO is not to build a "statistical AI", but to provide increasingly better skill matching accuracy and, ultimately, game balance to the players. This could be done by adding specific tune-able parameters in the server of each game, within SPADS itself.

As for 0 vs 25 as a starting rank, it's numerically cosmetic in fact but, as you mentioned, there are some significant perceptual implications. People with ranks 10 and below are already disregarded by other players in the existing system, I can't imagine how negatives would make it any worse. OTOH, high starting positives are non-intuitive and confusing for new players like I said. A possible alternative would be having 0 set as the lowest possible skill, either by hiding negative numbers or by numerically capping them, to provide for quicker bump ups. This way a bad veteran player could only "look as bad" as a starting (or new smurf account) player.

Or just hide TS altogether, like @Ivand and others would prefer, and only use it for autobalance computations. The constant concern over ranking tends to demote players from playing when they're, for instance, on a bad day. Demoting play is never good, only leads to diminishing player numbers. But that's probably better left for another conversation.

Bottom line: Let's introduce TS2 and make sure it's flexible enough to both receive individual player "efficiency" stats from a game gadget and to have a couple parameters (eg.: low-kills bias, win/loss weight) adjustable in the SPADS implementation.

Silentwings · Post by **Silentwings** » 31 Jan 2019, 12:01

it's imperative to craft the `Prior` dynamically, as the stats from the matches kick in

As long as something reasonably sane was chosen, I doubt it's worth agonizing over precisely what prior is best. What matters is if the proxy metrics for player skill can actually reflect the long term reality of who wins & loses more often.

Bottom line: Let's introduce TS2 and make sure it's

You should train + test it on at least a few hundred games and be able to show a better predictive accuracy that the existing system. Badly chosen metrics would almost certainly make things worse, and afaics its not known if good (non-insanely complex) metrics even exist, for Springs usual style of RTS games. That said, I think there is a good chance of getting a positive result.

very_bad_soldier · Post by **very_bad_soldier** » 31 Jan 2019, 15:47

I think trying to measure a player's performance by anything else than win/loss is bad. The system should not dictate players *how* they win. Otherwise it will have influence on the actual gameplay.
The guy building that one nuke in the right moment might be game deciding.

Post by **triton** » 31 Jan 2019, 18:51

eheh, the day has come, I totally agree with vbs :)

DeinFreund · Post by **DeinFreund** » 31 Jan 2019, 22:42

Convergence
Firstly, there is how data points are used. Independent of what data is actually collected, both systems make different use of the data.

Trueskill 1/2 always only considers the most recent result and most recent ratings associated with each player. This makes it very easy to both calculate and work with, as every player always has just one rating. On the downside, this means that past ratings are fixed and can't be updated with future information.

Whole history rating uses data points more efficiently by storing all match results and associated ratings for a player in a time series. After each match, a new rating time series is chosen as to maximize the likelihood of all stored results. This means future information can be used to updated past ratings. For example if player A and B fought many games with an even amount of victories. Now a new player C enters and A beats C. In Trueskill, A's rating would now be higher than B's, while with WHR, A and B's rating will both be raised by the same amount (as the past has clearly shown A=B).

This means, given just victories, WHR will converge slightly faster than TS for a new player. With enough time, they will both come to the same result. The downside with WHR is that the rating update after each match is effectively an optimization problem for every single pair of players and matches they've played. This can be worked around by batching updates and doing them incrementally.

Zero-K updates each participating player's rating history directly after a match (runtime 0.01-0.1sec) and only calculates the causal effects hourly in a full update(runtime 1-10sec). Only the most recent ratings are cached in the db, this means the full rating history has to be recalculated on every server restart. This is a very expensive operation usually taking around 5 minutes. The current size of the battle dataset is around 200k battles, but the runtime complexity of full updates/initialization is worse than linear. The update of a single player is linear in the amount of battles they've played.

Trueskill can directly calculate new ratings from the ratings of the players involved in a match. It is many orders of magnitudes faster than WHR, even if WHR only updates the involved players, because it doesn't have to iterate past games.

Skill development
Both systems are modeling skill development as a Wiener process (continuous random walk), with all players being expected to make random changes in skill at the same rate as time progresses. This is independent from uncertainty, which is what causes new or inactive players to have large changes in estimated ratings after each game.

To compensate for newbies being generally worse at the game, both systems bias their ratings. Trueskill2 suggests a system where an experience score is used to subtract from the rating. This suggests that the negative bias may never increase again. ZK WHR adds a bias depending on uncertainty, meaning that both inactive and new players will get easier matches. This can easily be changed in both systems, so it's really up to the implementation to decide what system to use.

A problem I've encountered in practice is that the expected random change in skill is fixed for every player. When I was optimizing this parameter for datasets before Zero-K's steam release, it was best to set it near zero. Meaning players are modeled as having nearly constant skill with only random variations. This seems wrong, but for a playerbase that is so old, many players are older than the battle dataset and new accounts are often smurfs instead of actual new players. After the steam release, I've evaluated new matches and found that it's best to set the change rate excessively high. The players are suddenly playing like a whole new person every day. Of course this can be easily explained by many players coming into the game and often only playing for a few days, improving rapidly each day.

To solve this inconsistency, I've tried to model the expected random rating deviation as a function of time. Something along the lines of a 1/x function, starting high and tending towards zero. Hoping this would allow me to model both the old and the new data set effectively, it turned out to still only work well for one of them and no better than a well chosen constant at that. Unfortunately I still haven't found a solution to this and Trueskill 2 doesn't seem to do anything differently.

Team games
Both Trueskill 2 and WHR easily support team games. One decision you have to make is how to weight the games. In WHR this worked out best by simply weighting games and individual skill with 1/#teammates. Either giving high uncertainty (new) or low uncertainty (veteran) players increased weight made the predictions worse. Both systems also support a penalty which is used to compensate for good or bad teammates. This makes sure rating change depends on both teams and not just the enemies.

Outcome evaluation
This is where Trueskill 2 really differs from its predecessor and Elo based rating systems. While Elo based systems usually support a penalty for things such as who got to make the first move, Trueskill 2 comes with a whole list of parameters that are incorporated into win chances. Tracking in-game performance, whether the player quit early, for how long he played and even if he played with friends or strangers. By adding so much more data to every game instead of just treating it as a raw binary result, the rating system can become a lot better at predicting outcomes and converges faster.

Even though this sounds great, it has been intentionally avoided in strategy games. If in-game parameters are chosen that define a player's chance to win, they can be directly used to describe how a game has to be played to maximize the expected win chance. Unless the game has been solved for every possible position, such a system will make errors that can be exploited.

An example where such a system might be useful are big team games(8v8-16v16). These are very common in ZK and from my statistics, neither Elo nor WHR can make any good predictions of the outcome of these games. Feeding a rating system solely with 1v1 games actually led to better team game predictions than feeding it the team games. This is because the high numbers of players add too much noise for a simple Win/Loss outcome to be useful. If one wants to base ratings on this, there seems to be no way around including additional parameters.

On the other hand, when it comes to 1v1 games and possibly predicting the outcomes of competitive/tournament games, I think it's important not to have any prejudice on what are good or bad strategies and thus no in-game data should be used. If somebody wins in 30 seconds using some simple rush, that's a problem of the game, not anything the rating system should value less.

Choosing your system
When implementing your own version of these rating systems, all but the core rating update algorithm (See Convergence) can be easily adjusted. All of the other points can be added on top of both Trueskill and WHR. Thus I'd suggest looking whether you'd be willing to handle the computational and space overhead of WHR for slightly improved convergence speeds. Especially if players are playing very actively (many games per account) and against random opponents, both systems will perform nearly identically.

I've chosen WHR for Zero-K as there were many complaints about smurfs not being rated quickly enough and because it can support nearly every feature found in other rating systems. So far it has worked out well, especially for small teams and 1v1.

Addendum: Comparing Rating Systems
Correct prediction percentages have been mentioned multiple times, and I'd like to add that there is a more effective way to compare rating systems. Prediction percentages suffer from a high amount of noise and treat every prediction percentage as binary correct/incorrect. They are also highly susceptible to the data set. If most games are very balanced, it becomes hard to predict the winner and all systems will score near 50%, while the opposite is true for highly imbalanced games. What I've found to correlate well with prediction percentages while having lower noise, thus being more reliable in smaller testing batches, is to use a logarithmic scoring.

For every game and every team in those games, each rating system makes a win chance prediction p. It then gets a cumulative score calculated as follows:

Code: Select all

Team won:  (1+log2(p)) 
Team lost: (1+log2(1-p))

A score of 0 signifies random choice, scores below are worse and above are better than random. Giving a completely wrong prediction (0% win chance team wins) will set the score to negative infinity.

Silentwings · Post by **Silentwings** » 01 Feb 2019, 12:30

Prediction percentages suffer from a high amount of noise and treat every prediction percentage as binary correct/incorrect.

This isn't (always) correct - for example the TS2 paper tests their model by comparing "expected win rates" to actual win rates. Since the model makes non-deterministic preditions of winners, the expected win rate is the average across the test data of predicted winning probabilities per game without binary rounding.

Code: Select all

Team won:  (1+log2(p)) 
Team lost: (1+log2(1-p))

I didn't look hard but this results in a number on a scale that isn't intuitive (at least, not to me) - very close proximity to 0/infty obviously means something but what does the actual value signify? It looks like it could be heavily dominated by a single game where the prediction was badly wrong, even if predictions were good in all other games. Is your choice of logs to do with entropy, perhaps? It also seems a bit odd not to normalize for the number of games in the test data.

big team games(8v8-16v16) ... from my statistics, neither Elo nor WHR can make any good predictions of the outcome of these games

The original TS paper has does have data with a big improvement in predictive accuracy over Elo style systems in cases where the teams have similar skill.

bibim · Post by **bibim** » 01 Feb 2019, 16:53

MaDDoX wrote: ↑28 Jan 2019, 21:18 Hey guys, I'm considering the development of a new plugin for SPADS replacing the (now obsolete) TrueSkill by the new TrueSkill 2 calculation. Given it's an official MS thing, I just thought that maybe it should be an update (or option?) to SPADS itself, instead of a plugin. What do you think?

I think it would be much easier for you to use the plugin interface, which has been designed just for that, instead of hacking SPADS internal code. All the callbacks and API functions required to implement alternative ranking/balancing methods should already be available in the plugin interface (and if you need new ones I can implement them on request).
If it makes sense to include a plugin code directly in SPADS core later on, it's very easy to do afterward anyway.

MaDDoX wrote: ↑28 Jan 2019, 21:18 What do you guys think? Bibim, advice?

I'm not really convinced by the benefit of using 0 instead of 25 as starting rank. Anyway this shouldn't be seen as a reason to change algorithm. With current system we can already use 0 as starting rank if we want, it's just one line of code to change...

Also, I agree with vbs and DeinFreund concerning the drawbacks of using some arbitrary rules to evaluate players' effectiveness (I guess such measurements might be useful to improve smurfs detection though...).

Tbh I'm not sure implementing a new ranking algorithm will improve the situation notably, because I think the main reason we have some unbalanced games is because of players trying to defeat the ranking system on purpose (using smurfs accounts through VPN, open wifi networks, mobile networks etc., and also swapping vet accounts with friends for example).

However, it can't be a bad thing to add support for new skill estimations algorithms in SPADS so feel free to experiment with that, and don't hesitate to ask if you need help :)

Just keep in mind that in current architecture the skill estimation algorithm isn't implemented in SPADS but centralized in SLDB. SPADS just provides the match results to SLDB, and uses the skill estimation values returned by SLDB to balance teams. Also, ranking computations need to be reasonably fast so that it's possible to take into account smurfs detections and manual accounts split/join retro-actively, which can lead to recomputing all matches since 2012 on the fly.

DeinFreund · Post by **DeinFreund** » 01 Feb 2019, 22:21

Silentwings wrote: ↑01 Feb 2019, 12:30 This isn't (always) correct - for example the TS2 paper tests their model by comparing "expected win rates" to actual win rates. Since the model makes non-deterministic preditions of winners, the expected win rate is the average across the test data of predicted winning probabilities per game without binary rounding. (They don't give a formula to make their choice absolutely clear, unfortunately.)

That's used to adapt their parameters. Knowing that a specific action happened (for example a player quitting early), they look how that correlates with win rate and then add a factor to Trueskill so it gives the same result. This isn't helpful to compare the overall performance of the rating system to other systems (unless they also handle these specific events and you want to just compare that).

Silentwings wrote: ↑01 Feb 2019, 12:30
Code: Select all
Team won:  (1+log2(p)) 
Team lost: (1+log2(1-p))
I didn't look hard but this results in a number on a scale that isn't intuitive (at least, not to me) - very close proximity to 0/infty obviously means something but what does the actual value signify? It looks like it could be heavily dominated by a single game where the prediction was badly wrong, even if predictions were good in all other games. Is your choice of logs to do with entropy, perhaps? It also seems a bit odd not to normalize for the number of games in the test data.

I do normalize for the number of games and for the number of teams within each game. When directly comparing two ratings on the same dataset you don't have to do that if you simply wish to rank them though.

Maximizing the logarithmic scoring minimizes information gained from every match. This means there are as little upsets as possible. I'm not too sure if this is the perfect scoring to use, but for nearly all my tests it identified the better rating system with less data than required to do so decisively by counting correct/incorrect predictions. Still, I'm always looking at both metrics, as that helps to identify unreliable results.

See: https://en.wikipedia.org/wiki/Scoring_r ... oring_rule

Silentwings wrote: ↑01 Feb 2019, 12:30 The original TS paper has does have data with a big improvement in predictive accuracy over Elo style systems in cases where the teams have similar skill.

I've looked at the original paper and all I could find was this:
https://i.imgur.com/YRamnxk.png
The percentages are incorrect predictions, where "full" indicates the full dataset and "challenged" indicates only using the 20% most closely matched games, as determined by the other rating system.

For large teams, Elo still manages to predict over 60% of the matches correctly, meaning that these were really imbalanced matches. For example in ZK, Elo usually gets at most 53% of matches right, WHR upping that to some 54-55%. Still, even with this "easier" dataset, Trueskill is only 1-2% better. Their challenge datasets show a bigger difference, but just like the logarithmic scoring this is no longer a direct representation of the match prediction accuracy.

DeinFreund · Post by **DeinFreund** » 01 Feb 2019, 22:43

bibim wrote: ↑01 Feb 2019, 16:53 Also, ranking computations need to be reasonably fast so that it's possible to take into account smurfs detections and manual accounts split/join retro-actively, which can lead to recomputing all matches since 2012 on the fly.

While WHR can't detect smurfs, it can be fun to confirm your predictions with the plots.
https://i.imgur.com/thKDPQN.png
(Direct link because this website's CSP disallows embedding images from other websites)

You can clearly see who's a new player learning a game and who's that same player making a new account. ZK WHR in the way it's currently implemented also allows live relinking of battles to different users. This is because the entire dataset is kept in memory, so it can be manipulated easily.

You can also try the history plotter.

Silentwings · Post by **Silentwings** » 02 Feb 2019, 07:31

the TS2 paper tests their model by comparing "expected win rates" to actual win rates

That's used to adapt their parameters.

It's used in every experiment they do - those comparing the effectiveness of TS1 to TS2, and those comparing TS2 with some feature on / off.

(1+log2(p)) , (1+log2(1-p))
Maximizing the logarithmic scoring minimizes information gained from every match.

This does mean that, when the score is low, it may be due to just a small number (perhap even just one) of particularly poor predictions; even if the vast majority of other predictions were accurate.

I can now see that your score is an estimator of information content times -1 (which = expectation of entropy), but rescaled to set 0 as the benchmark for "fully" random predictions, thanks.

Their challenge datasets show a bigger difference, but just like the logarithmic scoring this is no longer a direct representation of the match prediction accuracy.

Agreed, in this case its also not clear how to interpret the size of the number!

MaDDoX · Post by **MaDDoX** » 02 Feb 2019, 16:27

Silentwings wrote:As long as something reasonably sane was chosen, I doubt it's worth agonizing over precisely what prior is best. What matters is if the proxy metrics for player skill can actually reflect the long term reality of who wins & loses more often.

I didn't mean the former, much rather the latter :) What I meant is that it's all about the system - and its tuning parameters - being dynamic and reflecting the current "skill status" of the game strategies and player base. Those do evolve along time and will require re-tuning, spending too much time trying to make it perfect (?) from the let go is a waste.

You should train + test it on at least a few hundred games and be able to show a better predictive accuracy that the existing system.

That is something Microsoft supposedly (from what they claim in their article) already did, with a very large predictive accuracy gain.

very_bad_soldier wrote:The guy building that one nuke in the right moment might be game deciding.

Indeed, and I also believe putting heavy weight on the victories is critical. Not awarding good players in a poor team is a too shallow evaluation though, and can only lead to disconnects and other nonsense from good players when they realize they're not going to win because of their team's bad performance. The fact that TS2 punishes disconnects is already a great improvement in that aspect.
Also, performance evaluation criteria will always be arbitrary, just as much as the mod/game rules themselves. That's the very root of game design and balance, re-engineering and guiding the human behavior as a simulated experience unfolds. If you get anal about measuring minimalistic "success" criteria you would indeed be "dictating" better/worse behaviors while playing. Now, if you focus on measuring the achievement of primary and secondary goals (eg. (P) killing commanders, (S) destroying units, etc), as defined in the game rules / modoptions themselves, you're only enforcing the game rules in a clear and sensible way.

On the downside, this means that past ratings are fixed and can't be updated with future information.

True. I believe TS2 would be ideal for new mods/games, or after a "reset" in existing mods, which may or may not be desired by each team.

DeinFreund wrote:One decision you have to make is how to weight the games. In WHR this worked out best by simply weighting games and individual skill with 1/#teammates. Either giving high uncertainty (new) or low uncertainty (veteran) players increased weight made the predictions worse. Both systems also support a penalty which is used to compensate for good or bad teammates. This makes sure rating change depends on both teams and not just the enemies.

Excellent and thorough analysis DeinFreund, also very useful feedback from actual experience with uncertainty bias in a running game. Thank you very much!

bibim wrote:I think it would be much easier for you to use the plugin interface, which has been designed just for that, instead of hacking SPADS internal code. All the callbacks and API functions required to implement alternative ranking/balancing methods should already be available in the plugin interface (and if you need new ones I can implement them on request).

Thanks bibim. I never considered replacing SPADS innards directly, I was just pondering if it wouldn't be an obvious-enough update to be included in SPADS by you. In any case, the possibility to move it from a plugin to the core later if needed/wanted is a good idea, I'll probably follow that path.

bibim wrote:in current architecture the skill estimation algorithm isn't implemented in SPADS but centralized in SLDB. SPADS just provides the match results to SLDB, and uses the skill estimation values returned by SLDB to balance teams.

Hmm.. that got me a bit confused since you mentioned "it can't be a bad thing to add support for new skill estimations algorithms in SPADS". Now, if I got that right a SPADS plugin wouldn't be enough, a patch in SLDB to optionally (set per mod) use a TrueSkill2 python module would be required?

Silentwings · Post by **Silentwings** » 02 Feb 2019, 17:55

You should train + test it on at least a few hundred games and be able to show a better predictive accuracy that the existing system.
That is something Microsoft supposedly (from what they claim in their article) already did, with a very large predictive accuracy gain.

The point is that the paper shows this for some easy cases, but whether it works is heavily dependent on having good metrics and this depends on the game, so it needs to be done for Spring games before having any certainty that TS2 is better (and not worse) than TS1 for us.

DeinFreund · Post by **DeinFreund** » 02 Feb 2019, 21:42

You should train + test it on at least a few hundred games and be able to show a better predictive accuracy that the existing system.

Also make sure to check the significance level. With 1000 games there's still a std dev of +-1.6% around 50%. Differences in rating systems can often be smaller than this. So just be aware that when you're talking about hundreds of games it often needs tens of thousands to be able to reject the null hypothesis.

bibim · Post by **bibim** » 04 Feb 2019, 09:20

MaDDoX wrote: ↑02 Feb 2019, 16:27 Hmm.. that got me a bit confused since you mentioned "it can't be a bad thing to add support for new skill estimations algorithms in SPADS". Now, if I got that right a SPADS plugin wouldn't be enough, a patch in SLDB to optionally (set per mod) use a TrueSkill2 python module would be required?

Well, first a plugin in SPADS is required because you need to extract in-game data/stats to be able to implement TS2.
For that I would recommend using the onGameEnd callback which is called by SPADS core each time a game ends and provides a lot of in-game data/stats.
If you need more data, you can also implement a game-specific gadget which would send additional in-game data through LUA messages to the autohost during game. These data can then be received by SPADS plugins with a handler declared using the addSpringCommandHandler like this for example:

Code: Select all

addSpringCommandHandler({GAME_LUAMSG => \&hSpringGameLuaMsg});

[...]

sub hSpringGameLuaMsg {
  <your LUA messages handling code here>
}

Then the plugin would be in charge of sending all the game data required for TS2 computing to a centralized database, which would store and process them to update TS2 values. This can indeed be a SLDB instance patched to support TS2, but this could also be your own database with your own interface. Of course, you could also choose to implement TS2 computing entirely in the SPADS plugin, which would remove the need for an interface with a centralized database, but then it will be much harder to share this TS2 ranking between autohosts... This can be useful in the early stages of development as a POC though.

Finally, the plugin should also implement the updatePlayerSkill callback to provide SPADS with the TS2 values.

Spring RTS Engine

TrueSkill 2 vs WHR

TrueSkill 2 vs WHR

Re: TrueSkill 2 vs WHR

Re: TrueSkill 2 vs WHR

Re: TrueSkill 2 vs WHR

Re: TrueSkill 2 vs WHR

Re: TrueSkill 2 vs WHR

Re: TrueSkill 2 vs WHR

Precision vs Accuracy

Re: TrueSkill 2 vs WHR

Re: TrueSkill 2 vs WHR

Re: TrueSkill 2 vs WHR

Re: TrueSkill 2 vs WHR

Re: TrueSkill 2 vs WHR

Re: TrueSkill 2 vs WHR

Re: TrueSkill 2 vs WHR

Re: TrueSkill 2 vs WHR

Re: TrueSkill 2 vs WHR