Loading...
  OR  Zero-K Name:    Password:   

Predictiveness of the new ELO Split in 1v1-4v4

21 posts, 1227 views
Post comment
Filter:    Player:  
Page of 2 (21 records)
sort
We have the tools, so there's no point guessing about whether it's better to have different ELO values for 1v1/2v2-4v4 or to merge them into a single value. I have evaluated multiple possible systems using this scoring.

ZK ELO


Elo as it is currently implemented in Zero-K except for leaving out newbie malus(which only matters little in the long run). The old prefix refers to using K=32 instead of K=64.

Split ELO


Using two different Elo values, one for games with more than S players and one for the others. Otherwise the same as ZK ELO.

Interpolated ELO


Using two different Elo values and weighting them with S/Players(Max. weight = 1). Suggested by DErankBrackman

Linear Mixed ELO


Using N different Elo values, each one for a specific player count, linearly interpolating them in between.

Always 0.5


Dummy rating system that will always return equal chances. This would lead to random balance.

Results


These systems were evaluated over all even 1v1-4v4 battles not played on a funny map or against bots.

Scores:
Interp ELO (S=5, K=64): 0.1789421646708225
ZK Elo: 0.17810675231268006
Lin Mix ELO (2, 8): 0.17601081142739458
Split ELO (S=4, K=64): 0.1711465984015903
Old ZK Elo: 0.16988054006151806
Always 0.5: 0.0


As you see, splitting the games is actually a bad idea, it reduces the available data to calculate ratings from. The single ELO value is only beaten with little margin by the interpolated one.

I have also tried different constants for the systems, but limited it to the best result for each system.
+8 / -0
Great work!
It looks like the ZK Elo is preferable, since it's effectively equal in accuracy to the Interpolated Elo, and it's easier to explain.
+1 / -0

8 years ago
Not to say we can't draw conclusions from this, but we should keep in mind that the elo system scores likely also depend on playerbase behavior, which might change with more/new people.

But I guess for now we can stick with a single elo value.
+0 / -0
i have no idea how this scoring system works but common sense tells me that just mashing all values together (in one way or another) cant actually be better than tracking them individually.

what does "too small dataset" mean in the context of split elo? from my experience a relativly small number of games is enough to give a reasonable estimate of someones elo.

a somewhat active player should have a reasonably large number of games of one or more type to give an accurate estimate for these types. in case there are gross misbalanced in the number of games some method of interpolation would probably be useful, otherwise not so much.

most peoples elo values hover 50-100, max 200 points around a fix value, but wether you have 1000 or 2000 games the changes per game will be the same. if you use one value for all types you still have the hovering, but probably more swingy. can you explain in regular people language how this is not a thing?
+0 / -0

8 years ago
quote:
DErankKlon
what does "too small dataset" mean in the context of split elo? from my experience a relativly small number of games is enough to give a reasonable estimate of someones elo. i cannot imagine that on a realistic number of games for a somewhat active player the split system would be incaccurate.


Skill is not constant, so there's never enough data about somebody's "current skill". Splitting up the games means less recent games for each of the ELOs to make their estimates with.

I made this thread to be independent from beliefs and imagination.
+2 / -0
for most active players skill is not moving a lot outside of fluctuation (which is pretty much noise). it is also fluctuating independently for different game types. with just one rating the value will more quickly adapt to fluctation but it will also be always off regardless of that fluctuation, right?


anyhow. i dont understand the theory behind it but from a practical perspective it just seems wrong. if i happen to play a number of 1v1s and lose them all (maybe because im bad at it), and my rating drops a lot because i gained a lot of elo in teams before (maybe im good at it), does my rating then properly reflect my skill because it was formed by the maximum amount of available data?
+0 / -0
quote:
for most active players skill is not moving a lot outside of fluctuation

I call bullshit. We're talking about a multi-year time frame with many new players joining and climbing the ranks. Maybe your idea of a (currently) "active player" is a pro that is just poking around in big teamgames, but even in the 1v1 scene there is substantial movement whenever players decide to actively work on their game. I distinctly remember EErank[ISP]Lauri experimenting and improving a lot, and there's a whole bunch of other players both at the visible top of the ladder as well as all through the ranks whose skill is constantly changing.

quote:
if i happen to play a number of 1v1s and lose them all (maybe because im bad at it), and my rating drops a lot because i gained a lot of elo in teams before (maybe im good at it), does my rating then properly reflect my skill because it was formed by the maximum amount of available data?

That example was cited in the other thread, too, and while it's clearly disfavoring the combined elo, apparently this situation is a lot less common than people having played too few games in either mode to make the respective predictions accurate.

I'd actually go so far as saying that an influx of new players favors the 1 measure system a lot more, because it's twice as fast at establishing decent elo values, whereas split elo is good for long term players that have played a ton in both modes.

Maybe it would be possible that 1v1 and team elo stay identical (i.e. calculated based on the same game set) until you have played enough games in either mode that they can be assumed to be reasonably accurate when (from that point on) only being calculated with games of the respective type?
+0 / -0
I think you've misrepresented these results a bit, perhaps only by implication.

I can well believe that splitting 1v1 and teams elo entirely is a bad idea for the reason you've outlined, but interpolated elo performing better (if only by a little) indicates that there is definitely some value to considering different game types separately.

Interpolating over more points in the same way could improve its score further, I don't know if you tried that. (Unless that's what "linear mixed elo" means, but I would be very surprised to find that the best possible choice of points is that much worse than only choosing two.)

Further things to consider are that:
(1) if two players in a game have an elo which misrepresents their team skill by (say) 200, then depending on what team they're on there's about a 50/50 chance that the problems cancel out, resulting in no change to the outcome of the game... but that doesn't mean that those two players are going to have a good play experience
(2) it is unclear to me how well this result will generalise to matchmade games where skill differences in each game will (hopefully) be much lower in general
(3) How much is this result due to the large number of players who play mostly teams, then applying their team skills to the handful of 1v1 games they play, and their 1v1 rating not keeping up?
+0 / -0
quote:
quote:
for most active players skill is not moving a lot outside of fluctuation
I call bullshit. We're talking about a multi-year time frame


i think we are not. Freund suggested that we are talking about very small time frames of a few games, and the impact a split data set has here on the quality of predictions.

my argument was that in these small time frames, elo movements are largely due to natural fluctuation and not so much skill improvement.

quote:
That example was cited in the other thread, too, and while it's clearly disfavoring the combined elo, apparently this situation is a lot less common than people having played too few games in either mode to make the respective predictions accurate.


maybe this is because the "all games in history" data set contains just too much garbage (ie. newbies playing 5 games and then quitting, smurfs, etc). i suggested that after someone reaches a base level of experience, the example case is more likely to happen than lack of data. (a test case could be constructed with data only from players lvl > 20/50/100)

so i agree with your conclusion. similar to the newbie malus the different ratings would at first develop together and then eventually diverge. you could go further and weigh the number of games in either mode.

so for example if you never play 1v1s, your team rating would just carry over. if you play exactly the same amount in either type, there would be no carry-over at all, and interpolation would happen in between these extremes.

+0 / -0

8 years ago
quote:
newbies playing 5 games and then quitting

Do you think this will happen less or more with steam launch?
+1 / -0

8 years ago
DErankKlon
what you call "natural fluctuation" is exactly where ELO has its limits. Some people can easily play 300 ELO above/below "their skill" on a good/bad day. In those cases maximum convergence speed to the current skill is required. With skill represented by just one number this means this number has to adapt, as quickly as possible. 1v1 is the only game mode where ELO "really works" and makes good predictions, so using these games to adapt Team ELO is favorable.
+1 / -0

8 years ago
Why not select the easy solution, one ELO for alle types of games.

There are good team players, who are bad at 1v1 and vice versa, but most good team players will not play 1v1 or the oposite, so the impacted of the combined ELO will be moderate.
I presume that players playing both types of games will in most cases have two ELO values not so far apart DeinFreund do you have data about this ?

So the one ELO solution will be easy to calculate and easy to explain to new players, while giving a resonable balance.
+0 / -0
quote:
In those cases maximum convergence speed to the current skill is required.


why? does it matter if my elo drops by 50 or 100 points during a few games played on a bad day, if the next day will be different anyway? (and probably more accurate if the change was small). they rating system cant know when i decide to call it a day.

also, wouldnt using 1 number mean it actually adapts slower to the "current skill" because for example my team elo keeps my 1v1 elo afloat even if i strike a series of 1v1 losses?

quote:
1v1 is the only game mode where ELO "really works" and makes good predictions, so using these games to adapt Team ELO is favorable.


you are probably right on the math part but the conclusion is not legit. experience from the past is that after winning several 1v1s (which used to have high impact, as opposed to clusterfuck games with elo changes in the lower one-digits) my elo would be higher than my ability to carry my team at that time, ie. my current skill at team games.

quote:
Do you think this will happen less or more with steam launch?


i dont see how this has anything to do with split or not split. we already agreed that for newbies split should not be used or not used as much.
+0 / -0

8 years ago
Thank you for this work CHrankAdminDeinFreund

Could you please explain what does this number

0.17810675231268006 and similar

represent?
+0 / -0
See here for the formula. It represents predictiveness.
+0 / -0
8 years ago
It is a score for a system's prediction quality. For always predicting 50%, your score is 0. For always predicting 100% right, your score is 1. I call it "translog score" because I have transformed the logarithmic scoring rule so that this is true.
+0 / -0
8 years ago
I assume S refers to the number of players in the whole game, not per team? And ZK Elo/Old ZK Elo means just a single elo value instead of Split with S=2? Maybe separation is more advantageous if bigger games than 4v4 are considered, too?
+0 / -0
S is divided by the sum of players on all teams. ZK Elo is just a single number with the default 1/sqrt(players) weighting decay. (I've also tried different decays for multi ELO systems like yours but that wasn't very successful)

If we extend this to 1v1-16v16 we can see Interpolated and Mixed ELO pulling ahead, but note that split Big Teams/Small Teams(casual/competitive) ELO is still worse than using a single ELO value!

Results:
[Spoiler]
+1 / -0
8 years ago
In my versions, the highest value for S was 2. I like that you modified it to improve predictions. But is it true that, by increasing S to a value > 2 in the interpolated system, your big team weightings for very small games become negative? That would be bad because then, players could throw 1v1s to increase their team elo. (It wouldn't work another way round.) Of course, there would be a way to fix that mathematically..

[Spoiler]
+0 / -0

8 years ago
quote:
(Max. weight = 1)
+0 / -0
Page of 2 (21 records)