I think I know what is causing the sync problems...

I think I know what is causing the sync problems...

Discuss your problems with the latest release of the engine here. Problems with games, maps or other utilities belong in their respective forums.

Moderators: Moderators, Moderators

zerver
Spring Developer
Posts: 1358
Joined: 16 Dec 2006, 20:59

I think I know what is causing the sync problems...

Post by zerver »

Hi!

I'm a professional programmer from Sweden, I have been playing Spring for about two months. Great game!

I had a brief look at the Source code today and I think I know why you have the sync problems.

The problem seems to be that floating point operations (multiplication, division...) sometimes yield slightly different results depending on the model/make of processor.

This is also why usually more than one person become unsynced simultaneously. Suppose that four players are using CPU model X and six are using CPU model Y. All with CPU model X will get unsynced.

Of course I may be altoghether wrong, but if some member of the programming team is reading this, please reply and I will explain further.
0 x

User avatar
Peet
Malcontent
Posts: 4381
Joined: 27 Feb 2006, 22:04

Post by Peet »

In other news...
0 x

User avatar
AF
AI Developer
Posts: 20671
Joined: 14 Sep 2004, 11:32

Post by AF »

Actually we already know that and tobi did a big thing on fiddling floating point claculations in order to combat it.

We have gotten as far as gcc 4.1+<->mingw32 synced. However there's still issues here and there. Sometimes its as simple as uninitialized values or improperly initialized values, or places where synced and unsynced code accidentally affect eachother.
0 x

Tobi
Spring Developer
Posts: 4598
Joined: 01 Jun 2005, 11:36

Post by Tobi »

It would be really interesting if you could statistically support it, ie. by looking at a number of desyncs and figuring out that e.g. it's always AMD people that desync in an Intel host or Intel people desyncing in an AMD host...

Floating point arithmetic ought to give exactly equal results under certain conditions tho (according to IEEE754 if I recall correctly), and Nicolas Brodu and I have basically fixed spring to provide these conditions. (or at least we thought so :-))

EDIT; btw the same symtoms would be visible if e.g. the dynamic sky option is bugged or sth, and everyone with it on desyncs...
0 x

zerver
Spring Developer
Posts: 1358
Joined: 16 Dec 2006, 20:59

Post by zerver »

OK, it is good to hear that you are aware of the problem!

I can confirm that the differing results between CPU make/model is a hard fact because I have come across the very same problem in other situations.

In my case (this was some years ago) a 1.6GHz P4 gave a different round-off (the least significant bit) compared to a 1.4GHz of the same model. I do not remember if it was multiplication or division, but I can say it happened VERY RARELY. I had calculations running for hours and suddenly this error popped up from nowhere, or so I thought.

A possible solution might be to use higher precision and/or always discard (or rather round-off) some of the least significant bits after performing a multiplication or division. Or even better, make a re-syncing feature to transfer all object states to those clients who are desynced.

Edit: By the way, this has nothing to do with what compiler is in use. It is because of tiny differences in the behaviour of the FPU. That IEEE shit is theory; this is real life :?
Last edited by zerver on 17 Dec 2006, 01:53, edited 1 time in total.
0 x

Tobi
Spring Developer
Posts: 4598
Joined: 01 Jun 2005, 11:36

Post by Tobi »

I've thought about extra rounding / chop off too but I don't think that'll work because you basically just shift the problem around a bit. Ie. now it desyncs on e.g. 0.5 being 0.499 on some clients, then on e.g. 0.5 being 0.49 (chopped off the last 9). In the end you always have a hard, exact, boundary between two numbers, because floating point numbers are represented using a finite number of bits hence have a limited accuracy.

Resync is still somewhere on some todo lists :-)
0 x

zerver
Spring Developer
Posts: 1358
Joined: 16 Dec 2006, 20:59

Post by zerver »

Correct, but if you round off more than one digit, I think it will work!

Suppose that we decide that the two least significant bits should always be 00.

If the bits are 11, increase the result one step (by the smallest unit possible).

If the bits are 01, decrease the result one step.

If the bits are 10, simply truncate it to 00.

I suggest moving to double precision, because losing 2 bits precision on a float may be a disaster.

Edit: On second thought this won't work either, but there has to be some way of doing it...
0 x

Tobi
Spring Developer
Posts: 4598
Joined: 01 Jun 2005, 11:36

Post by Tobi »

Double precision will probably be very expensive performance wise. Not because of CPU speed, but because a lot of the gamestate in spring is stored in floats now. Changing to double would increase gamestate memory usage and - more importantly - cache misses by a factor two.
0 x

User avatar
hunterw
Posts: 1838
Joined: 14 May 2006, 12:22

Post by hunterw »

i've an idea.

if the desync's are impossible to fix, how about implementing this...giving the host the ability to pause the game and force everyone to download the host's version of the game. sure, it's not optimal, but desyncs will turn from a game-ruining bug in to just an annoyance.
0 x

User avatar
Acidd_UK
Posts: 963
Joined: 23 Apr 2006, 02:15

Post by Acidd_UK »

Tobi wrote:Resync is still somewhere on some todo lists :-)
@hunterw - that's what Tobi is talking about.
0 x

imbaczek
Posts: 3629
Joined: 22 Aug 2006, 16:19

Post by imbaczek »

I think that truncating bits that can carry errors should work; every compiler/fpu combination should be giving 5 digits of precision correctly if it has 2 digits of error buffer.
0 x

zerver
Spring Developer
Posts: 1358
Joined: 16 Dec 2006, 20:59

Post by zerver »

Tobi wrote:Double precision will probably be very expensive performance wise. Not because of CPU speed, but because a lot of the gamestate in spring is stored in floats now. Changing to double would increase gamestate memory usage and - more importantly - cache misses by a factor two.
It seems difficult to come up with a correction scheme to fix these round-off errors on the fly.

Switching to double may actually help because the FPU arithmetic for double precision may be more consistent among different processors. I think the unwanted conversions between float and double will have more of a negative impact on speed than anything else.
0 x

zerver
Spring Developer
Posts: 1358
Joined: 16 Dec 2006, 20:59

Post by zerver »

imbaczek wrote:I think that truncating bits that can carry errors should work; every compiler/fpu combination should be giving 5 digits of precision correctly if it has 2 digits of error buffer.
The problem is to know when to do the truncation and when not to do it.

If you do it all the time, you will run into problems when one CPU type yields a result that will be rounded off upwards and the other CPU type yields a result that is just one notch lower and thus will be rounded downwards.

What we need is a round-off scheme that guarantees two results differing in value by just one notch are rounded off to the same result. Such a round-off method may not exist...
0 x

Kloot
Spring Developer
Posts: 1865
Joined: 08 Oct 2006, 16:58

Post by Kloot »

Not unless you switch to fixed-point math.
0 x

imbaczek
Posts: 3629
Joined: 22 Aug 2006, 16:19

Post by imbaczek »

zerver wrote:What we need is a round-off scheme that guarantees two results differing in value by just one notch are rounded off to the same result. Such a round-off method may not exist...
True. There are FPU flags on x86 that specify round-off method (up/down/nearest), but I don't know wheter they've been tried or will help with something...
0 x

User avatar
AF
AI Developer
Posts: 20671
Joined: 14 Sep 2004, 11:32

Post by AF »

I thought the problem with resync was that as soon as it completed the original cause of the desync would pop up again and force another resync over and over again leading to continuous pausing?

btw:
http://www.parashift.com/c++-faq-lite/n ... #faq-29.18
0 x

Tobi
Spring Developer
Posts: 4598
Joined: 01 Jun 2005, 11:36

Post by Tobi »

@ zerver: There are no double<->float conversions in spring, it's entirely float, with a few exceptions for the glClipPlanes function, of which only variants taking doubles exist. Actually removing all doubles was one of the things that made all versions of gcc >= 4 sync with each other (AFAIK float<->double conversions aren't defined in IEEE754 so they even may differ on different processor models).

@ Kloot: Implementing fixed point math takes an extreme amount of time because for every variable one has to determine the range, to choose the appropriate position for the fixed point. Or, one could use something like 128 bits fixed point but that will make spring much slower (cache misses and multiple insns to perform one operation).

@ imbaczek: Yeah, I know they exist. Haven't tested them tho, and looking at the nature of the current sync errors I doubt I'll ever be able to reproduce any difference between the rounding modes (and I ain't change things like the rounding mode without proper testing & reasoning).

@ AF: That's true if the sync error is caused by certain type of bugs in spring e.g. an uninitialized memory read in synced code. If, however it is caused, as zerver suggests, by an operation that returns a different result once in a few billion times, then resync would work fine. Same applies if network packets get corrupted for some reason or if unsynced code occasionally overwrites synced memory...

The link AF posted is a reasonably good explanation of the reasoning behind changing every double in spring to float, and changing to our own libm.
0 x

zerver
Spring Developer
Posts: 1358
Joined: 16 Dec 2006, 20:59

Post by zerver »

Well, once the resyncing is implemented, it will be easy to make each client print the corrected values.

If we see things like

CLIENT 3: Object 142 -> MidPos -> x (1.2224e3) corrected to (1.2225e3)
CLIENT 5: Object 142 -> MidPos -> x (1.2224e3) corrected to (1.2225e3)

then we know for sure it is a roundoff issue.

I have not studied the source in detail, but it is important that (at least during debugging) the checksumming is performed each simulation step so that the first incorrect value is caught. If checksumming is only done couple of times per second, the unsynced data may spread like an avalanche and too many errors will be reported.

Tobi, I don't have that much time available, but if you want some help with the resyncing code I may be able to assist.
0 x

zerver
Spring Developer
Posts: 1358
Joined: 16 Dec 2006, 20:59

Post by zerver »

Possible workaround is using double precision and then use a software floating point library whenever there is a risk for roundoff errors.

Unfortunately still about 10 times slower than using pure float, so implementing resyncing is probably a better solution.

Code: Select all

#include "softfloat.h"

inline float tofloat(float32 f) {
  return *(float *)&f;
}

inline float32 tofloat32(float f) {
  return *(float32 *)&f;
}

inline float float_div(float f1, float f2) {
  double f3=(double)f1/(double)f2;
  unsigned int m=(*(((unsigned int *)&f3)+1))&0x1E000000;

  if(m==0x10000000 || m==0x0E000000) 
    return tofloat(float32_div(tofloat32(f1),tofloat32(f2)));
  return (float)f3;
}

inline float float_mul(float f1, float f2) {
  double f3=(double)f1*(double)f2;
  unsigned int m=(*(((unsigned int *)&f3)+1))&0x1E000000;

  if(m==0x10000000 || m==0x0E000000) 
    return tofloat(float32_mul(tofloat32(f1),tofloat32(f2)));
  return (float)f3;
}
0 x

mongus
Posts: 1463
Joined: 15 Apr 2005, 18:52

Post by mongus »

didnt read/or understand well but feel like posting.

When in a game, you dont care much about your opponet gaining you 0.01 of metal or energy.

Despite not being perfect, it can be looked over, and continue a game, wihtout caring much on this things. Unless it happens in a redundant frequent manner...

Im sure that is pointless.. bc the "chekcsum" is what desynchs the games rite? well then how about dividing the problem and using partial checksums for different parts of the "state".... is the game able to tell, from this "synch data" if the delta is minimal, or if its related to a specific resource / or .give unit cheat?.

If so... again, small differences .. are less important then losing a hole game. ymmv.
0 x

Post Reply

Return to “Help & Bugs”