I think I know what is causing the sync problems...

LordMatt · Post by **LordMatt** » 10 Jan 2007, 05:43

Small differences could be exploited by cheaters. Better to just implement a resync as is being done already.

hunterw · Post by **hunterw** » 10 Jan 2007, 05:44

the problem with resync is that it will open the door for cheating

i love that cheating is basically impossible in this game. at least with resync only the hosts will be able to cheat. then again, if the resync does lead to assholes writing cheats, we will all just use the autoservers, so no problem anyway

Acidd_UK · Post by **Acidd_UK** » 10 Jan 2007, 09:25

The idea of checksumming partial states is a good one though, as it would reduce the amount of data that needed to be resynced...

Tobi · Post by **Tobi** » 10 Jan 2007, 09:51

In current version that data is already checksummed in chunks, hence the MEXYZ flags (metal, energy, unit midPos.x, unit midPos.y, unit midPos.z).

Obviously resync will use some smart algorithm too, I wont just stream all data over the net without any sort of rsync like compression...

Licho · Post by **Licho** » 10 Jan 2007, 10:49

Well, I consistently desync with my athlon xp vs. my notebook pentium M on crossing4final map..
Every single game I try. Map is identical. So I believe that in this case it's really down to HW differences and not bug in code.

Post by **jcnossen** » 10 Jan 2007, 13:58

Weirdness... do you also desync with other maps?

I used to play spring with an athlon XP without problems, and I now play spring with a Core Duo (=Pentium M) without problems...

llama · Post by **llama** » 10 Jan 2007, 15:48

hunterw wrote: i love that cheating is basically impossible in this game.

There's nothing to stop information cheats, like .spectator view. You'd know which radar dots to attack with bombers or LRPC. And stealth/cloak for base infiltration would be useless. Of course units couldn't fire at anything they wouldn't normally, at least not without attack-ground orders... It would be more trouble than it's worth (for most people) but you could probably write a layer that sits between the player and spring, and issues orders that take advantage of what it can "see"...

Back on topic: I'd be surprised if you can run the exact same stream of instructions on different CPUs and get different results, but I can believe it. I suppose someone's tried to google for info about floating point differences, but if that doesn't turn up anything then someone could cook up a little test program that does floating point operation to a sequence of input values and checksums the results. Then run the program on various CPUs. I might be able to help with that, since I have access to a P2 and some even more ancient stuff, a P3 (Katmai), P4 (various), Pentium M, Core 2 Duo, K7 (original, tbird, palomino, and palomino SMP, and tbred), K8 (Newcastle), and Opteron. All of these run Linux (or at least can boot a live CD), so I could run a test binary.
The first iteration of such a program could output a checksum for each of fmul, fadd, fdiv, sqrt, cos, sin, and whatever other FPU instructions gcc 4.1 emits when compiling spring. The checksum would be e.g. on an array created by doing
2.71 * 1.000
2.71 * 1.001
2.71 * 1.002
...
2.71 * 2.000
(or cos(1.000) ...)
Where the increment is one LSB of the mantissa. i.e. FLT_EPSILON, for x = 1.0. Manipulating the input float as an integer would be useful to guarantee you're giving the same input on all CPUs. Maybe some denormals, NaNs, and other weird stuff should be thrown in for good measure, as a separate checksum item for each op.

Then you'd compare the checksums for all the different processors (for a binary compiled with the same flags as spring, or not as an experiment), and see which instructions were always consistent, and which weren't. I don't think anyone is claiming that you might get different results for multiple runs on the same CPU, are they? I hope not.

I'd guess that sqrt, cos, and sin (and not algebraic ops like add or div) would be the guilty ones, because AFAIK they're implemented as an iterative process that converges to the desired precision, probably with something like Newton-Raphson. There's a lot of scope for different approaches to getting a result, depending on what fits best in the silicon. IIRC, SSE(2?) even has an approximate sqrt instruction for when normal sqrt isn't fast enough.

I'd be less surprised about binaries where one used SSE2 floating point, and the other used x87, because of the 80bit internal precision of x87. I'm guessing that spring is compiled with -mfpmath=387 even on AMD64 where that's not the default, so it can run on CPUs without SSE math support... I actually haven't tried on my wintendo slot A K7, which only has 3dnow, since before v0.73; well before the compiler switch for the windows builds.

If we do find that e.g. cos is CPU dependent, then we could try to find out what input values are problematic. If it's only some special values to certain ops, then maybe it could be worked around. If it's worse than that, we could see if always rounding the result of cos could give consistent checksums. There's no guarantee that that's possible, but if we're lucky the problem values might not be near rounding thresholds. I'm not sure if there's a math lib function to round floats other than to an integer. And truncating would require even more luck than rounding. We could look into rounding modes in case that makes any difference. (see fenv(3), although that might not help). If all else fails, we could try emulating the problematic ops in software, although that would definitely be slow if they're used frequently. Maybe a range check on the input could decide whether to use the FPU op or emulate, if there is a big enough range that most of the time the input falls into. The Linux kernel has a GPLed soft float implementation that you could get code from.

BTW, what AF was saying about resync leading to another desync could be a problem if the states are synced to before the problem op. If all machines get the hosts result for the op, then it would only desync again immediately if more operations with the problem value were ongoing. e.g. maybe a specific angle that some unit is moving or aiming.

--
Peter <peter at cordes.ca>

LordMatt · Post by **LordMatt** » 10 Jan 2007, 16:10

1. Resync would help with troubleshooting desync.
2. Devs don't have access to all those random computers in all likelyhood.

manored · Post by **manored** » 10 Jan 2007, 16:55

hunterw wrote:i've an idea.

if the desync's are impossible to fix, how about implementing this...giving the host the ability to pause the game and force everyone to download the host's version of the game. sure, it's not optimal, but desyncs will turn from a game-ruining bug in to just an annoyance.

If I understood well how the problem works (probaly no

) I think this one is a good idea.

Dragon45 · Post by **Dragon45** » 10 Jan 2007, 22:03

It should be noted that simply saying "If my XYZ.11 model CPU works in this way, then all XYZ.11 model CPUs should work in this way" is not a valid argument or even an intelligent problem solving method. The multiple chips made in the same batch can have varying performance characteristics. The number on the box is only a very good approximation. Although it tends to be extremely minute, it should be kept in mind if you're running up against the sort of microcosmic errors that lead to drifting in high-precision applications.

People should also keep in mind that resynching isn't going to permanently solve the problem over the course of a given game. Just because you may share states from the host and essentially keep everyone using the same set of numbers, the "poison" may not have necessarily gone away; it may just be back again later as errors accumulate. Chaos theory/butterfly effect is a bitch in this regard. Until the underlying problems with how Spring deals with numbers is resolved, resynching is only going to forestall problems and not solve them. You may even have the unintended effect of the forced resynchings occurring more and more as long games drag on, because more errors are accumulating in the background, so to speak.

Relevant question: There's no way in hell Spring is the only program to run into this problem. How do commercial games and games programmers handle it?

Post by **jcnossen** » 10 Jan 2007, 23:05

The multiple chips made in the same batch can have varying performance characteristics. The number on the box is only a very good approximation. Although it tends to be extremely minute, it should be kept in mind if you're running up against the sort of microcosmic errors that lead to drifting in high-precision applications.

Its really quite fair to assume that the actual transistor schematic will be the same for the same model. There are no analog processes involved in fpu computations, so its completely deterministic.

It's quite ok to assume there are no actual errors in the chip, other than design errors. In case there would be manufacturing errors, the chances are very small that these errors only cause changes in the computation of the least significant bits in floating point numbers.

Different performance characteristics != nondeterministic behavior, these things are designed with small varations in the properties of semiconductors in mind.

Tobi · Post by **Tobi** » 11 Jan 2007, 09:53

llama wrote: Back on topic: I'd be surprised if you can run the exact same stream of instructions on different CPUs and get different results, but I can believe it. I suppose someone's tried to google for info about floating point differences, but if that doesn't turn up anything then someone could cook up a little test program that does floating point operation to a sequence of input values and checksums the results. Then run the program on various CPUs. I might be able to help with that, since I have access to a P2 and some even more ancient stuff, a P3 (Katmai), P4 (various), Pentium M, Core 2 Duo, K7 (original, tbird, palomino, and palomino SMP, and tbred), K8 (Newcastle), and Opteron. All of these run Linux (or at least can boot a live CD), so I could run a test binary.
The first iteration of such a program could output a checksum for each of fmul, fadd, fdiv, sqrt, cos, sin, and whatever other FPU instructions gcc 4.1 emits when compiling spring. The checksum would be e.g. on an array created by doing
2.71 * 1.000
2.71 * 1.001
2.71 * 1.002
...
2.71 * 2.000
(or cos(1.000) ...)
Where the increment is one LSB of the mantissa. i.e. FLT_EPSILON, for x = 1.0. Manipulating the input float as an integer would be useful to guarantee you're giving the same input on all CPUs. Maybe some denormals, NaNs, and other weird stuff should be thrown in for good measure, as a separate checksum item for each op.

Then you'd compare the checksums for all the different processors (for a binary compiled with the same flags as spring, or not as an experiment), and see which instructions were always consistent, and which weren't. I don't think anyone is claiming that you might get different results for multiple runs on the same CPU, are they? I hope not.

Have you estimated how long a single run of this program will be?

A single check of all 2^32 input values for e.g. sin/cos/tan/sqrt takes a few seconds on my AMD 3200+. Hence, a check of a binary operator (+ - * / etc.) will take 2^32 times a few seconds. Well, assume - optimistically - a few seconds is exactly 1 second and you can continuously keep the program running at 100% cpu, then it'd take 584,942,417,355 years (2^64/3600/24/365) for the program to complete checksumming just one operation.

Hence it's unrealistic to test this

Note tho that I have tested some small subranges, ie. just some random math on an array of few thousand random numbers. This never gave any differences with the current setup.

I'd guess that sqrt, cos, and sin (and not algebraic ops like add or div) would be the guilty ones, because AFAIK they're implemented as an iterative process that converges to the desired precision, probably with something like Newton-Raphson. There's a lot of scope for different approaches to getting a result, depending on what fits best in the silicon. IIRC, SSE(2?) even has an approximate sqrt instruction for when normal sqrt isn't fast enough.

AFAIK this doesn't apply to the x87 instructions (ignore internal implementation). SSE does indeed have some instructions that have to be called in a loop for increased accuracy.

I'd be less surprised about binaries where one used SSE2 floating point, and the other used x87, because of the 80bit internal precision of x87. I'm guessing that spring is compiled with -mfpmath=387 even on AMD64 where that's not the default, so it can run on CPUs without SSE math support... I actually haven't tried on my wintendo slot A K7, which only has 3dnow, since before v0.73; well before the compiler switch for the windows builds.

Indeed, SSE vs. x87 math desyncs, according to Nicolas Brodu because of a different handling of denormals. You're right about AMD64 compiles indeed, they're compiled with -mfpmath=387.

If we do find that e.g. cos is CPU dependent, then we could try to find out what input values are problematic. If it's only some special values to certain ops, then maybe it could be worked around. If it's worse than that, we could see if always rounding the result of cos could give consistent checksums. There's no guarantee that that's possible, but if we're lucky the problem values might not be near rounding thresholds. I'm not sure if there's a math lib function to round floats other than to an integer. And truncating would require even more luck than rounding. We could look into rounding modes in case that makes any difference. (see fenv(3), although that might not help). If all else fails, we could try emulating the problematic ops in software, although that would definitely be slow if they're used frequently. Maybe a range check on the input could decide whether to use the FPU op or emulate, if there is a big enough range that most of the time the input falls into. The Linux kernel has a GPLed soft float implementation that you could get code from.

The problem was mostly that different math libraries had different handling / implementation of sine/cosine/tan etc. Probably some optimized it to fsin fcos ftan and others just calculated it with some approximation method. Hence we built in a libm in spring (streflop).

BTW, what AF was saying about resync leading to another desync could be a problem if the states are synced to before the problem op. If all machines get the hosts result for the op, then it would only desync again immediately if more operations with the problem value were ongoing. e.g. maybe a specific angle that some unit is moving or aiming.

The states will be synced to after the desync, ie. the calculation has already been performed, yielding slightly different results on different clients. Then the results from one of the clients is transferred to the other to resync.

Assuming it are only certain infrequently used combinations of input to an FPU instruction that cause desync, this should work fine I think.

It's another matter if the desync is caused by a spring bug. Then chances are indeed high it reoccurs in a matter of seconds after the resync.

Tobi · Post by **Tobi** » 11 Jan 2007, 11:04

Dragon45 wrote:Relevant question: There's no way in hell Spring is the only program to run into this problem. How do commercial games and games programmers handle it?

For Age of Empires 1 and sequels, see:
http://www.gamasutra.com/features/gdcar ... 00arch.doc
(synced simulation just like Spring, in later versions peer to peer synced simulation instead of one server and multiple clients)

I haven't read anything about other games, just some hearsay which I really have no clue whether it's correct or not.

As for open source RTSes, there are:

Glest - no multiplayer though
ORTS - dumb client model
Machinations - project seems dead, was going to use synced simulation

llama · Post by **llama** » 12 Jan 2007, 22:03

Tobi wrote: Have you estimated how long a single run of this program will be?

A single check of all 2^32 input values for e.g. sin/cos/tan/sqrt takes a few seconds on my AMD 3200+. Hence, a check of a binary operator (+ - * / etc.) will take 2^32 times a few seconds.

I was picturing holding one arg constant for binary operators; I know how big 2^64 is! I was also thinking of just running through the the range of the mantissa for just one exponent (e.g. 1.0 - 2.0), but a couple seconds per test isn't bad for full coverage.

Tobi wrote: Note tho that I have tested some small subranges, ie. just some random math on an array of few thousand random numbers. This never gave any differences with the current setup.

Did you try running it on two CPUs that people have reported desync between?

If you post some code, I can compile it and then try the binary on all the machines I have access to. and/or post a binary for ia32 linux.

llama wrote: I'd guess that sqrt, cos, and sin (and not algebraic ops like add or div) would be the guilty ones, because AFAIK they're implemented as an iterative process that converges to the desired precision, probably with something like Newton-Raphson. There's a lot of scope for different approaches to getting a result, depending on what fits best in the silicon.
AFAIK this doesn't apply to the x87 instructions (ignore internal implementation). SSE does indeed have some instructions that have to be called in a loop for increased accuracy.

I'm talking about what the FPU does internally when running e.g. an fcos instruction. That's why they take > 10 clock cycles, unlike add or mul. AMD and Intel might use a different algorithm in their microcode to calculate cosines, which could sometimes give different results. But it's less likely that there more than one way to do an algebraic op like fmul. Maybe denormal handling and stuff like that can vary. I thought there was a way to set the FPU to flush denormals to zero, but I can't find anything about it in the libc info page (for GNU libc). Maybe it's just that the FPU hardware on some non-x86 architectures flushes denormals.

Ah, here's an interesting doc:
http://docs.sun.com/source/806-3568/ncg_goldberg.html

It gives an algorithm for addition, subtraction, multiplication, division and square root, and requires that implementations produce the same result as that algorithm. Thus, when a program is moved from one machine to another, the results of the basic operations will be the same in every bit if both machines support the IEEE standard.

x87 math is IEEE compliant so we can trust the arithmetic ops and sqrt. Definitely good to know. Nothing is said about transcendental ops, like exp or trig. I googled around a bunch and AFAICT IEEE 754 doesn't specify how any transcendental ops have to work. It's totally possible that different CPU designs can give different results for those ops. They're all unary operations, so it's possible to give them a thorough check.

The problem was mostly that different math libraries had different handling / implementation of sine/cosine/tan etc. Probably some optimized it to fsin fcos ftan and others just calculated it with some approximation method. Hence we built in a libm in spring (streflop).

interesting... That should be a good framework for doing extra fudging of results if we find a problematic operation.

Tobi · Post by **Tobi** » 13 Jan 2007, 11:48

llama wrote: I was picturing holding one arg constant for binary operators; I know how big 2^64 is! I was also thinking of just running through the the range of the mantissa for just one exponent (e.g. 1.0 - 2.0), but a couple seconds per test isn't bad for full coverage.

I wouldn't even mind if it's a few hours in total, but trillions of years seemed a bit much

Did you try running it on two CPUs that people have reported desync between?

No, no one had any clue desyncs had to do with different CPU back then. I used this test code to test compilers and optimization flags.

If you post some code, I can compile it and then try the binary on all the machines I have access to. and/or post a binary for ia32 linux.

I'm not sure I still have it, but if I find it or consider it important enough to hack together some new code then I'll let you know.

I'm talking about what the FPU does internally when running e.g. an fcos instruction. That's why they take > 10 clock cycles, unlike add or mul. AMD and Intel might use a different algorithm in their microcode to calculate cosines, which could sometimes give different results. But it's less likely that there more than one way to do an algebraic op like fmul. Maybe denormal handling and stuff like that can vary. I thought there was a way to set the FPU to flush denormals to zero, but I can't find anything about it in the libc info page (for GNU libc). Maybe it's just that the FPU hardware on some non-x86 architectures flushes denormals.

IIRC, I have tested at least one of the transcendental ops. But again, not on that much machines, probably just AMD ones (for the same reason as above, this different CPU "clue" wasn't known yet).

Denormals can be flushed to zero on SSE only, x87 math doesn't have such option.

Ah, here's an interesting doc:
http://docs.sun.com/source/806-3568/ncg_goldberg.html
It gives an algorithm for addition, subtraction, multiplication, division and square root, and requires that implementations produce the same result as that algorithm. Thus, when a program is moved from one machine to another, the results of the basic operations will be the same in every bit if both machines support the IEEE standard.
x87 math is IEEE compliant so we can trust the arithmetic ops and sqrt. Definitely good to know. Nothing is said about transcendental ops, like exp or trig. I googled around a bunch and AFAICT IEEE 754 doesn't specify how any transcendental ops have to work. It's totally possible that different CPU designs can give different results for those ops. They're all unary operations, so it's possible to give them a thorough check.

Well, I basically already knew that (even attempting to sync linux/windows would have been totally useless if even the normal ops weren't guaranteed to give same results under controlled conditions). Thanks anyway.

It might even be so that streflop calculates them using normal operations, I'll try to remember checking that.

zerver · Post by **zerver** » 13 Jan 2007, 21:46

jcnossen wrote:Its really quite fair to assume that the actual transistor schematic will be the same for the same model. There are no analog processes involved in fpu computations, so its completely deterministic.

There are different "steppings" of the same processor model. The incompatibility that I came across was P4 1.4 vs P4 1.6 - most likely a stepping being the only difference. So, I would not make any assumptions on processor schematics being identical, not even for the same model.

I suggest adding a feature to spring that uploads the CPU type/ID/stepping of host and client whenever a desync occurs.
Do some statistical analysis on this data and divide CPU:s into "classes" that are compatible with each other. Disallow incompatible variants to join the same game, or at least warn the user before joining!

I understand there is quite a lot of data to resync (not only units but also terrain and such) so it will likely take some seconds to complete.

Spring RTS Engine

I think I know what is causing the sync problems...