Posted: 10 Jan 2007, 05:43
Small differences could be exploited by cheaters. Better to just implement a resync as is being done already.
Open Source Realtime Strategy Game Engine
https://springrts.com/phpbb/
There's nothing to stop information cheats, like .spectator view. You'd know which radar dots to attack with bombers or LRPC. And stealth/cloak for base infiltration would be useless. Of course units couldn't fire at anything they wouldn't normally, at least not without attack-ground orders... It would be more trouble than it's worth (for most people) but you could probably write a layer that sits between the player and spring, and issues orders that take advantage of what it can "see"...hunterw wrote: i love that cheating is basically impossible in this game.
If I understood well how the problem works (probaly nohunterw wrote:i've an idea.
if the desync's are impossible to fix, how about implementing this...giving the host the ability to pause the game and force everyone to download the host's version of the game. sure, it's not optimal, but desyncs will turn from a game-ruining bug in to just an annoyance.
Its really quite fair to assume that the actual transistor schematic will be the same for the same model. There are no analog processes involved in fpu computations, so its completely deterministic.The multiple chips made in the same batch can have varying performance characteristics. The number on the box is only a very good approximation. Although it tends to be extremely minute, it should be kept in mind if you're running up against the sort of microcosmic errors that lead to drifting in high-precision applications.
Have you estimated how long a single run of this program will be?llama wrote: Back on topic: I'd be surprised if you can run the exact same stream of instructions on different CPUs and get different results, but I can believe it. I suppose someone's tried to google for info about floating point differences, but if that doesn't turn up anything then someone could cook up a little test program that does floating point operation to a sequence of input values and checksums the results. Then run the program on various CPUs. I might be able to help with that, since I have access to a P2 and some even more ancient stuff, a P3 (Katmai), P4 (various), Pentium M, Core 2 Duo, K7 (original, tbird, palomino, and palomino SMP, and tbred), K8 (Newcastle), and Opteron. All of these run Linux (or at least can boot a live CD), so I could run a test binary.
The first iteration of such a program could output a checksum for each of fmul, fadd, fdiv, sqrt, cos, sin, and whatever other FPU instructions gcc 4.1 emits when compiling spring. The checksum would be e.g. on an array created by doing
2.71 * 1.000
2.71 * 1.001
2.71 * 1.002
...
2.71 * 2.000
(or cos(1.000) ...)
Where the increment is one LSB of the mantissa. i.e. FLT_EPSILON, for x = 1.0. Manipulating the input float as an integer would be useful to guarantee you're giving the same input on all CPUs. Maybe some denormals, NaNs, and other weird stuff should be thrown in for good measure, as a separate checksum item for each op.
Then you'd compare the checksums for all the different processors (for a binary compiled with the same flags as spring, or not as an experiment), and see which instructions were always consistent, and which weren't. I don't think anyone is claiming that you might get different results for multiple runs on the same CPU, are they? I hope not.
AFAIK this doesn't apply to the x87 instructions (ignore internal implementation). SSE does indeed have some instructions that have to be called in a loop for increased accuracy.I'd guess that sqrt, cos, and sin (and not algebraic ops like add or div) would be the guilty ones, because AFAIK they're implemented as an iterative process that converges to the desired precision, probably with something like Newton-Raphson. There's a lot of scope for different approaches to getting a result, depending on what fits best in the silicon. IIRC, SSE(2?) even has an approximate sqrt instruction for when normal sqrt isn't fast enough.
Indeed, SSE vs. x87 math desyncs, according to Nicolas Brodu because of a different handling of denormals. You're right about AMD64 compiles indeed, they're compiled with -mfpmath=387.I'd be less surprised about binaries where one used SSE2 floating point, and the other used x87, because of the 80bit internal precision of x87. I'm guessing that spring is compiled with -mfpmath=387 even on AMD64 where that's not the default, so it can run on CPUs without SSE math support... I actually haven't tried on my wintendo slot A K7, which only has 3dnow, since before v0.73; well before the compiler switch for the windows builds.
The problem was mostly that different math libraries had different handling / implementation of sine/cosine/tan etc. Probably some optimized it to fsin fcos ftan and others just calculated it with some approximation method. Hence we built in a libm in spring (streflop).If we do find that e.g. cos is CPU dependent, then we could try to find out what input values are problematic. If it's only some special values to certain ops, then maybe it could be worked around. If it's worse than that, we could see if always rounding the result of cos could give consistent checksums. There's no guarantee that that's possible, but if we're lucky the problem values might not be near rounding thresholds. I'm not sure if there's a math lib function to round floats other than to an integer. And truncating would require even more luck than rounding. We could look into rounding modes in case that makes any difference. (see fenv(3), although that might not help). If all else fails, we could try emulating the problematic ops in software, although that would definitely be slow if they're used frequently. Maybe a range check on the input could decide whether to use the FPU op or emulate, if there is a big enough range that most of the time the input falls into. The Linux kernel has a GPLed soft float implementation that you could get code from.
The states will be synced to after the desync, ie. the calculation has already been performed, yielding slightly different results on different clients. Then the results from one of the clients is transferred to the other to resync.BTW, what AF was saying about resync leading to another desync could be a problem if the states are synced to before the problem op. If all machines get the hosts result for the op, then it would only desync again immediately if more operations with the problem value were ongoing. e.g. maybe a specific angle that some unit is moving or aiming.
For Age of Empires 1 and sequels, see:Dragon45 wrote:Relevant question: There's no way in hell Spring is the only program to run into this problem. How do commercial games and games programmers handle it?
I was picturing holding one arg constant for binary operators; I know how big 2^64 is! I was also thinking of just running through the the range of the mantissa for just one exponent (e.g. 1.0 - 2.0), but a couple seconds per test isn't bad for full coverage.Tobi wrote: Have you estimated how long a single run of this program will be?
A single check of all 2^32 input values for e.g. sin/cos/tan/sqrt takes a few seconds on my AMD 3200+. Hence, a check of a binary operator (+ - * / etc.) will take 2^32 times a few seconds.
Did you try running it on two CPUs that people have reported desync between?Tobi wrote: Note tho that I have tested some small subranges, ie. just some random math on an array of few thousand random numbers. This never gave any differences with the current setup.
I'm talking about what the FPU does internally when running e.g. an fcos instruction. That's why they take > 10 clock cycles, unlike add or mul. AMD and Intel might use a different algorithm in their microcode to calculate cosines, which could sometimes give different results. But it's less likely that there more than one way to do an algebraic op like fmul. Maybe denormal handling and stuff like that can vary. I thought there was a way to set the FPU to flush denormals to zero, but I can't find anything about it in the libc info page (for GNU libc). Maybe it's just that the FPU hardware on some non-x86 architectures flushes denormals.AFAIK this doesn't apply to the x87 instructions (ignore internal implementation). SSE does indeed have some instructions that have to be called in a loop for increased accuracy.llama wrote: I'd guess that sqrt, cos, and sin (and not algebraic ops like add or div) would be the guilty ones, because AFAIK they're implemented as an iterative process that converges to the desired precision, probably with something like Newton-Raphson. There's a lot of scope for different approaches to getting a result, depending on what fits best in the silicon.
x87 math is IEEE compliant so we can trust the arithmetic ops and sqrt. Definitely good to know. Nothing is said about transcendental ops, like exp or trig. I googled around a bunch and AFAICT IEEE 754 doesn't specify how any transcendental ops have to work. It's totally possible that different CPU designs can give different results for those ops. They're all unary operations, so it's possible to give them a thorough check.It gives an algorithm for addition, subtraction, multiplication, division and square root, and requires that implementations produce the same result as that algorithm. Thus, when a program is moved from one machine to another, the results of the basic operations will be the same in every bit if both machines support the IEEE standard.
interesting... That should be a good framework for doing extra fudging of results if we find a problematic operation.The problem was mostly that different math libraries had different handling / implementation of sine/cosine/tan etc. Probably some optimized it to fsin fcos ftan and others just calculated it with some approximation method. Hence we built in a libm in spring (streflop).
I wouldn't even mind if it's a few hours in total, but trillions of years seemed a bit muchllama wrote: I was picturing holding one arg constant for binary operators; I know how big 2^64 is! I was also thinking of just running through the the range of the mantissa for just one exponent (e.g. 1.0 - 2.0), but a couple seconds per test isn't bad for full coverage.
No, no one had any clue desyncs had to do with different CPU back then. I used this test code to test compilers and optimization flags.Did you try running it on two CPUs that people have reported desync between?
I'm not sure I still have it, but if I find it or consider it important enough to hack together some new code then I'll let you know.If you post some code, I can compile it and then try the binary on all the machines I have access to. and/or post a binary for ia32 linux.
IIRC, I have tested at least one of the transcendental ops. But again, not on that much machines, probably just AMD ones (for the same reason as above, this different CPU "clue" wasn't known yet).I'm talking about what the FPU does internally when running e.g. an fcos instruction. That's why they take > 10 clock cycles, unlike add or mul. AMD and Intel might use a different algorithm in their microcode to calculate cosines, which could sometimes give different results. But it's less likely that there more than one way to do an algebraic op like fmul. Maybe denormal handling and stuff like that can vary. I thought there was a way to set the FPU to flush denormals to zero, but I can't find anything about it in the libc info page (for GNU libc). Maybe it's just that the FPU hardware on some non-x86 architectures flushes denormals.
Well, I basically already knew that (even attempting to sync linux/windows would have been totally useless if even the normal ops weren't guaranteed to give same results under controlled conditions). Thanks anyway.Ah, here's an interesting doc:
http://docs.sun.com/source/806-3568/ncg_goldberg.htmlx87 math is IEEE compliant so we can trust the arithmetic ops and sqrt. Definitely good to know. Nothing is said about transcendental ops, like exp or trig. I googled around a bunch and AFAICT IEEE 754 doesn't specify how any transcendental ops have to work. It's totally possible that different CPU designs can give different results for those ops. They're all unary operations, so it's possible to give them a thorough check.It gives an algorithm for addition, subtraction, multiplication, division and square root, and requires that implementations produce the same result as that algorithm. Thus, when a program is moved from one machine to another, the results of the basic operations will be the same in every bit if both machines support the IEEE standard.
There are different "steppings" of the same processor model. The incompatibility that I came across was P4 1.4 vs P4 1.6 - most likely a stepping being the only difference. So, I would not make any assumptions on processor schematics being identical, not even for the same model.jcnossen wrote:Its really quite fair to assume that the actual transistor schematic will be the same for the same model. There are no analog processes involved in fpu computations, so its completely deterministic.