Sync errors - Page 2

Sync errors

Various things about Spring that do not fit in any of the other forums listed below, including forum rules.

Moderator: Moderators

Tobi
Spring Developer
Posts: 4598
Joined: 01 Jun 2005, 11:36

Post by Tobi »

The pathfinder isn't the problem, it doesn't need replacing either imho. It works flawlessly. As I have stated numerous times already, units getting stuck into each other is a collision detection problem, not a pathfinding problem. At best the PF could be modified to work around the CD problems (as someone did already a while ago by tuning some constants, but that had the side effect of increasing cpu usage with 20-30%)

About the compiler, I doubt that would be the problem. It would probably have shown up in at least one of my 200 test runs. (Let alone the fact that the GCC people aren't that stupid to release an untested compiler, plus many linux distros are compiled using gcc 4.1 and run perfectly stable&correct :-))

Generally sync errors are reproducable. Ie. clicking on some thing overwrites some synced memory-> sync error, or some variable (remember the commonSonarJammerMap) is unitialized, making it trivial to reproduce too (once you know the cause that is, figuring out the bug from the sync error is much harder).
malric
Posts: 521
Joined: 30 Dec 2005, 22:22

Post by malric »

I meant if we have a new reproducible sync error with the new version...

About compiler bugs, I saw quite a few (and that were exposed in very strange cases) some maybe I am biased ;). And gcc is quite tested (with automated suite, etc.) But then again, I saw even bigger test suites that couldn't catch all the bugs.

Ok, so next time when somebody tries to blame the pathfinding I will know what to say ;).
User avatar
LordMatt
Posts: 3393
Joined: 15 May 2005, 04:26

Post by LordMatt »

Ok, so it isn't pathfinding. :shock: Is it possible thought that units are getting stuck on one machine, and not the other, and that is causing the desync?
User avatar
Argh
Posts: 10920
Joined: 21 Feb 2005, 03:38

Post by Argh »

What's much more likely is that one machine's CPU is getting overwhelmed and falling out of lockstep with the rest for long enough to desync.

And yeah, better steering and collision code would certainly help a lot. Steering code is fairly hard, however.
malric
Posts: 521
Joined: 30 Dec 2005, 22:22

Post by malric »

Argh wrote:What's much more likely is that one machine's CPU is getting overwhelmed and falling out of lockstep with the rest for long enough to desync.

And yeah, better steering and collision code would certainly help a lot. Steering code is fairly hard, however.
And do not get exactly what you are saying in the first paragraph, but better steering code could at most make the problem appear less often. The problem that would need to be fixed first is one processor getting overhelmed.

But then again. If the devs can have a reproducible desync they can't solve it... (and we will not know where it comes from ;) )
User avatar
Licho
Zero-K Developer
Posts: 3803
Joined: 19 May 2006, 19:13

Post by Licho »

Yeah but I have seen several desync where say 50% of people desynced. Or 3 at once.
So that was certainly not just 1 overheated CPU.
User avatar
Felix the Cat
Posts: 2383
Joined: 15 Jun 2005, 17:30

Post by Felix the Cat »

My CPU and GPU overheat fairly often; I don't fall out of sync because of it.

When I do fall out of sync, I usually have two or three others for company...
User avatar
AF
AI Developer
Posts: 20687
Joined: 14 Sep 2004, 11:32

Post by AF »

It seems to make sense that 1 person falling out of step long enough itno desync could pull others along with them under the right conditions, and its usually the pattern seen.

Perhaps the host jumps too far ahead of the desynced clients..?
User avatar
Licho
Zero-K Developer
Posts: 3803
Joined: 19 May 2006, 19:13

Post by Licho »

How could one desynced player desync others??? It doesnt make sense, imo it's impossible. Because data recieved by other clients are still same, they cannot desync as a result of another desynced player.

Only way to desync is HW problem (calculating error due to CPU overheat - and yes it happens!) or software bug..

I had luck .. the very first game had desync ..
1 player and host vs. 3 players .. so it's certainly not the HW type..

(Desync near the end when I'm about to pwn remaining player)

http://licho.eu/desync3.zip
User avatar
Argh
Posts: 10920
Joined: 21 Feb 2005, 03:38

Post by Argh »

Look, people, instead of wasting time on theory, try this test. Let us know the results, or better yet, post a replay, so that we can repeat the experiment on our machines and actually have a scientifically-repeatable experiment with various hardware mixes.

1. Start up a server. Set up a game with NanoBlobs.

2. .give yourselves 500 of any land unit in NanoBlobs.

3. Tell them to all go somewhere.

4. Watch what happens to your CPU usage. You will see CPU usage hit over 50%(which is when problems start occuring) in waves. It is not always the first instant after the move command is issued that is the worst, either.

Try a second test:

1. Start up a server. Set up a game of AA.

2. .give yourself 100 Peewees.

3. Move 'em around a little bit, to randomly distribute them.

4. Kill 'em with something light enough to leave a corpse.

5. .give yourself 50 more PeeWees.

6. Give them a move order through the field of dead.

I wish there was a cleaner way to set up the second test, but nobody has bothered to construct a simple mod for stress-testing purposes that is 3DO and includes corpses, and I'm not about to build one for you guys, when you could do it yourselves in about 20 minutes.

My points here are simple, testable, and repeatable:

A. The amount of CPU used during events is not always maxed at start (when the pathfinder is doing initial work). It often actually peaks later.

B. Corpses add a lot to the CPU use, because they aren't handled well by the steering / collision code.

Basically, there are dozens of ways to make Spring lag out to the point of desync with the older mods, and I'm frankly surprised that desync hasn't been a bigger problem, except that everybody's been playing mods with slower gameplay than NB. For example, the FALL | SMOKE | FIRE | EXPLODE-ON-HIT code is veeeeeeeeeery inefficient and slow. It breaks up the model and makes new collision spheres on the fly, so that it will bounce on the ground. If you blow up enough stuff with fancy explosions in multiplayer, it is quite likely someone will desync. That's one of the reasons I took most of that code out've NB... it was waaaaay too slow for gameplay I wanted, and would've caused problems everywhere...

Oh yeah, and did I mention that because somebody thought it'd be great if the broken parts could cause damage, that every part freed up by FALL events becomes both something with a hitsphere, but also gets treated as a WeaponProjectile, and is thus being used by sync code, instead of just being a desynced special effect like it probably should be? ;)

So, what we need is for someone to test my theory, that the primary cause of desync is people with crappy systems or lousy 'net connections. As Lazorwolf's post indicated, it eventually was shown that somewhere along the way, Spring was desyncing due to packet loss or fragmentation.

Perhaps Spring's bandwidth usage needs to be studied more- maybe it has a very low maximum throughput level that is reached too easily on high-latency connections, or maybe it's flooding lower-bandwidth connections, because a lot've things that shouldn't be synced still are being synced, or maybe my theory is correct, and CPU usage spikes are the true culprit- in which case, the only real remedy on the software end is better-optimized code for the primary culprits (not an easy or pleasant task, I wager, and I'll be the first to say that I am not skilled enough to even try). However, there is ALWAYS GOING TO BE a ceiling, people. The urge to keep Spring at such a low level that it is always playable on 800Mhz machines is just stupid, when it costs so little these days to build a machine that will run it with ease.

I played NanoBlobs on a co-worker's Core Duo machine the other day at Work, and was amazed how silky-smooth it ran, even with insane unit counts. His GPU sucks compared to the one in my Athlon XP 2800+, so I have to put that down to sheer CPU speeds. It was quite instructive about what really makes Spring "lag"- CPUs are simply getting overwhelmed.
User avatar
AF
AI Developer
Posts: 20687
Joined: 14 Sep 2004, 11:32

Post by AF »

That with the FALL | EXPLDOE stuff makes a lot of sense. I dont remember every desyncing on a nanoblobz game though, maybe cus I havent played so much nanoblobz online......
User avatar
PauloMorfeo
Posts: 2004
Joined: 15 Dec 2004, 20:53

Post by PauloMorfeo »

Licho wrote:How could one desynced player desync others??? It doesnt make sense, imo it's impossible. ...
I would guess it is not that that hapens but, instead, the server must be controling what is regarded as synced people or not. If "wrong" data hapens to be in one PC and it tells other PCs that that data exists, the server would complain about all of those PCs beeing in sync error. This explains why often not just one person goes out of sync but several or all but the host. In the case of all but the host going into sync error, i would bet, then, that it was some data error that hapened in host (making the host regard all others as out of sync).

Note, however, that i don't know well how the sync system works.

Now that i mention it, anyone remembers a host ever beeing out of sync? I would guess that that is not possible even though i seem to have idea of remembering such thing.
User avatar
Felix the Cat
Posts: 2383
Joined: 15 Jun 2005, 17:30

Post by Felix the Cat »

Hosts don't fall out of sync, as far as I know. If the host's game is different from everyone else's, everyone else falls out of sync.
malric
Posts: 521
Joined: 30 Dec 2005, 22:22

Post by malric »

Would it be feasible for the server to have sync debugger compiled in (there is a document about it svn /Documentation/HowToSyncDebug.odt) ? I mean would there be big performance depreciation ... ?
Tobi
Spring Developer
Posts: 4598
Joined: 01 Jun 2005, 11:36

Post by Tobi »

If you're talking about the syncdebug code (which is mostly useful to figure out which line of code generates a slightly different result on some CPUs), it uses ~200M extra RAM on the host (~20M on client) plus it requires the mingw32 addr2line tool installed and spring to be compiled with max debugging info (over 100M IIRC). There's also a hit on the CPU, but that is minor I think.

Also it hasn't been integrated into all parts of spring, and because of certain designs it's hard to do.

However, maybe it would be possible to make a lightweight version of it, that can send the hex addresses to a server with the debugging symbols, and that doesn't use the full 200M extra RAM. I'll think about it.

There's also the syncify code (which is mostly useful to debug synced memory corruption by unsynced code), which is broken, and which also has a pretty big performance hit because it changes permissions on ram pages many times per second.

Also, all these pathfinding/units getting stuck things are totally unrelated to the sync problem, I'll split the thread. (as long as the logic has deterministic results, it doesn't matter how bugged the logic itself is.)
User avatar
LordMatt
Posts: 3393
Joined: 15 May 2005, 04:26

Post by LordMatt »

I have the computer power to run a debug build and would be happy to send in results if there is a problem. But the question is, would the debug build sync with the non-debug build?
User avatar
Licho
Zero-K Developer
Posts: 3803
Joined: 19 May 2006, 19:13

Post by Licho »

Can you please explain how it works? Can delays in calculation (like mentioned explosions or stuff) really desync it? I thought it wont, instead "synced" game will slow down a bit to wait for proper sync result and host will get "delayed sync" if there are problems with slow calculations or responses from clients.

Packet loss problems are certainly dealed with by lower layers in spring, otherwise we would get desyncs every game (UDP has no built in transport layer controls so you get all problems of underlying layers replicated here = high real packet loss of UDP packets).

Data modified by the network transfer would certainly cause desyncs. And it certainly happens too from time to time.


But still, if 3 people desync and 3 not, it clearly points out to SW bug, am I right?


Btw. friend of mine was getting sync errors (due to overheated CPU), easy way to check your own CPU is to run something like primes stress tool. If CPU is overstressed some floating point operations are getting less exact. Generally it doesn't matter, but it desyncs spring..
malric
Posts: 521
Joined: 30 Dec 2005, 22:22

Post by malric »

About what Tobi said, I think a lot of people could use the debug version (for the sake of debugging and greater good :wink: ). Of course, maybe it would be better not to depend on addr2line.

I have another question : is the syncdebug dump usefull if some the players use the debug version and the others don't ?

(@Matt : I would hope they would sync, but let's wait for the official answer)
Tobi
Spring Developer
Posts: 4598
Joined: 01 Jun 2005, 11:36

Post by Tobi »

@Licho:

The syncdebugger works by providing a SyncedFloat (and SyncedInt..) class which is to be used in place of float (and int..). Basically this class is one big wrapper to the normal operators, each operator downcasting the result to the type given by upcast<type1, type2> (ie. the normal C++ casting is mimicked using template magic). All this magic is basically there to prevent detecting desyncs on constructs like unsyncedvar = syncedvar1 + syncedvar2;.

However, all assignment operators have a special task: after performing the assignment, they call CSyncDebugger::Sync(). This commits the result of the assignment to a big history (the ~200M), on the host combined with a short stacktrace and the operator name (+= -= *= etc.). On clients, the history is maintained too, but without stacktrace/operator info, hence the smaller size (~20M).

Once the game doesn't desync, this is all that happens. But, if the game desyncs, the syncdebugger kicks in. It auto-pauses the game, and the host syncdebugger starts "talking" with the client syncdebuggers. They compare each others' histories (using some smart checksumming algo so only ~10-100 kilobyte is transfered, not 20M). When that is finished, the host syncdebugger outputs a file syncdebug-server.log, in which it first puts a (rather boring and long) dump of all differing assignments, each with an index into a list of stacktraces. After that it dumps the pool of stacktraces. (actual dump and stacktraces are split up because the time between real desync and sync error is often rather long (that is, on the scale of individual assignments), hence the same place in the code can easily generate over 60 differing assignments in the dump.

Client-side, the syncdebugger writes some progress in syncdebug-client.log, but this doesn't contain really useful info.

Analyzing syncdebug-server.log can help debugging sync errors. (emphasize the "help", it is in no way the ultimate solution that always works...)

EDIT:
about the normal code, the server sends out newframe messages, and clients only run CGame::SimFrame() if they receive a newframe message. Further, packet loss is handled by the CNet class (using acks prepended to actual messages, and buffers of unacked packets etc.). I've never extensively tested it but it must have been there since the first spring release with multiplayer support so I've been assuming it works fine (for those that got involved later then I did: I'm not involved since the beginning...)

EDIT 2:
Delayed sync response messages (renamed to delayed response since 0.73 IIRC) just mean the server has already sent out the sync request for frame N+1 (the messages used to determine whether clients are still synced) at the moment it receives the sync response for frame N.

As you may have noticed delayed sync response messages always go hand in hand with "no sync response" messages. The server sends those at the moment it is going to send a new sync request and it hasn't received the response from that player yet.

In short, sync request/response messages can be seen as some sort of ping/ pong sequence that happens every 2 seconds (wallclock time, not gametime) to check whether clients are still synced. The message itself includes five checksums: one for X,Y,Z components of position of all units, and one for metal (M) and one for energy (E) of all players. The checksums are calculated by XOR'ing together the individual elements (ie. x checksum = unit1.pos.x ^ unit2.pos.x ^ unit3.pos.x ^ ...). Hence the XYZME modifier behind "Sync error" messages.

@LordMatt:

The sync debug build should sync with the non-syncdebug build. However...

@malric:

Since the syncdebugger has a client side and a server side (to compare the history), it doesn't work at all if one of the clients doesn't run a binary with the syncdebugger enabled. (Plus it probably errors out with "Unknown net msg" errors on the clients, because the syncdebugger specific net messages aren't coded in if syncdebug isn't..)
Tobi
Spring Developer
Posts: 4598
Joined: 01 Jun 2005, 11:36

Post by Tobi »

You can upload desynced replays here:
http://www.osrts.info/~tvo/spring/upload_replay.php

I'll run them through valgrind soonish.
Only replays made with the latest version of spring please.
Post Reply

Return to “General Discussion”