Thoughts about sync/resync

ivand · Post by **ivand** » 19 Mar 2015, 23:10

Hi!

I've spend few hours today trying to understand how synchronization between server & clients is checked and whether anything could be done to implement resync relatively easy. I'm not a c++/programming expert, so feel free to correct me if I'm wrong. Here are my thoughts anyway.

if SYNCCHECK is defined, it looks like that sync check is based on thorough evaluation of any change happening to variables of Synced* types as well as few ASSERT_SYNCED() calls. The hash of changes then accumulated into unsigned int variable (g_checksum). This checksum is sent to server every frame and if it doesn't match to the majority consensus checksum, then the client sent erroneous checksum is declared to have desync. Game over.

In my view the current scheme (if I got it right) is fine as far as detecting desync occurrence, but it may have few flaws:
1. It's impossible to tell what synchronized component caused a desync exactly. (Yes, looks like there is SYNCDEBUG tracer, which will likely to outline where desync has happened, but it requires special spring build)
2. It's impossible to send partial state update as the root cause for desync remains unknown. Resync is doable, but requires full synchronized state transfer.
3. Every change to a synced variable causes hashing computation overhead. As synced variable may be modified several times during simulation then overhead stacks up.

What I'd like to discuss is an alternative approach to sync detection & resync. What if every class holding synced primitives had a checksum function as part of it? Another words - hashing/control sum mechanism is applied to synchronized class entity rather to individual variables. Hashes then could be "summed up" in order to represent a homogenous group of entities and finally combined into the "frame state hash" being the "sum" of aforementioned groups.

Code: Select all

UnitX.crc=hash of synced variables
activeUnits.crc=sum of UnitX.crc
....
same for map, projectiles, features and other entities that must be in sync.
....
frame.crc=sum of map.crc, activeUnits.crc, projectiles.crc, features.crc, etc...

The approach like this is likely to solve all three issues listed above:
1. It's possible to identify sync entity caused desync, after resync has been complete. It's precise down to an individual entity and it's also possible to dump to the log file both correct and erroneous state for such entity in order to narrow down to an individual sync variable.
2. It's possible to resynchronize desynced client using only a subset of global sync state. Upon receiving wrong frame.crc from a client, resynchronization protocol will first request CRCs for type groups (map.crc, activeUnits.crc, projectiles.crc, features.crc) and next it will discover what entity caused an issue exactly and finally request will request a correct copy.
3. CRC calculation could be made on demand, rather than on every change of individual sync variable.

However I'm not sure about few things:
1. Best way to introduce crc() function to sync entities.
1a. How to define the list of such entities?
1b. How to make crc() in D.R.Y. way?

2. Synced part of Lua seems to have its own state not necessary reflected to regular C++ synced entities.
2a. Can Lua stack be CRC'ed and uploaded/downloaded?
2b. Should mods/maps explicitly declare if they support resync capability (if not 2a)?

A bit incoherent and likely flawed, still I hope this could ignite a healthy discussion.

PicassoCT · Post by **PicassoCT** » 19 Mar 2015, 23:22

Sorry, i did not really understand your post.

Is this a diffrent kind of SYNC as i know it? Cause if spring Desyncs, you are in for a debug hug down the funny float error line..

Every float in floating point determinism is a possible source of a desync. But even if you detect a class having desyncs, you do not know why it has them.
Some maps even cause desyncs.

Everything else is just re-requesting the same package from the server.. so sure you can freeze a game, and try to send all commands a player has lost, so that he can catch up. Or send him the whole replay so far, so he can speed through SIM.

You can not repair DESYNCs. By there very nature, they start out very subtile, and often propagate along unexpected lanes, before aggregating and showing up in the checksum.

Sorry im tired. Maybe should not post.

Silentwings · Post by **Silentwings** » 19 Mar 2015, 23:54

Desyncs are extremely rare with current versions of Spring (although this was not always the case) so tbh I'd say there isn't a need for anything more elaborate than what syncdebug builds offer.

There is often no visible dividing line between what is synced and what is not, in the engines code. The only way to know is experience & hard work; for this reason what you're suggesting is not easy to implement properly. The same is true of your "resync", which is essentially the "why doesn't spring have save/load" issue.

ivand · Post by **ivand** » 20 Mar 2015, 10:05

Are you saying that resync is not needed? If I'm not missing something, this should also give ability to instajoin to the running game.

There is often no visible dividing line between what is synced and what is not, in the engines code.

If the current spring code is correct then synchronized variables are mostly defined with Synced* primitives (like SyncedFloat, SyncedFloat3 or SyncedInt). From what I saw in the code, there is nothing special about Synced* primitives, if SYNCCHECK is not defined, then such types become regular float, float3 or int. The only difference I've observed in the usual case is that changes to Synced variables are checksummed to a cumulative checksum variable. Maybe I've missed something.

Silentwings · Post by **Silentwings** » 20 Mar 2015, 10:10

No, that's not what I said. I said that better ways of debugging sync errors are (afaik) not needed.

Imo resyncing would be a very useful feature, although its primary usage would be save/load/insta-join, rather than repairing sync errors. Whether its "needed" is ofc a personal opinion. As said above, its not easy to implement for the same reason as makes your sync debug approach hard to fully implement.

Sometimes the code makes it obvious that some variable is synced and sometimes it doesn't. E.g. https://github.com/spring/spring/commit ... a9db324453.

ivand · Post by **ivand** » 20 Mar 2015, 11:15

Silentwings wrote:Sometimes the code makes it obvious that some variable is synced and sometimes it doesn't. E.g. https://github.com/spring/spring/commit ... a9db324453.

Perhaps it's "replay-sync", but I'm not sure it makes it part of global sync-state. At least in the commit excerpt I don't see ping values put into any Synced* primitive. If it's not put into Synced* primitive then it doesn't affect checksum upon inspecting which desync is determined.

Anarchid · Post by **Anarchid** » 20 Mar 2015, 11:59

Instant join sounds like something i'd donate towards.

Silentwings · Post by **Silentwings** » 20 Mar 2015, 13:40

If it's not put into Synced* primitive then it doesn't affect checksum upon inspecting which desync is determined.

The sync check does not check all synced data.

There is no division from the engines point of view between "synced because of the current game" and "synced because of the replay" because e.g. synced lua in principle has access to all synced data (such as ping) and can use it to alter the game state in real time (such as altering the effect of orders given by players with too high ping - which if it caused a desync would then be picked up easily by the sync check, even though the sync check didn't explicitly include ping). The synced lua states don't even know if they are part of a replay or not.

To work out a foolproof way of determining what has to be resynced in a save-load/insta-join needs someone with more code reading time than me, I think I can't be much more help here.

Super Mario · Post by **Super Mario** » 20 Mar 2015, 14:04

Edit: Replaced quote of the entire previous post.

Silentwings wrote: To work out a foolproof way of determining what has to be resynced in a save-load/insta-join needs someone with more code reading time than me, I think I can't be much more help here.

Cast a ritual to summon kloot.

PicassoCT · Post by **PicassoCT** » 20 Mar 2015, 14:55

@instant join..

Is technically not feasible. You can create snapshots of the world, and from them in a speed up - headless simulation race towards the current simulation.

Real Instant Join would imply constant drops of the whole Simstate (which is a enormous look & save Operation and to transfer the whole worldstate would demand a lot of bandwith)

Instant Join and Save/Load Problematic are very related.

ivand · Post by **ivand** » 20 Mar 2015, 15:46

@PicassoCT.

I don't think instajoin is technically impossible because of size of state. Even if amount of sync state per unit/projectile/feature/whatever is 1 kilobyte (which is a lot --> 256*floats) and summary amount of them is in the range of 5000, then it's only 5 MB of data.

What I'm really concerned about is Lua part. Many gadgets have a lot of internal state and they are written with no reentrancy in mind. Yesterday I took a quick look how unit/team params are stored and it looked like they were stored in c++ maps. As for gadget's local variables, these might be impossible to replicate.

abma · Post by **abma** » 23 Mar 2015, 01:58

imo resync isn't needed as the cause for a desync should be fixed. what if some error is triggered every game frame? this would slow down game way to much.

for instant join imo load/save has to be fixed first.

some pointers:

https://github.com/spring/spring/tree/d ... ystem/creg

also maybe see the output of spring --test-creg

sorry, idk details about sync checking, not sure if creg is used for that.

search for ENTER_SYNCED_CODE / LEAVE_SYNCED_CODE in source code?!

I'm not sure if all synced variables are used to calculate checksum, imo only a few have to be used as if a desync happens A LOT of variables will be different. it would be very weird if only one variable is different when game desyncs.

edit: https://github.com/spring/spring/blob/d ... s.cpp#L517

->

https://github.com/spring/spring/blob/e ... ect.h#L277

-> only some variables are sync-checked.

if you want to figure out possible optimizations just compile with sync-checking disabled and check if its significant faster than with sync checking enabled.

ivand · Post by **ivand** » 23 Mar 2015, 10:15

abma wrote:imo resync isn't needed as the cause for a desync should be fixed. what if some error is triggered every game frame? this would slow down game way to much.

Surely it's a proper approach in theory. Although spring is like 10 years old project and desyncs still happen.

abma wrote: for instant join imo load/save has to be fixed first.

Yeah, all three features (instajoin, load/save and resync) are based on the same foundation of having an API to load/save synchronized state completely or partially. Instajoin would require additional changes to network protocol and probably to replay file structure.

abma wrote: some pointers:

https://github.com/spring/spring/tree/d ... ystem/creg

also maybe see the output of spring --test-creg

sorry, idk details about sync checking, not sure if creg is used for that.

If anyone could explain magic behind creg, that would be nice to hear. Search in forums didn't help much. Also to me it looked like classes/structures in the source code were creg'ed regardless whether or not they were supposed to be 'serializable'. Looks like at some point creg'ing an entity has become a bit of cargo cult. I was expecting that only synced/serializable stuff is creged, but I don't think it's the case.

abma wrote: search for ENTER_SYNCED_CODE / LEAVE_SYNCED_CODE in source code?!

I did that. There are not so many occasions of such pairs. And one occurrence is applied on the very top of the game hierarchy, which basically tells that whatever happens in the -->Update() call ought to be synchronous. Doesn't help much from practical perspective

abma wrote: I'm not sure if all synced variables are used to calculate checksum, imo only a few have to be used as if a desync happens A LOT of variables will be different. it would be very weird if only one variable is different when game desyncs.

I came to the same conclusion eventually. While only Synced* variables and variables dragged into ASSERT_SYNCED() cause checksum changes, the amount of implicit sync state is way larger. Moreover now I don't think there is a precise way to tell apart which variable is sync-relevant and which is not.

abma wrote:
edit: https://github.com/spring/spring/blob/d ... s.cpp#L517

->

https://github.com/spring/spring/blob/e ... ect.h#L277

-> only some variables are sync-checked.

if you want to figure out possible optimizations just compile with sync-checking disabled and check if its significant faster than with sync checking enabled.

Stumbled upon the same code blocks as you mentioned. I even added checksum function Synced* vars from https://github.com/spring/spring/blob/e ... ect.h#L277, but.... after I realized not all synced vars are marked as Synced*.

Spring RTS Engine

Thoughts about sync/resync

Thoughts about sync/resync

Re: Thoughts about sync/resync

Re: Thoughts about sync/resync

Re: Thoughts about sync/resync

Re: Thoughts about sync/resync

Re: Thoughts about sync/resync

Re: Thoughts about sync/resync

Re: Thoughts about sync/resync

Re: Thoughts about sync/resync

Re: Thoughts about sync/resync

Re: Thoughts about sync/resync

Re: Thoughts about sync/resync

Re: Thoughts about sync/resync