Gadget Failure (long)
Moderator: Moderators
Gadget Failure (long)
Originally, I was going to type this one up in Feature Requests, but since this is a non-trivial change in the Spring Engine's behavior, I thought it would be more appropriate to put it here.
I'd like to see a major change in Gadget behaviors, allowing for easier maintenance and error handling.
What:
1. Gadgets that fail during Initialize() would be removed from the Lua virtual machine, as they are currently. If they failed in a way that would prevent the virtual machine from operating, then the virtual machine needs to be restarted, sans the offending Gadget.
2. Gadgets would have an additional tag, "failureTolerance", where if more than X errors occur, the Gadget is removed from the Lua virtual machine. Or some other failure approach needs to be constructed (see further discussion below about specific instances).
Why:
This would prevent two very big problems with Gadgets:
1. Gadget failure can 'cascade', causing the entire LuaRules state to crash. One bug in a minor Gadget can pull down tens of thousands of lines of perfectly-good code. This doesn't happen with Widgets; it shouldn't happen with Gadgets, either.
2. Game designers can't possibly anticipate, test, or eradicate such Gadget failures, because so much of it's inherent to hardware oddities. The differences in hardware or software (if we're talking about Linux builds) mean that there are just too many things that can and do go wrong, outside our test platforms.
I've seen this problem, time and again, where code that works perfectly on the hardware I can test for not only fails, but drags down the entire LuaRules state, on other hardware. This is often related to OpenGL problems, and can be mitigated somewhat with safety code, culling the Gadgets if they fail.
But it's certainly not limited to just OpenGL- a function that requires an input variable from something else to operate may crash if that value isn't correct, bringing down all the other code in the project.
Lastly, but probably most critically... the engine needs to be aware of whether a Gadget is game-critical to the state of sync, in terms of gameplay (i.e., if we've lost it, we might as well tell the players they've desynced due to Player X having a critical failure), or not.
If, say, a Gadget that is graphics-oriented fails, and is removed on one hardware platform due to gadgetHandler:RemoveGadget(), wouldn't that cause an instant desync with all other clients, where the Gadget is operating normally?
Whether or not that's true, there are still a lot of issues with the current model.
There are quite a few graphical operations where it's terrifically impractical to put all of the OpenGL operations into Widget code where it can fail gracefully, due, in part, to there being no equivalent to SendToUnsynced() for Gadgets to communicate with Widget counterparts. I've gotten around this problem by doing parsing operations with LuaUI Messages, but that's a second-best operation, and largely impractical because of the speed issues involved with parsing.
One of the easier end-runs around the OpenGL issues in general may be to add a variant of RecvFromSynced() to LuaUI, so that Widgets are allowed to receive messages sent to them (in particular).
Obviously, this has some major security implications- there would have to be a way to ensure that only Widget X can get message Y. This can be done, but it's important that if this is added, that it's a primary consideration in terms of design- we don't want Widgets that can "spy" on synced Lua data, obviously.
The other way that that issue could be repaired is to stop the operation of unsynced sections of a Gadget, whilst allowing it to continue to SendToUnsynced(). I have no idea how that could be made practical, give the way that the Lua state loops are designed, however.
I'd like to see a major change in Gadget behaviors, allowing for easier maintenance and error handling.
What:
1. Gadgets that fail during Initialize() would be removed from the Lua virtual machine, as they are currently. If they failed in a way that would prevent the virtual machine from operating, then the virtual machine needs to be restarted, sans the offending Gadget.
2. Gadgets would have an additional tag, "failureTolerance", where if more than X errors occur, the Gadget is removed from the Lua virtual machine. Or some other failure approach needs to be constructed (see further discussion below about specific instances).
Why:
This would prevent two very big problems with Gadgets:
1. Gadget failure can 'cascade', causing the entire LuaRules state to crash. One bug in a minor Gadget can pull down tens of thousands of lines of perfectly-good code. This doesn't happen with Widgets; it shouldn't happen with Gadgets, either.
2. Game designers can't possibly anticipate, test, or eradicate such Gadget failures, because so much of it's inherent to hardware oddities. The differences in hardware or software (if we're talking about Linux builds) mean that there are just too many things that can and do go wrong, outside our test platforms.
I've seen this problem, time and again, where code that works perfectly on the hardware I can test for not only fails, but drags down the entire LuaRules state, on other hardware. This is often related to OpenGL problems, and can be mitigated somewhat with safety code, culling the Gadgets if they fail.
But it's certainly not limited to just OpenGL- a function that requires an input variable from something else to operate may crash if that value isn't correct, bringing down all the other code in the project.
Lastly, but probably most critically... the engine needs to be aware of whether a Gadget is game-critical to the state of sync, in terms of gameplay (i.e., if we've lost it, we might as well tell the players they've desynced due to Player X having a critical failure), or not.
If, say, a Gadget that is graphics-oriented fails, and is removed on one hardware platform due to gadgetHandler:RemoveGadget(), wouldn't that cause an instant desync with all other clients, where the Gadget is operating normally?
Whether or not that's true, there are still a lot of issues with the current model.
There are quite a few graphical operations where it's terrifically impractical to put all of the OpenGL operations into Widget code where it can fail gracefully, due, in part, to there being no equivalent to SendToUnsynced() for Gadgets to communicate with Widget counterparts. I've gotten around this problem by doing parsing operations with LuaUI Messages, but that's a second-best operation, and largely impractical because of the speed issues involved with parsing.
One of the easier end-runs around the OpenGL issues in general may be to add a variant of RecvFromSynced() to LuaUI, so that Widgets are allowed to receive messages sent to them (in particular).
Obviously, this has some major security implications- there would have to be a way to ensure that only Widget X can get message Y. This can be done, but it's important that if this is added, that it's a primary consideration in terms of design- we don't want Widgets that can "spy" on synced Lua data, obviously.
The other way that that issue could be repaired is to stop the operation of unsynced sections of a Gadget, whilst allowing it to continue to SendToUnsynced(). I have no idea how that could be made practical, give the way that the Lua state loops are designed, however.
Last edited by Argh on 16 Mar 2010, 01:03, edited 1 time in total.
Re: Gadget Failure (long)
As far as I am aware, is this all not possible using a modified gadgethandler?
Re: Gadget Failure (long)
IDK whether a GadgetHandler is able to edit the Lua or the virtual machine's state, and it certainly has no say in terms of whether player A continues to run Gadget Y, while player B has had a failure, but is still allowed to keep the same sync state.
This non-differentiation between critical processes and non-critical things is very important. All Gadgets are presumed to be critical processes, whereas Widgets aren't. The problem is that many Gadgets aren't actually critical, yet can't be made Widgets, because they need access to certain game data. That's why I was thinking that a variant of SendToUnsynced() might be the best end-run... graphics operations could then be moved to the Widget space, where they can fail without causing issues in the game state. That's probably the easiest way to change the engine, too.
This non-differentiation between critical processes and non-critical things is very important. All Gadgets are presumed to be critical processes, whereas Widgets aren't. The problem is that many Gadgets aren't actually critical, yet can't be made Widgets, because they need access to certain game data. That's why I was thinking that a variant of SendToUnsynced() might be the best end-run... graphics operations could then be moved to the Widget space, where they can fail without causing issues in the game state. That's probably the easiest way to change the engine, too.
Re: Gadget Failure (long)
That's the important thing here. Only Spring can be aware that a Gadget has crashed the entire LuaRules state.For many applications, you do not need to do any error handling in Lua. Usually, the application program does this handling. All Lua activities start from a call by the application, usually asking Lua to run a chunk. If there is any error, this call returns an error code and the application can take appropriate actions.
While I suppose it's theoretically possible to encapsulate things in pcall... in practical terms, not really. Where's "critical", when even one failure during Initialize(), for example, will crash the entire LuaRules state?
Moreover, let's say that I open up every Initialize() function and encapsulate it, and use gadgetHandler:RemoveGadget() to kill it if it fails.
That still doesn't prevent two distinct game states on two different machines. Which is why I'm leaning towards some sort of "firewalled" unsynced Lua (whether Widgets, or something new) that is allowed to crash without altering the sync state.
Re: Gadget Failure (long)
Incorrect as all your current assumptions/requests.Argh wrote:Only Spring can be aware that a Gadget has crashed the entire LuaRules state.
Everything is doable atm (via pcall and custom gadgetHandler).
Re: Gadget Failure (long)
What about the sync state, should the GadgetHandler be modified to reload with a new set of Gadgets, should one cause a major failure?
I'll take a look at the unsynced side, see how safety can be improved without invoking gadgetHandler:RemoveGadget... that, at least, should be possible.
I'll take a look at the unsynced side, see how safety can be improved without invoking gadgetHandler:RemoveGadget... that, at least, should be possible.
Re: Gadget Failure (long)
AFAIK only LuaUI gets unloaded when there are too many errors in it.
Also even for it, it's handled via a pcall in the c++ interface of LuaUI, so if you use pcall in your lua-code the pcall in c++ itself would never get called (only the pcall on the top of the stack gets called on errors).
Also even for it, it's handled via a pcall in the c++ interface of LuaUI, so if you use pcall in your lua-code the pcall in c++ itself would never get called (only the pcall on the top of the stack gets called on errors).
Re: Gadget Failure (long)
A gadget crashing on one machine but not the others assumes that a desync has already occurred to cause this, otherwise the gadget would crash on each and every synced machine.
Or at least basic logic would assume so.
Should a gadget crash there's nothing stopping you telling the other machines in the game that it has blown itself up.
Graceful handling as always. If you can handle errors on the lua side fo the divide you can recover and continue. Application error handling is primarily useful in that you can then handle a failure in your system and either initiate a failsafe, reset the entire lua VM, or use your own application error reporting system.
Eitherway it is all totally unnecessary, and what you want can and should be done on your end of the code, since to other people and other requirement specifications it could cause problems as they might want different handling behaviour
Or at least basic logic would assume so.
Should a gadget crash there's nothing stopping you telling the other machines in the game that it has blown itself up.
Graceful handling as always. If you can handle errors on the lua side fo the divide you can recover and continue. Application error handling is primarily useful in that you can then handle a failure in your system and either initiate a failsafe, reset the entire lua VM, or use your own application error reporting system.
Eitherway it is all totally unnecessary, and what you want can and should be done on your end of the code, since to other people and other requirement specifications it could cause problems as they might want different handling behaviour
Re: Gadget Failure (long)
So, yes, it causes a desync state for that client.A gadget crashing on one machine but not the others assumes that a desync has already occurred to cause this, otherwise the gadget would crash on each and every synced machine.
So all of our attempts to write safety code via gadgetHandler:RemoveGadget() are, in fact, a primary threat to sync, and should be deprecated in favor of safer methods.
Name one reason why it would not be desirable to see more robust handling of LuaRules errors and fewer causes of desync. That makes no sense to me at all, frankly. There are no good reasons why we should prefer that code failures not be handled as safely as is practical, especially for the many, many things we want to do that are using Gadgets for graphical behaviors.Eitherway it is all totally unnecessary, and what you want can and should be done on your end of the code, since to other people and other requirement specifications it could cause problems as they might want different handling behaviour
Anyhow... I'm working on this now, should have some results soon.
Re: Gadget Failure (long)
Because for some it may be intolerable that a gadget fail even once.
You also fail at basic comprehension. I said that if one machien crashes and the other machiens do not, then there must be a difference between the two machines, and this would be a desync.
As such, if both machines where in sync before the crash, then both machines will crash with the same error, and both machines will have the same luarules problem, and remain in sync thus having identical erroneous states, but still in sync
If however you are correct, then this must mean that a desync happened prior to the lua Rules crash, and that the desync is actually the cause of the luarules crash, not the gadget in question.
Basic cause and effect, go read up on it
To demonstrate, what you are saying is:
The desync caused the desync
You also fail at basic comprehension. I said that if one machien crashes and the other machiens do not, then there must be a difference between the two machines, and this would be a desync.
As such, if both machines where in sync before the crash, then both machines will crash with the same error, and both machines will have the same luarules problem, and remain in sync thus having identical erroneous states, but still in sync
If however you are correct, then this must mean that a desync happened prior to the lua Rules crash, and that the desync is actually the cause of the luarules crash, not the gadget in question.
Basic cause and effect, go read up on it
To demonstrate, what you are saying is:
The desync caused the desync
Re: Gadget Failure (long)
If there are three clients, and only one of them crashed, two of them remain in sync. Ideally, short of game-critical code (which should crash on all hardware simultaneously), this is not the desired behavior.
And lastly, the "no errors are acceptable" argument is a very silly straw man. We're all aware that complex software is going to have issues, and Spring itself is chock-full of error-handling tricks to keep them from ruining the experience. I merely want Gadgets to have a similar level of safety, if possible. I can't imagine a real project coder here not wanting that as well.
I'm not saying that we don't need reports of errors, that mission-critical gamecode failures shouldn't result in desync- they can and should do so. There's a manifest difference, though, when we're talking about primarily graphical operations developed for UI aid or to enhance the experience. These are things where it's important to develop fail-soft methods, and I'm totally amazed that you want to have us believe that 100% perfection's even attainable, because even the giant software houses don't attain that, even when only operating under Windows, and they can afford to perform far more testing than is practical for any Spring game.
With this sort of code, due to hardware and driver issues, it's a mess, and there's no way to absolutely be ensured that your code will be error-free on all platforms, regardless of how much testing you do before release.
I've seen countless problems where I can't replicate it on my available hardware, AF. We need things to fail gracefully and without ruining people's overall experience- it's one thing if some graphics don't show, it's another thing entirely when they lose the game they've been playing for half an hour with other people, or simply can't play multiplayer games because they insta-desync with others.
And lastly, the "no errors are acceptable" argument is a very silly straw man. We're all aware that complex software is going to have issues, and Spring itself is chock-full of error-handling tricks to keep them from ruining the experience. I merely want Gadgets to have a similar level of safety, if possible. I can't imagine a real project coder here not wanting that as well.
I'm not saying that we don't need reports of errors, that mission-critical gamecode failures shouldn't result in desync- they can and should do so. There's a manifest difference, though, when we're talking about primarily graphical operations developed for UI aid or to enhance the experience. These are things where it's important to develop fail-soft methods, and I'm totally amazed that you want to have us believe that 100% perfection's even attainable, because even the giant software houses don't attain that, even when only operating under Windows, and they can afford to perform far more testing than is practical for any Spring game.
With this sort of code, due to hardware and driver issues, it's a mess, and there's no way to absolutely be ensured that your code will be error-free on all platforms, regardless of how much testing you do before release.
I've seen countless problems where I can't replicate it on my available hardware, AF. We need things to fail gracefully and without ruining people's overall experience- it's one thing if some graphics don't show, it's another thing entirely when they lose the game they've been playing for half an hour with other people, or simply can't play multiplayer games because they insta-desync with others.
Re: Gadget Failure (long)
I don't want that. When the gadget error has escalated to a flood of error in the gadget handler, it's usually too late to properly recover, the states of my Lua table are in a mess and attempts to recover would cause even more issues.Argh wrote:I can't imagine a real project coder here not wanting that as well.
I found it works better to write solid code in the gadget itself, for instance always checking unitid are valid and units not dead, and if there's ever a piece of unsure code that you know could crash, use pcall to handle the error in the gadget itself, at an appropriate level, where you know how to cut the bad code without ruining the rest.
A generic solution hard coded in the engine is phail because it cannot guess how I want it to handle the error, which is specific to each case.
But then you mention "graphics-oriented", "OpenGL", and "hardware platform", I guess you have problem not in the synced part of gadget, but in the unsynced part of gadget. And that changes the problem completly!
The unsynced part of a gadget is not game-critical, can be closed or restarted without affecting the syncedness with other clients, and should not cause the entire LuaRules to crash.
So please first choose if your problem is crash in the synced parts of gadget or crash in the unsynced parts of gadget!
Or maybe you are saying errors in the unsynced, graphical operation of gadgets, can mess and cause errors in the synced of gadgets?
Also, would stop adding (long) to your thread titles? They're not that long, tbh, and like about all of our forum readers, I have eyes that can see at glance how many lines a post is without the need for you pre-categorize them between "long" and "very long". I guess it irks me because of the insinuation that we are all monkeys that can't read or write more than 2 lines and that you feel the need to humbly apologize for having a superior mind.
Re: Gadget Failure (long)
That is exactly the issue.Or maybe you are saying errors in the unsynced, graphical operation of gadgets, can mess and cause errors in the synced of gadgets?
That said, I don't know why having parameters for failure for synced code is such a horrible thing. You make it sound like I want something mysterious and complex- I just want a counter that counts errors, and if it goes over X, removes the offending Gadget automatically.
And if I don't write (long) then certain people whine about tl:dnr. I guess I can't win on that

Oh, and please don't use the "bug-free" argument, since I have yet to see one of your projects run bug-free, and you aren't releasing regularly merely to add new features... don't waste our time pretending otherwise. We've all released things with bugs, mmkay?
Re: Gadget Failure (long)
This guy made several posts longer than yours, has no reputation, and was never answered tl;dr
tl;dr can mean two things:
- I'm a spamming idiot.
- You spew a lot of B.S. for something that simple.
Anyway, there's the "bug free" argument, but there's also the "handle the error yourself". When I was coding my map with tunnels and bridges, I used "pcall" a lot because I'd rather have some features not drawn than a flood of error. When I was coding my pseudo load game thing, I used pcall for every line I decode, because missing a unit in a savegame isn't as bad as failing the reload the entire save.
AFAIK that would indeed be a bug in the engine. Build a minimal gadget, showing an single error in a graphical, unsynced gadget, cascading into a crash of the entire synced LuaRules state. Then the dev can examine it.
tl;dr can mean two things:
- I'm a spamming idiot.
- You spew a lot of B.S. for something that simple.
Would love to see you finding a bug in Kernel Panic 4.0 escalating to the gadget handler level (or my "hotfix" gadget, which replaces part of the handler). So far the only one I know of is that if you /luarules reload while ONS is on, there's a single burst of error messages, that fixes itself after a frame or two. I'm not saying there are none though. I've fixed so many already after the last one. But then you have your particle system and are doing lots of fancy graphical stuff, while I avoid GLSL like plague exactly because I had bad experiences with it crashing gadgets.In tiny text Argh wrote:Oh, and please don't use the "bug-free" argument, since I have yet to see one of your projects run bug-free, and you aren't releasing regularly merely to add new features... don't waste our time pretending otherwise. We've all released things with bugs, mmkay?
Anyway, there's the "bug free" argument, but there's also the "handle the error yourself". When I was coding my map with tunnels and bridges, I used "pcall" a lot because I'd rather have some features not drawn than a flood of error. When I was coding my pseudo load game thing, I used pcall for every line I decode, because missing a unit in a savegame isn't as bad as failing the reload the entire save.
It wasn't that clear from your first posts.Argh wrote:That is exactly the issue.zwzsg wrote:Or maybe you are saying errors in the unsynced, graphical operation of gadgets, can mess and cause errors in the synced of gadgets?
AFAIK that would indeed be a bug in the engine. Build a minimal gadget, showing an single error in a graphical, unsynced gadget, cascading into a crash of the entire synced LuaRules state. Then the dev can examine it.
Re: Gadget Failure (long)
Argh, gadgets should not allow themselves to crash thansk to stuff like GLSL and unsynced code.
Unsynced code by tis very nature is 'dangerous' and you should take every precaution.
Fail gracefully, but this is no excuse for not doing what zwzsg said. Even with all the safety nets in the world, you should implement your codes components as wholly as possible and contain internal errors in internal code, and control errors propagating to the API level of the code so that they're well formed.
Put quite simply, you shouldn't be in this kind of mess anyway.
Unsynced code by tis very nature is 'dangerous' and you should take every precaution.
Fail gracefully, but this is no excuse for not doing what zwzsg said. Even with all the safety nets in the world, you should implement your codes components as wholly as possible and contain internal errors in internal code, and control errors propagating to the API level of the code so that they're well formed.
Put quite simply, you shouldn't be in this kind of mess anyway.
Re: Gadget Failure (long)
So many weird assumptions in this thread.
1. Spring doesn't know what gadgets are, it only sees main.lua and draw.lua, which create a single global function for each callin they want.
2. You can't 'crash' the luarules state, you can only set up callins that don't work.
3. Each gadget is really two gadgets sharing the same file, one in main.lua and one in draw.lua, so
4. Synced is very isolated from unsynced. If you manage to do ANYTHING that affects local synced, not across the network, in ANY way, file a bug report immediately.
5. If you EVER have a synced gadget crash on one machine but not another, you have hit a very strong spring bug and need to report it. (unless you were using objects as table keys, that particular aspect of lua doesn't sync)
6. Unsynced code isn't dangerous. Unsynced code is kept from writing to synced data.
1. Spring doesn't know what gadgets are, it only sees main.lua and draw.lua, which create a single global function for each callin they want.
2. You can't 'crash' the luarules state, you can only set up callins that don't work.
3. Each gadget is really two gadgets sharing the same file, one in main.lua and one in draw.lua, so
4. Synced is very isolated from unsynced. If you manage to do ANYTHING that affects local synced, not across the network, in ANY way, file a bug report immediately.
5. If you EVER have a synced gadget crash on one machine but not another, you have hit a very strong spring bug and need to report it. (unless you were using objects as table keys, that particular aspect of lua doesn't sync)
6. Unsynced code isn't dangerous. Unsynced code is kept from writing to synced data.