Framerate, after glGet changes

AF · Post by AF » 02 Dec 2007, 16:02

I think it'd be more productive to put that effort into removing the glGet calls themselves rather than profiling them.

Tobi · Post by **Tobi** » 02 Dec 2007, 17:26

premature optimization is the root of all evil and I have not yet seen any good measurement of glGet* execution time

Argh · Post by **Argh** » 02 Dec 2007, 17:49

And for all we know, I might be completely wrong. Thus far, my early optimism hasn't been borne out by experiment.

It's just a hunch, based on the opinions I read- that bit with Gabba was the thing that really convinced me that this might be an issue.

But I don't really know enough about this stuff to have an informed opinion.

It just looks wrong- I always avoid running anything that requires a wait-state, and I run everything in my code as slowly as possible, and spread loads when I can. Calling for a big halt if it's not necessary seems wrong to me. But sometimes one has to wait for data to return from an operation before proceeding, and there isn't any way around it, period... the real question is, "where is this occuring within the loop", I think- is it independent of the simulation, or is it synchronized? If it's not synchronized, then the real question is whether it's the fastest way to go about the stuff I see there, which looks like matrix transforms (but it might as well be magic to me- I don't get that stuff at all)...

AF · Post by AF » 02 Dec 2007, 18:11

Common consensus and theory side with the glget slowdown hypothesis. This alone in my opinion should warrant the removal of unnecessary glGet calls.

The reasoning also suggests that fi you have 3 glGets, removing one will have little effect, and the real speed up would only occur once the third and final glGet was removed.

Argh · Post by **Argh** » 02 Dec 2007, 18:45

It appears that both of the remaining glGets have been reworked by Kloot. Lemme take a look, with the LUA GUI stuff off (since that appears to have glGets in it), and see what performance looks like now.

Argh · Post by **Argh** » 02 Dec 2007, 19:08

<comes back from testing>

After comparing with 0.75b2, I can report that Kloot's changes resulted in approximately 10% speed increase for me, with as high as 20% under certain conditions (particle effects ran faster- with cegtags factored in, it appears they ran much faster).

I'll repeat the test with LUA GUI enabled, and see what we get...

<tests>

With LUA GUI re-enabled, tests came out with similar differences, but lower overall FPS- about a 3-5% drop, depending on the situation.

The main thing that I saw that was very interesting to me is that cegtag is incredibly dependent on what CEG is called, in terms of CPU usage- methinks I'd better remove all instances of CSimpleParticleSpawner wherever practical...

Needless to say, my results need to be verified. I should also say that I'm seeing most FPS slowdown due to CPU-side constraints- while the 7800GT I'm running is still not "slow", it's usually more than fast enough for Spring.

AF · Post by AF » 02 Dec 2007, 19:36

In the longrun moving the simulation and the rendering into two separate threads would allow higher frame rates, but it would likely be a hassle to actually do.

Argh · Post by **Argh** » 02 Dec 2007, 20:01

Yup. Such a move is probably entirely necessary to take full advantage of multiple CPUs, though.

Tobi · Post by **Tobi** » 02 Dec 2007, 21:27

AF wrote:Common consensus and theory side with the glget slowdown hypothesis. This alone in my opinion should warrant the removal of unnecessary glGet calls.

The reasoning also suggests that fi you have 3 glGets, removing one will have little effect, and the real speed up would only occur once the third and final glGet was removed.

Consensus and theory is nice, but it's stupid to optimize without comparing hard numbers before and after. Also the consensus here is nothing more then everyone repeating each other that some nvidia guy once said to SJ that glGet calls may be the cause. Before SJ asked that no one ever got that idea, or this thread wouldn't have a need to exist now.

IMHO it's at the same level as the consensus that the pathfinder is slow for aircraft, while aircraft don't even use the pathfinder...

In any case, I wouldn't mind if someone violates the first rule of software engineering and it happens to give good results.

However, to prevent someone from wasting time refactoring glGet calls away and being disappointed afterwards about the result, I'd suggest to measure first (it's only 2 lines of code after all to add a time profiler around the single remaining non-LUA glGet call) and set expectations accordingly, in particular if it's non trivial to factor the glGet call away.

As for splitting rendering and simulation into two threads, any help now with factoring rendering code out of simulation code (and preferably making rts/Sim/* compile without any reference to OpenGL), is appreciated, and will help in the long term to make 2 core support possible.

AF · Post by AF » 02 Dec 2007, 21:33

For the latter, I would point towards swing gui toolkit.

It uses a design pattern where a GUI control keeps hold of a data model object that uses an abstract/interface class, which contains all the information and data. So a table will have a table model an there'd be a selection model and so on, the table class itself wouldn't hold anything specific to that instances data unless it was specific to the Graphical attribute of the table e.g. its size. How many columns and what they hold and their ordering are the models responsibility.

Thus if we had say a CUnit class with a CUnitDataModel and a CUnitRenderer......

Post by **Kloot** » 02 Dec 2007, 21:55

There are in fact 3 remaining non-Lua
glGet's (of the original 5, I don't know
where Argh got the idea that I removed
any today), but they are simple enough
to refactor that I'll probably do so if I am
convinced by profiling data that it's worth
the bother. So far, I haven't noticed any
improvement on my own 8800GTS setup.

AF · Post by AF » 02 Dec 2007, 22:00

Then profiling glGet seems the obious choice.

Either way if glGet is the cause then the vast majority of any gains won't be seen untill the last glGet is removed.

LordMatt · Post by **LordMatt** » 02 Dec 2007, 23:24

Kloot what kind of performance numbers do you get on your 8800GTS and what OS are you using?

lurker · Post by **lurker** » 03 Dec 2007, 00:17

I'm having a little trouble figuring out exactly which three glget calls are run when lua is off. I'm testing right now, but I would like to know with more certainty.
Edit: I have only found a single call by testing with breakpoints, and it's not showing up as lots of time. Where are the other two?

Argh · Post by **Argh** » 03 Dec 2007, 07:34

Sorry, Kloot, I was referring to the changes in Camera.cpp... wasn't that new?

At any rate, if a profile is what's needed to sort out my hypothesis from fact, ok. I really should shut about this topic- again, I don't really know what I'm talking about, and Tobi's right, I'm going off of SJ's statement and words of advice directed at Gabba 2 years ago, so there might be absolutely nothing worth doing.

Now, as far as separating the rendering from sim... well, if Lurker cuts out the hardcoded FX stuff, that's one pretty major chunk that'd now be run through CEGs, which IIRC aren't tied directly into sim, except that they share the main loop of Spring. I don't see anything wrong with that- can't we just have a for / else that causes certain elements to run, say, 1/3rd as fast as everything else, so that the graphical elements (including COB) update at a max real framerate of 90FPS, but the main "sim" elements run at 30FPS? I'm fairly certain that COB tops out at 30 FPS right now. Untying COB, CEGs and Projectiles from the rest of sim should be fairly "simple"... except for the wide variety of things that are cross-tied with one another, like GUI elements.

How can I help? I'm not an awesome coder, but I can see that this might be more than worthwhile, and since Treeform hasn't delivered his magical 3DO-translating tool, and I'm still in early development of the second faction for PURE, it's a good time to look at this- the only major dev work I need to do with PURE this week is the whole public / private soundcode testing, now that I have all of the main FX stuff done until Resistance is a lot closer to feature-complete...

Tobi · Post by **Tobi** » 03 Dec 2007, 09:44

COB, CEG and many projectiles are tied unconditionally into simulation.

COB is needed because it also can change simulation states (cloaked, etc) and because it provides ways to emit synced projectiles (ie. that inflict damage), and it's state determines where weapon projectiles are "emitted".

CEG is needed for about the same reason: because one can spawn projectiles with it that inflict damage or influence the simulation in other ways.

So I would like to change "fairly simple" in your post to "impossible"

Only unsynced projectiles and all rendering can ever be moved to another thread. (Which could be worth doing, since especially unsynced projectile updates should be fairly easy to run in a different thread.)

One optimization that'd also help a lot on TA based mods, and in particular on SpeedMetal-esque maps, is making nano particles unsynced. They're still synced, and I think I know why: looking at nano particles is a game play element to see whether someone is stalling or not, so if they get culled because of other reasons (ie. your PC being slow), you can stumble upon a nice surprise when you attack

I think that ideally nano particles would be converted into one nano particle per builder, which would be a huge billboard quad with a scrolling texture of nano particles on it. I'm pretty sure that'll speed up things quite some %, but of course someone would have to put time profiler code in the nano particle first.

LordMatt · Post by **LordMatt** » 03 Dec 2007, 14:27

Tobi wrote: One optimization that'd also help a lot on TA based mods, and in particular on SpeedMetal-esque maps, is making nano particles unsynced. They're still synced, and I think I know why: looking at nano particles is a game play element to see whether someone is stalling or not, so if they get culled because of other reasons (ie. your PC being slow), you can stumble upon a nice surprise when you attack

I think this is an important gameplay feature that should not be removed. The decisions you make in game often are changed by whether your enemy is stalling to build that HLT or not (e.g. do you need to attack it now, or can you wait till yours is done, because you are making it faster than he is).

KDR_11k · Post by **KDR_11k** » 03 Dec 2007, 15:16

Nanoparticles wwere unsynced in OTA, they disappeared if too many were in play. The oldest ones disappeared first so you got a shorter spray. That didn't break the info and still had a max particle count.

AF · Post by AF » 03 Dec 2007, 16:37

The option in settings.exe to move particles from textured quads to untextured triangles would be a nice speed boost, however I think that sort of speed boost is likely to have the greatest impact on the lower end of the hardware spectrum.

Tobi · Post by **Tobi** » 03 Dec 2007, 17:18

That's not what I ment however, I ment combining several particles in one bigger particle with a texture that shows multiple particles. So you'd basically get 4 particles for the price of one, for example (assuming GPU fillrate isn't the bottleneck).

Optionally even just one particle with a scrolling texture with many particles drawn on it.

Though I realised it may make more sense to move nano particle simulation to the GPU and just make it unsynced for lower end.

Spring RTS Engine

Framerate, after glGet changes