Deferred rendering of shadow pass.

Argh · Post by **Argh** » 30 Oct 2008, 20:13

Took a look at the relevant code, and it appears to me, based on a first read-through, that we're running lots and lots of CPU-heavy shadow passes- per-unit, then we're deleting all of the places where it's duplicating on the final shadowmap.

It would be far more efficient to store all of the unit meshes being rendered this pass and the terrain in one large display list and then run the shadowmap pass.

See drawUnit.cpp, sections CUnitDrawer::DoDrawUnitShadow and CUnitDrawer::DrawShadowPass you can see what I mean.

Storing all visible Units, Features, and the mesh result for the terrain in one final display list before doing any shadow calculations would probably result in a giant speedup. This is, of course... assuming that I read the code right, but it really does look like it's rendering the world shadow to make the initial shadowmap, then adding to it via each Unit individually, then finally doing self-shadow in the unit pixel shader... which looks massively inefficient.

[Krogoth86] · Post by **[Krogoth86]** » 30 Oct 2008, 20:37

Correct me if I'm wrong but that's the "Let's do it like UT3 and fuck up all Anti-Aliasing!" method isn't it?

I prefer having a maybe not so efficient approach for shadows but keep my jag-free image...

Post by **Kloot** » 30 Oct 2008, 21:01

What it does:

Code: Select all

CUnitDrawer::Update()
	for each newly added unit
		add unit to renderUnits

CShadowHandler::DrawShadowPasses()
	CUnitDrawer::DrawShadowPass()
		for each unit in renderUnits
			CUnitDrawer::DoDrawUnitShadow
				if unit visible
					draw unit

What you want it to do:

Code: Select all

CUnitDrawer::Update()
	for each newly added unit
		add unit to renderUnits

	START-MAGIC-SUPERFAST-SHADOW-DLIST
	for each unit in renderUnits
		CUnitDrawer::DoDrawUnitShadow
	END-MAGIC-SUPERFAST-SHADOW-DLIST

CUnitDrawer::DrawShadowPass()
	DRAW-MAGIC-SUPERFAST-SHADOW-DLIST

Now guess why this approach will fail.

Argh · Post by **Argh** » 30 Oct 2008, 21:10

Correct me if I'm wrong but that's the "Let's do it like UT3 and fuck up all Anti-Aliasing!" method isn't it?

Er, no. I'm not talking about changing the method used for the shadowmap. It'd just run in a different order that would be a lot more efficient.

What you want it to do:

No, that's not what I want to do.

Code: Select all

CUnitDrawer::Update()
   for each newly added unit
      add unit to renderUnits

   START-MAGIC-SUPERFAST-SHADOW-DLIST
   for each unit in renderUnits
      if unit or Feature visible
         add unit triangles to a display list "AwesomeShadowList"
   END-MAGIC-SUPERFAST-SHADOW-DLIST

CUnitDrawer::DrawShadowPass()

Draw World Shadows (with standard shadow code) just like always.

Draw World Shadows onto the AwesomeShadowList, to get our initial shading value per vertex, either white (no shadow) or 128,128,128 (world shadow).

Draw contents of AwesomeShadowList on the world, do additive on top of world shadows, use texture size but do not do for entire world, just a square big enough for the FOV (iow, greatly increase the resolution, since you aren't blatantly wasting it on mountains far out of the FOV).  No editing of the original bitmap values are required, it's a second shadow.

Draw contents of AwesomeShadowList using the pixelshader used for self-shadowing, modified a bit to start with the texture made in the world-shadow step.  Now a given triangle will either have a value of 255, 128 (world shadow or self-shadow only) or 0 (world shadow and self-shadow).  Final value is either no shadow (don't draw) full "normal" shadow, or that value divided by two... or half the ambient value, whichever is higher.

IOW, instead of wasting lots of CPU time editing a final, incredibly-blurry bitmap... do exactly four sets of operations per pass, because you're deferring it, and do four different things. One rough pass, for terrain, which we then use to shade the units initially, one fine pass, for units shading the ground, one pixel-shader pass, for the units to self-shade (oh, and they would finally shade each other, too). Is that clearer? I am pretty darn sure that it would work, and run considerably faster than what we're currently doing.

[EDIT]Forgot using the world shadowmap to perform the intial shade value of the AwesomeShadowList. My bad. You guys get the idea though, I hope- we're looking for a final darkness value, basically, from white (no shadow at all) to the lowest level of ambient minus any self-shadows, down to a floor of, say, half the ambient value (iow, allow areas in shadow twice to go lower than ambient, which would be accurate, given that "ambient" is the lowest level for the world shadows).[/EDIT]

Post by **hoijui** » 31 Oct 2008, 10:14

i am sorry.. i do not have anythign usefull to say about this, i have no idea about what you are talking...
just wanted to say:
WOW Argh, you do lots of stuff these days :D

you seem to run against walls though. maybe, you should just try it yorself again. maybe.. do a branch for it (ok, you said it is a small change, so you could also do it locally and if it works, commit it directly to trunk). i mean.. you said it is a small and easy change.
maybe you are done with it before convincing others, or before you have them uderstand what exactly you are trying to do.

anyway... nice!

Post by jK » 31 Oct 2008, 11:21

It doesn't work like that ...

First, the creation of DisplayLists is very slow (yeah on ATis it is even extremely slow) because of the optimizations (the GL commands/vertices get _compiled_!).
Second, the ShadowMap is in a different space (normal rendering is in camera space and shadowmap is in sunlight space) -> stuff can be hidden in camera space, but visible in sunlight space!

So if you want to speed up shadowmapping then you have to speed up unit/terrain rendering itself. And yeah, both rendering passes (units & terrain) have already LOD factors for shadow rendering (that's how commercial games speedup their shadowmap generation, too).

PS: Shadows look ugly atm, but that's not an issue of the shadowmap generation itself (except you really want 100% hard edges), it is how the shadowmap is into account of unit/terrain rendering. It will get changed when I translated all shaders into GLSL.

..
Code: Select all
..
Draw World Shadows onto the AwesomeShadowList, to get our initial shading value per vertex, either white (no shadow) or 128,128,128 (world shadow).

Draw contents of AwesomeShadowList on the world, do additive on top of world shadows, use texture size but do not do for entire world, just a square big enough for the FOV (iow, greatly increase the resolution, since you aren't blatantly wasting it on mountains far out of the FOV).  No editing of the original bitmap values are required, it's a second shadow.

Draw contents of AwesomeShadowList using the pixelshader used for self-shadowing, modified a bit to start with the texture made in the world-shadow step.  Now a given triangle will either have a value of 255, 128 (world shadow or self-shadow only) or 0 (world shadow and self-shadow).  Final value is either no shadow (don't draw) full "normal" shadow, or that value divided by two... or half the ambient value, whichever is higher.
IOW, instead of wasting lots of CPU time editing a final, incredibly-blurry bitmap... do exactly four sets of operations per pass, because you're deferring it, and do four different things. One rough pass, for terrain, which we then use to shade the units initially, one fine pass, for units shading the ground, one pixel-shader pass, for the units to self-shade (oh, and they would finally shade each other, too). Is that clearer? I am pretty darn sure that it would work, and run considerably faster than what we're currently doing.

???
(you should really read a tutorial how shadowmaps (and fragment shaders) work ...)

AF · Post by AF » 31 Oct 2008, 16:06

I think your under a common misconception that display lists move stuff onto the gpu to minimize gpu<->cpu communication making them faster.

What it actually does is compile the commands into a list and then store it in ram, so that it can all be sent at once to the gpu. this is faster than normal glbegin glEnd etc of course but its nowhere near as fast as your expecting it to be, and for something thats only going to be used once on the next frame then destroyed, it defeats the whole point of display lists in the first place.

For an example of how fast display lists actually are, I have a program i have been coding that runs at 4fps with normal glBegin glEnd, ~40fps with display lists, and ~500fps using interleaved index vertex buffer objects. then again VBOs dont exactly fit in well with your idea either since they too take time to construct and send tot he gpu

With regards to your intended implementation, i would suggest instead maybe it would be better into an FBO than store the commands in a display list, then render the fbo texture. then again Ive not implemented shadows before so thats just random speculation from me lol

Argh · Post by **Argh** » 31 Oct 2008, 23:59

Look, I'm aware that making the display list is slow. However, you're just making it once, then transferring the contents to a table of verts (another display list), so we're not talking about re-doing it, merely copying the number to another location. Not so slow.

And because you're deferring it you're avoiding the constant CPU drag of world-objectworld-pixelshader loops, and all of the transfers from memory that are an inevitable part of that.

Look at it another way, guys.

Turning shadows on has about the same hit on FPS and CPU use for me all the way from a shadowmap size of 256 to 2048- a 64-fold increase in pixel area. It isn't until I go to 4096 that it's a slowdown on the GPU that's causing slower performance. Therefore, the main hit is on the CPU side of things, and in order to fix this, we need to continue to look there for ways to speed up the way that we're getting data from the CPU side into a shader, etc.

What I've proposed may not be a viable solution, I'd be the first one to say that. But the idea here is to think outside the box, and try to see the entire problem. I think there's a huge amount of waste involved with the loops for this, and I think that we need to figure out a way to cut down to the minimum number of repetitions.

I don't think that speeding up terrain rendering will significantly impact the performance of this. In fact, I'm tempted to make a test map and do a proof of that. I think that the larger the map, the longer it takes, period.

So, while I'm happy that you want to optimize that area, jK, for shadows maps... it'll make it a small amount faster, but not a lot. I can see LOD making things a bit faster, and I'm aware that games use a shadow LOD. But I really don't think it's just about that, either, otherwise I'd be stomping up and down about getting the LOD stuff working right. I think it's mainly about waste- memory transfers and too many iterations when it should be a lot cleaner.

Argh · Post by **Argh** » 01 Nov 2008, 04:59

Did some experiments, here were the results:

1. The cost on shadowmaps is definitely irrespective of the size of the shadowmap- it made 1FPS difference, at most. If you're going to turn it on at all, you might as well turn it all the way up, basically.

2. The cost on shadowmaps is almost purely CPU-side. I looked at huge numbers of Resistance troopers (my favorite stress-test) and actual rendered FPS was stable whether they were moving or not, but game FPS dropped by a huge amount when the shadow pass was on and they were moving, vs. a minor drop when they weren't.

Post by jK » 01 Nov 2008, 05:54

Argh wrote:Look, I'm aware that making the display list is slow. However, you're just making it once, then transferring the contents to a table of verts (another display list), so we're not talking about re-doing it, merely copying the number to another location. Not so slow.

And because you're deferring it you're avoiding the constant CPU drag of world-objectworld-pixelshader loops, and all of the transfers from memory that are an inevitable part of that.

erm what

Argh wrote:I think it's mainly about waste- memory transfers and too many iterations when it should be a lot cleaner.

the `waste` is in the unneeded LOD ...
(and in the rendering in general)

Argh wrote:Did some experiments, here were the results:

1. The cost on shadowmaps is definitely irrespective of the size of the shadowmap- it made 1FPS difference, at most. If you're going to turn it on at all, you might as well turn it all the way up, basically.

2. The cost on shadowmaps is almost purely CPU-side. I looked at huge numbers of Resistance troopers (my favorite stress-test) and actual rendered FPS was stable whether they were moving or not, but game FPS dropped by a huge amount when the shadow pass was on and they were moving, vs. a minor drop when they weren't.

Anyone thought it would be fragment limited?
(note: Shadowmap generation is never fragment limited (no blending, glColorMask(4xfalse),etc.).)
So how could those `experiements` help with improving the shadows? Especially the moving vs. standing just shows that the moving/pathing code is slow (what has nothing to do with rendering at all) ..
If any rendering is CPU bound then it is the terrain rendering (terrain LOD is done on CPU), that's why I mentioned the LOD factor ... (unit rendering isn't cpu bound at all, it is just bandwidth (uploading all the unit space matrices) and vertex shader stress).

Post by **jcnossen** » 02 Nov 2008, 00:08

And because you're deferring it you're avoiding the constant CPU drag of world-objectworld-pixelshader loops, and all of the transfers from memory that are an inevitable part of that.

Things like this really make it clear that you have no idea what is happening. jk already debunked your ideas a few posts back, why continue?

On a side note: Deferred shading might be a very good way for spring to support real dynamic lighting in a scalable way. The only problem i can see is integrating it with current shaders and effects. Any thoughts jk?

Post by jK » 02 Nov 2008, 05:17

First, I would prefer to cleanup the current render code before integrating Deferred Shading. Means rewriting all shaders in GLSL, an abstract unit render interface, remove of any FFP (includes combiners!) if shaders are available, an abstract terrain render interface.

Engine effects should be easy to integerate with Deferred Shading, just water rendering is a problem, but should be doable.
Then Lua is a problem, it has direct OpenGL access, what isn't a problem as long as you render to the screenbuffer, but if you have to render to a MRT you need an experienced programmer.
So rendering in DrawWorld (DrawScreen would still render to screenbuffer) would become very complicated. Spring needs to supply a standard shader that writes everything in the render targets etc., because the barrier to realize it is very high for the standard opengl noob (you need to know how to write GLSL shaders, how MRTs work, a lot of maths, ...).
So it would limit the the ability for 95% of all lua programmers. But as long as there is a default shader which allows basic rendering, it shouldn't affect a lot of ppl.

A different problem are the Intel users ... Intel still doesn't support GLSL. So they couldn't play spring anymore, and making Deferred Shading optional is too complicated.

A short list of stuff todo for Deferred Shading:
* write the MRT class
** performance tweaking (texture targets vs. render targets, float precision, ati/nvidia differences, ...)
* rewrite the full OpenGL world render code (no FFP!)
* remove terrain lightmap and replace with a plain normalmap
* write the basic deferred sun lighting shaders (lighting & shadows)

* fix transparency rendering queue
* rewrite all shaders in GLSL and split some of them up if needed
* fix water renderers & reflective units (they should use MRTs too etc.)
* write Lua standard DrawWorld shader and a safe interface
* add some new Lua callins

* write extended deferred shaders (dyn lighting, dyn shadows? (for explosions etc.), ssao?, hdr?, bloom?, fog?)

Warlord Zsinj · Post by **Warlord Zsinj** » 02 Nov 2008, 08:58

I'm sure nobody would mind, or even notice if you ruled out all people using intel from playing spring...

imbaczek · Post by **imbaczek** » 02 Nov 2008, 12:33

Warlord Zsinj wrote:I'm sure nobody would mind, or even notice if you ruled out all people using intel from playing spring...

Wrong, unfortunately. I've got a laptop that I sometimes hack spring on and not being able to run it wouldn't exactly help; also, I've seen quite a few help requests with infologs that had Intel gfx in them.

tizbac · Post by **tizbac** » 02 Nov 2008, 12:51

i have a friend with a laptop with intel gfx, in better case you can play at 10 fps and with gray map, in worst case it doesn't start

Warlord Zsinj · Post by **Warlord Zsinj** » 02 Nov 2008, 12:53

imbaczek, I was just teasing

Post by **Auswaschbar** » 02 Nov 2008, 13:08

imbaczek wrote:
Warlord Zsinj wrote:I'm sure nobody would mind, or even notice if you ruled out all people using intel from playing spring...

Wrong, unfortunately. I've got a laptop that I sometimes hack spring on and not being able to run it wouldn't exactly help; also, I've seen quite a few help requests with infologs that had Intel gfx in them.

Help requests like this one: http://spring.clan-sy.com/phpbb/viewtop ... 11&t=16730

lurker · Post by **lurker** » 02 Nov 2008, 13:20

Well that was an interesting hour. Still only a fraction of the way through the second one.
http://www.gamedev.net/reference/articl ... le2333.asp
http://www.ziggyware.com/readarticle.php?article_id=155

Edit: I had an intel chip and I got a very reasonable fps out of it.

imbaczek · Post by **imbaczek** » 02 Nov 2008, 13:36

I'd eagerly help, unfortunately everything works for me, and I have colors and relatively high fps in super low details, too ;p

jk: how hard would it be to have 2 separate exes with proper #ifdefs? we could provide a legacy binary and a present-tech binary if it's not hard to do.

Post by **hoijui** » 02 Nov 2008, 14:27

there was an other request, about 2 weeks or more ago, of someone that wanted spring in wireframe mode.
that would solve this issue as well, no?

Spring RTS Engine

Deferred rendering of shadow pass.

Deferred rendering of shadow pass.

Re: Deferred rendering of shadow pass.

Re: Deferred rendering of shadow pass.

Re: Deferred rendering of shadow pass.

Re: Deferred rendering of shadow pass.

Re: Deferred rendering of shadow pass.

Re: Deferred rendering of shadow pass.

Re: Deferred rendering of shadow pass.

Re: Deferred rendering of shadow pass.

Re: Deferred rendering of shadow pass.

Re: Deferred rendering of shadow pass.

Re: Deferred rendering of shadow pass.

Re: Deferred rendering of shadow pass.

Re: Deferred rendering of shadow pass.

Re: Deferred rendering of shadow pass.

Re: Deferred rendering of shadow pass.

Re: Deferred rendering of shadow pass.

Re: Deferred rendering of shadow pass.

Re: Deferred rendering of shadow pass.

Re: Deferred rendering of shadow pass.