Optimizing Math Functions

aegis · Post by **aegis** » 31 Mar 2008, 17:04

Could Lua scripting on units instead of cob be much faster? I think I remember Lurker complaining Lua counted faster, anyway.

lurker · Post by **lurker** » 31 Mar 2008, 17:06

Complaining? But yes, a for loop was an astounding 20 times faster in lua than in cob.

Post by SJ » 31 Mar 2008, 18:36

Its true that the rendering optimization wont make that much difference if simulation is what takes all the cpu time altough in my memory rendering used to eat at least 50% cpu even during heavy battles (but then I might have had a faster CPU than opponents so they reduced simulation speed).

You dont need to rebuild the display lists for moving units to send them as one display list. Just send in the matrices for the pieces as vertex constants and then have the vertices store which piece they belong to (requires use of vertex programs which is why spring didnt use this from the begining).

AF · Post by AF » 01 Apr 2008, 13:00

lua is far faster than cob but the reason using lua in animations is slow is because you have to make cob->lua->cob calls and crossing between the two languages is very expensive.

A pure lua implementation of the animation format would be far far faster than the cob implementation.

shaddam · Post by **shaddam** » 02 Apr 2008, 12:23

nice discussion! :)

want to share my experience because i'm working on low level optimizations on my job too... i tried several approaches but ended with writting the very core in low level assembler SSE code...

e.g. on the square root for the vector scalarproduct i was able to get an speedup by ~2 by using the SSE packed version SQRTPS for 4 float sqare roots instead of using 4 times the FPU version FSQRT.

but the biggest impact is possible with the approximated reciprocal function RSSQRTPS, which is an hardware based approximation of the squareroot!
http://softwarecommunity.intel.com/arti ... g/1818.htm
so the square root timing shrinks down for an float number from ~32 clock cycles to 4 cycle, a speed-up of 8!
(timings from the table http://softwarecommunity.intel.com/arti ... g/3089.htm)

so there is on actual hardware no need anymore to do this approx. by hand, just use the SSE version for even more speedup....

here some code examples for inline assembler squareroots with timings .. (not from me)
http://www.musicdsp.org/showone.php?id=206

downsides of SSE assembler code are, beside the CPU hardware dependency, that this enormous speedups can only be gained on blocked (packed) calculation on multiple data which has to be aligned to 128bit memory boundaries.... which is sometimes complicated (or even impossible) to fullfill.
....but if the memory is aligned, significant speed ups on memory transfers by using MOVNTQ are possible with SSE2 too, but that's another topic. (http://cdrom.amd.com/devconn/events/AMD ... _paper.pdf)

on the problem which i optimized on my job (geometric distance mapping with scalar vectorproducts for billions of voxels) i result in an overall speed-up of 10 versus the also optimized C version.

Jonanin · Post by **Jonanin** » 03 Apr 2008, 01:04

Indeed inline assembler could be much faster, but I think there are portability issues with that...

shaddam · Post by **shaddam** » 03 Apr 2008, 13:43

Jonanin wrote:Indeed inline assembler could be much faster, but I think there are portability issues with that...

hi jonanin,
yes you are right its seems to be in every C/c++ compiler different to use inline assembly, also support for SSE1/SSE2/sSSE3 etc. might missing.
Another problem might OS portability, and different calling conventions...
therefore, my solution was to use the NASM assembler and produce object files which should be easy to link in in every compile system, also OS portablity is no real problem because NASM supports all major formats.

i take a look in Jonanin's fastmath.cpp and make a version SSE1 version of the fastsqrt and the inverse function.

float fmsqrt(float)
float fmisqrt(float)

additionally i tried an 4x float version (should 2times faster), whcih writes the result back to the source values memory
void fmsqrt4(*float[4])
void fmsqrt4a(*float[4]) (needs the 4floats on a 128bit aligned memory address)

to optimize is, this stuff is implemented as function and not as inline, so the function call overhead adds up, a check for CPU capablities is also needed, bigger block mode. etc.

appended files: source and to link in WIN32 object file and linux ELF object file, so if someone would be interessted in testing it with spring, it would be great. (hope it works...not tested!

)

AF · Post by AF » 03 Apr 2008, 17:34

What about 64bit?

shaddam · Post by **shaddam** » 03 Apr 2008, 18:14

AF wrote:What about 64bit?

possible...but with 64 bit the calling conventions changed and more evil are now different between linux & windows (http://en.wikipedia.org/wiki/Stdcall)... inlining would solve this problem

here an 64bit src&compile for Windows (only), same functionallity like 32bit version (would be really interessted if this works!...no system to test myself)

if there is interest i would also spend time in in providing more complex functions, like distance or geometric/vector calculation...etc. especially if these calculations can be done on a big amount of data at once (on complete picture, texture, all coordinates of all objeccts... don't know how data is organized)

PS: the fmsqrt4a version should be 2 times faster on 64systems then on 32bit system (after technical documents) ....if someone can verify that ...

AF · Post by AF » 03 Apr 2008, 19:07

What about sync issues and compatability? Remember rattle being the odd one out in the entire community not being able to run with SSE because of the chipset he used not supporting it?

And what about differences in calculations between the SSE approximate answer and the fastmath answer which might cause desync if one client supports the instruction but the other doesnt?

shaddam · Post by **shaddam** » 03 Apr 2008, 19:25

AF wrote:What about sync issues and compatability? Remember rattle being the odd one out in the entire community not being able to run with SSE because of the chipset he used not supporting it?

And what about differences in calculations between the SSE approximate answer and the fastmath answer which might cause desync if one client supports the instruction but the other doesnt?

intressting question about that syncing... don't have a clue!
about support, the cabability can be easily read out from CPU and if not available switched to an FPU based approximation... but you are right even then the accuracy might be little bit different...
Might that leading to serious sync issuses? ...maybe, but if some codepart is that sensitive, this case might also happen if one compile of spring uses the GCC -ffast-math switch (also available in MS compiler don't know name...) and the other builds not.

from practical point of view, for the stuff i use only SSE1 is needed which is available since earliest pentium 3 (which was introduced spring 1999, so SSE is in market since almost 10 years.... 3 CPU generations!)

overall i agree with you, such low level optimizations should be only introduced on special crucial bottleneck places when there is an significant gain from it.

Tobi · Post by **Tobi** » 03 Apr 2008, 19:51

It does desync if one client calculates stuff using SSE and the other using X87 math or an integer approximation.

(If used in synced context, of course.)

SSE has different denormal handling then X87 IIRC, meaning everyone has to run Spring with SSE, or no one has to run it with SSE (applies to the synced parts only ofc).

Anyway, IMO no assembler in Spring unless absolutely required (AKA the operator/instruction isn't available otherwise). Besides the hassle to fix the buildsystem to integrate NASM, the extra build dependency, it also needs to be checked regularly whether it's still faster then the compiler we use (I've seen hand tuned assembler code, which used to be faster then similar C code, becoming slower then that C code because of smarter compilers / smarter CPUs / cache/alignment effects / etc.).

All in all the amount of work to maintain it etc. by far exceeds the amount of time you win because the instruction executes a few % faster.

shaddam · Post by **shaddam** » 03 Apr 2008, 20:14

Tobi wrote:It does desync if one client calculates stuff using SSE and the other using X87 math or an integer approximation.

(If used in synced context, of course.)

SSE has different denormal handling then X87 IIRC, meaning everyone has to run Spring with SSE, or no one has to run it with SSE (applies to the synced parts only ofc).

Anyway, IMO no assembler in Spring unless absolutely required (AKA the operator/instruction isn't available otherwise). Besides the hassle to fix the buildsystem to integrate NASM, the extra build dependency, it also needs to be checked regularly whether it's still faster then the compiler we use (I've seen hand tuned assembler code, which used to be faster then similar C code, becoming slower then that C code because of smarter compilers / smarter CPUs / cache/alignment effects / etc.).

All in all the amount of work to maintain it etc. by far exceeds the amount of time you win because the instruction executes a few % faster.

hi tobi,
good point.... external assembler (like nasm) would complicate the build process. Even inline assembler might or might not be easier to maintain.
about 'good compiler + other hardware will beat instantly
optimized asm code' ....i have to disagree, sorry

... thats a often given myth, that nowadays compiler are better then assembler. its true they are pretty good and producing well optimized code (if feeded with well preparated code) but even then an significant speedup (severall factors not only some percent) is possible when using all assembler possibilities.

one good article to this topic:
http://www.azillionmonkeys.com/qed/opti ... #asmdebate

Spring RTS Engine

Optimizing Math Functions

Re: Optimizing Math Functions

Re: Optimizing Math Functions

Re: Optimizing Math Functions

Re: Optimizing Math Functions

Re: Optimizing Math Functions

Re: Optimizing Math Functions

Re: Optimizing Math Functions

Re: Optimizing Math Functions

Re: Optimizing Math Functions

Re: Optimizing Math Functions

Re: Optimizing Math Functions

Re: Optimizing Math Functions

Re: Optimizing Math Functions