Optimizing Math Functions - Page 5

Optimizing Math Functions

Discuss the source code and development of Spring Engine in general from a technical point of view. Patches go here too.

Moderator: Moderators

User avatar
aegis
Posts: 2456
Joined: 11 Jul 2007, 17:47

Re: Optimizing Math Functions

Post by aegis »

Could Lua scripting on units instead of cob be much faster? I think I remember Lurker complaining Lua counted faster, anyway.
User avatar
lurker
Posts: 3842
Joined: 08 Jan 2007, 06:13

Re: Optimizing Math Functions

Post by lurker »

Complaining? But yes, a for loop was an astounding 20 times faster in lua than in cob.
SJ
Posts: 618
Joined: 13 Aug 2004, 17:13

Re: Optimizing Math Functions

Post by SJ »

Its true that the rendering optimization wont make that much difference if simulation is what takes all the cpu time altough in my memory rendering used to eat at least 50% cpu even during heavy battles (but then I might have had a faster CPU than opponents so they reduced simulation speed).

You dont need to rebuild the display lists for moving units to send them as one display list. Just send in the matrices for the pieces as vertex constants and then have the vertices store which piece they belong to (requires use of vertex programs which is why spring didnt use this from the begining).
User avatar
AF
AI Developer
Posts: 20687
Joined: 14 Sep 2004, 11:32

Re: Optimizing Math Functions

Post by AF »

lua is far faster than cob but the reason using lua in animations is slow is because you have to make cob->lua->cob calls and crossing between the two languages is very expensive.

A pure lua implementation of the animation format would be far far faster than the cob implementation.
shaddam
Posts: 14
Joined: 23 Aug 2005, 15:49

Re: Optimizing Math Functions

Post by shaddam »

nice discussion! :)

want to share my experience because i'm working on low level optimizations on my job too... i tried several approaches but ended with writting the very core in low level assembler SSE code...

e.g. on the square root for the vector scalarproduct i was able to get an speedup by ~2 by using the SSE packed version SQRTPS for 4 float sqare roots instead of using 4 times the FPU version FSQRT.

but the biggest impact is possible with the approximated reciprocal function RSSQRTPS, which is an hardware based approximation of the squareroot!
http://softwarecommunity.intel.com/arti ... g/1818.htm
so the square root timing shrinks down for an float number from ~32 clock cycles to 4 cycle, a speed-up of 8!
(timings from the table http://softwarecommunity.intel.com/arti ... g/3089.htm)

so there is on actual hardware no need anymore to do this approx. by hand, just use the SSE version for even more speedup.... ;)

here some code examples for inline assembler squareroots with timings .. (not from me)
http://www.musicdsp.org/showone.php?id=206

downsides of SSE assembler code are, beside the CPU hardware dependency, that this enormous speedups can only be gained on blocked (packed) calculation on multiple data which has to be aligned to 128bit memory boundaries.... which is sometimes complicated (or even impossible) to fullfill.
....but if the memory is aligned, significant speed ups on memory transfers by using MOVNTQ are possible with SSE2 too, but that's another topic. (http://cdrom.amd.com/devconn/events/AMD ... _paper.pdf)

on the problem which i optimized on my job (geometric distance mapping with scalar vectorproducts for billions of voxels) i result in an overall speed-up of 10 versus the also optimized C version.
Jonanin
Posts: 107
Joined: 13 Jan 2008, 21:34

Re: Optimizing Math Functions

Post by Jonanin »

Indeed inline assembler could be much faster, but I think there are portability issues with that...
shaddam
Posts: 14
Joined: 23 Aug 2005, 15:49

Re: Optimizing Math Functions

Post by shaddam »

Jonanin wrote:Indeed inline assembler could be much faster, but I think there are portability issues with that...
hi jonanin,
yes you are right its seems to be in every C/c++ compiler different to use inline assembly, also support for SSE1/SSE2/sSSE3 etc. might missing.
Another problem might OS portability, and different calling conventions...
therefore, my solution was to use the NASM assembler and produce object files which should be easy to link in in every compile system, also OS portablity is no real problem because NASM supports all major formats.

i take a look in Jonanin's fastmath.cpp and make a version SSE1 version of the fastsqrt and the inverse function.

float fmsqrt(float)
float fmisqrt(float)

additionally i tried an 4x float version (should 2times faster), whcih writes the result back to the source values memory
void fmsqrt4(*float[4])
void fmsqrt4a(*float[4]) (needs the 4floats on a 128bit aligned memory address)

to optimize is, this stuff is implemented as function and not as inline, so the function call overhead adds up, a check for CPU capablities is also needed, bigger block mode. etc.

appended files: source and to link in WIN32 object file and linux ELF object file, so if someone would be interessted in testing it with spring, it would be great. (hope it works...not tested! ;) )
Attachments
fastmath.zip
(1.08 KiB) Downloaded 36 times
User avatar
AF
AI Developer
Posts: 20687
Joined: 14 Sep 2004, 11:32

Re: Optimizing Math Functions

Post by AF »

What about 64bit?
shaddam
Posts: 14
Joined: 23 Aug 2005, 15:49

Re: Optimizing Math Functions

Post by shaddam »

AF wrote:What about 64bit?
possible...but with 64 bit the calling conventions changed and more evil are now different between linux & windows (http://en.wikipedia.org/wiki/Stdcall)... inlining would solve this problem

here an 64bit src&compile for Windows (only), same functionallity like 32bit version (would be really interessted if this works!...no system to test myself)

if there is interest i would also spend time in in providing more complex functions, like distance or geometric/vector calculation...etc. especially if these calculations can be done on a big amount of data at once (on complete picture, texture, all coordinates of all objeccts... don't know how data is organized)

PS: the fmsqrt4a version should be 2 times faster on 64systems then on 32bit system (after technical documents) ....if someone can verify that ...
Attachments
fastmath64.ZIP
(827 Bytes) Downloaded 33 times
User avatar
AF
AI Developer
Posts: 20687
Joined: 14 Sep 2004, 11:32

Re: Optimizing Math Functions

Post by AF »

What about sync issues and compatability? Remember rattle being the odd one out in the entire community not being able to run with SSE because of the chipset he used not supporting it?

And what about differences in calculations between the SSE approximate answer and the fastmath answer which might cause desync if one client supports the instruction but the other doesnt?
shaddam
Posts: 14
Joined: 23 Aug 2005, 15:49

Re: Optimizing Math Functions

Post by shaddam »

AF wrote:What about sync issues and compatability? Remember rattle being the odd one out in the entire community not being able to run with SSE because of the chipset he used not supporting it?

And what about differences in calculations between the SSE approximate answer and the fastmath answer which might cause desync if one client supports the instruction but the other doesnt?
intressting question about that syncing... don't have a clue!
about support, the cabability can be easily read out from CPU and if not available switched to an FPU based approximation... but you are right even then the accuracy might be little bit different...
Might that leading to serious sync issuses? ...maybe, but if some codepart is that sensitive, this case might also happen if one compile of spring uses the GCC -ffast-math switch (also available in MS compiler don't know name...) and the other builds not.

from practical point of view, for the stuff i use only SSE1 is needed which is available since earliest pentium 3 (which was introduced spring 1999, so SSE is in market since almost 10 years.... 3 CPU generations!)

overall i agree with you, such low level optimizations should be only introduced on special crucial bottleneck places when there is an significant gain from it.
Tobi
Spring Developer
Posts: 4598
Joined: 01 Jun 2005, 11:36

Re: Optimizing Math Functions

Post by Tobi »

It does desync if one client calculates stuff using SSE and the other using X87 math or an integer approximation.

(If used in synced context, of course.)

SSE has different denormal handling then X87 IIRC, meaning everyone has to run Spring with SSE, or no one has to run it with SSE (applies to the synced parts only ofc).

Anyway, IMO no assembler in Spring unless absolutely required (AKA the operator/instruction isn't available otherwise). Besides the hassle to fix the buildsystem to integrate NASM, the extra build dependency, it also needs to be checked regularly whether it's still faster then the compiler we use (I've seen hand tuned assembler code, which used to be faster then similar C code, becoming slower then that C code because of smarter compilers / smarter CPUs / cache/alignment effects / etc.).

All in all the amount of work to maintain it etc. by far exceeds the amount of time you win because the instruction executes a few % faster.
shaddam
Posts: 14
Joined: 23 Aug 2005, 15:49

Re: Optimizing Math Functions

Post by shaddam »

Tobi wrote:It does desync if one client calculates stuff using SSE and the other using X87 math or an integer approximation.

(If used in synced context, of course.)

SSE has different denormal handling then X87 IIRC, meaning everyone has to run Spring with SSE, or no one has to run it with SSE (applies to the synced parts only ofc).

Anyway, IMO no assembler in Spring unless absolutely required (AKA the operator/instruction isn't available otherwise). Besides the hassle to fix the buildsystem to integrate NASM, the extra build dependency, it also needs to be checked regularly whether it's still faster then the compiler we use (I've seen hand tuned assembler code, which used to be faster then similar C code, becoming slower then that C code because of smarter compilers / smarter CPUs / cache/alignment effects / etc.).

All in all the amount of work to maintain it etc. by far exceeds the amount of time you win because the instruction executes a few % faster.
hi tobi,
good point.... external assembler (like nasm) would complicate the build process. Even inline assembler might or might not be easier to maintain.
about 'good compiler + other hardware will beat instantly
optimized asm code' ....i have to disagree, sorry ;)
... thats a often given myth, that nowadays compiler are better then assembler. its true they are pretty good and producing well optimized code (if feeded with well preparated code) but even then an significant speedup (severall factors not only some percent) is possible when using all assembler possibilities.

one good article to this topic:
http://www.azillionmonkeys.com/qed/opti ... #asmdebate
Post Reply

Return to “Engine”