Optimizing Math Functions

Jonanin · Post by **Jonanin** » 16 Mar 2008, 19:54

gmon.out isn't created when I exit spring.

Jonanin · Post by **Jonanin** » 16 Mar 2008, 20:39

Alright, I got it working.

I played a replay with 45 krog vs 45 krog on DeltaSiegeDryRevX_v3.

With Square root approximation: here
With normal square root: here

Not quite sure how to interpret this correctly, other than noticing significantly less time spent on streflop sqrt.

Beware, 5.3mb files.

imbaczek · Post by **imbaczek** » 17 Mar 2008, 18:35

not bad. IMHO worth including.

Jonanin · Post by **Jonanin** » 17 Mar 2008, 20:13

Alright, I suppose those replays aren't too good because LzmaDecode is at the top ><

I made some new ones with 90 krog vs 90 krog (but more dispersed), so I think this gives better results.

Normal Square Root
Approximated Square Root

I think what really shows a difference is the cumulative times for CBFGroundDrawer::Draw. (I believe this is the biggest one because the krog explosions really mushed up the ground so that it dropped to 20 fps from 60 fps when idle).

Normal square root cumulative time: 13.82
Approximate square root cumulative time: 5.51

I played replays pretty much exactly the same amount of time.

Jonanin · Post by **Jonanin** » 19 Mar 2008, 08:16

Hi, I have made a FastMath.cpp file which contains fast approximations for sqrt, inverse sqrt, sin, and cos.

It contains sqrt and invsqrt at two different levels of accuracy.

Here:
http://jonanin.com/spring/FastMath.cpp

See the file for the accuracy of the sine and cosine functions.

How does it look? Please let me know if there is anything that should be changed.

Argh · Post by **Argh** » 19 Mar 2008, 08:43

Wow, that's a pretty giant improvement. 250%+?? How much of Spring's CPU load will this really impact, though? Does this impact pathfinding costs as well as stuff like explosions?

Jonanin · Post by **Jonanin** » 19 Mar 2008, 09:11

It will help wherever it can be (and is) used (so, wherever sqrts and inverse sqrts are used.) Yes, explosions, possibly pathfinding, not so sure on that one, but you can take a look at the source (do a search for sqrt, maybe) to find where it can be used.

The cool part is that these stats are testing it in only one function, the normalizing of floats. I have no idea how much it can really improve the performance if it's used everywhere. I will try to test that later, right now it's bed time.

dizekat · Post by **dizekat** » 19 Mar 2008, 22:54

I think fast sqrt should be implemented...

btw what exactly is going wrong with normal sqrt() from math.h ? Is it different between amd and intel, or it is different between different compiler versions? In latter case, that must be coming from optimization, and so it shouldnt be a problem to write tiny assembler function that does square root using fsqrt instruction.

By the way, I also have some clever code for re-normalization of vectors or quaternions whose length is already close to 1 , such as results of quaternion multiplication .

Code: Select all

float f=1.5-0.5*(v.x*v.x+v.y*v.y+v.z*v.z);
v.x*=f;
v.y*=f;
v.z*=f;

(add w value for quaternions)

This thing can not be be used to normalize arbitrary vectors. *Only* those whose length is already close to 1. For vectors whose length is far from 1 , it doesnt work at all, i.e. it diverges.

Each iteration of this doubles number of correct digits of |v| , e.g. 1.000blabla or 0.999blabla become 1.000000blabla or 0.999999blabla
Normally after operations on unit-length quaternions you have very few wrong digits on the end, and the results of this renormalization are as good as true 1/sqrt (and much better than those with fast sqrt approximation).

I use it myself when re-normalizing orientation quaternions after performing rotation of object. It can also be used as part of matrix re-orthonormalization, if spring uses matrices to store and process orientations (sorry, i dont have time right now to check sources myself)

How it works: let
|v| = a = 1+epsilon
where epsilon is quite small value.
then
sqrt(1+epsilon) ~= 1+epsilon/2 (easy to derive, consider derivative of square near 1 . Also its well known rule)
1/sqrt(1+epsilon) ~= 1-epsilon/2 (ditto, for derivative of inverse, also well known rule)
hence
1/sqrt(a) ~= 1-(a-1)/2 = 1.5-0.5*x

where ~= means approximately equal.

If someone's interested, i can give more detailed proof that it works.

Jonanin · Post by **Jonanin** » 19 Mar 2008, 23:11

dizekat wrote:I think fast sqrt should be implemented...

btw what exactly is going wrong with normal sqrt() from math.h ? Is it different between amd and intel, or it is different between different compiler versions? In latter case, that must be coming from optimization, and so it shouldnt be a problem to write tiny assembler function that does square root using fsqrt instruction.

There is no 'problem', only that it is slow and the speed can be drastically improved while still maintaining acceptable accuracy. Hopefully yes it will be implemented, because tests show the speed to be gained is quite a bit.

I would love to use assembly code but it's really a portability issue.

As for the second part of your post, maybe I can put that into FastMath if people think it looks good... Do you know exactly how much faster it is?

dizekat · Post by **dizekat** » 20 Mar 2008, 07:46

There is no 'problem', only that it is slow and the speed can be drastically improved while still maintaining acceptable accuracy.

Then why spring uses that software math library's implementation of sqrt? (Streflop's sqrt). That would make sense if desyncs would be coming from differencies between intel and amd processors, but if those come from compiler optimizations, then software math library shouldn't be required.

As for the second part of your post, maybe I can put that into FastMath if people think it looks good... Do you know exactly how much faster it is?

I had some old benchmark code somewhere, will look for it later.
it replaces 1/sqrt(a) with 1.5-0.5*(a) , which is about as much faster as ever possible, i think it cant get any faster than multiply and subtract. If you insert it, make sure you comment that it is only useful for getting rid of inaccuracies, eg after quaternion multiplication and things like that.

btw, this thing can also be obtained as 2 terms of Taylor series of x^-0.5 around 1. The taylor series are 1 - 0.5*(x-1) + 1.5/4*(x-1)^2 - ....

Jonanin · Post by **Jonanin** » 20 Mar 2008, 08:04

dizekat wrote:
There is no 'problem', only that it is slow and the speed can be drastically improved while still maintaining acceptable accuracy.
Then why spring uses that software math library's implementation of sqrt? (Streflop's sqrt). That would make sense if desyncs would be coming from differencies between intel and amd processors, but if those come from compiler optimizations, then software math library shouldn't be required.

Where did someone say it was because of compiler optimizations? streflop IS used becuase of desyncs.

Maybe I read wrong in my first reply.

dizekat · Post by **dizekat** » 20 Mar 2008, 08:22

my earlier post, which you replied to:

dizekat wrote: btw what exactly is going wrong with normal sqrt() from math.h ? Is it different between amd and intel, or it is different between different compiler versions?

Jonanin wrote:
dizekat wrote:
There is no 'problem', only that it is slow and the speed can be drastically improved while still maintaining acceptable accuracy.
Then why spring uses that software math library's implementation of sqrt? (Streflop's sqrt). That would make sense if desyncs would be coming from differencies between intel and amd processors, but if those come from compiler optimizations, then software math library shouldn't be required.

Where did someone say it was because of compiler optimizations? streflop IS used becuase of desyncs.

Maybe I read wrong in my first reply.

I asked what wrong was with math.h sqrt, you said that there was no problem with math.h sqrt except that its slow [but its still lotta faster than streflop], hence i asked why spring uses streflop sqrt (if theres no problem with math.h sqrt).
The sync issues can come from 2 sources. Compiler optimization, and cpu differencies. Software math library is really necessary and useful only if thats cpu differencies issue, coz in case its just compiler, its not so hard to wrap floats in a wrapper that will not let compiler do any optimizations.

imbaczek · Post by **imbaczek** » 20 Mar 2008, 09:36

I'm not aware of the details, but my guess is that usual math.h sqrt causes (or used to cause) sync issues.

Tobi · Post by **Tobi** » 20 Mar 2008, 20:43

Yes, what math.h sqrt actually does may depend on platform / compiler / compiler options etc.

Though I don't think we (Nicolas and I) ever found anomalies with sqrt, it was just safer to replace entire libm then to exhaustively test all of it on a number of major platforms.

Realize that while exhaustively testing single precision sin/cos/sqrt is reasonable; exhaustively testing binary operations like pow is pretty much impossible on today's hardware, because there are 2^64 different possibilities for the input. If anyone can put together a decent test (even if not quite exhaustive) which does actually break the same ways spring breaks, that would be much appreciated!

(IOW, GCC 3.X vs 4.X should desync, GCC 3.X any optimization vs 3.X any optimization should sync, 4.X any optimization vs 4.X any optimization should sync, and any GCC with any optimization vs MSVC 8 with any optimization should desync.)

dizekat · Post by **dizekat** » 20 Mar 2008, 23:16

I'll write exaustive test tomorrow, for floats and sqrt()
I'm gonna compute checksums of all results for all float numbers, one checksum for normal floats, other for denormalized, and third for invalid (NaN and the like).

BTW, did spring set floating point precision of cpu on linux to match that of windows?
Linux apparently uses FPU in 80 bit mode by default, which means that intermediate results storen in registers are 80 bits on linux, and 64 bits on windows or freebsd.
http://www.wrcad.com/linux_numerics.txt

LordMatt · Post by **LordMatt** » 21 Mar 2008, 00:40

Tobi wrote: Though I don't think we (Nicolas and I)

Who is Nicolas?

Jonanin · Post by **Jonanin** » 21 Mar 2008, 03:01

LordMatt wrote:
Tobi wrote: Though I don't think we (Nicolas and I)
Who is Nicolas?

Author of streflop, IIRC.

Jonanin · Post by **Jonanin** » 22 Mar 2008, 23:20

Hi,

Does anyone think this will be included? Or is there something else I need to do... maybe write a patch that included FastMath.cpp and uses those sqrts?

Also, another quesion, in VertexArray.cpp, why aren't these functions inlined? It could have much better performance... considering that there are 400 billlion calls to AddVertex0 and 70 billion calls to AddVertexTC in a 2 minute game. Wouldn't it be very beneficial to inline these?

Post by **Kloot** » 23 Mar 2008, 00:31

It would be slightly neater if your FastMath functions were
in their own namespace (as opposed to carrying that "fm"
prefix) and if it had its own header, since now you have to
declare each function you want to use as extern. I'll add it
if you take care of those two points.

Jonanin · Post by **Jonanin** » 23 Mar 2008, 02:12

Alright, I have made it in the fastmath namespace.

I just renamed the file FastMath.h, because otherwise you still have to include the cpp file to get the inline functions to work.

Here it is

Spring RTS Engine

Optimizing Math Functions

Re: Optimizing Math Functions

Re: Optimizing Math Functions

Re: Optimizing Math Functions

Re: Optimizing Math Functions

Re: Optimizing Math Functions

Re: Optimizing Math Functions

Re: Optimizing Math Functions

Re: Optimizing Math Functions

Re: Optimizing Math Functions

Re: Optimizing Math Functions

Re: Optimizing Math Functions

Re: Optimizing Math Functions

Re: Optimizing Math Functions

Re: Optimizing Math Functions

Re: Optimizing Math Functions

Re: Optimizing Math Functions

Re: Optimizing Math Functions

Re: Optimizing Math Functions

Re: Optimizing Math Functions

Re: Optimizing Math Functions