Optimizing Math Functions
Moderator: Moderators
Re: Optimizing Math Functions
gmon.out isn't created when I exit spring.
Re: Optimizing Math Functions
not bad. IMHO worth including.
Re: Optimizing Math Functions
Alright, I suppose those replays aren't too good because LzmaDecode is at the top ><
I made some new ones with 90 krog vs 90 krog (but more dispersed), so I think this gives better results.
Normal Square Root
Approximated Square Root
I think what really shows a difference is the cumulative times for CBFGroundDrawer::Draw. (I believe this is the biggest one because the krog explosions really mushed up the ground so that it dropped to 20 fps from 60 fps when idle).
Normal square root cumulative time: 13.82
Approximate square root cumulative time: 5.51
I played replays pretty much exactly the same amount of time.
I made some new ones with 90 krog vs 90 krog (but more dispersed), so I think this gives better results.
Normal Square Root
Approximated Square Root
I think what really shows a difference is the cumulative times for CBFGroundDrawer::Draw. (I believe this is the biggest one because the krog explosions really mushed up the ground so that it dropped to 20 fps from 60 fps when idle).
Normal square root cumulative time: 13.82
Approximate square root cumulative time: 5.51
I played replays pretty much exactly the same amount of time.
Re: Optimizing Math Functions
Hi, I have made a FastMath.cpp file which contains fast approximations for sqrt, inverse sqrt, sin, and cos.
It contains sqrt and invsqrt at two different levels of accuracy.
Here:
http://jonanin.com/spring/FastMath.cpp
See the file for the accuracy of the sine and cosine functions.
How does it look? Please let me know if there is anything that should be changed.
It contains sqrt and invsqrt at two different levels of accuracy.
Here:
http://jonanin.com/spring/FastMath.cpp
See the file for the accuracy of the sine and cosine functions.
How does it look? Please let me know if there is anything that should be changed.
Re: Optimizing Math Functions
Wow, that's a pretty giant improvement. 250%+?? How much of Spring's CPU load will this really impact, though? Does this impact pathfinding costs as well as stuff like explosions?
Re: Optimizing Math Functions
It will help wherever it can be (and is) used (so, wherever sqrts and inverse sqrts are used.) Yes, explosions, possibly pathfinding, not so sure on that one, but you can take a look at the source (do a search for sqrt, maybe) to find where it can be used.
The cool part is that these stats are testing it in only one function, the normalizing of floats. I have no idea how much it can really improve the performance if it's used everywhere. I will try to test that later, right now it's bed time.
The cool part is that these stats are testing it in only one function, the normalizing of floats. I have no idea how much it can really improve the performance if it's used everywhere. I will try to test that later, right now it's bed time.
Re: Optimizing Math Functions
I think fast sqrt should be implemented...
btw what exactly is going wrong with normal sqrt() from math.h ? Is it different between amd and intel, or it is different between different compiler versions? In latter case, that must be coming from optimization, and so it shouldnt be a problem to write tiny assembler function that does square root using fsqrt instruction.
By the way, I also have some clever code for re-normalization of vectors or quaternions whose length is already close to 1 , such as results of quaternion multiplication .
(add w value for quaternions)
This thing can not be be used to normalize arbitrary vectors. *Only* those whose length is already close to 1. For vectors whose length is far from 1 , it doesnt work at all, i.e. it diverges.
Each iteration of this doubles number of correct digits of |v| , e.g. 1.000blabla or 0.999blabla become 1.000000blabla or 0.999999blabla
Normally after operations on unit-length quaternions you have very few wrong digits on the end, and the results of this renormalization are as good as true 1/sqrt (and much better than those with fast sqrt approximation).
I use it myself when re-normalizing orientation quaternions after performing rotation of object. It can also be used as part of matrix re-orthonormalization, if spring uses matrices to store and process orientations (sorry, i dont have time right now to check sources myself)
How it works: let
|v| = a = 1+epsilon
where epsilon is quite small value.
then
sqrt(1+epsilon) ~= 1+epsilon/2 (easy to derive, consider derivative of square near 1 . Also its well known rule)
1/sqrt(1+epsilon) ~= 1-epsilon/2 (ditto, for derivative of inverse, also well known rule)
hence
1/sqrt(a) ~= 1-(a-1)/2 = 1.5-0.5*x
where ~= means approximately equal.
If someone's interested, i can give more detailed proof that it works.
btw what exactly is going wrong with normal sqrt() from math.h ? Is it different between amd and intel, or it is different between different compiler versions? In latter case, that must be coming from optimization, and so it shouldnt be a problem to write tiny assembler function that does square root using fsqrt instruction.
By the way, I also have some clever code for re-normalization of vectors or quaternions whose length is already close to 1 , such as results of quaternion multiplication .
Code: Select all
float f=1.5-0.5*(v.x*v.x+v.y*v.y+v.z*v.z);
v.x*=f;
v.y*=f;
v.z*=f;
This thing can not be be used to normalize arbitrary vectors. *Only* those whose length is already close to 1. For vectors whose length is far from 1 , it doesnt work at all, i.e. it diverges.
Each iteration of this doubles number of correct digits of |v| , e.g. 1.000blabla or 0.999blabla become 1.000000blabla or 0.999999blabla
Normally after operations on unit-length quaternions you have very few wrong digits on the end, and the results of this renormalization are as good as true 1/sqrt (and much better than those with fast sqrt approximation).
I use it myself when re-normalizing orientation quaternions after performing rotation of object. It can also be used as part of matrix re-orthonormalization, if spring uses matrices to store and process orientations (sorry, i dont have time right now to check sources myself)
How it works: let
|v| = a = 1+epsilon
where epsilon is quite small value.
then
sqrt(1+epsilon) ~= 1+epsilon/2 (easy to derive, consider derivative of square near 1 . Also its well known rule)
1/sqrt(1+epsilon) ~= 1-epsilon/2 (ditto, for derivative of inverse, also well known rule)
hence
1/sqrt(a) ~= 1-(a-1)/2 = 1.5-0.5*x
where ~= means approximately equal.
If someone's interested, i can give more detailed proof that it works.
Re: Optimizing Math Functions
There is no 'problem', only that it is slow and the speed can be drastically improved while still maintaining acceptable accuracy. Hopefully yes it will be implemented, because tests show the speed to be gained is quite a bit.dizekat wrote:I think fast sqrt should be implemented...
btw what exactly is going wrong with normal sqrt() from math.h ? Is it different between amd and intel, or it is different between different compiler versions? In latter case, that must be coming from optimization, and so it shouldnt be a problem to write tiny assembler function that does square root using fsqrt instruction.
I would love to use assembly code but it's really a portability issue.
As for the second part of your post, maybe I can put that into FastMath if people think it looks good... Do you know exactly how much faster it is?
Re: Optimizing Math Functions
Then why spring uses that software math library's implementation of sqrt? (Streflop's sqrt). That would make sense if desyncs would be coming from differencies between intel and amd processors, but if those come from compiler optimizations, then software math library shouldn't be required.There is no 'problem', only that it is slow and the speed can be drastically improved while still maintaining acceptable accuracy.
I had some old benchmark code somewhere, will look for it later.As for the second part of your post, maybe I can put that into FastMath if people think it looks good... Do you know exactly how much faster it is?
it replaces 1/sqrt(a) with 1.5-0.5*(a) , which is about as much faster as ever possible, i think it cant get any faster than multiply and subtract. If you insert it, make sure you comment that it is only useful for getting rid of inaccuracies, eg after quaternion multiplication and things like that.
btw, this thing can also be obtained as 2 terms of Taylor series of x^-0.5 around 1. The taylor series are 1 - 0.5*(x-1) + 1.5/4*(x-1)^2 - ....
Re: Optimizing Math Functions
Where did someone say it was because of compiler optimizations? streflop IS used becuase of desyncs.dizekat wrote:Then why spring uses that software math library's implementation of sqrt? (Streflop's sqrt). That would make sense if desyncs would be coming from differencies between intel and amd processors, but if those come from compiler optimizations, then software math library shouldn't be required.There is no 'problem', only that it is slow and the speed can be drastically improved while still maintaining acceptable accuracy.
Maybe I read wrong in my first reply.
Re: Optimizing Math Functions
my earlier post, which you replied to:
The sync issues can come from 2 sources. Compiler optimization, and cpu differencies. Software math library is really necessary and useful only if thats cpu differencies issue, coz in case its just compiler, its not so hard to wrap floats in a wrapper that will not let compiler do any optimizations.
dizekat wrote: btw what exactly is going wrong with normal sqrt() from math.h ? Is it different between amd and intel, or it is different between different compiler versions?
I asked what wrong was with math.h sqrt, you said that there was no problem with math.h sqrt except that its slow [but its still lotta faster than streflop], hence i asked why spring uses streflop sqrt (if theres no problem with math.h sqrt).Jonanin wrote:dizekat wrote:Then why spring uses that software math library's implementation of sqrt? (Streflop's sqrt). That would make sense if desyncs would be coming from differencies between intel and amd processors, but if those come from compiler optimizations, then software math library shouldn't be required.There is no 'problem', only that it is slow and the speed can be drastically improved while still maintaining acceptable accuracy.
Where did someone say it was because of compiler optimizations? streflop IS used becuase of desyncs.
Maybe I read wrong in my first reply.
The sync issues can come from 2 sources. Compiler optimization, and cpu differencies. Software math library is really necessary and useful only if thats cpu differencies issue, coz in case its just compiler, its not so hard to wrap floats in a wrapper that will not let compiler do any optimizations.
Re: Optimizing Math Functions
I'm not aware of the details, but my guess is that usual math.h sqrt causes (or used to cause) sync issues.
Re: Optimizing Math Functions
Yes, what math.h sqrt actually does may depend on platform / compiler / compiler options etc.
Though I don't think we (Nicolas and I) ever found anomalies with sqrt, it was just safer to replace entire libm then to exhaustively test all of it on a number of major platforms.
Realize that while exhaustively testing single precision sin/cos/sqrt is reasonable; exhaustively testing binary operations like pow is pretty much impossible on today's hardware, because there are 2^64 different possibilities for the input. If anyone can put together a decent test (even if not quite exhaustive) which does actually break the same ways spring breaks, that would be much appreciated!
(IOW, GCC 3.X vs 4.X should desync, GCC 3.X any optimization vs 3.X any optimization should sync, 4.X any optimization vs 4.X any optimization should sync, and any GCC with any optimization vs MSVC 8 with any optimization should desync.)
Though I don't think we (Nicolas and I) ever found anomalies with sqrt, it was just safer to replace entire libm then to exhaustively test all of it on a number of major platforms.
Realize that while exhaustively testing single precision sin/cos/sqrt is reasonable; exhaustively testing binary operations like pow is pretty much impossible on today's hardware, because there are 2^64 different possibilities for the input. If anyone can put together a decent test (even if not quite exhaustive) which does actually break the same ways spring breaks, that would be much appreciated!
(IOW, GCC 3.X vs 4.X should desync, GCC 3.X any optimization vs 3.X any optimization should sync, 4.X any optimization vs 4.X any optimization should sync, and any GCC with any optimization vs MSVC 8 with any optimization should desync.)
Re: Optimizing Math Functions
I'll write exaustive test tomorrow, for floats and sqrt()
I'm gonna compute checksums of all results for all float numbers, one checksum for normal floats, other for denormalized, and third for invalid (NaN and the like).
BTW, did spring set floating point precision of cpu on linux to match that of windows?
Linux apparently uses FPU in 80 bit mode by default, which means that intermediate results storen in registers are 80 bits on linux, and 64 bits on windows or freebsd.
http://www.wrcad.com/linux_numerics.txt
I'm gonna compute checksums of all results for all float numbers, one checksum for normal floats, other for denormalized, and third for invalid (NaN and the like).
BTW, did spring set floating point precision of cpu on linux to match that of windows?
Linux apparently uses FPU in 80 bit mode by default, which means that intermediate results storen in registers are 80 bits on linux, and 64 bits on windows or freebsd.
http://www.wrcad.com/linux_numerics.txt
Re: Optimizing Math Functions
Who is Nicolas?Tobi wrote: Though I don't think we (Nicolas and I)
Re: Optimizing Math Functions
Author of streflop, IIRC.LordMatt wrote:Who is Nicolas?Tobi wrote: Though I don't think we (Nicolas and I)
Re: Optimizing Math Functions
Hi,
Does anyone think this will be included? Or is there something else I need to do... maybe write a patch that included FastMath.cpp and uses those sqrts?
Also, another quesion, in VertexArray.cpp, why aren't these functions inlined? It could have much better performance... considering that there are 400 billlion calls to AddVertex0 and 70 billion calls to AddVertexTC in a 2 minute game. Wouldn't it be very beneficial to inline these?
Does anyone think this will be included? Or is there something else I need to do... maybe write a patch that included FastMath.cpp and uses those sqrts?
Also, another quesion, in VertexArray.cpp, why aren't these functions inlined? It could have much better performance... considering that there are 400 billlion calls to AddVertex0 and 70 billion calls to AddVertexTC in a 2 minute game. Wouldn't it be very beneficial to inline these?
Re: Optimizing Math Functions
It would be slightly neater if your FastMath functions were
in their own namespace (as opposed to carrying that "fm"
prefix) and if it had its own header, since now you have to
declare each function you want to use as extern. I'll add it
if you take care of those two points.
in their own namespace (as opposed to carrying that "fm"
prefix) and if it had its own header, since now you have to
declare each function you want to use as extern. I'll add it
if you take care of those two points.
Re: Optimizing Math Functions
Alright, I have made it in the fastmath namespace.
I just renamed the file FastMath.h, because otherwise you still have to include the cpp file to get the inline functions to work.
Here it is
I just renamed the file FastMath.h, because otherwise you still have to include the cpp file to get the inline functions to work.
Here it is