SSE sqrt() performance tests

SSE sqrt() performance tests

Discuss the source code and development of Spring Engine in general from a technical point of view. Patches go here too.

Moderator: Moderators

Post Reply
User avatar
hoijui
Former Engine Dev
Posts: 4344
Joined: 22 Sep 2007, 09:51

SSE sqrt() performance tests

Post by hoijui »

Background info:
Spring uses the SSE1 sqrt function. SSE requires 16bit alignment in memory. The JVM somehow seems to mess this up, so when invoking a callback method or issuing a command using SSE1 functions from a Java AI, spring crashes.
GCC allows forcing alignment in a method (thanks jk), but before putting it into the code, some performance testing had to be done, documentation implies that this could cost relevant performance.

so i tested 4 different ways to call the SSE1 sqrt() function, whihc i labeled like this:
  • spring
  • spring forcedAligment
  • load
  • load forcedAligment
spring is the one currently used in spring; it uses a union to prepare the argument for the sqrt() function. load uses the SSE1 instructions load_ss and store_ss to prepare the argument and read back the result. forcedAligment has the method preparing the call decorated with:

Code: Select all

__attribute__ ((force_align_arg_pointer))
-> no crash for Java AIs

When compiling, i tried to use the smae flags like when compiling spring (eg -O2). The flags are all listed in compile.sh.

each test is looping 1111 times over the sqrt() function, and i used 3 different inputs (i is the integer loop var, and x is initializes as: float x=0.0f):
  • x
  • i
  • i+x
as it converges to 0.0 quite fast with x, the other two tests are possible more meaningfull.

The results on my two machines:
coreDuo1.8GHzSSE3_Linux32bit_GCC4.3.3

Code: Select all

################################################################################
sqrt(x)
=======
spring          (iterations: 1000000000) test result: 0.000000
real 13.63
user 13.51
sys 0.00

spring (forced) (iterations: 1000000000) test result: 0.000000
real 13.54
user 13.54
sys 0.00

load            (iterations: 1000000000) test result: 0.000000
real 6.76
user 6.74
sys 0.00

load   (forced) (iterations: 1000000000) test result: 0.000000
real 6.17
user 6.17
sys 0.00
################################################################################
sqrt(i)
=======
spring          (iterations: 1000000000) test result: 549755813888.000000
real 27.04
user 27.04
sys 0.00

spring (forced) (iterations: 1000000000) test result: 549755813888.000000
real 27.05
user 27.04
sys 0.00

load            (iterations: 1000000000) test result: 549755813888.000000
real 17.53
user 17.52
sys 0.00

load   (forced) (iterations: 1000000000) test result: 549755813888.000000
real 17.99
user 17.52
sys 0.00
################################################################################
sqrt(i+x)
=======
spring          (iterations: 1000000000) test result: 562949953421312.000000
real 28.52
user 28.16
sys 0.00

spring (forced) (iterations: 1000000000) test result: 562949953421312.000000
real 28.25
user 28.14
sys 0.00

load            (iterations: 1000000000) test result: 562949953421312.000000
real 20.77
user 20.64
sys 0.00

load   (forced) (iterations: 1000000000) test result: 562949953421312.000000
real 20.67
user 20.65
sys 0.00
################################################################################
AMDAthlonXP2500+SSE1_WinXP32bit_MinGW4.4.0

Code: Select all

################################################################################
sqrt(x)
=======
spring          (iterations: 1000000000) test result: 0.000000
real	0m24.625s
user	0m0.031s
sys	0m0.000s

spring (forced) (iterations: 1000000000) test result: 0.000000
real	2m48.453s
user	0m0.015s
sys	0m0.015s

load            (iterations: 1000000000) test result: 0.000000
real	0m14.031s
user	0m0.015s
sys	0m0.031s

load   (forced) (iterations: 1000000000) test result: 0.000000
real	0m13.969s
user	0m0.015s
sys	0m0.000s
################################################################################
sqrt(i)
=======
spring          (iterations: 1000000000) test result: 549755813888.000000
real	0m25.844s
user	0m0.015s
sys	0m0.016s

spring (forced) (iterations: 1000000000) test result: 549755813888.000000
real	2m55.703s
user	0m0.015s
sys	0m0.000s

load            (iterations: 1000000000) test result: 549755813888.000000
real	0m18.969s
user	0m0.015s
sys	0m0.015s

load   (forced) (iterations: 1000000000) test result: 549755813888.000000
real	0m19.015s
user	0m0.015s
sys	0m0.015s
################################################################################
sqrt(i+x)
=======
spring          (iterations: 1000000000) test result: 562949953421312.000000
real	0m28.421s
user	0m0.015s
sys	0m0.015s

spring (forced) (iterations: 1000000000) test result: 562949953421312.000000
real	2m58.532s
user	0m0.015s
sys	0m0.015s

load            (iterations: 1000000000) test result: 562949953421312.000000
real	0m21.265s
user	0m0.015s
sys	0m0.015s

load   (forced) (iterations: 1000000000) test result: 562949953421312.000000
real	0m21.203s
user	0m0.015s
sys	0m0.000s
################################################################################
side note: I did the same tests without optimization (no -O2), and there the forced aligment caused a performance loss of between 1% - 4%.

With this results, it looks to me as if we should move to using load_ss & store_ss & forced alignment, as the forced aligmnet seems to virtually not impact performance there, and load & store seem to be considerably faster then the union we use now.
Attachments
sqrtTests.zip
source & scripts & results
(2.33 KiB) Downloaded 32 times
Auswaschbar
Spring Developer
Posts: 1254
Joined: 24 Jun 2007, 08:34

Re: SSE sqrt() performance tests

Post by Auswaschbar »

Tried it here:
  • didn't compile because of SSE2 first
  • ┬╗force_align_arg_pointer┬½ is ignored on x64 anyway, so no difference there
  • load beeing considerably faster than spring
zerver
Spring Developer
Posts: 1358
Joined: 16 Dec 2006, 20:59

Re: SSE sqrt() performance tests

Post by zerver »

Is this fix of mine somehow related?

http://github.com/spring/spring/commit/ ... c75e33fb12
User avatar
hoijui
Former Engine Dev
Posts: 4344
Joined: 22 Sep 2007, 09:51

Re: SSE sqrt() performance tests

Post by hoijui »

Aus:
cool! :D
thanks for testing (i assume you did it on your 64bit Gentoo, as you also mentiond x64 ignoring the forced alignment).
yeah the SSE2 part was just for initial testing, could be removed of course, just forgot.

zerver:
not that i am an expert in this field, but i can not see a closer relation then fixing an alignment issue with the same code.

so if there will not be any good reasons agaisnt it, i propose the following changes:
  • add the following to rts/System/maindefines.h:

    Code: Select all

    #if defined(__GNUC__) && (__GNUC__ == 4) && !defined(__arch64__)
    #define __ALIGN_ARG__ __attribute__ ((force_align_arg_pointer))
    #else
    #define __ALIGN_ARG__
    #endif
    
  • use __ALIGN_ARG__ for the sse sqrt() function in fastmath
  • use load & store in all the aplicable places
User avatar
jK
Spring Developer
Posts: 2299
Joined: 28 Jun 2007, 07:30

Re: SSE sqrt() performance tests

Post by jK »

:shock:
I always thought the mm_load mm_set would be slower here. In theory it should need more cpu ops (additional write, needs to copy the function argument etc.).

But yeah same results here (athlonXP 2.5Ghz). also found out the massive performance decrease is caused by march=i686, with march=athlon-xp everything is fine. Still I couldn't find the reason why union is such slower than load/set. Didn't disassembled the final binary yet, would be interesting to do so.
Also I don't see problem with the compiler flag then (when using load/set) cause it's limited to that function the reduced number of useable register doesn't matter.
And a 25%-50% faster sqrt is always fine :mrgreen:
"good work"
YokoZar
Posts: 883
Joined: 15 Jul 2007, 22:02

Re: SSE sqrt() performance tests

Post by YokoZar »

Very cool indeed. Good news :)
imbaczek
Posts: 3629
Joined: 22 Aug 2006, 16:19

Re: SSE sqrt() performance tests

Post by imbaczek »

jk: sync testing -march=i686 vs -march=athlon-xp is a good idea IMHO. we could provide an athlon-optimized exe if it works, there's plenty of those pretty old processors in spring's playerbase if I'm not mistaken.
User avatar
hoijui
Former Engine Dev
Posts: 4344
Joined: 22 Sep 2007, 09:51

Re: SSE sqrt() performance tests

Post by hoijui »

+1 :D
User avatar
tizbac
Posts: 136
Joined: 19 Jun 2008, 14:05

Re: SSE sqrt() performance tests

Post by tizbac »

I can do test games with an amd athlon 64 bit if it's needed :-)
el_matarife
Posts: 933
Joined: 27 Feb 2006, 02:04

Re: SSE sqrt() performance tests

Post by el_matarife »

imbaczek wrote:jk: sync testing -march=i686 vs -march=athlon-xp is a good idea IMHO. we could provide an athlon-optimized exe if it works, there's plenty of those pretty old processors in spring's playerbase if I'm not mistaken.
If the recompile just changes the Spring.exe binary file, why not have SpringDownloader (And other OS equivalents) automatically pull down a version compiled specifically for your processor architecture automatically? (Assuming optimizations won't break sync)
Post Reply

Return to “Engine”