SSE sqrt() performance tests

Post by **hoijui** » 18 Sep 2009, 11:45

Background info:
Spring uses the SSE1 sqrt function. SSE requires 16bit alignment in memory. The JVM somehow seems to mess this up, so when invoking a callback method or issuing a command using SSE1 functions from a Java AI, spring crashes.
GCC allows forcing alignment in a method (thanks jk), but before putting it into the code, some performance testing had to be done, documentation implies that this could cost relevant performance.

so i tested 4 different ways to call the SSE1 sqrt() function, whihc i labeled like this:

spring
spring forcedAligment
load
load forcedAligment

spring is the one currently used in spring; it uses a union to prepare the argument for the sqrt() function. load uses the SSE1 instructions load_ss and store_ss to prepare the argument and read back the result. forcedAligment has the method preparing the call decorated with:

Code: Select all

__attribute__ ((force_align_arg_pointer))

-> no crash for Java AIs

When compiling, i tried to use the smae flags like when compiling spring (eg -O2). The flags are all listed in compile.sh.

each test is looping 1111 times over the sqrt() function, and i used 3 different inputs (i is the integer loop var, and x is initializes as: float x=0.0f):

x
i
i+x

as it converges to 0.0 quite fast with x, the other two tests are possible more meaningfull.

The results on my two machines:
coreDuo1.8GHzSSE3_Linux32bit_GCC4.3.3

Code: Select all

################################################################################
sqrt(x)
=======
spring          (iterations: 1000000000) test result: 0.000000
real 13.63
user 13.51
sys 0.00

spring (forced) (iterations: 1000000000) test result: 0.000000
real 13.54
user 13.54
sys 0.00

load            (iterations: 1000000000) test result: 0.000000
real 6.76
user 6.74
sys 0.00

load   (forced) (iterations: 1000000000) test result: 0.000000
real 6.17
user 6.17
sys 0.00
################################################################################
sqrt(i)
=======
spring          (iterations: 1000000000) test result: 549755813888.000000
real 27.04
user 27.04
sys 0.00

spring (forced) (iterations: 1000000000) test result: 549755813888.000000
real 27.05
user 27.04
sys 0.00

load            (iterations: 1000000000) test result: 549755813888.000000
real 17.53
user 17.52
sys 0.00

load   (forced) (iterations: 1000000000) test result: 549755813888.000000
real 17.99
user 17.52
sys 0.00
################################################################################
sqrt(i+x)
=======
spring          (iterations: 1000000000) test result: 562949953421312.000000
real 28.52
user 28.16
sys 0.00

spring (forced) (iterations: 1000000000) test result: 562949953421312.000000
real 28.25
user 28.14
sys 0.00

load            (iterations: 1000000000) test result: 562949953421312.000000
real 20.77
user 20.64
sys 0.00

load   (forced) (iterations: 1000000000) test result: 562949953421312.000000
real 20.67
user 20.65
sys 0.00
################################################################################

AMDAthlonXP2500+SSE1_WinXP32bit_MinGW4.4.0

Code: Select all

################################################################################
sqrt(x)
=======
spring          (iterations: 1000000000) test result: 0.000000
real	0m24.625s
user	0m0.031s
sys	0m0.000s

spring (forced) (iterations: 1000000000) test result: 0.000000
real	2m48.453s
user	0m0.015s
sys	0m0.015s

load            (iterations: 1000000000) test result: 0.000000
real	0m14.031s
user	0m0.015s
sys	0m0.031s

load   (forced) (iterations: 1000000000) test result: 0.000000
real	0m13.969s
user	0m0.015s
sys	0m0.000s
################################################################################
sqrt(i)
=======
spring          (iterations: 1000000000) test result: 549755813888.000000
real	0m25.844s
user	0m0.015s
sys	0m0.016s

spring (forced) (iterations: 1000000000) test result: 549755813888.000000
real	2m55.703s
user	0m0.015s
sys	0m0.000s

load            (iterations: 1000000000) test result: 549755813888.000000
real	0m18.969s
user	0m0.015s
sys	0m0.015s

load   (forced) (iterations: 1000000000) test result: 549755813888.000000
real	0m19.015s
user	0m0.015s
sys	0m0.015s
################################################################################
sqrt(i+x)
=======
spring          (iterations: 1000000000) test result: 562949953421312.000000
real	0m28.421s
user	0m0.015s
sys	0m0.015s

spring (forced) (iterations: 1000000000) test result: 562949953421312.000000
real	2m58.532s
user	0m0.015s
sys	0m0.015s

load            (iterations: 1000000000) test result: 562949953421312.000000
real	0m21.265s
user	0m0.015s
sys	0m0.015s

load   (forced) (iterations: 1000000000) test result: 562949953421312.000000
real	0m21.203s
user	0m0.015s
sys	0m0.000s
################################################################################

side note: I did the same tests without optimization (no -O2), and there the forced aligment caused a performance loss of between 1% - 4%.

With this results, it looks to me as if we should move to using load_ss & store_ss & forced alignment, as the forced aligmnet seems to virtually not impact performance there, and load & store seem to be considerably faster then the union we use now.

Post by **Auswaschbar** » 18 Sep 2009, 12:43

Tried it here:

didn't compile because of SSE2 first
┬╗force_align_arg_pointer┬½ is ignored on x64 anyway, so no difference there
load beeing considerably faster than spring

zerver · Post by **zerver** » 18 Sep 2009, 13:03

Is this fix of mine somehow related?

http://github.com/spring/spring/commit/ ... c75e33fb12

Post by **hoijui** » 18 Sep 2009, 14:22

Aus:
cool! :D
thanks for testing (i assume you did it on your 64bit Gentoo, as you also mentiond x64 ignoring the forced alignment).
yeah the SSE2 part was just for initial testing, could be removed of course, just forgot.

zerver:
not that i am an expert in this field, but i can not see a closer relation then fixing an alignment issue with the same code.

so if there will not be any good reasons agaisnt it, i propose the following changes:

add the following to rts/System/maindefines.h:

Code: Select all

#if defined(__GNUC__) && (__GNUC__ == 4) && !defined(__arch64__)
#define __ALIGN_ARG__ __attribute__ ((force_align_arg_pointer))
#else
#define __ALIGN_ARG__
#endif

use __ALIGN_ARG__ for the sse sqrt() function in fastmath
use load & store in all the aplicable places

Post by jK » 18 Sep 2009, 17:39

I always thought the mm_load mm_set would be slower here. In theory it should need more cpu ops (additional write, needs to copy the function argument etc.).

But yeah same results here (athlonXP 2.5Ghz). also found out the massive performance decrease is caused by march=i686, with march=athlon-xp everything is fine. Still I couldn't find the reason why union is such slower than load/set. Didn't disassembled the final binary yet, would be interesting to do so.
Also I don't see problem with the compiler flag then (when using load/set) cause it's limited to that function the reduced number of useable register doesn't matter.
And a 25%-50% faster sqrt is always fine

"good work"

YokoZar · Post by **YokoZar** » 20 Sep 2009, 15:24

Very cool indeed. Good news :)

imbaczek · Post by **imbaczek** » 22 Sep 2009, 14:20

jk: sync testing -march=i686 vs -march=athlon-xp is a good idea IMHO. we could provide an athlon-optimized exe if it works, there's plenty of those pretty old processors in spring's playerbase if I'm not mistaken.

Post by **hoijui** » 22 Sep 2009, 16:32

+1 :D

tizbac · Post by **tizbac** » 22 Sep 2009, 16:47

I can do test games with an amd athlon 64 bit if it's needed

el_matarife · Post by **el_matarife** » 30 Sep 2009, 01:02

imbaczek wrote:jk: sync testing -march=i686 vs -march=athlon-xp is a good idea IMHO. we could provide an athlon-optimized exe if it works, there's plenty of those pretty old processors in spring's playerbase if I'm not mistaken.

If the recompile just changes the Spring.exe binary file, why not have SpringDownloader (And other OS equivalents) automatically pull down a version compiled specifically for your processor architecture automatically? (Assuming optimizations won't break sync)

Spring RTS Engine

SSE sqrt() performance tests

SSE sqrt() performance tests

Re: SSE sqrt() performance tests

Re: SSE sqrt() performance tests

Re: SSE sqrt() performance tests

Re: SSE sqrt() performance tests

Re: SSE sqrt() performance tests

Re: SSE sqrt() performance tests

Re: SSE sqrt() performance tests

Re: SSE sqrt() performance tests

Re: SSE sqrt() performance tests