Spring uses the SSE1 sqrt function. SSE requires 16bit alignment in memory. The JVM somehow seems to mess this up, so when invoking a callback method or issuing a command using SSE1 functions from a Java AI, spring crashes.
GCC allows forcing alignment in a method (thanks jk), but before putting it into the code, some performance testing had to be done, documentation implies that this could cost relevant performance.
so i tested 4 different ways to call the SSE1 sqrt() function, whihc i labeled like this:
- spring
- spring forcedAligment
- load
- load forcedAligment
Code: Select all
__attribute__ ((force_align_arg_pointer))
When compiling, i tried to use the smae flags like when compiling spring (eg -O2). The flags are all listed in compile.sh.
each test is looping 1111 times over the sqrt() function, and i used 3 different inputs (i is the integer loop var, and x is initializes as: float x=0.0f):
- x
- i
- i+x
The results on my two machines:
coreDuo1.8GHzSSE3_Linux32bit_GCC4.3.3
Code: Select all
################################################################################
sqrt(x)
=======
spring (iterations: 1000000000) test result: 0.000000
real 13.63
user 13.51
sys 0.00
spring (forced) (iterations: 1000000000) test result: 0.000000
real 13.54
user 13.54
sys 0.00
load (iterations: 1000000000) test result: 0.000000
real 6.76
user 6.74
sys 0.00
load (forced) (iterations: 1000000000) test result: 0.000000
real 6.17
user 6.17
sys 0.00
################################################################################
sqrt(i)
=======
spring (iterations: 1000000000) test result: 549755813888.000000
real 27.04
user 27.04
sys 0.00
spring (forced) (iterations: 1000000000) test result: 549755813888.000000
real 27.05
user 27.04
sys 0.00
load (iterations: 1000000000) test result: 549755813888.000000
real 17.53
user 17.52
sys 0.00
load (forced) (iterations: 1000000000) test result: 549755813888.000000
real 17.99
user 17.52
sys 0.00
################################################################################
sqrt(i+x)
=======
spring (iterations: 1000000000) test result: 562949953421312.000000
real 28.52
user 28.16
sys 0.00
spring (forced) (iterations: 1000000000) test result: 562949953421312.000000
real 28.25
user 28.14
sys 0.00
load (iterations: 1000000000) test result: 562949953421312.000000
real 20.77
user 20.64
sys 0.00
load (forced) (iterations: 1000000000) test result: 562949953421312.000000
real 20.67
user 20.65
sys 0.00
################################################################################
Code: Select all
################################################################################
sqrt(x)
=======
spring (iterations: 1000000000) test result: 0.000000
real 0m24.625s
user 0m0.031s
sys 0m0.000s
spring (forced) (iterations: 1000000000) test result: 0.000000
real 2m48.453s
user 0m0.015s
sys 0m0.015s
load (iterations: 1000000000) test result: 0.000000
real 0m14.031s
user 0m0.015s
sys 0m0.031s
load (forced) (iterations: 1000000000) test result: 0.000000
real 0m13.969s
user 0m0.015s
sys 0m0.000s
################################################################################
sqrt(i)
=======
spring (iterations: 1000000000) test result: 549755813888.000000
real 0m25.844s
user 0m0.015s
sys 0m0.016s
spring (forced) (iterations: 1000000000) test result: 549755813888.000000
real 2m55.703s
user 0m0.015s
sys 0m0.000s
load (iterations: 1000000000) test result: 549755813888.000000
real 0m18.969s
user 0m0.015s
sys 0m0.015s
load (forced) (iterations: 1000000000) test result: 549755813888.000000
real 0m19.015s
user 0m0.015s
sys 0m0.015s
################################################################################
sqrt(i+x)
=======
spring (iterations: 1000000000) test result: 562949953421312.000000
real 0m28.421s
user 0m0.015s
sys 0m0.015s
spring (forced) (iterations: 1000000000) test result: 562949953421312.000000
real 2m58.532s
user 0m0.015s
sys 0m0.015s
load (iterations: 1000000000) test result: 562949953421312.000000
real 0m21.265s
user 0m0.015s
sys 0m0.015s
load (forced) (iterations: 1000000000) test result: 562949953421312.000000
real 0m21.203s
user 0m0.015s
sys 0m0.000s
################################################################################
With this results, it looks to me as if we should move to using load_ss & store_ss & forced alignment, as the forced aligmnet seems to virtually not impact performance there, and load & store seem to be considerably faster then the union we use now.