[engine] Comparing two float3s

Post by jK » 15 Dec 2012, 21:05

I thought about the fastest way to compare 2 vectors.

Code Spring uses atm:

return std::fabs(x - f.x) <= CMP_EPS
 && std::fabs(y - f.y) <= CMP_EPS
 && std::fabs(z - f.z) <= CMP_EPS;

Currently Spring compares each component individually and you always hear branching is slow. On the other hand CPUs predict them, GPUs don't.

So I thought instead of comparing all 3 components each, why not do `Length(vecA-vecB)^2 <= CMP_EPS`? The additional multiplies & additions by the dot-product should be `for free` I thought.

So here is the code:
http://ideone.com/B9KS0f (3 compares)
http://ideone.com/zrAY3X (DOT FPU)
http://ideone.com/9eLaHD (SSE)

with the results:
3 compares: 0.98s
dot FPU: 1.3s
SSE: >5sec (Time limit exceeded)

-> another assumption turned out wrong

Possible reason: It seems the early-out advantage of the individual compares (the CPU can stop the bool-check if already the first comparison fails) wins against Length()^2.

PS: duno why the SSE version is such damn slow (and even slower than FPU) anyone got an idea?

PPS: to compile the code I used `g++ -o foo.bin -O2 -mfpmath=sse -msse -msse2 foo_float3.c -DUSE_...`.

Peet · Post by **Peet** » 15 Dec 2012, 22:22

Code: Select all

peet@starscream ~/floats> g++ -o sse.bin -O2 -mfpmath=sse -msse -msse2 floats.cpp -DUSE_SSE
peet@starscream ~/floats> g++ -o dot.bin -O2 -mfpmath=sse -msse -msse2 floats.cpp -DUSE_DOT
peet@starscream ~/floats> g++ -o branch.bin -O2 -mfpmath=sse -msse -msse2 floats.cpp
peet@starscream ~/floats> time ./sse.bin 
16777216.000000 8388608.000000 16777216.000000
0.35user 0.00system 0:00.36elapsed 99%CPU (0avgtext+0avgdata 1008maxresident)k
0inputs+0outputs (0major+315minor)pagefaults 0swaps
peet@starscream ~/floats> time ./dot.bin 
16777216.000000 8388608.000000 16777216.000000
0.53user 0.00system 0:00.53elapsed 99%CPU (0avgtext+0avgdata 1004maxresident)k
0inputs+0outputs (0major+313minor)pagefaults 0swaps
peet@starscream ~/floats> time ./branch.bin 
16777216.000000 8388608.000000 16777216.000000
0.26user 0.00system 0:00.27elapsed 99%CPU (0avgtext+0avgdata 1004maxresident)k
0inputs+0outputs (0major+314minor)pagefaults 0swaps

For me USE_SSE is notably faster than USE_DOT. Perhaps your USE_SSE compiled one is not actually using simd instructions - I believe it's sometimes a bit more effort than one would assume to convince the compiler to utilize them. At least with MSVC I've found that explicitly specifying the type's alignment is definitely a factor.

Also, I imagine it might also be pertinent to do this test with arrays of float3s (where the computation is repeated without each calculation depending on the one immediately preceding it) so that pipelining can be more of a factor.

Post by jK » 15 Dec 2012, 22:41

Peet wrote:For me USE_SSE is notably faster than USE_DOT.

You got an Intel right? Me got an AMD and it seems IdeOne does so, too. Single SSE instructions seem to fail heavily on AMDs (it differs a bit when running on arrays afaik).

Peet wrote:Perhaps your USE_SSE compiled one is not actually using simd instructions - I believe it's sometimes a bit more effort than one would assume to convince the compiler to utilize them. At least with MSVC I've found that explicitly specifying the type's alignment is definitely a factor.

I used gcc native vector extension, it automatically set alignment etc.
It also can generate fallback FPU code for all ops, but it doesn't seem to be like that. Neither does a proper -march=amdfam10 change anything.

Peet wrote:Also, I imagine it might also be pertinent to do this test with arrays of float3s (where the computation is repeated without each calculation depending on the one immediately preceding it) so that pipelining can be more of a factor.

That would be a different test, implementing array driven computations in Spring code would be a heavy modification.

Peet · Post by **Peet** » 15 Dec 2012, 23:12

Yeah I am running on an i7 2630QM. FWIW i did a naive arrayified version and sse/branching performed almost identically. Pretty disappointing that SSE works so poorly on AMD...sounds like this has implications for more than just comparison operations. I suppose simd vs not-simd has sync implications as well so we can't just ditch it for AMD users ...

Post by jK » 15 Dec 2012, 23:49

Peet wrote:FWIW i did a naive arrayified version and sse/branching performed almost identically.

http://ideone.com/joQ7f8 (SSE 2.02s)
http://ideone.com/3kBhPe (FPU 0.13s)

Post by **gajop** » 16 Dec 2012, 00:59

not sure your original test is "correct", as the if will always be false and it'll be short circuited at the first vector component
example with different numbers (different CMP_EPS & starting a value):
http://ideone.com/TFarBl
http://ideone.com/psA6zO

Beherith · Post by **Beherith** » 16 Dec 2012, 01:58

I thought you were going to use an eps variable, or are you planning to inline to an immediate epsilon value?

Post by jK » 16 Dec 2012, 04:16

gajop wrote:not sure your original test is "correct", as the if will always be false and it'll be short circuited at the first vector component
example with different numbers (different CMP_EPS & starting a value):
http://ideone.com/TFarBl
http://ideone.com/psA6zO

seems gcc optimized something away in the if-clause, cause of `a = b;`. Replacing it with `a += b;` gives again an advantage for 3comps:
http://ideone.com/GS2b4H (dot 1.30s)
http://ideone.com/QnPH9S (3comps 0.89s)

Also tried another type of the for-loop in the hope gcc cannot optimize it away:
http://ideone.com/Strg3k (dot 3.48s)
http://ideone.com/pfol5Z (3comps 2.69s)

edit: much better version:
http://ideone.com/cq3pjw (3comps 1.03s)
http://ideone.com/DK0HWb (dot 1.26s)

Post by jK » 16 Dec 2012, 04:17

Beherith wrote:I thought you were going to use an eps variable, or are you planning to inline to an immediate epsilon value?

Everything in the code is auto-inlined.

Beherith · Post by **Beherith** » 16 Dec 2012, 12:47

Scream if you want access to an idle ubuntu server with a sandy bridge with a g620 cpu.

a1983 · Post by **a1983** » 17 Dec 2012, 06:53

May be try speedup not vector, but float equality checking.
Like, for example, here:
http://www.cygnus-software.com/papers/c ... floats.htm

PicassoCT · Post by **PicassoCT** » 17 Dec 2012, 10:49

time in seconds of results were right at home, here on the laptop they differ - guiltguess is osscheduling.

zerver · Post by **zerver** » 17 Dec 2012, 16:01

Interesting blog.

Indeed when doing AND

Code: Select all

if (A && B && C())

you should put the statement that is most likely to be false first, and when doing OR

Code: Select all

if (A || B || C())

you should put the statement that is most likely to be true first.

Actually I have had friends who call themselves C programmers that were totally unaware of the optimizations that are in effect and their implications. I.e. why the f-k does C() not get called? LoL

PicassoCT · Post by **PicassoCT** » 17 Dec 2012, 18:20

But its glaringly obvious once you done assembler for a semester.. jmpIfEquals im looking at you.

Spring RTS Engine

[engine] Comparing two float3s

[engine] Comparing two float3s

Re: Blog: Comparing two float3s

Re: Blog: Comparing two float3s

Re: Blog: Comparing two float3s

Re: Blog: Comparing two float3s

Re: Blog: Comparing two float3s

Re: Blog: Comparing two float3s

Re: Blog: Comparing two float3s

Re: Blog: Comparing two float3s

Re: Blog: Comparing two float3s

Re: Blog: Comparing two float3s

Re: Blog: Comparing two float3s

Re: Blog: Comparing two float3s

Re: Blog: Comparing two float3s