Speed comparisons

krogothe · Post by **krogothe** » 15 Dec 2005, 23:06

I very much doubt anyone here will know c++ that much to answer this precisely but its worth a try, maybe someone know where i can look up this stuff!

How much time would the following operations take (in clock cycles, units of time, whatever, i just need the ratios between them):

(comparing two unsigned ints or booleans with ==)
if(someunsignedintorboolean==1)

(multiplying two standard ints)
int*int

(comparing two standard ints)
if(int1==int2)

(same as above but with < or >, i heard they are slower)
if(int1>int2)

(addition and subtraction of ints)
int1+int2
int1-int2

Yes its a bit much to ask but its important (and secret for now

). If you dont know the exact values, ordering them by speed and maybe stating a rough estimate in speed differences would help a lot already!

Post by **jcnossen** » 15 Dec 2005, 23:15

All those operations are fast. They depend on compiler specific optimizations and other locally used variables and structures.... IE impossible to predict. Some variables would be stored in a register, some on the stack, others require writing register stored variables back to the stack...

In general all these integer operations are constrained by the time it takes to acquire their operands, not to perform the actual operation.

If you want to know more, knock yourself out with the instruction set (which is absolutely horrible for i386

http://www.intel.com/design/pentium4/ma ... ex_new.htm

krogothe · Post by **krogothe** » 15 Dec 2005, 23:49

Ill have go at finding info there, but its painful to even look at those docs!
hmmm now that you mentioned, i could probably get a huge boost from sticking stuff into the register and maybe making the functions inline (not sure how that works yet so i might be talking out of my ass here)

Post by **jcnossen** » 15 Dec 2005, 23:59

Basically I was saying that it's impossible to know, because the compiler does complex optimization in release mode. However to be sure of speed you can restrict yourself to very simple arrays to just know what really happens. std::set, std::map and std::list operations are not suited for the inner loops, because they call the memory manager.

VS automatically inlines appropriate functions if that's enabled in the optimization settings, and the register keyword can best be left unused because the compiler knows better what's fast than you do.

krogothe · Post by **krogothe** » 16 Dec 2005, 00:09

Guess ill only know once i code in two different ways, which i cba doing right now...
thx zaph

SoftNum · Post by **SoftNum** » 16 Dec 2005, 02:02

Why are you so worried about speed, anyway? I'm writing a threaded AI, and it can do some pretty complex operations, and the thread will be done before the next frame.

But, as to your question, The multiplication is slower then everything else, but not by an appreciable amount. If you're multiplying by 2, it's faster to do a bitwise shift, then a *2.

Code: Select all

int x,y;

x = y*2;  
x = y<<1;  // Faster

Also, integer math is signifigantly faster then float math.

But honestly, you shouldn't worry about the speed of arithmetic. Looping , search functions, and sort functions are all going to be much more of a drain than any arithmetic optimizations you choose to do.

Oh, and new and delete (or malloc() and free()) are rather computationally expesive, as well.

renrutal · Post by **renrutal** » 16 Dec 2005, 03:33

Zaphod wrote:Basically I was saying that it's impossible to know, because the compiler does complex optimization in release mode. However to be sure of speed you can restrict yourself to very simple arrays to just know what really happens. std::set, std::map and std::list operations are not suited for the inner loops, because they call the memory manager.

VS automatically inlines appropriate functions if that's enabled in the optimization settings, and the register keyword can best be left unused because the compiler knows better what's fast than you do.

Do you know which data containers in STL and Boost are the fastest and rather safe to use?

mr sharpoblunto · Post by **mr sharpoblunto** » 16 Dec 2005, 04:11

Do you know which data containers in STL and Boost are the fastest and rather safe to use?

the stl::vector is basically a dynamic array, so has o(1) access to any element, but it can take up to o(n) time to insert (however its still o(1) to add an element to the end of the list) or remove an element from the vector. This is probably the best choice where you want to read the contents often, but don't change the contents (well insertions and deletions anyway)

boost::array is pretty much a standard array, so it can't be resized, but as such it has o(1) time for access, insertions and deletions, and it includes some handy operators etc. which make it a bit safer than using standard arrays.

for fast searching the best option is a hashtable. The stl does not include a hashtable but most compilers include an stl-like implementation (i know gcc has one called hash_table and I know VS has one aswell)
theoretically a hashtable has o(1) searching time but it can be affected by how effective the hashing function being used is, however there are alot of good hashing functions around and if your using standard data types as keys you don't need to implement your own hashing function.

stl::map looks similar to a hashtable in that it stores key/value pairs but it uses a B-tree to store them, so it has o(log(n)) search time, but this can be okay for small data sets

stl::list has o(n) access time but 0(1) insertion and deletion, so it can be faster than a vector for a data set that is constantly changing, but its not so good if your using it to search for items all the time.

hope that helps (hopefully i didn't make any dumb mistakes on the big(0) speeds

)

Post by **jcnossen** » 16 Dec 2005, 12:45

stl::list has o(n) access time but 0(1) insertion and deletion, so it can be faster than a vector for a data set that is constantly changing, but its not so good if your using it to search for items all the time.

Except when you don't destruct the vector but just clear it.
vector::clear doesn't free the memory, it just sets the size to zero

If you're just pushing and popping at the end, the vector is faster than the list in this case (I often use this).

Code: Select all

vector<int> tempdata;

void func() {
tempdata.clear();
for (....) {
  tempdata.push_back(...);
}
/// use the tempdata
...
}

Tobi · Post by **Tobi** » 16 Dec 2005, 14:10

First, figure out which parts of your program (AI) are worth optimizing. Use the profiling capabilities of your compiler for that. gcc/g++ has it with -pg option and the gprof program, MSVC only has it in Professional edition iirc.

There's a common rule that 90% of the execution time only 10% of the code is executed. Probably it's even more like 99% vs. 1%., so optimizing anything but the most-executed 1% is entirely and utterly useless.

Also, you shouldn't optimize before code is finished, and it's way more efficient to optimize on algorithm level than on instruction level. For example, replacing a std::list with a std::vector for a container of which you query the Nth element in an inner loop has much more effect than replacing the std::list with a home-brewed linked list class, because vector has O(1) (constant time) random access complexity, while std::list has O(N) (linear time).

Keep in mind too that one L2 cache miss can be as expensive as 100 integer multiplications (iirc)... The tool "cachegrind" of the program "valgrind" can be used to simulate cache (misses). Don't know if it's available for windows tho, but google is your friend.

Summary:

Never optimize before profiling
Never optimize anything but the slowest 1% of the code, which takes 99% of the execution time
First do algorithmic optimization, then instruction level optimizations (or rather, just don't do instruction level optimizations )

krogothe · Post by **krogothe** » 16 Dec 2005, 16:06

Well, speed is a bit of a concern when you are planning on doing roughly 188,743,680,000 of those operations (total, including ifs etc) in your average game...
and yes thats the 5% of code which will take up 99.9% of the resources, so ANY speed improvement will help :)

Ill have a look at all that stuff as well as the constant googling ive been doing...
Based on the information about a simpler chip, I could get at least good 2-10x speed improvement over my original planned design by adopting some changes, provided my Athlon 64bit works similarly...

AF · Post by AF » 16 Dec 2005, 17:55

If you need idle processign time, keep a counter and icnrement it every tiem update() gets called.
If the gameframe returned by the engine hasnt changed but the manual counter has then the game is paused. Or you could just look for a paused message.

You might also want to add that sicne algorithms that run during Update() continue to run 30 times every second even tho the game is paused. For example the threat amtrix updating algorithm in AAI and NTAI continues when the game is paused so fi an enemy unit is in sight fo the AI then it re-inforces thats ector continually during pause resulting in unsightly mex ups with attack targetting.

Post by SJ » 16 Dec 2005, 19:55

Update doesnt get called when the game is paused (if no one has changed it since i looked last time), its called at the same time as the rest of the game logic get its update call (so if you run at speed 2 it will be called 60 times a second etc).

SoftNum · Post by **SoftNum** » 16 Dec 2005, 20:12

Just to be perverse and play the large numbers game.

188 billion operations only takes 188 seconds @ 1Ghz. That's 3 mins. Which means in a 1.5 hour game, you're AI can take up just 1/30th of the CPU and still run with no load.

cain · Post by **cain** » 16 Dec 2005, 21:49

you kould never say for sure, but on a teoretical way follow these
principle:

arcitectural path to memory:
the x86 set could do those operations:
memory register (native)
register register (netive)
pointer_to_memory register (+1 ck)
memory memory (+1ck)
pointer_to_memory memory (+2ck)

native means implemented. the additional
clock comes from the loading of the value from
memory to translate te operation to a native one.

arithmetic operation:
sum, 1ck
sub, 1ck
mul: 2ck (one is used for the signaling with the alu)
div: variable
shifts: 1ck if shifting by a multiple of 2, 2ck otherwise (VARIABLE)

note that a memory acces could mean a cache miss, and a cahe miss
means an alt of the processor for a LOT of ck.
this depends on fsb, ram speed, ram latency, bus latency, cache latency, cache speed.

Also add a move to retrieve teh results!
so:
move register register 1ck
register memory 1ck
memory memory 1ck
pointer pointer 3ck

note the memory-memory that use 1ck: this inconsistency
comes from the new memory controller that intercepts
the move call and process them without the cpu help.

on the other and mmx ans sse2 are constant time and can
operate on parallel on more register, BUT are limited
to register register mode.
They are a sort of standalone load/store unit.
1 ck to store operands.
1 ck to make operations.
1 ck to load the results.
useful if you're doing lot of operation on the same
elements.

and never in life read a intel manual:
it will exaust your clock.

[edit]

note that for what I've said,
a*=b is faster than a=a*b
this rewrites to:
mov register valueofa
mul register valueofb <-register overwritten with result
mov valueofa register

still, clever compiler (gcc for sure) will
optimize your code, so, instead of searching the
best code path mantain focus on getting
the fastest alghoritm.

note on the hashmap:
yes, theyre O(1).
note that the operation required is a has function,
and that could be somewhat slow.
and they've a slow iterator, so are suited
if you have a big data set and no searches,
but only direct query.
Also keep in mind that if there is a collision
on the hashmap, the hashmap will be rehashed!
that is O(n), and every n of the O is a slow operation.

and don't forgot the deque! fantastic for pipes.

NOTE FOR THE DEVELOPERS:
what about a fast unsigned integer-key only hashmap?
the hash could just be integer%value. (fast!)
container should by incremented by shifting by one
the old value. Everithing could be stored on two arrays,
the index done by long, with -1 as tombstone.
still iterator.next() a little slow as it have to skip all the
tombstone.

cain · Post by **cain** » 16 Dec 2005, 22:36

for the shifsts:

on p4 and above, shift are no longer faster than multiplications!

rule for inline:

if the function is declared and defined inside the class {} scope,
the compiler will always try to make it inline. if outside,
you should specify the inline keyword. inlining preserves
3ck for the calling, plus one clock for the push on the stack for every
parameter plus 1ck for reading parameters from the stack inside
the caller context, plus all the copy constructor involved, and maybe
I've forgot something.

note that inlining all the code will have the same sideeffect
of unrolling loops: will trash the cache as snow on hell! and memory
is pretty slow today.

consider to use those loop if short of speed:
int count=iterationCount +1;
while(--i) {
....
}

and those tips:

http://www.devx.com/amd/Article/21545

don't forgot the extreme unroll technique if the index
is not important, only the iteration count suffice as
for iterators:

int n = (count + 7) / 8; /* count > 0 assumed */

switch (count % 8)
{
case 0: do { operation;
case 7: operation;
case 5: operation;
case 4: operation;
case 3: operation;
case 2: operation;
case 1: operation;
} while (--n > 0);
}

also: always code a slow alghorithm, then switch on the fastes, and
DON'T throw away the older: use it for debugging
(execute both in real usage, not tests, and if the result differs throw
an exception)

krogothe · Post by **krogothe** » 17 Dec 2005, 01:37

SoftNum wrote:Just to be perverse and play the large numbers game.

188 billion operations only takes 188 seconds @ 1Ghz. That's 3 mins. Which means in a 1.5 hour game, you're AI can take up just 1/30th of the CPU and still run with no load.

funny but i very much doubt that say, a mutiplication will only take up a single operation, or that every clock cycle will be available to the AI but ultimately the wishful thinking you'd last 1.5 hours against KAI when its done

(JUST KIDDING DONT START FLAMING ME OK?).

Cain, that was pure awesome! thanks a bucketload! (cool to have you back too!)
So multiplication is actually quite fast (only 2x a sum)! New things you learn everyday...

Ill look up on cache miss (im sure it will be easier to find than this info you just gave me), draw a couple of plans and choose how to code it efficiently!
This coupled with my growing knowledge of the STL (cppreference.com is great) and general coding is making me rewrite a lot of code, so dont expect any new features for KAI until next year!

Ill read this again tomorrow to soak it all up :)

renrutal · Post by **renrutal** » 17 Dec 2005, 07:31

SoftNum wrote:Just to be perverse and play the large numbers game.

188 billion operations only takes 188 seconds @ 1Ghz. That's 3 mins. Which means in a 1.5 hour game, you're AI can take up just 1/30th of the CPU and still run with no load.

I also have to remind everyone that even a simple attribution call could take over 3 cpu operations and the simplest function call(not the operations of the functions themselves) some good 20-30, all of course depending on the CPU architecture.

There will also be over 60 kernel and user space programs sharing your computer resources, each one with multiple threads, all running or idling at the same time.

Even with that said, I'll ask one thing: Please, don't be too worried about optimizing everything. Yeah, it's always good to have stuff running faster and better but your design is even more important.

A well done program design will help the compiler to optimize your code efficiently, but much more importantly, a well done program will help the developers to understand what you're doing.

IMHO, even a small comment that doesn't get compiled with the code will be more helpful than the actual code running.

IMSabbel · Post by **IMSabbel** » 17 Dec 2005, 12:18

Generally, with modern cpus, clockcycle counting is pointless.

That single additional add or mul may very well be completely hidden because its executed in parallel to a non-blocking other instruction, but a single l2 cache hit can cost between 100 and 200 clockcyles.

Thats the whole reason why many old optimisation techniques arent that hot nowadays.
Take lookup tables for exaple... its cheaper to calculate trigonomical functions directly in the fpu than having a 50% cache miss rate when acessing the lookup table... (of course this is based on a decent accuracy of the table... a 128 value look-up will of course always fit in cache...)

Triaxx2 · Post by **Triaxx2** » 17 Dec 2005, 13:18

If you're that worried about it, make the AI able to detect if the computer has multiple cores, and use the inoperative one. Sheesh.

Spring RTS Engine

Speed comparisons

Speed comparisons