Spring has a problem with L2 cache misses

Post by jK » 21 Feb 2015, 08:36

There is a nice discussion in IT world about general optimization (that big-O notation isn't about actual the program's real performance), see http://hacksoflife.blogspot.de/ .
That blog pointed me also to https://www.youtube.com/watch?v=fHNmRkzxHWs
Bringing back that L2 cache missrate is the biggest evil of all performance killers and that modern CPUs often spend 50% of their time doing nothing just waiting for the RAM to fill the cache.

So I checked spring again (my last spring L2 missrate benchmark is a bit longer ago)

jk@jk ~ $ perf stat -p $SPRING_PID -v -d
...
Performance counter stats for process id '9190':

298171.052562 task-clock (msec) # 1.158 CPUs utilized [100.00%]
114,399 context-switches # 0.384 K/sec [100.00%]
9,454 cpu-migrations # 0.032 K/sec [100.00%]
14,630 page-faults # 0.049 K/sec
852,436,283,340 cycles # 2.859 GHz [40.03%]
164,592,798,347 stalled-cycles-frontend # 19.31% frontend cycles idle [39.98%]
380,360,865,634 stalled-cycles-backend # 44.62% backend cycles idle [40.07%]
600,976,209,142 instructions # 0.71 insns per cycle
# 0.63 stalled cycles per insn [40.01%]
103,683,553,679 branches # 347.732 M/sec [39.98%]
5,702,899,520 branch-misses # 5.50% of all branches [39.99%]
299,257,302,821 L1-dcache-loads # 1003.643 M/sec [40.03%]
5,373,981,258 L1-dcache-load-misses # 1.80% of all L1-dcache hits [40.10%]
13,220,467,139 LLC-loads # 44.339 M/sec [40.02%]
4,403,868,048 LLC-load-misses # 33.31% of all LL-cache hits [40.02%]

257.385969812 seconds time elapsed

-> 30% cache miss rate & 40% cpu stalling (= waiting for RAM = doing nothing)

Seems Spring has currently a big problem with data locality, or let's phrase it "there is a lot optimization potential in improving data locality".

So I generated a nice graph where in the code it issues such cache misses:
download/file.php?id=9255

First thing I noticed is that 15% of the cache misses are caused in AudioChannel::FindSourceAndPlay by calling OpenAL (another 5% by CSound::Update, but that happens from another thread and so isn't that important).

Another spot is LocalModelPiece::UpdateMatricesRec with 5%.

Do you find any other spots? (the graph is huge)

Post by jK » 21 Feb 2015, 10:00

did a new cleaner graph after I made some optimizations to ::FindSourceAndPlay

abma · Post by **abma** » 21 Feb 2015, 15:01

how is the graph generated?
can anything > 5% be made in a specific color?

i wonder a bit about CCommandAI::GiveAllowedCommand (3.67%), i wouldn't expact that this function takes this many cycles.

Post by jK » 21 Feb 2015, 15:36

abma wrote:how is the graph generated?

Code: Select all

#!/bin/bash

rm -r /var/tmp/oprof_data
mkdir -p /var/tmp/oprof_data

# start the profiler
operf \
        --pid $(pidof spring) \
        --session-dir=/var/tmp/oprof_data \
        --callgraph \
        --event L3_CACHE_MISSES:10000
        #--event CPU_CLK_UNHALTED:2000000

rm profile.oprof profile.png profile.svg

opreport \
        --long-filenames \
        --callgraph \
        --session-dir=/var/tmp/oprof_data \
        --demangle=smart \
        --merge=tid,tgid \
        --symbols "*spring" \
        --output-file profile.oprof

gprof2dot -s -n 1.5 -e 1.5 --format=oprofile --output=profile.dot profile.oprof

dot -Tpng profile.dot > profile.png
rm profile.dot profile.oprof

(yes, my cpu got a L3 cache, still I name it L2 cache miss cause it's a synonym for "read from RAM")

abma wrote:i wonder a bit about CCommandAI::GiveAllowedCommand (3.67%), i wouldn't expact that this function takes this many cycles.

It's not the cycles spend in that function it's the L2 cache miss rate.
It only correlates to a specific point, i.e. when a code has a huge L2 miss rate it oftens spends many cycles in it (waiting for RAM!), too. In contrast, when you got a function that does a lot arithmetic w/o many mem loads (-> low L2 miss rate), that function uses a lot cycles BUT it doesn't waste any (the CPU is always busy!).
So L2 says where your program is _wasting_ cycles with waiting. It does not contain sections where time is spend on _real_ work.

Also the upper number is the accumulative value (the value of the function+all functions called in it), while the sub number is the value just of the function itself (that's the important number!).

Silentwings · Post by **Silentwings** » 21 Feb 2015, 16:16

I'm not familiar with operf - I guess that the numbers on the graph are the % of L3 cache misses, per function, out of total L3 cache misses. Is that right?

aeonios · Post by **aeonios** » 27 Feb 2015, 10:20

what compile settings are being used for spring binaries, anyway?

abma · Post by **abma** » 27 Feb 2015, 10:46

"default": search for CMAKE_CXX_FLAGS in https://github.com/spring/spring/blob/d ... eLists.txt or look into the generatedCMakeCache.txt.

aeonios · Post by **aeonios** » 27 Feb 2015, 22:27

abma wrote:"default": search for CMAKE_CXX_FLAGS in https://github.com/spring/spring/blob/d ... eLists.txt or look into the generatedCMakeCache.txt.

Well, given that it's jK's test then it'd depend on his cflags. Basically, if you're using -O3, don't. -O2 is generally safe, but in some cases -Os is faster precisely because of cache performance.

There's honestly not much you can do about cache performance though. Compiler designers in general tend to make very stupid decisions to try to cheat and get "more performance" when their choices ultimately lead to larger binaries and cache trashing. Loop unrolling, inlining large functions, aligning loops and labels, superblocks, etc all have unjustifiable cost/benefit tradeoffs. Alignment is especially dumb, since loops and labels are expensive and may make performance worse but functions and jumps are very cheap and favorable to align. All compilers that I'm aware of treat all alignments as being the same in spite of that.

Unless you want to try fixing gcc there's basically nothing you can do with your code that will change anything. Also, when I was running gentoo I found out that gcc does not support enabling custom optimizations (ie anything outside -Ox), even though it has flags for doing so. If you do it tends to break in strange ways, so there's not really a good way around the defaults.

The only other way I know of for significantly improving cache performance would be to move to a slab allocator instead of a heap allocator. That would mean porting libumem from opensolaris to every OS that spring runs on since no working ports exist. That would at least be doable as opposed to fixing gcc, but still wouldn't be easy. Spring's code may or may not have to be modified in that case, I believe libumem was designed to be compatible with existing unmodified C/C++ so it might be possible to avoid it.

aeonios · Post by **aeonios** » 27 Feb 2015, 22:28

(derp double post, hit quote instead of edit)

Super Mario · Post by **Super Mario** » 27 Feb 2015, 23:38

aeonios wrote:
abma wrote:"default": search for CMAKE_CXX_FLAGS in https://github.com/spring/spring/blob/d ... eLists.txt or look into the generatedCMakeCache.txt.
Well, given that it's jK's test then it'd depend on his cflags. Basically, if you're using -O3, don't. -O2 is generally safe, but in some cases -Os is faster precisely because of cache performance.

There's honestly not much you can do about cache performance though. Compiler designers in general tend to make very stupid decisions to try to cheat and get "more performance" when their choices ultimately lead to larger binaries and cache trashing. Loop unrolling, inlining large functions, aligning loops and labels, superblocks, etc all have unjustifiable cost/benefit tradeoffs. Alignment is especially dumb, since loops and labels are expensive and may make performance worse but functions and jumps are very cheap and favorable to align. All compilers that I'm aware of treat all alignments as being the same in spite of that.

Unless you want to try fixing gcc there's basically nothing you can do with your code that will change anything. Also, when I was running gentoo I found out that gcc does not support enabling custom optimizations (ie anything outside -Ox), even though it has flags for doing so. If you do it tends to break in strange ways, so there's not really a good way around the defaults.

The only other way I know of for significantly improving cache performance would be to move to a slab allocator instead of a heap allocator. That would mean porting libumem from opensolaris to every OS that spring runs on since no working ports exist. That would at least be doable as opposed to fixing gcc, but still wouldn't be easy. Spring's code may or may not have to be modified in that case, I believe libumem was designed to be compatible with existing unmodified C/C++ so it might be possible to avoid it.

What about using other compilers such as clang though? Is that feasible/possible?

aeonios · Post by **aeonios** » 27 Feb 2015, 23:45

Super Mario wrote:What about using other compilers such as clang though? Is that an option.

There are compatibility issues between code written for gcc and code written for clang (edit: in some cases, at least). That aside, llvm is full of lots of really dumb design decisions just like gcc, and clang has basically the same limitations. The 'inline functions' option in llvm is particularly questionable and opaque, and I don't even know what criteria they use. Gcc at least has a clear "inline small functions" option, which is basically always worth using.

AFAIK link-time optimization is broken under both gcc and llvm as well, although it seemed to work for spring so either spring lucked out or else it just isn't doing anything.

Super Mario · Post by **Super Mario** » 28 Feb 2015, 00:06

aeonios wrote:
Super Mario wrote:What about using other compilers such as clang though? Is that an option.
There are compatibility issues between code written for gcc and code written for clang (edit: in some cases, at least).

If you build spring with different compilers instead of forcing to use just one then yea.

That aside, llvm is full of lots of really dumb design decisions just like gcc, and clang has basically the same limitations.

Such as? Even if it were true, have you converse with the llvm dev team behind it?

AFAIK link-time optimization is broken under both gcc and llvm as well, although it seemed to work for spring so either spring lucked out or else it just isn't doing anything.

Did you submit a bug report to them?

PicassoCT · Post by **PicassoCT** » 28 Feb 2015, 00:25

Read somewhere that compilers optimization make up for only 10 % of speed gain, while the rest is about cache aware programming. Yes those numbers are suspicious

http://www.research.scea.com/research/p ... 8Mar03.pdf
Page 6

So please, can compiler optimization be forked. Its part of the battle, but its not main arena.

http://www.agner.org/optimize/optimizing_cpp.pdf

aeonios · Post by **aeonios** » 28 Feb 2015, 06:02

Super Mario wrote:If you build spring with different compilers instead of forcing to use just one then yea.

Well, maybe but YMMV. Compilers just aren't very consistent.

Super Mario wrote:Such as?

For example, it would have been more efficient if they had deferred all optimization until link time from the beginning. Currently, with LTO some optimizations have to be run twice, and object files may have to be recompiled several times, which is unnecessary if you wait to run optimizations until after they've been linked. LLVM's asm backends also treat aligning loops, labels, functions and jumps as all being equal just as gcc does. Inlining I'm not sure, but I do know they support loop unrolling, which is basically useless.

On the other hand LLVM is still much more maintainable than gcc could ever hope for, so I can't fault them on everything.

Super Mario wrote:Even if it were true, have you converse with the llvm dev team behind it?

No, but that might be worthwhile.

Super Mario wrote:Did you submit a bug report to them?

No, gcc was too far upstream for me to get involved with. Also, it's not like it was some special knowledge that I possessed that anyone else involved with gentoo doesn't. That stuff is well documented, and basically has to be if you intend to allow people to consistently compile their whole system from source.

PicassoCT wrote:Read somewhere that compilers optimization make up for only 10 % of speed gain, while the rest is about cache aware programming. Yes those numbers are suspicious

http://www.research.scea.com/research/p ... 8Mar03.pdf
Page 6

So please, can compiler optimization be forked. Its part of the battle, but its not main arena.

http://www.agner.org/optimize/optimizing_cpp.pdf

That's some interesting stuff. I'm reading through that book since it looks like it might be useful.

The reason 10-20% gains are a big deal with compilers is simply because there's a limit to how much you can achieve in general, and because compiler optimization technology is fairly mature. You can generally get bigger gains from things like improved algorithms and reduced overheads, but that's not to say that significant improvements in cache performance couldn't be achieved, especially since it's been neglected as a focus for optimization.

I think I'm going to try to build libumem on windows and see if I can't get it working with spring. There is actually a port here, but I was unable to get it to build on linux the last time I tried. That may be due to a build script problem rather than anything in the actual code though. I'll post profiling data if and when I manage that, although who knows how long it'll take to figure all of that out.

PicassoCT · Post by **PicassoCT** » 28 Feb 2015, 08:57

jk stated in sy that he trys to replace booleans with a stl::array, as this contains a bitmask optimization, thus increasing packaging density in structs

another thing to be considered although a big nono in compscience is duplicated data. Meaning the data in a class exists twice, once the mother data, and once as a smaller data set (struct) that is cache optimized and together with the other of its kind in a array, in extremes even together with the data its going to need.

So for example a pathing node in oo-classic would look like this

Code: Select all

std::vector<node> PathMap;

struct node
{
vec3 normal,
int weightTerrain,
bool BuildingBlocked,
int x,
int y,
int TerrainType,
vec3 originalTerrain,
etc etc
};

Now assume a algorithm wants to compute traverseability over this node, ideal cachewise would be two arrays with duplicate (one in x-direction, one in y-direction)

Code: Select all

std::array<HalfNode>  x_Wise_DuplicateOfMap, y_Wise_DuplicateOfMap;
struct HalfNode
{
vec3 normal,
int weightTerrain,
bool BuildingBlocked,
}

Return Type should be small aswell
std::array<boolean> BlockedForVehcileType

which contains just the necessary data to do the computation - without loading the rest of the mumbo jumbo which just pushes the neighbours out of the cache..

Post by jK » 28 Feb 2015, 12:07

@aeonios
A lot of text for less real information ...

@compilers
Yes, some optimizations that help CPU ALU will kill L2. Mainly cause of the CPU/Mem gap often explained in L2 papers. Still since 10years CPUs don't become much faster anymore instead they increase the cores, so there is the chance that in future this gap will stop increasing or even decline a little.
After all Spring isn't using -O3 and never will cause it won't sync.

@compilers #2
Yes, currently compilers do less optimization for L2. It's mainly limited to some loops. Nor will there be much improvement in the near future at least for the data cache, for code cache see LTO.

@LTO & code cache misses
First, it's really really slow atm and needs a LOT of memory. So it's not advised to use it for development. Only for releases it's worth to consider.
Second, it might reduce code cache miss rate. And yes it's the only sane way to do so (maybe in future LTO could even use PGO). The ways mentioned in the video - namely: reordering of the .o linking, grouping via pragma's, reordering the source code - ARE NO SOLUTION, they make stuff worse and are the first step into asylum. So when not using LTO, don't bother about code cache misses (nor are there easy ways too even profile them!).
Never the less gcc's LTO don't work with Spring yet (too huge project & errors in LTO).

@non-std alloc
First, custom allocators don't optimize L2 cache misses. They optimize thread synchronization and OS/kernel sync spots.
Second, Spring supports a custom allocator already: google's tcmalloc. It's used under linux when available and gives ~10-15% speedup. Porting it to windows might be worth to consider. Still likely not advised. Cause it's a balloon allocator, meaning it never frees memory.

@gentoo
What you said makes no sense. Seems you have less experience with gentoo/gcc.

Read somewhere that compilers optimization make up for only 10 % of speed gain, while the rest is about cache aware programming. Yes those numbers are suspicious

That's the point. Cause of the increase CPU/Mem gap, the CPU is now wasting ~50% of its cycles waiting for mem. Meaning when the compiler does no or less L2 optimizations, it can only improve one half of the problem. The other half is mainly the programmer's job!

@L2 optimizations
In contrast to what aeonios said, there are ways to improve L2 cache hit rate. But it's a thing of the source code (-> programmer) and not of compile-flags!
Some options are: struct compression (alignment optimization, pack pragma, getting rid of bools, ...), getting rid of stl containers with many L2 cache misses (std::map, std::list, ...), hot/cold structures, ...
The list of options is long, you just need to know where it is worth to use them on.

@compiler's L2 possibilities
First, they are limited w/o PGO. Still I see some:
1. There could be a new pragma similar to pack, which enables reordering of member vars to reduce alignment overhead. Still this is already part of static code analyzers and so the programmers can solve this themselves, so the win would be little. Still such a new pragma would open the door for hot/cold optimization (see below).
2. There could be a new boolean type, that allows the compiler to pack multiple booleans of a struct into a single bitarray. So that each bool only uses 1bit instead of 8. Note that this would break backward compatibility, cause you cannot take a pointer to such booleans!
3. loop optimization, e.g. when you got "for (auto& d: map) {}" then the compiler knows that the whole map is iterated, so it can/should precache the preceding items to reduce the cache miss rate. Note that when the loop-body is huge this might be contra-productive cause the body might trash the whole L2 cache and then the precaching was useless.
4. see hot/cold

@hot/cold splitting/grouping
@pica-question: Is there a way to automatize generating algocentric struct duplicates, to reduce l1 and l2-cache misses?
The first problem is to identify what's cold and what hot. With standard profiling (oprofile, perf, ...) this isn't possible cause you cannot identify what member vars are read.
To do so you need valgrind, and I wonder if there are already tools/plugins with that feature. Such tool would be a GREAT win cause it gives very important data. I.e. when you got a huge struct, you could simply move all `hot` vars to the front so that when you access one, all other are accessed via the cacheline, too. Preventing multiple cache misses when you access vars at front & end of the struct.
Obv. such generated data could also be used by the compiler itself when available (-> new feature for PGO?).

tl'dr
1. we have to wait till LTO works with Spring.
2. hand optimize L2 cache hit rate
3. need more data to better optimize L2 -> check if valgrind has a hot/cold tool and when not how complicated would it be to write one?

Silentwings · Post by **Silentwings** » 28 Feb 2015, 13:36

I am kind of amazed that valgrind & friends seemingly can't tell you the access pattern within a struct, since it must at least see each access in a recordable way. But I didn't find it anywhere in the docs for valgrind/cachegrind/callgrind. You can see (a model of) the cache hits/misses per line with cachegrind, which I guess you already know - but that's the closest thing I could find.

Super Mario · Post by **Super Mario** » 28 Feb 2015, 18:14

jK wrote:

@L2 optimizations
In contrast to what aeonios said, there are ways to improve L2 cache hit rate. But it's a thing of the source code (-> programmer) and not of compile-flags!
Some options are: struct compression (alignment optimization, pack pragma, getting rid of bools, ...), getting rid of stl containers with many L2 cache misses (std::map, std::list, ...), hot/cold structures, ...
The list of options is long, you just need to know where it is worth to use them on.

And replace them with what? The boost equivalent or creating a new implementation of it?

aeonios · Post by **aeonios** » 28 Feb 2015, 23:56

jK wrote:@compilers #2
Yes, currently compilers do less optimization for L2. It's mainly limited to some loops. Nor will there be much improvement in the near future at least for the data cache, for code cache see LTO.

Regarding loops and cache, have a dose of reality: http://web.cse.ohio-state.edu/~saday/TC ... ommMin.pdf

40x compilation/profile cycles for an ~8% speedup on modern hardware.

jK wrote:@LTO & code cache misses
First, it's really really slow atm and needs a LOT of memory. So it's not advised to use it for development. Only for releases it's worth to consider.
Second, it might reduce code cache miss rate. And yes it's the only sane way to do so (maybe in future LTO could even use PGO). The ways mentioned in the video - namely: reordering of the .o linking, grouping via pragma's, reordering the source code - ARE NO SOLUTION, they make stuff worse and are the first step into asylum.

Most of that stuff just simplifies the design/flow of the compiler, and potentially makes it more efficient. LLVM/clang are already known for compiling faster than gcc. LLVM is known for producing slower code, but that's only because it doesn't support OpenMP, other tests show no clear winner.

jK wrote:So when not using LTO, don't bother about code cache misses (nor are there easy ways too even profile them!).
Never the less gcc's LTO don't work with Spring yet (too huge project & errors in LTO).

LTO is not likely to ever work properly in gcc. The only thing LTO does is make it easier for whole-program optimization though, ie it can make function inlining and a few other things work better. It could also make compilation (under LLVM) faster by avoiding some redundancy. None of that really has any direct impact on cache.

jK wrote:@non-std alloc
First, custom allocators don't optimize L2 cache misses. They optimize thread synchronization and OS/kernel sync spots.
Second, Spring supports a custom allocator already: google's tcmalloc. It's used under linux when available and gives ~10-15% speedup. Porting it to windows might be worth to consider.

edit better link: http://www.cs.berkeley.edu/~kubitron/co ... k_slab.pdf
http://www.parrot.org/sites/www.parrot. ... s/vmem.pdf

I read over the docs on tcmalloc. It actually has some similarities to the slab allocator, but the slab allocator is still superior in several ways:

-CPU-local memory caches in the slab allocator are stored in stacks instead of just arrays, and all allocations/deallocations are O(1) push/pops. Another consequence of this is that memory is allocated in LIFO order, so whenever an object is allocated it always uses the most recently cached memory locations first. That's why the slab allocator can improve cache performance while other allocators do not.

-The slab allocator avoids transfers between shared memory and CPU-local caches more effectively (also using a stack-like convention), which additionally helps improve caching performance, and also improves thread scaling by reducing contention. It's also cheaper than tcmalloc's "garbage collection" due to the stack-like organization.

-The slab allocator has an additional feature, although it requires configuration rather than being OOTB. The slab allocator can actually cache objects that have an initialized state, so that whenever objects are allocated or deallocated their constructor/destructor calls can be eliminated. With object caching the cost of allocation and deallocation approaches zero.

Also, to be factually correct the slab allocator was implemented incorrectly in linux. Twice. It was originally designed for use in kernel memory management, but is in no way limited to that.

jK wrote:Still likely not advised. Cause it's a balloon allocator, meaning it never frees memory.

Google does that in order to solve a very real problem, although that's more relevant on large servers than it is on desktops. The real solution to the problem they were trying to solve is called "RadixVM", but that's beside the point.

jK wrote:@gentoo
What you said makes no sense. Seems you have less experience with gentoo/gcc.

What did I say that made no sense again..?

jK wrote:@L2 optimizations
In contrast to what aeonios said, there are ways to improve L2 cache hit rate. But it's a thing of the source code (-> programmer) and not of compile-flags!
Some options are: struct compression (alignment optimization, pack pragma, getting rid of bools, ...), getting rid of stl containers with many L2 cache misses (std::map, std::list, ...), hot/cold structures, ...
The list of options is long, you just need to know where it is worth to use them on.

The slab allocator uses cache coloring, so struct alignment would be for-free.

Hot/cold structures increase binary size, require profiling, and are another feature that I would question the stability of in gcc. They're also not reliable for improving performance, since complicated code may not have clear "hot" paths, and due to the counter-productive increases in binary size.

That other stuff is language dependent, and I don't know enough about it to say if it would have any impact or not.

jK wrote:@compiler's L2 possibilities
First, they are limited w/o PGO.

PGO is much more limited than you might think. It can enable things like this though:
https://mercurylang.org/documentation/p ... thesis.pdf

but that only works for referentially transparent (functional/logic) languages, which C and all of its cousins are not. I point you again to 40x compile/profiling cycles above, and also to this paper:

http://liberty.princeton.edu/Publications/epic2_pbe.pdf

which demonstrates 10x longer compile times per compile with PGO function inlining (and/or procedure boundary elimination). They don't even manage to guarantee a speedup, which would require at least several compile/profile cycles each one 10x longer than normal.

jK wrote:Still I see some:
1. There could be a new pragma similar to pack, which enables reordering of member vars to reduce alignment overhead. Still this is already part of static code analyzers and so the programmers can solve this themselves, so the win would be little. Still such a new pragma would open the door for hot/cold optimization (see below).
2. There could be a new boolean type, that allows the compiler to pack multiple booleans of a struct into a single bitarray. So that each bool only uses 1bit instead of 8. Note that this would break backward compatibility, cause you cannot take a pointer to such booleans!
3. loop optimization, e.g. when you got "for (auto& d: map) {}" then the compiler knows that the whole map is iterated, so it can/should precache the preceding items to reduce the cache miss rate. Note that when the loop-body is huge this might be contra-productive cause the body might trash the whole L2 cache and then the precaching was useless.

I already addressed the first point. As for the second and third points, that is the kind of thing I consider "stepping into the asylum". Your boolean trick in reality trades data size for code size, and you're adding a bunch of extra instructions when your idle rate was only 40% to begin with. I've already beaten the loop-optimization thing into the ground, but precaching goes against the way caches were intended to be used, and as you've said it's a risky "optimization" that's also tedious to code. You get real wins by consistently making conservative decisions that you can ensure will be wins and not losses, not by trying to be a hero with low-level hacks.

jK wrote:4. see hot/cold
@hot/cold splitting/grouping
@pica-question: Is there a way to automatize generating algocentric struct duplicates, to reduce l1 and l2-cache misses?
The first problem is to identify what's cold and what hot. With standard profiling (oprofile, perf, ...) this isn't possible cause you cannot identify what member vars are read.
To do so you need valgrind, and I wonder if there are already tools/plugins with that feature. Such tool would be a GREAT win cause it gives very important data. I.e. when you got a huge struct, you could simply move all `hot` vars to the front so that when you access one, all other are accessed via the cacheline, too. Preventing multiple cache misses when you access vars at front & end of the struct.
Obv. such generated data could also be used by the compiler itself when available (-> new feature for PGO?).

There's nothing new about that. See the above paper on procedure boundary elimination, also papers on superblocks (which gcc supports) and the like. I don't recommend bothering with that stuff though since it blows up your file size/compile times and the performance is very inconsistent at best.

PicassoCT · Post by **PicassoCT** » 02 Mar 2015, 22:48

Idea for L2 cach optimisation for PathNode Array

Take it apart!

Spring RTS Engine

Spring has a problem with L2 cache misses

Spring has a problem with L2 cache misses

Re: Spring has a problem with L2 cache misses

Re: Spring has a problem with L2 cache misses

Re: Spring has a problem with L2 cache misses

Re: Spring has a problem with L2 cache misses

Re: Spring has a problem with L2 cache misses

Re: Spring has a problem with L2 cache misses

Re: Spring has a problem with L2 cache misses

Re: Spring has a problem with L2 cache misses

Re: Spring has a problem with L2 cache misses

Re: Spring has a problem with L2 cache misses

Re: Spring has a problem with L2 cache misses

Re: Spring has a problem with L2 cache misses

Re: Spring has a problem with L2 cache misses

Re: Spring has a problem with L2 cache misses

Re: Spring has a problem with L2 cache misses

Re: Spring has a problem with L2 cache misses

Re: Spring has a problem with L2 cache misses

Re: Spring has a problem with L2 cache misses

Re: Spring has a problem with L2 cache misses