jK wrote:@compilers #2
Yes, currently compilers do less optimization for L2. It's mainly limited to some loops. Nor will there be much improvement in the near future at least for the data cache, for code cache see LTO.
Regarding loops and cache, have a dose of reality:
http://web.cse.ohio-state.edu/~saday/TC ... ommMin.pdf
40x compilation/profile cycles for an ~8% speedup on modern hardware.
jK wrote:@LTO & code cache misses
First, it's really really slow atm and needs a LOT of memory. So it's not advised to use it for development. Only for releases it's worth to consider.
Second, it might reduce code cache miss rate. And yes it's the only sane way to do so (maybe in future LTO could even use PGO). The ways mentioned in the video - namely: reordering of the .o linking, grouping via pragma's, reordering the source code - ARE NO SOLUTION, they make stuff worse and are the first step into asylum.
Most of that stuff just simplifies the design/flow of the compiler, and potentially makes it more efficient. LLVM/clang are already known for compiling faster than gcc. LLVM is known for producing slower code, but that's only because it doesn't support OpenMP, other tests show no clear winner.
jK wrote:So when not using LTO, don't bother about code cache misses (nor are there easy ways too even profile them!).
Never the less gcc's LTO don't work with Spring yet (too huge project & errors in LTO).
LTO is not likely to ever work properly in gcc. The only thing LTO does is make it easier for whole-program optimization though, ie it can make function inlining and a few other things work better. It could also make compilation (under LLVM) faster by avoiding some redundancy. None of that really has any direct impact on cache.
jK wrote:@non-std alloc
First, custom allocators don't optimize L2 cache misses. They optimize thread synchronization and OS/kernel sync spots.
Second, Spring supports a custom allocator already: google's tcmalloc. It's used under linux when available and gives ~10-15% speedup. Porting it to windows might be worth to consider.
edit better link:
http://www.cs.berkeley.edu/~kubitron/co ... k_slab.pdf
http://www.parrot.org/sites/www.parrot. ... s/vmem.pdf
I read over the docs on tcmalloc. It actually has some similarities to the slab allocator, but the slab allocator is still superior in several ways:
-CPU-local memory caches in the slab allocator are stored in
stacks instead of just arrays, and all allocations/deallocations are O(1) push/pops. Another consequence of this is that memory is allocated in LIFO order, so whenever an object is allocated it always uses the most recently cached memory locations first. That's why the slab allocator can improve cache performance while other allocators do not.
-The slab allocator avoids transfers between shared memory and CPU-local caches more effectively (also using a stack-like convention), which additionally helps improve caching performance, and also improves thread scaling by reducing contention. It's also cheaper than tcmalloc's "garbage collection" due to the stack-like organization.
-The slab allocator has an additional feature, although it requires configuration rather than being OOTB. The slab allocator can actually cache
objects that have an initialized state, so that whenever objects are allocated or deallocated their constructor/destructor calls can be eliminated. With object caching the cost of allocation and deallocation approaches zero.
Also, to be factually correct the slab allocator was implemented
incorrectly in linux. Twice. It
was originally designed for use in kernel memory management, but is in no way limited to that.
jK wrote:Still likely not advised. Cause it's a balloon allocator, meaning it never frees memory.
Google does that in order to solve a very real problem, although that's more relevant on large servers than it is on desktops. The real solution to the problem they were trying to solve is called "RadixVM", but that's beside the point.
jK wrote:@gentoo
What you said makes no sense. Seems you have less experience with gentoo/gcc.
What did I say that made no sense again..?
jK wrote:@L2 optimizations
In contrast to what aeonios said, there are ways to improve L2 cache hit rate. But it's a thing of the source code (-> programmer) and not of compile-flags!
Some options are: struct compression (alignment optimization, pack pragma, getting rid of bools, ...), getting rid of stl containers with many L2 cache misses (std::map, std::list, ...), hot/cold structures, ...
The list of options is long, you just need to know where it is worth to use them on.
The slab allocator uses cache coloring, so struct alignment would be for-free.
Hot/cold structures increase binary size, require profiling, and are another feature that I would question the stability of in gcc. They're also not reliable for improving performance, since complicated code may not have clear "hot" paths, and due to the counter-productive increases in binary size.
That other stuff is language dependent, and I don't know enough about it to say if it would have any impact or not.
jK wrote:@compiler's L2 possibilities
First, they are limited w/o PGO.
PGO is much more limited than you might think. It can enable things like this though:
https://mercurylang.org/documentation/p ... thesis.pdf
but that only works for referentially transparent (functional/logic) languages, which C and all of its cousins are not. I point you again to 40x compile/profiling cycles above, and also to this paper:
http://liberty.princeton.edu/Publications/epic2_pbe.pdf
which demonstrates 10x longer compile times
per compile with PGO function inlining (and/or procedure boundary elimination). They don't even manage to guarantee a speedup, which would require at least several compile/profile cycles each one 10x longer than normal.
jK wrote:Still I see some:
1. There could be a new pragma similar to pack, which enables reordering of member vars to reduce alignment overhead. Still this is already part of static code analyzers and so the programmers can solve this themselves, so the win would be little. Still such a new pragma would open the door for hot/cold optimization (see below).
2. There could be a new boolean type, that allows the compiler to pack multiple booleans of a struct into a single bitarray. So that each bool only uses 1bit instead of 8. Note that this would break backward compatibility, cause you cannot take a pointer to such booleans!
3. loop optimization, e.g. when you got "for (auto& d: map) {}" then the compiler knows that the whole map is iterated, so it can/should precache the preceding items to reduce the cache miss rate. Note that when the loop-body is huge this might be contra-productive cause the body might trash the whole L2 cache and then the precaching was useless.
I already addressed the first point. As for the second and third points,
that is the kind of thing I consider "stepping into the asylum". Your boolean trick in reality trades data size for code size, and you're adding a bunch of extra instructions when your idle rate was only 40% to begin with. I've already beaten the loop-optimization thing into the ground, but precaching goes against the way caches were intended to be used, and as you've said it's a risky "optimization" that's also tedious to code. You get real wins by consistently making conservative decisions that you can ensure will be wins and not losses, not by trying to be a hero with low-level hacks.
jK wrote:4. see hot/cold
@hot/cold splitting/grouping
@pica-question: Is there a way to automatize generating algocentric struct duplicates, to reduce l1 and l2-cache misses?
The first problem is to identify what's cold and what hot. With standard profiling (oprofile, perf, ...) this isn't possible cause you cannot identify what member vars are read.
To do so you need valgrind, and I wonder if there are already tools/plugins with that feature. Such tool would be a GREAT win cause it gives very important data. I.e. when you got a huge struct, you could simply move all `hot` vars to the front so that when you access one, all other are accessed via the cacheline, too. Preventing multiple cache misses when you access vars at front & end of the struct.
Obv. such generated data could also be used by the compiler itself when available (-> new feature for PGO?).
There's nothing new about that. See the above paper on procedure boundary elimination, also papers on superblocks (which gcc supports) and the like. I don't recommend bothering with that stuff though since it blows up your file size/compile times and the performance is very inconsistent at best.