Spring has a problem with L2 cache misses - Page 2

Spring has a problem with L2 cache misses

Discuss the source code and development of Spring Engine in general from a technical point of view. Patches go here too.

Moderator: Moderators

User avatar
jK
Spring Developer
Posts: 2299
Joined: 28 Jun 2007, 07:30

Re: Spring has a problem with L2 cache misses

Post by jK »

User avatar
jK
Spring Developer
Posts: 2299
Joined: 28 Jun 2007, 07:30

Re: Spring has a problem with L2 cache misses

Post by jK »

aeonios wrote:There's nothing new about that. See the above paper on procedure boundary elimination, also papers on superblocks (which gcc supports) and the like. I don't recommend bothering with that stuff though since it blows up your file size/compile times and the performance is very inconsistent at best.
All those are I-Cache, I am speaking of D-Cache. And it really seems my idea of a tool that stats the hotness of member vars is something new :/
Atm ppl if at all use L2 cache miss visualization per codeline and then optimize the structures from that, but that's the opposite direction of how it should be imo.
User avatar
PicassoCT
Journeywar Developer & Mapper
Posts: 10450
Joined: 24 Jan 2006, 21:12

Re: Spring has a problem with L2 cache misses

Post by PicassoCT »

the problem is that you basically have to few datastructur from a function spanning algo-perspective, and you have to guarantee that no other thread/algo messes with your data - its not trivial to profile.

Regarding the multiple duplicates in ram to save l2 cache misses? Good idea, not good idea?
aeonios
Posts: 202
Joined: 03 Feb 2015, 14:27

Re: Spring has a problem with L2 cache misses

Post by aeonios »

Well, first off I reread that paper I linked on loop optimizations and realized that it didn't say exactly what I thought it did, and turns out (luckily) not to be relevant to our problem here.

The problem isn't that spring doesn't fit into cache (which is impossible), but rather that it accesses memory in a way which doesn't use cached items efficiently (ie more than once before being clobbered). A bit of research reveals that a 40% miss rate is the average for unoptimized loops/nested loops that access arrays.

http://surface.syr.edu/cgi/viewcontent. ... ntext=eecs

and a pretty good overview of the subject and strategies here.

More modern research has focused on using polyhedral representations of loops to improve the quality of the optimizations that are possible, which culminated in PLuTo. Long story short, PLuTo happens to be implemented in gcc under graphite, and should be able to solve our cache problem with a few flags. :P

In order to use it gcc has to be compiled with support for graphite (which it isn't by default), and you need the ppl and cloog-ppl libraries for it to work. AFAIK all that should be required is:

Code: Select all

-O2 -fgraphite-identity -floop-nest-optimize
I don't know if that does data layout optimization or not, I haven't been able to dig that far yet and it may take extra options. If it breaks spring, I'm sure they'd be interested to know upstream too, as the optimization is supposed to produce identical behavior.

I haven't gotten quite that far yet, since setting up mingw to do anything nonstandard (or useful, for that matter) requires doing complicated things with cygwin. The whole git bash/mingw/cygwin mess in general is making me regret not bothering to figure out how to set up dual boot. :|
Super Mario
Posts: 823
Joined: 21 Oct 2008, 02:54

Re: Spring has a problem with L2 cache misses

Post by Super Mario »

aeonios wrote:
I haven't gotten quite that far yet, since setting up mingw to do anything nonstandard (or useful, for that matter) requires doing complicated things with cygwin. The whole git bash/mingw/cygwin mess in general is making me regret not bothering to figure out how to set up dual boot. :|
Are you running windows? If so, what's preventing you from using cmake?
aeonios
Posts: 202
Joined: 03 Feb 2015, 14:27

Re: Spring has a problem with L2 cache misses

Post by aeonios »

Yes, windows, but cmake isn't the problem. The problem is I need automake and autoconf and libtool for working on libumem, and PPL and CLooG-ppl for compiling mingw/gcc with graphite (and so far PPL does not compile under cygwin at all, which sucks), and to compile mingw in the same environment with all of that other stuff I need, then set up my env for compiling with mingw, etc.
dansan
Server Owner & Developer
Posts: 1203
Joined: 29 May 2010, 23:40

Re: Spring has a problem with L2 cache misses

Post by dansan »

Just install VirtualBox, download a Ubuntu/Whatever ISO, boot it in VBox, install GuestAdditions (important!) in virtual Ubuntu. Make sure to give >=2 "CPU"s to the VM.

All done in 60min and no need to modify your disks etc.

For access to your devel environment on the host, create a "shared folder" in VBox and mount it in vUbu or simply install a dev env in vUbu - performance loss of working in a VM is <5% for CPU bound tasks.
User avatar
Silentwings
Posts: 3720
Joined: 25 Oct 2008, 00:23

Re: Spring has a problem with L2 cache misses

Post by Silentwings »

The notes here http://nicolas.brodu.net/en/programmation/streflop/ (this is the library Spring uses to do synced math), say that at least the default options of O2 and O3 won't sync.
User avatar
PicassoCT
Journeywar Developer & Mapper
Posts: 10450
Joined: 24 Jan 2006, 21:12

Re: Spring has a problem with L2 cache misses

Post by PicassoCT »

aeionius - thats all nice and fun, but are you really sure , that a compiler can extract the access pattern of loops, even when the loopconditions are all highly dynamic? Cause that would be awesome- thats near the holy grall of comp sience, a programm getting what a program is trying to do..

And yes, its highly difficult to get what a programm is trying to do... even as human.
Pathfinding does a funny walk over PathNodes array. Even i m not sure my optimization is going to cut it- im just following some heuristics. Please explain to me how he does it..
aeonios
Posts: 202
Joined: 03 Feb 2015, 14:27

Re: Spring has a problem with L2 cache misses

Post by aeonios »

dansan wrote:Just install VirtualBox, download a Ubuntu/Whatever ISO, boot it in VBox, install GuestAdditions (important!) in virtual Ubuntu. Make sure to give >=2 "CPU"s to the VM.

All done in 60min and no need to modify your disks etc.

For access to your devel environment on the host, create a "shared folder" in VBox and mount it in vUbu or simply install a dev env in vUbu - performance loss of working in a VM is <5% for CPU bound tasks.
Right. Except when I tried to do it the other way around with windows under qemu it was more like a 90% overhead for all tasks, even with kernel-passthrough, which there is basically no chance of under windows. Setting up a distro is a pain even under real hardware, and if I'm going to that much trouble I'd rather be able to use it natively. :|
Silentwings wrote:The notes here http://nicolas.brodu.net/en/programmation/streflop/ (this is the library Spring uses to do synced math), say that at least the default options of O2 and O3 won't sync.
That's curious. I recall there being a floating point bug in the first intel processors (which has compiler workarounds), but nowadays FP is standardized well enough that you should get the same results even between PPC and x86 (and certainly between AMD and intel). If I do "1/10" in windows calculator it says "0.1" every time, as you would expect, and not "0.99999" or some other nonsense. That sort of thing should be obvious.

The C standard is perhaps the dumbest thing I have ever had the misfortune of bearing witness to, but even -that- should only be relevant if you're doing type conversions (which is basically fixable if you know what the C standard is going to do anyway).
PicassoCT wrote:aeionius - thats all nice and fun, but are you really sure , that a compiler can extract the access pattern of loops, even when the loopconditions are all highly dynamic? Cause that would be awesome- thats near the holy grall of comp sience, a programm getting what a program is trying to do..

And yes, its highly difficult to get what a programm is trying to do... even as human.
Pathfinding does a funny walk over PathNodes array. Even i m not sure my optimization is going to cut it- im just following some heuristics. Please explain to me how he does it..
It depends. Basically, it's much easier to optimize for if the arrays are being accessed in a linear or mostly linear order. It can optimize loops even if the number of iterations of the loops are different, and even for complex nests, but not nearly as well for totally random access. Polyhedral loop optimization is very flexible, but it's not really magic. As far as pathfinding is concerned, it depends on what kind of "funny walk" we're talking about. That might require restructuring, or may not be tractable at all.

The easiest thing to do is try it and see if the miss rate improves. The next easiest thing to do is use instrumentation based profiling to see where our runtime is being spent, and work on it on a case-by-case basis starting with whatever takes the most time. That's just standard procedure for optimizing anything.
User avatar
Silentwings
Posts: 3720
Joined: 25 Oct 2008, 00:23

Re: Spring has a problem with L2 cache misses

Post by Silentwings »

nowadays FP is standardized well enough that you should get the same results even between...
However often it may "just work" and is better than it used to be (which is not much help for Spring, since Spring requires absolute reproducibility) some variation still exists in how it can be implemented. Strictly speaking, all IEEE gives is a recommendation that the language standards provide a means to write fully reproducible programs, with a description of how to do so.
User avatar
PicassoCT
Journeywar Developer & Mapper
Posts: 10450
Joined: 24 Jan 2006, 21:12

Re: Spring has a problem with L2 cache misses

Post by PicassoCT »

Question:

Code: Select all

struct PathNode {
PathNode()
: fCost(0.0f)
, gCost(0.0f)
, nodeNum(0)
, nodePos(0, 0)
{}
float fCost; ///< f
float gCost; ///< g
int nodeNum;
int2 nodePos;
inline bool operator < (const PathNode& pn) const { return (fCost < pn.fCost); }
inline bool operator > (const PathNode& pn) const { return (fCost > pn.fCost); }
inline bool operator == (const PathNode& pn) const { return (nodeNum == pn.nodeNum); }
};
Could we compress this a little down? for example, could we instead of a int2 use a unsigned int split into? How many Nodes are there ?
By definition there can only be NodeNum which is IntMax many!
So 2 unsigned int16 should be sufficient 65536 - which would allow us to cram nodePos into one int..

Code: Select all

unsigned int nodePos;
#define SetNodePos[1]=(a)   nodePos= ((nodePos & (!0xFFh)) | a)
#define SetNodePos[2]=(a)   nodePos= ((nodePos & (0xFFh)) | a)
#define GetNodePos[1]          (nodePos & 0xFFh)
#define GetNodePos[2]          (nodePos & 0xFF00h
User avatar
jK
Spring Developer
Posts: 2299
Joined: 28 Jun 2007, 07:30

Re: Spring has a problem with L2 cache misses

Post by jK »

Good find, just replace int2 with type2<boost::uint16_t>, go through all compile errors - there are some cause "int2 foo = type2<boost::uint16_t>(0,0) + int2(1,1);" isn't defined.

There are 64K PathNodes per PathNodeBuffer so saving 4bits frees 0.25MB something that should help L2. Also there 3 PathNodeBuffers (PathFinder and two PathEstimators).

So it's worth it.
User avatar
PicassoCT
Journeywar Developer & Mapper
Posts: 10450
Joined: 24 Jan 2006, 21:12

Re: Spring has a problem with L2 cache misses

Post by PicassoCT »

another goood "find"

Code: Select all

-->encapsulates a function, stores arguments given, chooses upon returned nil, 
--	the most often chosen argument
function heuristicDefault(fooNction,fname, teamID, ...)

if not  GG[fname] then  GG[fname]={} end
if not GG[fname][teamID] then GG[fname][teamID] ={} end

local heuraTable= GG[fname][teamID] 
ArgumentCounter=1
	for k,v in pairs(arg) do
	if not heuraTable[ArgumentCounter]then heuraTable[ArgumentCounter]={}end
	if not heuraTable[ArgumentCounter][v] then heuraTable[ArgumentCounter][v]=1 else heuraTable[v]=heuraTable[ArgumentCounter][v]+1  end
	ArgumentCounter=ArgumentCounter+1
	end

results=fooNction(args)

	if not results  then
	--devalue current Arguments
		ArgumentCounter=1
		for k,v in pairs(arg) do
		heuraTable[ArgumentCounter][v]=heuraTable[ArgumentCounter][v]-1  
		ArgumentCounter=ArgumentCounter+1
		end

	--call the function with the most likely arguments
	newWorkingSet={}
		ArgumentCounter=1
		for k,v in pairs (arg) do
		highestVal,highestCount=0,0
			for i,j in pairs ( heuraTable[ArgumentCounter]) do
				if heuraTable[ArgumentCounter][v] > highestCount then
				highestCount= heuraTable[ArgumentCounter][v] 
				highestVal= v
				end 
			end
		table.insert(newWorkingSet,highestVal)
		ArgumentCounter=ArgumentCounter+1
		end
	results=fooNction(newWorkingSet)
	Spring.Echo("FallBack::Heuristic Default")
	assert(results, "Heuristic Default has inssuficient working samples.Returns Nil")
	GG[fname][teamID]=heuraTable
	return results
		else
		GG[fname][teamID]=heuraTable
		return results
		end
end 
With this code, you can waste all the performance we gained gracefully for the comfort of never seeing a lua error..
User avatar
jK
Spring Developer
Posts: 2299
Joined: 28 Jun 2007, 07:30

Re: Spring has a problem with L2 cache misses

Post by jK »

Update:
here a graph of latest dev build, after some changes (also thank you pica for your patches)
Image
what changed:
* moved all OpenAL calls to the sound thread
* massive tweaks in ::TestNeighbors
* tweaks in CGroundMoveType::HandleUnitCollisions
* some minor L2 opt in rendering (ROAM & Unit)

What left is LocalModelPiece::UpdateMatricesRec().

Stats of L2 missrate are:
jk spring.git # perf stat -p $(pidof spring) -v -d
Performance counter stats for process id '16195':

181429.848239 task-clock (msec) # 1.251 CPUs utilized [100.00%]
65,839 context-switches # 0.363 K/sec [100.00%]
4,074 cpu-migrations # 0.022 K/sec [100.00%]
23,705 page-faults # 0.131 K/sec
540,908,636,724 cycles # 2.981 GHz [40.07%]
55,108,186,839 stalled-cycles-frontend # 10.19% frontend cycles idle [39.98%]
266,143,284,681 stalled-cycles-backend # 49.20% backend cycles idle [40.06%]
499,312,280,417 instructions # 0.92 insns per cycle
# 0.53 stalled cycles per insn [39.92%]
65,259,476,435 branches # 359.695 M/sec [40.01%]
3,195,035,520 branch-misses # 4.90% of all branches [39.99%]
241,564,357,073 L1-dcache-loads # 1331.448 M/sec [39.90%]
3,557,294,843 L1-dcache-load-misses # 1.47% of all L1-dcache hits [40.09%]
6,809,810,347 LLC-loads # 37.534 M/sec [40.04%]
1,899,545,216 LLC-load-misses # 27.89% of all LL-cache hits [40.11%]

145.051595277 seconds time elapsed
-> stalled cycles per insn: 0.63 -> 0.53
-> LLC-load-misses: 33.31% -> 27.89%

profile.png
(619.24 KiB) Not downloaded yet
gajop
Moderator
Posts: 3051
Joined: 05 Aug 2009, 20:42

Re: Spring has a problem with L2 cache misses

Post by gajop »

Image
Post Reply

Return to “Engine”