matrix multiplication code optimization
Moderator: Moderators
matrix multiplication code optimization
at first i was gonna ask if you guys were interested in a matrix multiplication function in inline assembly, but....
i just tried tweaking the formula in c++, and i got it to run almost as fast as the assembler version by exposing the m variable of the matrix, and doing m2.m[] instead of m2[], thus avoiding calling the [] operator every time.
the times for 1 million:
my assembly version: 337 ms
existing taspring c++ version: 1420 ms
taspring version after the change: 350 ms
just changing that made the matrix multiplication code in taspring 4 times faster.
i just tried tweaking the formula in c++, and i got it to run almost as fast as the assembler version by exposing the m variable of the matrix, and doing m2.m[] instead of m2[], thus avoiding calling the [] operator every time.
the times for 1 million:
my assembly version: 337 ms
existing taspring c++ version: 1420 ms
taspring version after the change: 350 ms
just changing that made the matrix multiplication code in taspring 4 times faster.
The way the COB/3do/s3o code calculates the matrices is suboptimal anyway, you could get a much bigger speedup by caching the matrices per cob piece every frame. IIRC, every time that a piece matrix is requested (which happens a lot), all the matrices of the parents are also calculated again. BOS scripts changing positions/rotations of pieces woul d have to be considered too though.
SSE can probably not be used due to floating point sync problems.
SSE can probably not be used due to floating point sync problems.
i did cause i couldnt get it to compile in release with the librarys i was using, now i got it to run in release, and turned on the optimizations in vc++.
heres my results this time.
in release with optimizations enabled
4'273'504 per second in assembler in release
4'694'835 per second in c++ in release without operator overload
1'358'695 per second in c++ in release with operator overload
3'952'569 per second in c++ taspring style without operator overload in release
2'314'814 per second in c++ taspring style with operator overload in release
the c++ formula i wrote out turns out to be the fastest of all of them when not using the overloaded operator.
heres my results this time.
in release with optimizations enabled
4'273'504 per second in assembler in release
4'694'835 per second in c++ in release without operator overload
1'358'695 per second in c++ in release with operator overload
3'952'569 per second in c++ taspring style without operator overload in release
2'314'814 per second in c++ taspring style with operator overload in release
the c++ formula i wrote out turns out to be the fastest of all of them when not using the overloaded operator.
TBH we really shouldn't use any assembler in Spring, the advantages (speed) just dont outweight the disadvantages (bad maintainability & readability).
As for the compiler not inlining the operator[], that sounds strange. Usually stuff like std::vector::size() and std::vector::operator[] does get inlined (at least I usually can't call them in the debugger because they don't exist if I compiled with optimization and debugging enabled).
SSE will indeed most probably desync, plus we'd need a 387 version anyway for PCs without SSE. (Plus just enabling SSE math in the compiler would be a lot easier.. don't get good vectorization then tho...)
As for the compiler not inlining the operator[], that sounds strange. Usually stuff like std::vector::size() and std::vector::operator[] does get inlined (at least I usually can't call them in the debugger because they don't exist if I compiled with optimization and debugging enabled).
SSE will indeed most probably desync, plus we'd need a 387 version anyway for PCs without SSE. (Plus just enabling SSE math in the compiler would be a lot easier.. don't get good vectorization then tho...)
SSE has existed since the Pentium 4. Now that Macs use Intel chips, I don't think it is unreasonable to require it. You would be locking out the PowerPC users, but Apple doesn't sell PPC Macs anymore, and hasn't for almost a year. Considering I expect it will be a while before the MacOS version syncs anyway, I think SSE is totally reasonable to demand.
Not that I think we should be using assembly in the code anyway, for the reasons Tobi mentioned.
BTW, anyone who manages to speed up the sim loop or the unithandler loop will get lots of gratitude from me. I miss huge games. OTA used to support thousands of units in one game, and spring can't handle 500 without slowing down.
Not that I think we should be using assembly in the code anyway, for the reasons Tobi mentioned.
BTW, anyone who manages to speed up the sim loop or the unithandler loop will get lots of gratitude from me. I miss huge games. OTA used to support thousands of units in one game, and spring can't handle 500 without slowing down.
i been wanting to try tweaking with other parts, but frankly i cant get taspring to compile, i spent a couple hours today trying to get all the libraries it wants, and when i added the wtl library for the crashhandler, it started spitting out hundreds of errors, saying stuff like float3 and frustum dont exsist.
Getting the libraries should be as simple as downloading the library package... did you follow the compiling thread?
well, as i said in my first post, i gave up on the assembly thing, even in my own program, cause unless i use sse or 3d now, optimized c++ is faster if you dont use the overloaded operator, which is what i been trying to say, the method used in taspring right now, is as fast as my assembly version, if you access the M variable directly instead of using the overloaded operator.
and no i didnt know about any compiling guide, i just loaded the vc7 project. and started trying to fix the errors.
edit:i just put in the library pack thing, i was surprised that it had the newer stuff you guys added like the crash handler, but its boost library was to old to compile 74b3 on so i had to update it.
edit again: ah it doesnt have the crash handler, and thats the killer, when i give it those library's the whole thing explodes, or rather, crashrpt wants the wtl libraries, which once i include them, every class in the whole game loses its definition.
and no i didnt know about any compiling guide, i just loaded the vc7 project. and started trying to fix the errors.
edit:i just put in the library pack thing, i was surprised that it had the newer stuff you guys added like the crash handler, but its boost library was to old to compile 74b3 on so i had to update it.
edit again: ah it doesnt have the crash handler, and thats the killer, when i give it those library's the whole thing explodes, or rather, crashrpt wants the wtl libraries, which once i include them, every class in the whole game loses its definition.