Over the years i've had to write, re-write and optimize math libraries far too often. Sometimes 2-3 times within 1 project. This might sound strange given that time is money however, counter intuitively, that fact is the driving force behind so many re-writes.
I believe that its possible to write an "ok" math library that will work for "all" use cases but nigh on almost impossible to write an "optimal" math library to suite all cases. Large and still growing programming teams, changing platforms and multiple simultaneous platforms and differing requirements across target executables compound the issue to the point where its simply too complex.
Gustavo Oliveira discusses some of the considerations here which can be roughly summed up as
- use intrinsics
- return by value
- declare data purely
- provide operators and functions
- inline everything
- replicate results in SIMD
1. Use Intrinsics
Intrinsics are functions whose implementation is handled entirely by the compiler. See Wiki for more. Most compilers provide the ability to write inline assembly routines and this can be very useful for pieces of code that require significant hand optimization however the for short functions such as those seen in vector math libraries, the bulk of the programming work is in the handling of the calling convention and how the return values are provided and interpreted. Due to the myriad entrypoints to support for math libraries it quickly gets out of control and unmanageable.
Intrinsics hide these issues, the compiler has intimate knowledge of the intrinsic function, how it should be called, how it returns, how it can be combined with other instructions and every possible combination you might otherwise have to handle with inline assembly. It does this optimally every time (IME).
2. "XXX by Value"
Return by value, on this one i both agree and disagree. It isn't hard and fast in all cases, especially so in cases where a function has more than 1 resultant value. Generally speaking however you should, where possible, write code that returns register sized values as "return by value". Under optimization the compiler will attempt to keep the value in a register and not commit to memory thus making the code faster. Certain platforms are however limited to a small set of registers for SIMD math and thus storage back to memory is more prevalent. I recommend investigating this for your platform(s) of choice.
The XNAMath library handles this by specifying function parameters that can be altered per platform. For example
XMFINLINE XMVECTOR XMVector2IntersectLine
CXMVECTOR Line2Point2 <== different type
FXMVector is defined to be a constant, pass by reference type, for platforms that don't support passing in registers, under x86 we are limited to passing the 1st to 3rd parameters as registers so only the first three params to this function use this type. The typedef is thus defined as
typedef const XMVECTOR FXMVECTOR;
for x86 / xbox 360.
On xbox 360 the 4th parameter is defined as
typedef const XMVECTOR CXMVECTOR;
whereas on x86 its defined as
typedef const XMVECTOR& CXMVECTOR;
This allows the actual code to be completely independent of the underlying hardware limitations. Under x64 the PC has more registers to use thus the x86 limited types can be defined to pass by value without changing ought but the typedef.
This platform agnostic setup becomes very important as more and more platforms need support.
Note: On some platforms its possible to define this as a directive to the compiler. On xbox-360 for instance the compiler provides a pragma (passinreg) applying to a single float, double or __vector4. This informs the compiler that it should "prefer" using registers over temporary storage and helps avoid LHS (Load Hits Store).
3. Declare data purely
This is a good general rule however brings with it issues that are hard to avoid. The theory is that the compiler can better optimize a native type than a wrapper type, which is generally the case apart from types supported by extended instructions such as VMX. On 360 the passinreg pragma does not require a pure type to be used provided the data is the only member of the class being created. This allows you to define your type as
// my foo vector interface
and the compiler will treat it as if it were the __vector4 type. This is similarly the case for float/double types and is enormously useful.
What this means is that we can wrap the datatypes and better define their syntax and semantics without negatively impacting the resultant code generation. Typical types defined by a math library might be
- point4d - 3d position in space, w = 1
- direction4d - 3d direction vector, w = 0
- quaternion - simple quaternion container
- sphere - 3d position, w= radius
- plane - 3d normal, w= distance
- euler4d - euler angles, w not used.
- float4d - x=y=z=w
direction4d generate_direction(point4d_const_in a, point4d_const_in b);
point4d offset_point(point4d_const_in a, direction4d dir, float4d length);
Once a high level math library is defined in terms of behavior rather than type it is much more usable by general programmers who then don't need to worry about the implementation itself, only that the behavior they require is well defined by the type they choose.
So, i would recommend that you investigate your platform and make the call, sometimes platforms don't provide any type of "passinreg" functionality and then it becomes a judgement call. Personally i prefer a wrapped type than a pure type simply because it allows me to limit the functionality i provide to the clients of my system.
4. Operators vs Functions
we can all see that the code
position = position_a + direction_b * scalar_c;
is easy to read and its precedence well defined. Just like general math terms, the more complex things get the more difficult it is to understand and thus we expand things out to more meta terms (temporaries). The compiler will do the same when presented with a long sequence of operators.
IME however... it doesn't help to re-factor that long sequence of operators into a long(er?) sequence of function calls. I am willing to believe there are circumstances where this occurs however the general case that i examine when writing code doesn't show it.
For now i recommend writing all sensible operators, of course those defined as "sensible" are often up for discussion. I tend to do +-*/ and all variants leaving it at that. I highly recommend AGAINST ever overloading operators to supply cross product / dot product.
5. Inline everything.
Another general rule that i agree with (but for those pesky debug builds). Personally i choose to force the inlining of all base level math functions.
Most development occurs in a debug form, we've written the code and it didn't work first time, or we've come back to fix an edge case 6months after writing the code. For most game logic code the math library will be a significant support system and in many cases it is >50% OF said code.
In a release build it is highly likely that most of our math functions will boil down to a small set of instructions, in many cases a single instruction acting on registers alone. This is perfect for our needs in a release environment. Due to the nature of math code we tend to use a lot of functions for very little actual work per function. In a full debug build the compiler will turn off all optimisations, this includes the general use of pass-by-register and return by register, it will also NOT inline functions defined as "inline" while (most of the time) maintaining the inlining of code defined as "forceinline". The net result is we get a LOT more actual code for a client function using the math library. The code is horribly inefficient, dumping to memory after every operation which results in significant LHS issues.
Vectorized inline reliant math in debug is slow, so don't inline in debug.
What i choose to do is define the math library as inline for all builds other than debug, in debug we instance the functions into a release optimized lib that the client executable links against. The math code itself isn't inlined so it doesn't bloat AND the compiler will honor the calling conventions of the code. It won't be anywhere near as fast as the full release inline versions but you can expect ~40% of the speed of said build completed to ~5-10% for full debug with inlining.
The dot product mathematically provides a scalar value but in code this value most often used AS an operand to a vector operation. Consider reflection
i - [in] A floating-point reflection vector.
n - [in] A floating-point ray direction vector.
v = i - 2 * dot(i, n) * n
a naive implementation might do
// Result = Incident - (2 * dot(Incident, Normal)) * Normal
float f_dot = dot_product(i , n); // vector -> memory
vector4 result= i - (2.0f * f_dot * n);
this will roughly boil down to
// simd register -> memory
float f_dot = dot_product(i , n);
// LHS (~25cycles)
// memory -> float register + mul -> memory
float tmp= 2.0f * f_dot
// LHS (~25cycles)
// memory -> simd register + replicate
vector4 promote= replicate(tmp);
// simd mul, simd subtract
vector4 result= i - (promote * n);
for which we get 2 LHS hits. (360 PPC chip experience, not sure on PC side)
Now if we consider the instructions above its not a huge jump to see how to remove the LHS problems.
// simd register return from dot_product
float4 f_dot4 = dot_product(i , n);
float4 f_dot4_2 = f_dot4 + f_dot4;
// simd mul, simd subtract
vector4 result= i - (f_dot4_2 * n);
this is a relatively simple change that removes some very expensive operations & delays... it can be taken further with the use of vnmsubfp which will perform result= - ((b * c) - a).
the general principle is that vector data should remain registers/vector storage as long as possible.
Part 2 will discuss implementation details.