Meandering thoughts of an Addicted Games Programmer: 2010

Saturday, August 7, 2010

Static assert with a "real" message.

static asserts are important for many reasons and are heavily used in most large codebases esp when said codebase has many target environments. They inform the programmer when an assumption fails at compile time allowing for the runtime code to avoid expensive verification.

#define COMPILE_ASSERT(x) extern int __dummy[(int)x]

Which is typically used as

COMPILE_ASSERT(sizeof(int) == sizeof(unsigned));
COMPILE_ASSERT(sizeof(int) == sizeof(char));

the second example if obviously wrong and provides the output.

error C2466: cannot allocate an array of constant size 0

Which is not very useful other than as a signal that something bad occurred.

Now consider the form

#pragma warning (disable: 4483)

typedef char __identifier("compile assert:");

#define COMPILE_STRING_ASSERT(condition, name) \
typedef __identifier("compile assert:") __identifier(name)[1]; \
typedef __identifier("compile assert:") __identifier(name)[(condition) ? 1 : 0];

define COMPILE_ASSERT(condition) COMPILE_STRING_ASSERT(condition, #condition)

The above uses the identifier keyword which is technically an error however the pragma warning disables the error (yes it doesn't make sense). This keyword allows the user to define ANY identifier, valid C++ or not.

COMPILE_ASSERT(sizeof(int) == sizeof(char));

Now provides the output

'compile assert: sizeof(int) == sizeof(char)[1]'

which is a much more palatable message esp when considering automatic build systems.

Note2

__identifier is specific to Visual Studio and may not be supported by other compilers.
VC++ 2010 introduces static_assert for all targets, tho it is less powerful than using __identifier.
SNC may support both, i will try to confirm.

Becoming a console programmer : Math Libraries Part 2

Implementation

Generally speaking implementation of a math library is not determined by the programmer DOING the implementation but by the clients of the system and the platforms that require support.

Given the myriad targets currently available AND the similarly complex client base i always recommend a relatively complex abstraction built on simple rules:

separate the client facing api into semantic types
define those types using platform specific typedefs or defines
implement using a cross platform api with platform specific implementations.

for example we might want the client to use matrix, point, vector and proxy type for float constants.

the semantic types define their functionality in terms of the low level api which in turn defines a set of common operations that all the semantic types can use at will. This means that to provide support for a new platform for ALL semantic types (which there are usually far more than 4) the programmer simply has to satisfy the lowest level api.

That low level api is typically relatively small providing wrappers for some of the basic operations such as add, mul, madd, permute, shuffle/vsel etc whilst also implementing some of the more complex but common operations such as normalize, permute_operation etc.

To flesh things out for the client side the programmer should define operators and functions that provide mathematically sensible and desirable operations such as

vector = point - point,
point = point + vector * scale.

the proxy type mentioned above would be a simd representation of a single float allowing the math itself to remain in registers.

The key to maintaining long mathematically complex code in registers is in ensuring the semantic types are well defined within themselves and as a whole.

there is a lot more to be said on the subject of math libraries however without infringing on company policies both new and old i cannot expand further. The above should however provide a good start and from there a good programmer will run with it and discover for themselves the rest.

enjoy.

Monday, July 5, 2010

Becoming a console programmer : Math Libraries Part 1

Writing a Math Library

Over the years i've had to write, re-write and optimize math libraries far too often. Sometimes 2-3 times within 1 project. This might sound strange given that time is money however, counter intuitively, that fact is the driving force behind so many re-writes.

I believe that its possible to write an "ok" math library that will work for "all" use cases but nigh on almost impossible to write an "optimal" math library to suite all cases. Large and still growing programming teams, changing platforms and multiple simultaneous platforms and differing requirements across target executables compound the issue to the point where its simply too complex.

Gustavo Oliveira discusses some of the considerations here which can be roughly summed up as

use intrinsics
return by value
declare data purely
provide operators and functions
inline everything
replicate results in SIMD

I don't agree with all of the above as hard rules but prefer to judge based on each specific platform and target build.

1. Use Intrinsics

Intrinsics are functions whose implementation is handled entirely by the compiler. See Wiki for more. Most compilers provide the ability to write inline assembly routines and this can be very useful for pieces of code that require significant hand optimization however the for short functions such as those seen in vector math libraries, the bulk of the programming work is in the handling of the calling convention and how the return values are provided and interpreted. Due to the myriad entrypoints to support for math libraries it quickly gets out of control and unmanageable.

Intrinsics hide these issues, the compiler has intimate knowledge of the intrinsic function, how it should be called, how it returns, how it can be combined with other instructions and every possible combination you might otherwise have to handle with inline assembly. It does this optimally every time (IME).

Use them.

2. "XXX by Value"

Return by value, on this one i both agree and disagree. It isn't hard and fast in all cases, especially so in cases where a function has more than 1 resultant value. Generally speaking however you should, where possible, write code that returns register sized values as "return by value". Under optimization the compiler will attempt to keep the value in a register and not commit to memory thus making the code faster. Certain platforms are however limited to a small set of registers for SIMD math and thus storage back to memory is more prevalent. I recommend investigating this for your platform(s) of choice.

The XNAMath library handles this by specifying function parameters that can be altered per platform. For example

XMFINLINE XMVECTOR XMVector2IntersectLine
(
    FXMVECTOR Line1Point1,
    FXMVECTOR Line1Point2,
    FXMVECTOR Line2Point1,
    CXMVECTOR Line2Point2                                  <== different type
)

FXMVector is defined to be a constant, pass by reference type, for platforms that don't support passing in registers, under x86 we are limited to passing the 1st to 3rd parameters as registers so only the first three params to this function use this type. The typedef is thus defined as

typedef const XMVECTOR FXMVECTOR;

for x86 / xbox 360.
On xbox 360 the 4th parameter is defined as

typedef const XMVECTOR CXMVECTOR;

whereas on x86 its defined as

typedef const XMVECTOR&  CXMVECTOR;

This allows the actual code to be completely independent of the underlying hardware limitations. Under x64 the PC has more registers to use thus the x86 limited types can be defined to pass by value without changing ought but the typedef.

This platform agnostic setup becomes very important as more and more platforms need support.

Note: On some platforms its possible to define this as a directive to the compiler. On xbox-360 for instance the compiler provides a pragma (passinreg) applying to a single float, double or __vector4. This informs the compiler that it should "prefer" using registers over temporary storage and helps avoid LHS (Load Hits Store).

3. Declare data purely

This is a good general rule however brings with it issues that are hard to avoid. The theory is that the compiler can better optimize a native type than a wrapper type, which is generally the case apart from types supported by extended instructions such as VMX. On 360 the passinreg pragma does not require a pure type to be used provided the data is the only member of the class being created. This allows you to define your type as

__declspec(passinreg)
class my_foo_vector
{
private:
    __vector4 m_vector_data;

public:

    // my foo vector interface
};

and the compiler will treat it as if it were the __vector4 type. This is similarly the case for float/double types and is enormously useful.

What this means is that we can wrap the datatypes and better define their syntax and semantics without negatively impacting the resultant code generation. Typical types defined by a math library might be

point4d - 3d position in space, w = 1
direction4d - 3d direction vector, w = 0
quaternion - simple quaternion container
sphere - 3d position, w= radius
plane - 3d normal, w= distance
euler4d - euler angles, w not used.
float4d - x=y=z=w

each of the types represents a specific behavior and has specific interfaces both local to the type and to allow generation of them. For instance

direction4d generate_direction(point4d_const_in a, point4d_const_in b);
point4d offset_point(point4d_const_in a, direction4d dir, float4d length);

Once a high level math library is defined in terms of behavior rather than type it is much more usable by general programmers who then don't need to worry about the implementation itself, only that the behavior they require is well defined by the type they choose.

So, i would recommend that you investigate your platform and make the call, sometimes platforms don't provide any type of "passinreg" functionality and then it becomes a judgement call. Personally i prefer a wrapped type than a pure type simply because it allows me to limit the functionality i provide to the clients of my system.

4. Operators vs Functions
we can all see that the code

position = position_a + direction_b * scalar_c;

is easy to read and its precedence well defined. Just like general math terms, the more complex things get the more difficult it is to understand and thus we expand things out to more meta terms (temporaries). The compiler will do the same when presented with a long sequence of operators.

IME however... it doesn't help to re-factor that long sequence of operators into a long(er?) sequence of function calls. I am willing to believe there are circumstances where this occurs however the general case that i examine when writing code doesn't show it.

For now i recommend writing all sensible operators, of course those defined as "sensible" are often up for discussion. I tend to do +-*/ and all variants leaving it at that. I highly recommend AGAINST ever overloading operators to supply cross product / dot product.

5. Inline everything.

Another general rule that i agree with (but for those pesky debug builds). Personally i choose to force the inlining of all base level math functions.

Most development occurs in a debug form, we've written the code and it didn't work first time, or we've come back to fix an edge case 6months after writing the code. For most game logic code the math library will be a significant support system and in many cases it is >50% OF said code.

In a release build it is highly likely that most of our math functions will boil down to a small set of instructions, in many cases a single instruction acting on registers alone. This is perfect for our needs in a release environment. Due to the nature of math code we tend to use a lot of functions for very little actual work per function. In a full debug build the compiler will turn off all optimisations, this includes the general use of pass-by-register and return by register, it will also NOT inline functions defined as "inline" while (most of the time) maintaining the inlining of code defined as "forceinline". The net result is we get a LOT more actual code for a client function using the math library. The code is horribly inefficient, dumping to memory after every operation which results in significant LHS issues.

Vectorized inline reliant math in debug is slow, so don't inline in debug.

What i choose to do is define the math library as inline for all builds other than debug, in debug we instance the functions into a release optimized lib that the client executable links against. The math code itself isn't inlined so it doesn't bloat AND the compiler will honor the calling conventions of the code. It won't be anywhere near as fast as the full release inline versions but you can expect ~40% of the speed of said build completed to ~5-10% for full debug with inlining.

6. Replicate

The dot product mathematically provides a scalar value but in code this value most often used AS an operand to a vector operation. Consider reflection

i  - [in] A floating-point reflection vector.
n - [in] A floating-point ray direction vector.
v = i - 2 * dot(i, n) * n

a naive implementation might do

// Result = Incident - (2 * dot(Incident, Normal)) * Normal
float f_dot = dot_product(i , n);    // vector -> memory
vector4 result= i - (2.0f   * f_dot * n);

this will roughly boil down to

// simd register -> memory
float f_dot = dot_product(i , n);   
// LHS (~25cycles)
// memory -> float register + mul -> memory
float tmp= 2.0f * f_dot
// LHS (~25cycles)
// memory -> simd register + replicate
vector4 promote= replicate(tmp);   
// simd mul, simd subtract
vector4 result= i - (promote * n);

for which we get 2 LHS hits. (360 PPC chip experience, not sure on PC side)
Now if we consider the instructions above its not a huge jump to see how to remove the LHS problems.

// simd register return from dot_product
float4 f_dot4 = dot_product(i , n);   
float4 f_dot4_2 = f_dot4 + f_dot4;
// simd mul, simd subtract
vector4 result= i - (f_dot4_2 * n);

this is a relatively simple change that removes some very expensive operations & delays... it can be taken further with the use of vnmsubfp which will perform result= - ((b * c) - a).

the general principle is that vector data should remain registers/vector storage as long as possible.

Part 2 will discuss implementation details.

Thursday, June 17, 2010

Becoming a console programmer : Structure Alignment

This is an area that all programmers should be familiar with however i've met and worked with a fair few who never had to care... until they had to care.
consider:

struct foo1
{
    char m_char;
};

struct foo2
{
    int  m_int;
    char m_char;
};
struct foo3
{
    long long m_longlong;
    int  m_int;
    char m_char1;
    char m_char2;
};

using default settings under x86 these three structs are 1, 8 and 16 bytes respectively. What is actually allocated is


struct foo2
{
    int  m_int;            // offset 0 : size 4 bytes
    char m_char;           // offset 4 : size 1 byte
    char m_pad[3];         //                             3 bytes padding
                          // 8 bytes
};
struct foo3
{
    long long m_longlong;  // offset 0  : size 8 bytes
    int  m_int;            // offset 8  : size 4 bytes
    char m_char1;          // offset 12 : size 1 byte
    char m_char2;          // offset 13 : size 1 byte
    char m_pad[2];         //                            2 bytes padding
                          // 16 bytes total
};

now lets take foo3 and jumble those members a little.

struct foo4
{
    char m_char1;
    long long m_longlong;
    char m_char2;
    int  m_int;
};

One might expect foo4 to be the same size as foo3 however it is not, it comes out to 24 bytes, the padding is as follows.

struct foo4
{
    char m_char1;           // offset 0  : size 1 byte
    char m_pad7[7];         //                          7 bytes padding
    long long m_longlong;   // offset 8  : 8 bytes
    char m_char2;           // offset 16 : 1 byte
    char m_pad[3];          //                          3 bytes padding
    int  m_int;             // offset 20 : 4 bytes
};                          // 24 bytes total

each type contained within a struct must begin on a natively aligned boundary, so the 8 byte "long long" must start on an 8 byte offset within the struct but be after m_char1 so the compiler inserts 7 bytes of "padding" that the user cannot directly access to ensure this.

to inspect this for yourself use the code


printf("\n(%d,",offsetof(foo4, m_char1));
printf("%d,",offsetof(foo4, m_longlong));
printf("%d,",offsetof(foo4, m_char2));
printf("%d)",offsetof(foo4, m_int));
printf(": %d",sizeof(foo4));

output : "(0, 8, 16, 20): 24"

Manual Alignment
It is often beneficial to align the base of a structure beyond its native requirements. I won't go into the reasons for this here (thats a later subject) but it does change the internals of a struct when you do this.
Consider:

struct _CRT_ALIGN(16) foo_aligned4
{
    char m_char1;
    long long m_longlong;
    char m_char2;
    int  m_int;
};

the

    _CRT_ALIGN(16)

informs the compiler that this structure should be 16 byte aligned (and therefore a multiple of 16 bytes in size), this is actually a wrapper around a declspec (see below for further reading)

In the case of foo4 this doesn't change the offsets of any of the internals but it does change the size of the struct

struct _CRT_ALIGN(16) foo_aligned4
{
    char m_char1;           // offset 0  : size 1 byte
                            //                          7 bytes padding
    long long m_longlong;   // offset 8  : 8 bytes
    char m_char2;           // offset 16 : 1 byte
                            //                          3 bytes padding
    int  m_int;             // offset 20 : 4 bytes
                                                      8 bytes adding
};                          // 32 bytes total

if we were to then include an instance of foo_aligned4 inside another struct, said struct will inherit the same aligment. In general a structure will always be aligned to the largest alignment of its components.

struct foo_composite
{
    char m_char1;                // offset  0: size 1 byte
                                //                             15 bytes padding
    foo_aligned4 m_foo_aligned4; // offset  16: size 32 byte
    long long m_longlong;        // offset  48: size 8 byte
    char m_char2;                // offset  56: size 1 byte
                                //                             3 bytes padding
    int  m_int;                  // offset  60: size 4 byte
                                // total 64 bytes
};

Now given our composite structure of 64 bytes lets see how small we can make it without altering the client facing elements.

struct _CRT_ALIGN(16) foo_aligned4_opt
{
    long long m_longlong;            // offset  0: size 8 bytes
    int  m_int;                      // offset  8: size 4 bytes
    char m_char1;                    // offset  12: size 1 byte
    char m_char2;                    // offset  13: size 1 byte
    char m_pad[2];                   // offset  14: size 2 bytes
};                                   // 16 bytes total


struct foo_composite_opt
{
    foo_aligned4_opt m_foo_aligned4;// offset  0: size 16 bytes
    long long m_longlong;           // offset  16: size 8 byte
    int  m_int;                     // offset  24: size 4 byte
    char m_char1;                   // offset  28: size 1 byte
    char m_char2;                   // offset  29: size 1 byte
    char m_pad[2];                  // offset  30: size 2 bytes
                                   // 32 bytes total
};

Our new composite struct is 32 bytes rather than 64, every member still exists but in a different configuration.
You may notice that I manually inserted the padding into the struct, this isn't required but personally i prefer it in cases where i require specific alignment or optimal structure size.

So key points are

structures inherit the largest alignment of their elements
elements will always begin at an offset that is a multiple of their alignment
the compiler will insert padding into your structure to achieve alignment requirements
structure size will always be a multiple of its alignment
to reduce alignment padding, arrange structures largest to smallest in alignement requirements.

Note that normal allocators, such as malloc, C++ operator new, and the Win32 allocators return memory that will most likely not be sufficiently aligned for __declspec(align(#)) structures or arrays of structures. I recommend writing your own allocator either as a wrapper around these or as a region allocator to achieve alignment above default (4 or 8 bytes usually).

further reading -_CRL_ALIGN, __alignof,

Becoming a console programmer : LHS

Zak Whaley pointed me at this article on "Some Assembly required" discussing float-> int conversion and what occurs when its done following onto the LHS article within the same site which describes relatively well what "Load Hits Store" (LHS) is and how to avoid it using the restrict keyword.

Later when i discuss math libraries i will come back to this with respect to vector <=> float interchange and how to effectively remove it from transient use.

Wednesday, June 16, 2010

Becoming a console programmer

Recently on a games industry forum, a friend who teaches over in the Netherlands (not Denmark :p) was asking us (programmers) what we would ensure a PC programmer moving to Console Development should know, before considering themselves a "console programmer"

None of these particularly apply to console programming however i've found over the years that programmers who come form a pure PC background often are lacking in this area so i personally consider them important.

Programmers new to console should strive to achieve:

an awareness of memory consumption - PC programmers tend to be somewhat liberal with memory allocation and stack usage. On Consoles we're often faced with very limited memory resources that shrink with time (due to the game getting larger) rather than the opposite on PC where virtual memory allows incredibly large memory footprints.
knowledge of alignment - internal to structures and how it pertains to their size AND how not paying attention can massively increase cache misses
appreciation of the cost of type conversions both in terms of float <=> integer, float <=> simd and the assumed simpler integer to integer conversions thar can cost significantly more than is assumed.
full understanding of what a Load Hits Store (LHS) is, how it occurs, how to avoid it and how to spot it in everyday code.
full understanding of the implications of a cache miss, what it is, how to avoid them and how to fill the time around the unavoidable ones.
full understanding of what a branch is, how it affects the cpu, how to avoid using them and what patterns should be avoided to minimize them in general purpose code
understand what select functions are (fsel, vsel family) and how/when to use them.

SIMD is important however there are MANY companies that write a math lib once and never look back, i've been guilty of this myself however now i know i was wrong back then and no doubt in many ways what i've written most recently is still wrong.

writing it once implies that when you wrote the lib you knew everything there was to know then and nothing changed in the years afterwards.... yes many games will be fine with an "ok" math lib however a "good" math lib can change the performance of a game and an entire team.

re-write it as often as you can afford learning from the mistakes each time... so much is based upon it that this is one of the few systems that i'd recommend this behavior on.

so a final point

learn simd math in all its forms, learn the extra instructions that most don't use and what they provide, learn how to use #.INF to your advantage and most of all use instructions such as vsel / shuffle/ permute to their fullest advantage.

I will be posting expansions on the above points over the next few weeks depending on demand.

Meandering thoughts of an Addicted Games Programmer