Thursday, June 17, 2010

Becoming a console programmer : Structure Alignment

This is an area that all programmers should be familiar with however i've met and worked with a fair few who never had to care... until they had to care.
consider:
struct foo1
{
char m_char;
};

struct foo2
{
int m_int;
char m_char;
};
struct foo3
{
long long m_longlong;
int m_int;
char m_char1;
char m_char2;
};

using default settings under x86 these three structs are 1, 8 and 16 bytes respectively. What is actually allocated is

struct foo2
{
int m_int; // offset 0 : size 4 bytes
char m_char; // offset 4 : size 1 byte
char m_pad[3]; // 3 bytes padding
// 8 bytes
};
struct foo3
{
long long m_longlong; // offset 0 : size 8 bytes
int m_int; // offset 8 : size 4 bytes
char m_char1; // offset 12 : size 1 byte
char m_char2; // offset 13 : size 1 byte
char m_pad[2]; // 2 bytes padding
// 16 bytes total
};

now lets take foo3 and jumble those members a little.
struct foo4
{
char m_char1;
long long m_longlong;
char m_char2;
int m_int;
};

One might expect foo4 to be the same size as foo3 however it is not, it comes out to 24 bytes, the padding is as follows.
struct foo4
{
char m_char1; // offset 0 : size 1 byte
char m_pad7[7]; // 7 bytes padding
long long m_longlong; // offset 8 : 8 bytes
char m_char2; // offset 16 : 1 byte
char m_pad[3]; // 3 bytes padding
int m_int; // offset 20 : 4 bytes
}; // 24 bytes total

each type contained within a struct must begin on a natively aligned boundary, so the 8 byte "long long" must start on an 8 byte offset within the struct but be after m_char1 so the compiler inserts 7 bytes of "padding" that the user cannot directly access to ensure this.

to inspect this for yourself use the code

printf("\n(%d,",offsetof(foo4, m_char1));
printf("%d,",offsetof(foo4, m_longlong));
printf("%d,",offsetof(foo4, m_char2));
printf("%d)",offsetof(foo4, m_int));
printf(": %d",sizeof(foo4));


output : "(0, 8, 16, 20): 24"

Manual Alignment
It is often beneficial to align the base of a structure beyond its native requirements. I won't go into the reasons for this here (thats a later subject) but it does change the internals of a struct when you do this.
Consider:
struct _CRT_ALIGN(16) foo_aligned4
{
char m_char1;
long long m_longlong;
char m_char2;
int m_int;
};

the
    _CRT_ALIGN(16)

informs the compiler that this structure should be 16 byte aligned (and therefore a multiple of 16 bytes in size), this is actually a wrapper around a declspec (see below for further reading)

In the case of foo4 this doesn't change the offsets of any of the internals but it does change the size of the struct

struct _CRT_ALIGN(16) foo_aligned4
{
char m_char1; // offset 0 : size 1 byte
// 7 bytes padding
long long m_longlong; // offset 8 : 8 bytes
char m_char2; // offset 16 : 1 byte
// 3 bytes padding
int m_int; // offset 20 : 4 bytes
8 bytes adding
}; // 32 bytes total


if we were to then include an instance of foo_aligned4 inside another struct, said struct will inherit the same aligment. In general a structure will always be aligned to the largest alignment of its components.

struct foo_composite
{
char m_char1; // offset 0: size 1 byte
// 15 bytes padding
foo_aligned4 m_foo_aligned4; // offset 16: size 32 byte
long long m_longlong; // offset 48: size 8 byte
char m_char2; // offset 56: size 1 byte
// 3 bytes padding
int m_int; // offset 60: size 4 byte
// total 64 bytes
};


Now given our composite structure of 64 bytes lets see how small we can make it without altering the client facing elements.
struct _CRT_ALIGN(16) foo_aligned4_opt
{
long long m_longlong; // offset 0: size 8 bytes
int m_int; // offset 8: size 4 bytes
char m_char1; // offset 12: size 1 byte
char m_char2; // offset 13: size 1 byte
char m_pad[2]; // offset 14: size 2 bytes
}; // 16 bytes total


struct foo_composite_opt
{
foo_aligned4_opt m_foo_aligned4;// offset 0: size 16 bytes
long long m_longlong; // offset 16: size 8 byte
int m_int; // offset 24: size 4 byte
char m_char1; // offset 28: size 1 byte
char m_char2; // offset 29: size 1 byte
char m_pad[2]; // offset 30: size 2 bytes
// 32 bytes total
};


Our new composite struct is 32 bytes rather than 64, every member still exists but in a different configuration.
You may notice that I manually inserted the padding into the struct, this isn't required but personally i prefer it in cases where i require specific alignment or optimal structure size.

So key points are

  • structures inherit the largest alignment of their elements
  • elements will always begin at an offset that is a multiple of their alignment
  • the compiler will insert padding into your structure to achieve alignment requirements
  • structure size will always be a multiple of its alignment
  • to reduce alignment padding, arrange structures largest to smallest in alignement requirements.
Note that normal allocators, such as malloc, C++ operator new, and the Win32 allocators return memory that will most likely not be sufficiently aligned for __declspec(align(#)) structures or arrays of structures. I recommend writing your own allocator either as a wrapper around these or as a region allocator to achieve alignment above default (4 or 8 bytes usually).


further reading -_CRL_ALIGN, __alignof,

10 comments:

Tom Forsyth said...

Some compilers allow you to enable a warning if they had to add padding to adjust to the alignment you set. I highly advise you to enable this warning. It means you need to add manual padding, but it will ensure you're never surprised.

Andy Firth said...

yup - forgot about that one... will add it later

Zak Whaley said...

So your post is primarily concerned with the size of a structure, but on a related side-note: what about the speed of access to its members?

For example, unless I am mistaken, accessing m_char2 from an object of type foo3 is somewhat costly compared m_longlong since it is neither a) on a register/bus sized boundary nor b) register sized.

Am I mistaken about the magnitude of the cost of the associated with accessing such a member? And am I mistaken about the internals of accessing said member (I'm presuming something akin to shifting and masking).

I'm asking because your initial post on the subject was concerned about performance sinks like LHSes due to pointer dereferencing.

Obviously LHSes are *FAR* more important to eradicate since they require accessing (lots of!) main memory, but is the cost of accessing unaligned/poorly sized data insignificant compared to the amount of memory saved by using a char instead of an int?



Hmm. Upon thinking about it, I guess I just answered my own question, but I'll post anyway:

Non-register/bus-aligned data is probably fine to access since the structure members will likely fit within close memory (L1 or an extended register), local arithmetic ops such as shifting and masking to access the small datum are negligible (*especially* when compared to fetches to main memory).

Now, obviously hot-cold data separation should be adhered to to avoid cache misses when iterating, but that's another issue altogether.



The TL;DR question is: When does sane optimization end and premature optimization begin?

Andy Firth said...

>>When does sane optimization end and premature optimization begin?

ouch - you asked the question that has no solid answer.

it comes with experience but the general rules i use for structures are

* keep structures as small as possible at all times
* comment each member with your expected range
* avoid misaligned access 100% until compression requires it
* manually pad wherever sensible (32bit/64bit complicates this)
* understand what your platform(s) do with "small" data, a signed extend is generally negligible however don't rely upon that. In some circumstances eating 3 bytes of waste is more optimal.. in most however the single byte is fine. For consoles L2 rules the roost.

Zak Whaley said...

Thanks for your guidelines!

If you manually put in padding to achieve alignment, isn't that the same as putting the data members in smallest to biggest order and getting automatic padding? (Size-wise, I mean)

Also, for the 32/64 bit issue, could you use a macro such as:

struct foo
{
char m_char; //offset 0 : size 1
PAD(char); //take up the next 3 bytes for 32 and 7 for 64
};

Where PAD is something like:

#define PAD(type) \
char m_padding_##__COUNTER__ [::Tools::NextPowerOfTwo::m_value]

Zak Whaley said...
This comment has been removed by the author.
Zak Whaley said...
This comment has been removed by the author.
Zak Whaley said...

Gah, Blogger removed my template parameter.

The last part was:
::Tools::NextPowerOfTwo < sizeof(type) > ::m_value

This doesn't necessarily have to be power of 2, but maybe NextMultipleOfRegister which could then be redefined per _M_IA64, _M_AMD64, etc.

Edit: I'm bad at remembering HTML :D

Andy Firth said...

thats exactly how you do it... it still gets complex pretty quickly tho.

Zak Whaley said...

I forgot to mention that a week ago I made a post related to this. Upon further pondering, I realize it's incomplete (sometimes you want to pack data and put a manual number of bytes afterward), but it was an interesting exercise.

I did it with meta programming, first, then someone pointed out a simple bit-trick to get the appropriate size. Both methods are shown:

http://sfinae.blogspot.com/2010/06/padding-problem.html