zenmumbler

A Struct of Arrays Type in C++14

August 3, 2015

In data-oriented design the efficient layout of data in memory is the primary driver for designing software. Main memory accesses are now so slow compared to L1/2/3 caches that cache misses can become an overriding factor in the speed of any system that deals with large arrays of items.

I was reading the excellent articles by BitSquid and was motivated to change my own component design in my stardazed engine. Now, unlike some people in game dev, I'm not averse to C++ and/or templates. I do use them sparingly and not to the heights found in some Boost libraries. I've got another post brewing about this, but I digress.

When I did some prototype test conversions of my existing types, I noticed a usage pattern returning every time. It is related to one of the main tenets of data-oriented design, namely that instead of an Array of Structs, you use a Struct of Arrays. For example, given an example struct like:

struct Transform {
    Vec3 position;
    Quat rotation;
    Vec3 scale;
    Mat4 modelMatrix;
};

If you look at the data layout of an array of Transforms in memory it would be this:

[position] [rotation] [scale] [modelMatrix] [position] [rotation] [scale] [modelMatrix] ...

and when you iterate over the contents of the array and are just looking at the matrices the CPU will load the entire struct into the cache but the other fields might as well not be there. This causes a cache miss and thus delay for likely every iteration because you're wasting cache space on data that is not needed at the moment.

Conversely, in a Struct of Arrays setup the layout is as follows:

struct Transform {
    Vec3 positions[];
    Quat rotations[];
    Vec3 scales[];
    Mat4 modelMatrices[];
};

Each field is separated out into its own contiguous array. Where things get a bit tricky is in the requirement to have all the arrays allocated together in a single allocation. The principle is not hard, you allocate a single buffer that can hold N items of each array, then calculate the base pointers for each array and from there on just use those. E.g.:

size_t numItems = 16384;
size_t combinedSize = sizeof(Vec3) + sizeof(Quat) + sizeof(Vec3) + sizeof(Mat4);
void* instanceData = allocator.alloc(combinedSize * numItems);

auto positionBase = static_cast<Vec3*>(instanceData);
auto rotationBase = reinterpret_cast<Quat*>(positionBase + numItems);
auto scaleBase = reinterpret_cast<Vec3*>(rotationBase + numItems);
auto modelMatrixBase = reinterpret_cast<Mat4*>(scaleBase + numItems);

// ... read/write data to offsets from base pointers
// somewhere down the line implement resizing the block with multiple memcpys, etc.

Additionally, with dynamic data blocks you will run out of array space and will need to expand the array space of all arrays together. The setup and resizing of this data is grunt work, but when done manually each time you implement this in a different component it can be error prone.

Enter the inventively named MultiArrayBuffer, a variadic template type that will allocate and resize the buffer and give you properly typed base pointers on demand. Like I said, going overboard with template stuff is not what I usually do, but I value the service MAB gives me in code that uses it over the chicanery I had to do to make it work, which is all contained in the implementation of a very low-level helper type. This is how the above setup code is done using MAB:

MultiArrayBuffer<Vec3, Quat, Vec3, Mat4> instanceData{ allocator, 16384 };

auto positionBase = instanceData.elementsBasePtr<0>();
auto rotationBase = instanceData.elementsBasePtr<1>();
auto scaleBase = instanceData.elementsBasePtr<2>();
auto modelMatrixBase = instanceData.elementsBasePtr<3>();

// ... read/write data to offsets from base pointers
// somewhere down the line:
instanceData_.resize(32768);

The elementsBasePtr<N>() method returns a typed pointer to the Nth array in the buffer. This method is also trivial enough that you don't need to cache the start pointers per se, potentially simplifying usage code.

Implementation

MAB being a variadic template means having to deal with the somewhat awkward recursion of types using templated helpers1. I will highlight a few things in this article but read the full implementation if you're interested.

The first thing I had to calculate was the total byte size of each parameterised type to mirror the sequence of sizeof(T)s in the manual version. Here's how I did2:

template <typename T>
constexpr size_t elementSumSize(T* = nullptr) {
    return sizeof(T);
};

template <typename T, typename... Ts>
constexpr size_t elementSumSize(T* = nullptr, Ts*... ts = nullptr) {
    return sizeof(T) + elementSumSize<Ts...>(std::forward<Ts*>(ts)...);
};

// ... in the class definition:

template <typename... Ts>
class MAB {
    static constexpr size_t sumSize = elementSumSize<Ts...>();
};

Quite straightforward with the notable exception of the annoying (defaulted) parameters I had to pass and also forward to make things compile. Omitting the parameters is not possible using this method as the two function calls become ambigious, I also tried this with structs but hit the same limitation. I use functions as they are less verbose than using structs.

The main challenge of this type came when I needed to implement resizing the buffer. To do so, I have to iterate over each base pointer and copy data to a new buffer, this combines the compile-time templated functions with run-time actions in a not entirely intuitive way. Here is the code I used to handle it:

// -- Call a callback with base pointer and element size info for each T

using ItFn = std::function<void(void*, uint32)>;

template <typename T>
void eachArrayBasePtr(void* basePtr, size_t /*capacity*/, const ItFn& fn, T* = nullptr) {
    fn(basePtr, sizeof(T));
}

template <typename T, typename... Ts>
void eachArrayBasePtr(void* basePtr, size_t capacity, const ItFn& fn, T* = nullptr, Ts*... ts = nullptr) {
    fn(basePtr, sizeof(T));

    auto bytePtr = static_cast<uint8_t*>(basePtr);
    eachArrayBasePtr<Ts...>(bytePtr + (capacity * sizeof(T)), capacity, fn, std::forward<Ts*>(ts)...);
}

// ... at the usage site:

newData = /* newly allocated buffer */;
newCapacity = /* number of items in the new arrays */;
data_ = /* current buffer */;
count_ = /* count of used elements in the arrays */;

eachArrayBasePtr<Ts...>(data_, capacity_,
    [newDataPtr = (uint8_t*)newData, newCapacity, usedCount = count_]
    (void* basePtr, size_t elementSizeBytes) mutable {
        memcpy(newDataPtr, basePtr, usedCount * elementSizeBytes);
        newDataPtr += elementSizeBytes * newCapacity;
    });

The eachArrayBasePtr<Ts...>() helper essentially recurses over the Ts and bumps a base pointer to the next array by using the sizeof(T) and the capacity that the caller has to supply. Before each iteration, a callback function is called with the current array base pointer and the size of the currently pointed-at element. Types are eschewed in this setup as they were not necessary and it would complicate things quite a bit. What that does mean is that MultiArrayBuffer can only be safely used for TriviallyCopyConstructible types. In fact, ideally you would use (almost) completely trivial types.

Other containers in my library properly accomodate non-trivial types but given the intended usage of this container, it is reasonable not to support them. If you have a large array of tens or hundreds of thousands of elements, it is simply not wise to use types with user specified destructors. POD types are the way to go for this; put the logic in the manager class maintaining the instance data instead. It also follows the philoshopical idea of SoA: these are trivial types that are initialized and processed en masse by other functions. Individual constructors and destructors at scale will annihilate the efficiency gains of this data layout.

If you have comments or see horrible mistakes or misconceptions, throw me a line on Twitter.

  1. Haters, I understand why you don't like this, I really do. I'm just not that bothered by it. Each tool has its use.
  2. In the real type, I use 32-bit sizes and some other helper functions, but the essence is the same.

Reply @zenmumbler to discuss this article.