Tuesday, July 14, 2015

Unexpected performance in copying a plain data structure (e.g. passing it by value)

When is passing by value preferred over passing by constant reference? It seems this question is surprisingly complicated.

Let's assume the structure we're passing does not have a ctor or dtor, and does not lug around any data on the heap that needs to be copied. It's a plain data object, with no frills.

Intuitively, it would surprise me if there was a performance advantage to passing such an object if the object itself is smaller than a pointer (8 bytes on a 64 bit architecture). In that case you're passing the same amount or more data, as well as adding a layer of indirection on top of that.

Hypothesizing will only get you so far, sometimes you just need to run some benchmarks to see what happens. I created a small test program that depending on preprocessor variables passes a configurable-sized structure to a function that is prevented from inlining. This function performs a dummy operation, copying the argument to a static variable.

All said and done, the results are both as expected and somewhat surprising at the same time.

Sample size = 4. The discontinuity happens at structure sizes that exceed 128 bytes.


In the small size limit, passing by value is indeed faster. Not by leaps and bounds, but the effect is at least measurable, but over 32 bytes, passing by reference starts to be the clear winner. I leave drawing conclusions from this to the reader. The benchmark can also be criticized in that the pass-by-reference case is unreasonably favorable, as the value is virtually guaranteed to be local to the called function's stack frame, and thus very likely found in L1.

It's also notable that g++ is significantly slower than clang below the 128 byte limit, although the performance figures are much less stable. I'm really not sure what's going on there. Clang's weird run times will continue to haunt us as the exploration of this topic continues.

However, something very strange happens at 128 bytes, for both clang and g++. The call time increases four fold for clang, and doubles for g++ to but stranger still, they land at the same number!

The fact that both compilers end up doing this at the same size would at first glance seem to be related to hardware limitations, perhaps the cache. But this appears to be a red herring!

To figure out what is actually happening to increase the cost of this operation, we need to look at the assembly code. The assembly sources from passing a 124 byte argument looks very different from passing 132 bytes. It appears that above this size, the compiler generates a compact rep movsq instruction (essentially memcpy) for populating the call stack, rather than moving the contents of the structure qword by qword like it does in the 124 byte case.


Note that there is no reason in hardware why this happens. You can circumvent this behavior by designing a program where you have two stack-sequential variables passed to a function so that they in combination exceed 128 bytes, without rep movsq.

It should be emphasized that this is not as much related to passing variables by value, as copying C-style structures (an operation implicitly performed in the former). To highlight this, I ran a second benchmark that removed the function call from the equation.


As expected, you get the exact same behavior at 128 bytes. Clang is consistently faster, although the performance is much noisier than g++ again. I can't quite work out what's going on with that.

It should however be noted that at clang (but not g++) seems to handle the 128 byte boundary better with architecture specific optimizations. As with most of these benchmarks, clang's run time persists in being faster but all over the map.







It would be too easy to draw the conclusion "well don't copy structures by value!", but let's not forget that big data structures sometimes exist, and this is for a definition of "big" that isn't actually very. Large structures exist in file systems, in OS kernels, in databases, and so forth. And if you break them apart, your performance suffers because of increased indirection and the cache misses you incur because of it. Sometimes, just sometimes you actually need to copy such structures. 

No comments:

Post a Comment