Tuesday, July 14, 2015

Unexpected performance in copying a plain data structure (e.g. passing it by value)

When is passing by value preferred over passing by constant reference? It seems this question is surprisingly complicated.

Let's assume the structure we're passing does not have a ctor or dtor, and does not lug around any data on the heap that needs to be copied. It's a plain data object, with no frills.

Intuitively, it would surprise me if there was a performance advantage to passing such an object if the object itself is smaller than a pointer (8 bytes on a 64 bit architecture). In that case you're passing the same amount or more data, as well as adding a layer of indirection on top of that.

Hypothesizing will only get you so far, sometimes you just need to run some benchmarks to see what happens. I created a small test program that depending on preprocessor variables passes a configurable-sized structure to a function that is prevented from inlining. This function performs a dummy operation, copying the argument to a static variable.

All said and done, the results are both as expected and somewhat surprising at the same time.

Sample size = 4. The discontinuity happens at structure sizes that exceed 128 bytes.

In the small size limit, passing by value is indeed faster. Not by leaps and bounds, but the effect is at least measurable, but over 32 bytes, passing by reference starts to be the clear winner. I leave drawing conclusions from this to the reader. The benchmark can also be criticized in that the pass-by-reference case is unreasonably favorable, as the value is virtually guaranteed to be local to the called function's stack frame, and thus very likely found in L1.

It's also notable that g++ is significantly slower than clang below the 128 byte limit, although the performance figures are much less stable. I'm really not sure what's going on there. Clang's weird run times will continue to haunt us as the exploration of this topic continues.

However, something very strange happens at 128 bytes, for both clang and g++. The call time increases four fold for clang, and doubles for g++ to but stranger still, they land at the same number!

The fact that both compilers end up doing this at the same size would at first glance seem to be related to hardware limitations, perhaps the cache. But this appears to be a red herring!

To figure out what is actually happening to increase the cost of this operation, we need to look at the assembly code. The assembly sources from passing a 124 byte argument looks very different from passing 132 bytes. It appears that above this size, the compiler generates a compact rep movsq instruction (essentially memcpy) for populating the call stack, rather than moving the contents of the structure qword by qword like it does in the 124 byte case.

Note that there is no reason in hardware why this happens. You can circumvent this behavior by designing a program where you have two stack-sequential variables passed to a function so that they in combination exceed 128 bytes, without rep movsq.

It should be emphasized that this is not as much related to passing variables by value, as copying C-style structures (an operation implicitly performed in the former). To highlight this, I ran a second benchmark that removed the function call from the equation.

As expected, you get the exact same behavior at 128 bytes. Clang is consistently faster, although the performance is much noisier than g++ again. I can't quite work out what's going on with that.

It should however be noted that at clang (but not g++) seems to handle the 128 byte boundary better with architecture specific optimizations. As with most of these benchmarks, clang's run time persists in being faster but all over the map.

It would be too easy to draw the conclusion "well don't copy structures by value!", but let's not forget that big data structures sometimes exist, and this is for a definition of "big" that isn't actually very. Large structures exist in file systems, in OS kernels, in databases, and so forth. And if you break them apart, your performance suffers because of increased indirection and the cache misses you incur because of it. Sometimes, just sometimes you actually need to copy such structures. 

Monday, February 9, 2015

The case for autodidacticism in computer programming

I was recently asked about how being self-educated affected my programming abilities in any way. I hadn't thought the question through prior to this, so I gave some less than satisfactory answer, meandering on about how it was a mixed blessing, or something to that effect. This isn't really a very good answer...

I'm an autodidact computer programmer, and formally educated physicist, so I'd like to think I have seen a decent cross-section of both informal and formal education.

Having given the question some thought, I would like to posit that autodidacts will ultimately prove to be superior at their craft. I'd like to emphasize that this is a pretty weak causal relationship. Suvivorship bias is arguably a strong effect at work here. Successful autodidacts would likely also be successful in formal education, whereas those who struggle with formal education will surely not make the cut in autodidacticism. Hence, most mediocre programmers can be assumed to be formally educated.

That being said, the successfully self-educated do possess a number of advantages. Learning has always been the autodidact's lifelong personal responsibility, and usually when you learn of your own volition, what you learn tends to ingrain itself fairly firmly in your mind. The act of furthering their learning is also something the autodidact performs habitually. Furthermore, in order to make it very far, the autodidact will need to develop some fundamentally useful skills in problem-solving and lateral thinking.

The formally educated are subject to entirely different dynamics. While they're probably going to be exposed to a broader spectrum of knowledge, following a fixed curriculum that makes few concessions to whether the student is susceptible means that central knowledge will be lost. Emphasis is on passing exams, not on deeper understanding of the subject matter, and passing exams has never been easier. I have also seen more than a hint of CS graduates ceasing their learning process the day they graduate. It's sadly fairly common that people with five year working experience have merely accumulated five years of one year working experience.

The singular drawback of self-education is that it is a lonesome road. It can be very difficult to find peers to bounce ideas off.

Although to repeat my reservation, it's of course always going to be hard to separate the selection bias from the effects imposed by the different means of education.

Sunday, February 8, 2015

On JFokus 2015

Last week I was at JFokus, after I won tickets at a JUG event hosted at my job last year. The conference offered a pretty interesting cross-section of hot topics in Java right now.

Functional programming is absolutely incandescent. It's probably a natural reaction to lambda functions being introduced in Java 8, as well as the ongoing difficulties with designing good metaphors for parallel computing. Although I must say I'm a bit disappointed in the framing. Having dabbled in functional programming for several years, I got what they were trying to say, but the message lacked motivation. As much as it's interesting to lift the how aspect of Monads, and other FP paradigms, I fear it's going to be lost on the crowd if the solution isn't framed with a problem it solves. Pointing to JS' callback hell is a pretty lackluster motivation. The C++-talk "Plain Threads are the GOTO of todays computing" is in my opinion a far better sales pitch for Monadic chaining of parallel code. But at any rate, it's great to see functional programming is starting to make it big outside of academia!

There was also a pretty large buzz about IoT, but I remain skeptical. It still seems like a solution looking for a problem. If the industry can actually agree to a standard, it could be great, but if DLNA / UPnP is even remotely indicative, it's going to work, but just poorly enough that you're going to be frustrated and disappointed whenever you use it.

I'm reminded of that Douglas Adams quote
We are stuck with technology when what we really want is just stuff that works.
I should have taken pictures of the conference, but I didn't. Instead, here's a completely unrelated picture a frozen river I took when I went for a walk last Monday morning.

Not pictured: The Conference.