In the last week, I had several occurrences that again taught me something I already knew – at least in theory. Local build results can be deceiving.
In our current project, we have a pretty common setup. Each developer has one, sometimes two development environments for the local build. We have Continuous Integration running on a Jenkins build server with a bunch of different build nodes. Nothing fancy, and it works. Mostly.
Local build: green; CI build: red
Last week, I made some changes on one part of our software that is written in C#. It was nothing too complicated – it simply overloaded a function that was already there so it could be called with a different generic type. And I had help from a colleague who really knows C#.
In addition, I changed some other code so that it now calls the new overload. I rebuilt the project, everything worked like a charm, so I checked it in an pushed it to our git server.
You broke the build!
A few minutes later the Jenkins build of my feature branch was red. That second file I changed would not compile, because “using the best overload, cannot convert A to B”.
I was confused. I just had provided the overload for A. Why would the compiler pick the overload for B and then complain that the arguments won’t match?
A little closer look at the error message told me that it was indeed the file I had changed that would not compile, and it looked exactly like I had written it. The overloaded function in the other file was present as well.
To be sure, I pulled the code to my dev machine again and compiled the project. The local build was just fine.
Ah, it’s the old compiler!
Something in the Jenkins output caught my eye: “.NET framework 2.0” was announced when the project build started. Aren’t we at 3.5 or later? I asked our server administrator to have a look at the .NET framework installed on the build node.
They got back to me, telling me that everything was in order. The problem remained: The local build worked, while the Jenkins build was red, using .NET framework 2.0.
It’s not what you think it is
It took some desperation and another colleague to have a look at the problem to figure it all out:
The .NET framework was fine. In fact, it was the same that got used in my local build. I just did not check that. .NET 2.0 for Windows compact is not that old.
The project where I changed those files was just fine as well. The compile error did come from the compilation of a different project!
That other project sneakily uses the file that calls the function in question. It does, however, not use the file that defines the function but provides a different implementation of the function.
For that different implementation, there was no overload. There was only the original function.
So, what went wrong?
The first was me: I used a local build different to the one that runs on our CI server. We have a script that builds all those projects on the server. For the local build, I only built the single project I had changed (or so I thought). I should have run the exact same script that runs on the server to reproduce the problem.
Coming from that, I just assumed that there was something different on the server. I jumped on the first thing that caught my eye – the .NET framework – without checking whether it actually was different from the local build.
Another problem was that I did not examine the context of the error message carefully enough. I should have seen that the error occurred during the build of the other project.
It’s a few thousand lines of build output, but that’s a lame excuse. Our script has nice “—===ooo Start build project X ooo===—” and “—===ooo End build project X ooo===—” output to add visible breaks to the output.
The rest was only a bunch of deceiving circumstances: The odd .NET framework number. The fact that someone just reused a single file from another project’s directory. The fact that the compiler babbles about “best overload” when there is only one function.
Don’t rely on assumptions, especially if the conclusions are proven to be wrong. And more importantly, run your local build exactly the same way as your remote build.
More on that in a future post.
How about encapsulating the whole build system and toolchain in a docker container? That way you could be 100% sure that the environment is identical on all machines.
I made quite good experience with that on Linux, building Qt applications and for Cortex M bare metal, but I wonder what people use on Windows (7) hosts? I’ve been trying Vagrant, but it feels a bit flaky and is slow compared to Docker.
This is your root cause. All such overloads should be in one component, and dependencies should be made to the component.