Rabbit-chasing bugs back in 2016
A recent hiring process required me to write down a debugging story that included application code and infrastructure. This prompt reminded me of an old happening back in 2016 and how amazingly baffling it was. I didn’t use this event in my job application, but I found it was worth writing down and share, showing the interesting failure modes that big distributed systems can bring.
This post goes into significant detail on High Performance Computing (HPC) systems from back when I was in the industry. I had recently started in tech jobs and this helped me learn a good deal of how Linux systems work on the inside.
Centro Nacional de Analisis Genomico (CNAG) 2016 #
Back in 2016 I was doing HPC support in a biomedical research facility. Their production pipeline started sequencing DNA and processing the results in a supercomputer to be able to run analysis and experiments on the data. This process was sensitive and sometimes time-bound, as the DNA could belong to patients looking for a genetic link to their ailments to aid in their diagnosis.
Their processing pipeline required multiple stages and took long, on the order of tens of hours of computation, using a sizable amount of resources (about 20% of our yearly computing time was used by production pipelines). A different research facility had been analysing the computational problem that was commonly a blocker for these pipelines and had written a new application that could leverage GPUs to speed up the processing, cutting hours off the processing time, and reducing the amount of overall computing time. Our site begin a collaboration program with that institution that allowed us to use their program in exchange for assistance developing and testing it. For that purpose, our site bought half a dozen new nodes with top-of-the-line NVIDIA GPUs for the new program to leverage, and added them to our existing cluster.
We updated the pipeline to use the new program, and it was running perfectly in all our tests. Our new hardware seemed fit for purpose and well integrated. We put the new pipeline to production, we checked its behaviour and were satisfied. It was significantly faster and effectively accomplished our goals, so we celebrated and moved on. Then, the sporadic failures started happening.
It couldn’t be that happy of an end, right? #
An issue started creeping up slowly as adoption of our new pipeline grew. The new program was running close to the very beginning of our pipeline, and it was running perfectly, but the following step would sometimes fail with strange error messages. Rerunning that step would succeed, showing the error was transient, but still, it would sometimes happen. That second step had been stable and used for years, so there was no reason it would start failing now. That step was pretty fast, a few minutes, but the failures would happen in seconds, showing the error was located somewhere at the beginning of the execution.
We suspected an issue with the exchange format generated by the new tool, but close inspection showed that the format was simple and stable, and we were not able to reproduce any failure with the same files, no matter how we did it. We tried to convert to different formats, and the issue would still happen sometimes.
Inspecting the failed jobs and comparing them to the successful ones, we noticed a pattern. Errors would overwhelmingly happen on Mondays. Looking closer, we noticed a second detail: Mondays had the least amount of production jobs run.
The nature of the beast: DNA sequencing at scale #
DNA sequencing took about 7 days in our lab. Sample preparation is expensive and delicate process, and sequencers accept several samples at once, so we’d hold samples for a few days until we could batch a sequencer full, triggering a job. Once the sequencing job finished, it would be automatically loaded into our cluster and it would trigger our pipeline to process the details. During the weekend, the lab would not work, so most sequencing jobs would be finishing over the weekend and not replaced by new ones until Monday.
Therefore, Monday morning was when our cluster had the lowest load. Otherwise said, it was when we had more computing resources available. That brought a different dimension to the issue: locality. When the cluster was loaded and full, the pipelines would take priority, and requiring our new nodes, they’d run on it. With the rest of the cluster packed and the new nodes underutilized, the entire pipeline would run in the same node. On Mondays, every step of the pipeline would run in a different node, spreading across the cluster, except for the first step that required the GPUs.
With this, we now knew that the issue was related to data locality, or in other words, our Lustre distributed filesystem.
Why, Lustre. Why? #
Lustre is a high performance distributed filesystem commonly used in HPC at the time. It is also used in other environments nowadays, and is even offered as a managed service in AWS.
Our cluster had been using Lustre from the start, and we had a dedicated network layer for the filesystem. I’m not going to describe how it works, but you can knock yourself out here if you are interested. Lustre was also quite delicate for our use case, and we had several issues over time for several reasons unrelated to the technical quality of Lustre. We delved to test the filesystem, reading their design documentation and running tests to see if there were issues with it. We also included debug statements to the jobs which allowed us to see the source of the weird error messages in the second program:
#BeGIN_SRC sh cannot access ‘my_big_file’: No such file or directory #END_SRC
When the second program started, the files generated by the first program where not there. But the second job could not start until the first job finished, and the first program had been stopped. So the files had been written. Inspecting the filesystem in the first job, showed the files being present and written. Lustre design was supposedly atomic (or as atomic as a distributed system usually is), so it shouldn’t be reporting inconsistently across nodes.
Knowing the cause, we tried to reproduce it with dummy files, copying large files inside the filesystem and trying to get a reproducer to report a bug to Lustre, to no avail. So… what was happening?
Have you met… the buffer cache? #
You may have heard of Linux’s caches. They are so in-famous that we got an early website devoted to it https://www.linuxatemyram.com/.
Basically, the OS caches recently accessed disk pages so the system doesn’t have to go back to the hard drive to read the files again if you close and reopen them. If you are writing to the files it is even better, because we can keep your writes in memory in case you keep writing to the file before flushing them to disk. This is very important to order writes, batch disk accesses and overall hide how slower than RAM hard drives really are (about 1000 times slower).
What does this mean for us? Well, for starters, in a filesystem you don’t want people to read files in an inconsistent state, where only half the file is there. In a distributed filesystem, you can avoid telling anyone about a new file until it is completely finished writing to, also known as a flush.
Naive: But our tests worked! the small dummy files appeared immediately, so they were flushed. Why wasn’t the big file (100s of GBs) flushed?
Size matters #
Here we have to account again with the speed of a distributed system. We were writing over our very fast network (10-40 GB/s) to arrays of very fast disks (IIRC we managed to get 6 GB/s at peak thanks to the highly striped writes). We were flushing blocks (parts of the file) continuously as we wrote (we couldn’t hold it completely in memory). But the system was not idle, it was being used to read and write by the rest of the cluster. And disks, specially magnetic disks (affectionately called spinning rust) can only do one thing at a time: read or write. And you want them to do that thing for as long as possible over as long a strip of magnetic disk as possible, because that’s fastest. With that logic unchecked, your big writes can block the disk from doing anything else, keeping it for yourself. Not a very good experience for everyone else!
Real systems will avoid you hogging the disk by breaking your operation into transactions (smaller operations in chunks) and these will be mixed “fairly” with other people’s transactions, so everyone gets a chance to use the disks. So our very big write that would take tens of seconds just going straight to the disk will take some more seconds to complete writing when the system is loaded.
Naive: Ok, ok. But the first program ended and it can’t end until all writes are done, right? So this would have been happening behind the scenes while the program terminated. Why is the bug there?
Well… not quite. man close(2) page tells us that all resources are freed (so memory buffers have to be flushed) but it also warns us that it does not guarantee that the data was successfully saved to disk. This doesn’t mean you’ll write data (though you might) but that the flush is not guaranteed to have happened by the time close(2) returns.
So… the program can finish before the writes are persisted, and if the writes are very big, that can happen tens of seconds after the program ended, you reaped the job and started a new job elsewhere in your cluster. Yoohoo! There’s our bug!
And the solution is…? #
We were now armed with the knowledge that it was not a bug in the system, but a result of the system design. We usually do not write big enough files to notice, but if we did, we would encounter this much more often. So… options?
So we just need to block the job after the program terminates until the data has been written to disk.
Naive: We can run ls and check if the files are there, right?
We would see the local node filesystem status not the real status from a different node, so no trust it works.
Naive: Reading the files? Like a hash?
You just exercise the cache. Again, nothing about the other nodes.
Naive: Can we force the caches to flush somehow?
Actually, we can. Coreutils has a nice tool called sync that flushes file system buffers. If we run it at the end of our job we’ll make sure it is all written to disk.
Guess what happened? It… hang. The job would not finish for a long time, tens of minutes after the main program was finished. But… why?
You aren’t alone pretty boy. #
Because the nodes have several processes and jobs. Processes that compete with sync and have to access the same filesystem. And you just wiped the read cache, so it has to recreate it again for you to flush it away once more. And they all access big files that don’t fit memory. You basically force it all to start paging memory in and out trying to flush never-ending writes for everything in the node, not just your job.
A better solution was to call fsync(2) (or an equivalent syscall) on the output file descriptors, forcing the flush to affect just the files we had written. Unfortunately, it being an us problem and an external program, we couldn’t implement it.
In the end, we put a loop with a timer at the beginning of the second job checking for input file existence before letting the second job start processing. Bash is the glue that holds the world.
Conclusion #
Distributed systems beget weird behaviours, and even weirder when we have large amounts of data and short time frames for our operations. We also take for granted how a filesystem works, despite it being much more complicated (and potentially full of gotchas) than we expect. Locality is also a factor, because the world can look different from different parts of your system, and the usual tricks may no longer be valid.
In the end, thorough systematic analysis and double checking your assumptions will lead you to figure out what’s happening (and learn quite a bit about how the underlying thing actually works).
Hope you had fun reading this!
Note: It’s been over 9 years since it happened. I’ve done my best to remember and double check with available information, but I may be misremembering some details. Explanations are very brief and superficial.