The mystery of the evil line endings
At Black Parrot Labs we have been brought in to help with a number of tricky situations over the years that, at first, appeared to be related to Apiman, but have almost always turned out to be something else. It's a side effect of API Management being a central component between so many systems — it tends to get the blame for everything!
Backend service sometimes saw HTTP headers joined together
On this occasion, BPL were drafted in to investigate why some of the client's backend services were intermittently unable to parse an inbound payload, resulting in the service's middleware rejecting the request.
Our first port of call was to set up some meetings with the client's senior developer, devops, QE, and ops team. We spent a considerable amount of time digging through logs on the service and Apiman's metrics; and there was nothing evident.
One of the key difficulties was that the team were unable to reproduce the error in their dev and testing environments; it only seemed to occur under high load on production systems. Even then, it appeared to represent (possibly considerably less than) 0.5% of requests. Still, this was unacceptable, as it potentially left the user's account in a broken state that required manual resolution.
The error itself was, at its most basic level, appeared to be that two headers were sometimes joined together with junk characters between. Depending on which headers were affected, this was rejected at the middleware level for having illegal values (e.g. in the Accept header), or it was rejected by the application itself for failing various schema checks.
It looked something like: Accept:application/jsonn0x0d0x0aX-Another-Header: https://example.org in the service's error logs.
As the parsing error often occurred in middleware, it was difficult to tell what the correlation ID of the request was, and hence the client was unsure which specific request was causing the exceptions. So, this was an area that needed tackling, as it was causing confusion and various theories about requests potentially being misdirected.
Initial analysis
I immediately honed in on the weird characters between our hydra headers: 0x0d is hex for \r (carriage return), and 0x0a is hex for \n (newline).
As a quick primer, the HTTP header section is a simple key-value list that looks something like this:
Key and value are separated by a colon character :, and each key-value pair is separated by either \r\n or \r.
So, the client's application was, for whatever reason, not seeing the end of the key-value pair, and interpreting the newline \r\n as part of the value and printing them encoded as a hex string. Mysterious.
I had various theories for why this could be:
- You aren't allowed to mix different line-endings in the header section, so you must use consistently use \r\n or \n, but flip-flopping can cause weird effects in some applications because they 'locks' onto the first line-ending type encountered, so if you use the wrong one it'll be interpreted as literal characters.
- An illegal and/or invisible character in the payload that is causing the parser to get offset somehow and start mis-parsing.
- Someone may be mangling the header before it ever reaches the service and encoding it as a literal string. Culprits could be things like APM software that does bytecode rewriting to 'wrap' HTTP clients.
- Apiman could be doing something weird.
- Various network devices existed on the K8s network between Apiman and the backend service, so in theory the corruption could be caused by some intermediary appliance that was trying to do something smart (logging, security, etc).
- Something in the backend application was mis-parsing under load.
Initial attempts to hone in on the problem
There were some significant challenges to identifying the issue. The end-customer was understandably very sensitive about changes to production, and the process of getting things approved was quite time-consuming. The development team had also been unable to reproduce the issue on any of their testing environments.
And based on the limited information I had at that point, I was also unable to reproduce the issue.
We spent a considerable amount of time digging through existing logs as this was the best way to avoid any risk before starting to look at more advanced solutions.
Interestingly, Apiman's metrics logs never saw any problematic characters. However, as there may be invisible characters involved, it was unclear whether Elasticsearch would be helpful to us; but it certainly did not seem that there was an issue.
Furthermore, because of the way the error was being manifested in the backend service, the ops team were not entirely sure which request was the problematic one, and were having to use timestamps for rough correlation. This obviously was not ideal.
Apiman
I first tried to eliminate some low-hanging fruit. In consultation with the client, I created a custom version of Apiman with some small changes that could detect (and eliminate) any unusual characters in the headers.
After a testing phase and approval from the client, this was deployed; however, the problem remained and no unusual characters were ever logged in Apiman.
We then tried disabling an APM solution that was on the Apiman image. However, again, this appeared to make no difference.
We spent a while doing research into various dependencies and middleware that Apiman uses to see whether anyone had seen such an issue. There was nothing. And, nobody else had reported Apiman causing an issue like this.
Going lower level: packet capture
It was clear to me at this point that we needed to see exactly what was on the wire, so that we could verify what was coming into Apiman, going out of Apiman, and arriving at the backend service where the error occurred.
I built a detailed plan of action for the customer to capture packets using tcpdump which I could then analyse.
This would have a few benefits:
- We could capture the entire header section of inbound packets, and thereby correlate requests where needed.
- This captures from the line, and hence should be exactly what the operating system sees. No interference from middleware or other applications.
- As we get the exact binary, we can use a hex editor to inspect the payloads of unusual invisible characters.
There were a few challenges:
- The client liked the idea, but had concerns the high traffic service would quickly fill their disks.
- Some of the services used TLS, so this needed to be handled somehow.
- We needed a privileged container: this was something for the client's ops teams to do, with our support.
Diskspace exhaustion
To solve the client's diskspace concerns I used a combination of tricks:
I made use of tcpdump's port filtering, so we only captured from relevant ports; tcpdump has a simple expression language.
We set a maximum size for each packet captured, which was large enough to ensure we always captured the full header segment, but that long bodies would not be captured. For the purposes of this analysis, the body was not relevant.
We made made use of bzip2 compression and rollover features, configuring tcpdump to rolling over once the file reached about 500MB (raw). The compression worked exceptionally well in practice, with a 10-50x size reduction.
In aggregate, these settings ensured that we could packet capture over the platform's peak periods (which is where the error occurred) without causing the machines to run out of disk. As a precaution, we boosted disk and CPU allocations to the machines partaking.
Unlocking encrypted packets
Some of the services were encrypted, so we needed to get the encryption keys. In modern crypto protocols with perfect forward secrecy, just having the server's base certificates/keys does not help.
I used jSSLKeyLog, which is a Java-based agent that spits out TLS keys from the application. You can then use these to decrypt the pcap files in your favourite analysis software. My preferred platform is WireShark, but there are many excellent options available.
What we discovered
Ultimately, the approach proved to be successful. We were able to correlate the specific requests that were flowing through the system from Apiman to the service, and which ones seemed to be involved in the parsing and validation errors on the backend service. The TLS keys we captured were attached to Wireshark, enabling those packets to be decrypted.
We used Wireshark for high-level analysis and tracing, and imhex for detailed binary analysis. I was able to immediately eliminate a wide range of possible causes, including that Apiman was involved, as it did not appear that any corrupted data was actually arriving on the wire at the backend service; everything looked normal, and those same payloads replayed against dev and testing instances did not cause errors. But, we now had narrowed things down significantly and had some payloads for further investigation.
Using the captured data as a template, I was able to manually create some reproducers for the backend service's developer that triggered the issue — although we did not directly observe any odd characters on the wire. I used ncat to send reproducers directly onto the wire without validation (i.e. cURL does not work as it fixes the payloads).
These discoveries led to the backend service developer deploying some workarounds and fixes that mitigated the problem.
The issue only seemed to trigger when the schema validation was running, under load, and with certain types of payloads. It suggests to me that the character corruption may be a concurrency issue in some part of the schema validation stack, and it was providing an extremely misleading indication that the corrupt data was arriving from 'outside'.
The fact we could 'replicate' the issue from outside the application by providing a magic combination of 'weird line-ending/invisible characters', but not observe them directly on the wire means we were likely indirectly replicating an issue caused within the schema validator (or similar). With the validator partly turned off, the header data never actually become corrupted in the service itself, although some safeguards were put in place.
An extremely interesting and unusual issue, and we were delighted to be able to help the client resolve it — albeit, it turned out to be unrelated to API Management!