James John – Software Engineer

When EWOULDBLOCK Feels Like a Lie: Handling Non-Blocking Socket Latency in C

I never thought I would reach a point in my career where I had to write code specifically to “slow down” a computer until I started writing low-level network code in C.

If you have spent time in the trenches of systems programming, you know the feeling. You build a robust system, but the moment it interacts with a high-level language, things get weird. Recently, I encountered a race condition in socket programming while I was training AI agents, that forced me to rethink how I handle non-blocking I/O.

The Scenario: A C Echo Server

I am building a custom echo server in C. The architecture is straightforward:

  1. Receive a payload.
  2. Echo the payload back.
  3. Perform some post-processing.
  4. Close the connection.

To prevent the server from hanging on slow clients, I used non-blocking I/O. On the client side, I used a standard Python script with sendall to push data to the server.

The Bug: The “Heisenbug” That Vanished Under Valgrind

The issue appeared when testing on the loopback interface. My C program was receiving data too fast, interpreting empty buffers as EOF, and closing the connection prematurely. This caused the Python client to crash with a Connection Reset by Peer.

But here is where it got maddening: The bug was impossible to debug using standard tools.

When I ran the server under Valgrind to check for memory leaks or errors, the bug vanished. Everything worked perfectly.

This was a classic “Heisenbug.” Valgrind naturally slows down program execution significantly. That extra latency was just enough to slow my C server down, allowing the Python client to catch up and fill the buffer. The moment I removed Valgrind, the C server went back to light-speed, and the race condition returned.

The Root Cause: Loopback Speed vs. Python Overhead

The Valgrind behavior confirmed my suspicion: this was purely a speed mismatch.

recv is not designed to wait for “all” data unless you specify MSG_WAITALL. However, using MSG_WAITALL blocks the thread, defeating the purpose of my non-blocking architecture.

I needed a way to distinguish between:

  1. True EOF: The client is actually done sending.
  2. Network/Processing Lag: The client (Python) is just slow, or the OS hasn’t flushed the buffer yet.

The Fix: The “Patience Counter”

Since I couldn’t rely on a simple blocking call, and I didn’t want to overcomplicate the architecture with complex polling immediately, I implemented a heuristic solution: a threshold counter.

Instead of closing the connection the moment I hit EWOULDBLOCK, I added a counter.

This effectively acts as a “poor man’s timeout.” It gives the slower Python client just enough time to fill the buffer before the C server decides to cut the cord.

The Big Question: Is Python’s sendall Too Slow?

This experience raised an interesting question about high-level abstractions. We often assume sendall in Python is instantaneous on a local machine, but the context switching and overhead compared to a raw C recv loop are significant enough to break logic that relies on immediate data availability.

Conclusion & Call for Debate

This counter-based approach fixed my Connection Reset issue, but it feels like a patch rather than a cure.

I want to hear from other systems engineers: Is this a valid strategy for simple flow control? Or does this point to a lack of proper message framing (like sending a content-length header first)? How do you distinguish between network lag and transfer completion without blocking?

James John

Software Engineer