Robust Infrastructural Programs

Martin F. Johansen, 2024-03-31

The robustness of computational programs depends primarily on the robustness of CPUs and memory, which are extremely robust. Given that those are robust, we can further improve the robustness by code quality: By either succeeding at the computational task of explaining why it could not be done. This is a matter of code correctness.

For infrastructural programs, the robustness also depends on the devices the program uses. Here, there are additional requirements of the infrastructural program to be robust: The program must continue working as well as possible given device failures.

Note that a pure computational server is uncomplicated. It waits for a request, and when that request comes, it processes it and sends the response (not waiting for a confirmation). It is now ready to process the next request. This is trivially robust.

Even a pure client, however, is more complicated with respect to robustness. It sends a request to a server. The worst case is that the server never responds in any way. Then the client can only retort to a timeout. If the client blocks waiting for the server, this is experienced as a hanging client, which is a very low quality experience. The client can check if the sending itself failed with a response such as ICMP's unreachable or TCP's connection refused.

What the client must do, is to send the request and then do other stuff until the response comes back. This can include answering server requests, waiting for a timeout to expire or checking whether the sending itself failed. This does not fit into a simple imperative flow. Rather, after the request has been sent, the iteration must end, so the state must be stored away in the state structure. This is also required by region based memory, as the iteration is done, so the working memory must be freed. In a later iteration, when the response comes back, it will continue processing using the data from the state.

Take this as a running example: A server providing disk mirroring. It depends on three virtual disks and provides one virtual disk interface as well as an administrative port. The code for this example is available in full in the progsbase library no.inductive.libraries DiskMirroring 0.1.1.

If one of the disks suffers a failure such that it does not answer any more, the disk mirroring program can continue running correctly because it does not sit still waiting for the failed disk. It must continue processing requests. It must remember the state between iterations, and if a timeout is passed, it marks the disk as failed. The client continues to get a good experience of a working disk solution, and the administrator can query the status on the disk mirror's administrative port. The administrator can replace the broken disk and ask the disk mirroring program to try again, thereby healing the system in flight.

Let us look at how this works in Linux. When doing a connect as a client, it can fail with ENETUNREACH or ECONNREFUSED. The first is a router responding with an ICMP packet that it does not know how to send the TCP SYN packet further. The second one is the OS saying that it does not have a service listening at that port, using a TCP RST packet. ETIMEDOUT, however, means that the connect function itself timed out waiting for a SYN ACK from the server. In Linux, the value of the timeout length can be set using select.

Let us look at how this works in Java. java.net.Socket throws a SocketException in case of unreachable ICMP packets or connection refused TCP RST packet. If the connection times out, it does a SocketTimeoutException. The timeout can be set in the setSoTimeout function.

Robust Infrastructural Programs

Contact Information

progsbase.com