A long time ago in my "C" programming days, I learned that when you code up anything that depends on any sort of external data, be it a file, database or socket, you should be paranoid and do it defensively. After all, you can't control those things and there's no guarantee that they will always work the way you hope. Sometimes you care about every possible error code; sometimes just success or failure. The point is to check the result of what you tried to do.

Fast forward through several years of C++ and ten years into Java, and our boss calls us into the office.

The Command Controller application is failing and nobody knows why. It runs fine for a while and then it starts throwing what appear to be random exceptions. The problem is happening on almost every single command that it's receiving, but only in production. We can not reproduce the issue in any of the other environments (DR, pre-prod, QA or Dev). The team that wrote it is dumbfounded and has asked for help. We have a pretty good reputation at solving tough issues, so you guys need to drop everything and figure this out.

Normally, the first step would be to attach a debugger and see where it's going ker-flooie. Nope: you aren't allowed to do that in production - ever, no exceptions!

Well if it can't be reproduced anywhere else, how are we going to do it? We will use truss, reasoning that we could capture the output, walk through it against the source code and try and match up all the I/O calls. It's painful, but doable.

As you might imagine, this generated quite a large log file. After fourteen long hours, we finally spotted the problem:

   try {
        byte []buffer = new byte[8192];
        bufferedInputStream.read(buffer);
        // Use buffer
    } catch (Exception e) {
        // ...
    }

Something clicked from decades in the past; make sure you got the number of bytes you intended to read!

But why would this suddenly start to fail? Something had to have changed! This code hadn't changed for several years, and nothing had been deployed for three months! After asking around the data center, it turned out that some SA had applied a routine OS-patch, which awakened this sleeping gremlin. To check it, we ran the unmodified binary on an unpatched machine and sure enough, the problem disappeared.

Since rolling back the OS patch in production was never going to happen at 3AM, we wound up grepping through every line of code of that application and replacing all of the single-line socket-reads with the proper loop, with appropriate error checking:

    int COMMAND_BUF_SIZE = 8192; // loaded from a property file
    byte []buffer = new byte[COMMAND_BUF_SIZE];
    int totalBytesRead = 0;
    try {
        while (totalBytesRead < buffer.length) {
            int numBytesReadThisTime = bufferedInputStream.read(buffer,totalBytesRead,buffer.length - totalBytesRead);
            if (numBytesReadThisTime == -1) {
                break; // Unexpected end-of-stream
            }
            totalBytesRead += numBytesReadThisTime;
        }
    } catch (final Exception e) {
        // Handle read-error
    }
    if (totalBytesRead < COMMAND_BUF_SIZE) {
        // throw exception for unexpected EOF
    }
    // Use buffer

Then all we had to do was the manual build and deployment at 5AM.

At 6AM, we all went out for breakfast. At 7:30, we came back to grab our stuff and go home. That's when the other managers in the department pointed out that there was work to do. That's when we pointed out that while they were in slumberland, we had been at it all night long and it was time for us to go nite-nite. Then we left.

[Advertisement] Utilize BuildMaster to release your software with confidence, at the pace your business demands. Download today!