Have you turned it off and on again?

2019-10-15

When I was a child, turning a computer, a modem, a tv or vcr off and on again to get it to work, was a right of passage.

Every member of my family should have known to do this, because it was one of the first things my father and I would try.

Many years later as a seasoned IT professional, it's not a tool I use as-often, but maybe I should!

Today at work, we had a production outage.

Shortly after I arrived, a Lead engineer noticed something wrong. The system remained up for most of the day, whilst they and another engineer investigated.

It looked like a connected service, Perhaps a third-party provider was letting us down. I'm rather used to that. It's one of the reasons engineering teams can be so focused on resilience, and scale, even when the rest of the world doesn't get it. When you operate at scale above a single, or a few machines, and people's lives can be affected, special measures have to be taken to avoid disaster.

They found users affected, checked if applications were running, traced requests using patterns such as distributed request ID's, assembled facts, checked APM, considered code changes, thread pools.

I'd reached out and let people know I was there if they needed me. I actually don't like to not be involved, but too many people should not drop everything. I got some nice work delivered that I'm proud of and whenever I would run a test-suite, I'd consider the last time I checked in. The problem sounded like it could make waves, but I'd logged in and things worked. I was confident they would find something interesting.

Eventually a third engineer stepped in and production went down. This is no longer some error messages, this is no longer something I can sit back from. Our CTO organises a meeting, several more engineers step in.

Group think

We had a group-call, everyone put out suggestions and we checked them, looking for evidence, often working in parallel. This is a read-heavy operation, so it's fine and no further risk is present. Some people are silent, these people are essential. If everyone is talking, understanding disappears. Their presence and support was welcomed and appreciated.

I pointed out some things I'd found, so did others, and we tried them one at a time. I communicated my understanding of the number of failure points so that we were all on the same page. It seemed we all knew.

Eventually I restarted some apps. The details are not important. They got turned off and on again.

Production now works

The details if they matter to you

The actual problem seems to have been an nginx load-balancer.

Many people use Nginx in this way. It's a pretty good tool, and has saved my bacon on more than a few occasions. The only tool I prefer more is the BSD http server.

I had noticed when looking at logs, a message that an upstream server was unavailable. From years of working with nginx, I knew this was a potential candidate for a fix, because I've seen some of it's behaviours. This message presented often. After the restart / re-deploy it ceased presenting.

The important take-away is that we could have saved a lot of mind-power by simply turning some things off and on again.