Saturday, April 25, 2020

The worst technical debt ever

Over the last few years, I've heard engineering teams rightly talk about technical debt and its consequences. Even non-technical executives are starting to understand its importance and the need to invest to avoid it. The other day as I was setting up a computer, I was reminded of the worst case I've ever seen of technical debt. I thought the story was worth telling here, but with a few details obscured to protect the guilty.

A few years ago, I visited one of company X's data centers. The data center was located in an older building in a slightly run-down part of town. The data center was a little hard to find because it wasn't marked in any way - there was nothing at all that made the building stand out. Outside the building, there was some trash on the sidewalk, including remnants of last night's take-outs that people had dropped on the street as they partied.

Once inside, things were different. Security at the entrance was shabby, but efficient and effective and we got through quickly. The interior was clean, but it was obvious the building hadn't been decorated in several years. Even the coffee machines had seen better days, but they worked.

We were given a detailed tour of the data center and built a good relationship with our guide. The data center had been one of the company's first and had been on the same site for several years. As you might expect, there were racks and racks of computers with technicians walking around fixing things and installing cables to connect new computers to the network. The air conditioning was loud and strong, which meant you had to be close to one another to talk - which also meant it was impossible to overhear conversations.

Late in the tour, I tripped on a loose floor tile that was a centimeter or two raised above the floor. Our guide apologized and told us we needed to be careful as we walked along. We asked why. This is where we discovered the technical debt.

Connecting computers in a data center means installing a physical cable from one computer (or router etc.) to another. You can either route the cable under the floor or on overhead trackways. Most data centers use some form of color-coded cables so you have some indication of what kind of data a cable's carrying (red cables mean one sort of data, blue another, yellow another, and so on). Some even go further and give unique labels or identifiers to cables, so you can identify a cable's pathway from end to end. Routing cables is something of an art form, and in fact, there's a sub-Reddit devoted to it: - from time to time I look at the pictures when I need an ordered view of the world. As you might expect, there's a sub-Reddit that focuses on the reverse:

Our guide told us that right from the start, the management at the data center wanted to save money and do things quickly. From time to time, routers and servers were moved or removed. Instead of removing the old cable, they just left it under the false floor and added the new cable on top of it. New cable was laid on top of old cable in any order or in any fashion, so long as the job was done cheaply and quickly, it was fine. Over time, the layers of cabling built up and up, like the strata in the rock you see at the Grand Canyon. You could even see when the company changed its cable supplier because the cable shade changed a little. Unfortunately, they always chose the same color cable (which happened to be the cheapest).

After a few years, management realized that leaving the old cable in place was a bad idea, so they instructed staff to try and remove the old cables. Unfortunately, there was so much cabling present, and it had been laid so haphazardly, it was physically impossible because the cables were so intertwined. In a few cases, they'd tried to pull up old cables by physical force, but this caused the insulation to be stripped off cables and connections failed. Obviously, leaving old cable connections just hanging around is a bad idea, so the management team told the technicians to cut off the ends of old cables as far along as they could. This meant that the old dead cable was left in place under the floor, but it all looked fine on the surface. Because the cabling ran under the floor, a superficial inspection would show that everything was working fine, especially because they'd cut the old cables as far back as they could.

Sweeping things under the rug went on for a while longer, but there was only so much false floor. By the time of my tour, there was no more space, in fact, the situation was so bad, the floor tiles wouldn't sit properly in their supports anymore. That's why we were tripping over tiles. When no one was looking, our tour guide removed one of the floor tiles to show us the cabling underneath. I was horrified by what I saw.

(Not the actual cables - but gives you a flavor of what I saw. Image source:,_electric_cables_(4).jpg. License: Creative Commons. Photographer: Jean-Pierre)

Cables were packed together with no room at all between them. They had obviously been laid across each other with no organization. It was as if a demented person had been knitting with cables leaving no gaps. There was no give in the cables and it was plain it was more or less a solid mass down to the real floor.  By my estimate, the cabling went to a depth of 30cm or more. I could clearly see why it was impossible to pull out old cables: cables had no markings, so you couldn't tell them apart; they were so intertwined you couldn't unpick them, and there were so many cables, they were too heavy to lift. In fact, there was no room under the floor to do any kind of maintenance.

There were some gaps in the cables though. Our guide told us that the data center was starting to have a vermin problem. Of course, there was a ready supply of food outside, and rats and mice had found sufficiently large gaps in the cabling to set up home.

I asked what happened when they needed to connect up computers now there wasn't any room under the floor to lay anything. Our guide showed us some men working round the corner. They had stepladders and were installing overhead cable ducting. This time, the cables were properly color-coded and properly installed. It was a thing of beauty to see the ordered way they were working and how they'd laid out the cables. The cables were also individually labeled, making the removal of old cables much easier.

The next obvious question was, what about the old cable under the floor? The plan seemed to be to sweep everything under the rug. Create new overhead connections until all of the old connections were unnecessary and then leave the old cables and forget about it.

To his credit, our guide seemed ashamed of the whole thing. He seemed like a decent man who had been forced into doing bad things by poor management decisions. Notably, we never saw senior management on our tour.

A while later, I heard the data center was temporarily closed for improvements. These improvements went on for many months and I never heard exactly what they were. I suspect the executive team was embarrassed by the whole thing once they found out the extent of the problem and ordered a proper cleanup. At the time of my tour, I wondered about the fire risk, and obviously having a vermin problem is never a good thing for any business, so maybe something bad happened that made the problem impossible to ignore.

I heard a rumor sometime later that the data center had passed an external quality inspection and received some form of quality certification. I can see how this might have happened; their new processes actually seemed decent, and if they could make the floor tiles sit flat, they could hide the horror under the floor. Most quality inspections focus on paperwork trails and the inspectors I've met didn't seem like the kind of people who would want to get their hands dirty by lifting floor tiles.

So what did I learn from all of this?

  • Technical debt is real. You eventually have to pay for short-term time and money-saving decisions. There's never a good time to pay and the longer you leave it, the bigger and more expensive the fix becomes.
  • Just because something's been done a certain way for a long time, doesn't mean it's good. It might just mean the problems haven't surfaced yet.
  • If you're inspecting something, always get your hands dirty and always talk to the people doing the work. Things may look good on the outside, but might be rotten underneath. If we hadn't established a good rapport with our guide and I hadn't tripped on the floor tile, we would never have discovered the cable issue.
  • If something looks bad, look carefully for the cause. It would have been easy to blame the technicians for the cable nightmare, but it wasn't their fault. They were responding to the demands placed on them by their management. Ultimately, management is the cause of most failures.

No comments:

Post a Comment