Saturday, April 25, 2020

The worst technical debt ever

Over the last few years, I've heard engineering teams rightly talk about technical debt and its consequences. Even non-technical executives are starting to understand its importance and the need to invest to avoid it. The other day as I was setting up a computer, I was reminded of the worst case I've ever seen of technical debt. I thought the story was worth telling here, but with a few details obscured to protect the guilty.

A few years ago, I visited one of company X's data centers. The data center was located in an older building in a slightly run-down part of town. The data center was a little hard to find because it wasn't marked in any way - there was nothing at all that made the building stand out. Outside the building, there was some trash on the sidewalk, including remnants of last night's take-outs that people had dropped on the street as they partied.

Once inside, things were different. Security at the entrance was shabby, but efficient and effective and we got through quickly. The interior was clean, but it was obvious the building hadn't been decorated in several years. Even the coffee machines had seen better days, but they worked.

We were given a detailed tour of the data center and built a good relationship with our guide. The data center had been one of the company's first and had been on the same site for several years. As you might expect, there were racks and racks of computers with technicians walking around fixing things and installing cables to connect new computers to the network. The air conditioning was loud and strong, which meant you had to be close to one another to talk - which also meant it was impossible to overhear conversations.

Late in the tour, I tripped on a loose floor tile that was a centimeter or two raised above the floor. Our guide apologized and told us we needed to be careful as we walked along. We asked why. This is where we discovered the technical debt.

Connecting computers in a data center means installing a physical cable from one computer (or router etc.) to another. You can either route the cable under the floor or on overhead trackways. Most data centers use some form of color-coded cables so you have some indication of what kind of data a cable's carrying (red cables mean one sort of data, blue another, yellow another, and so on). Some even go further and give unique labels or identifiers to cables, so you can identify a cable's pathway from end to end. Routing cables is something of an art form, and in fact, there's a sub-Reddit devoted to it: https://www.reddit.com/r/cableporn/ - from time to time I look at the pictures when I need an ordered view of the world. As you might expect, there's a sub-Reddit that focuses on the reverse: https://www.reddit.com/r/cablegore/.

Our guide told us that right from the start, the management at the data center wanted to save money and do things quickly. From time to time, routers and servers were moved or removed. Instead of removing the old cable, they just left it under the false floor and added the new cable on top of it. New cable was laid on top of old cable in any order or in any fashion, so long as the job was done cheaply and quickly, it was fine. Over time, the layers of cabling built up and up, like the strata in the rock you see at the Grand Canyon. You could even see when the company changed its cable supplier because the cable shade changed a little. Unfortunately, they always chose the same color cable (which happened to be the cheapest).

After a few years, management realized that leaving the old cable in place was a bad idea, so they instructed staff to try and remove the old cables. Unfortunately, there was so much cabling present, and it had been laid so haphazardly, it was physically impossible because the cables were so intertwined. In a few cases, they'd tried to pull up old cables by physical force, but this caused the insulation to be stripped off cables and connections failed. Obviously, leaving old cable connections just hanging around is a bad idea, so the management team told the technicians to cut off the ends of old cables as far along as they could. This meant that the old dead cable was left in place under the floor, but it all looked fine on the surface. Because the cabling ran under the floor, a superficial inspection would show that everything was working fine, especially because they'd cut the old cables as far back as they could.

Sweeping things under the rug went on for a while longer, but there was only so much false floor. By the time of my tour, there was no more space, in fact, the situation was so bad, the floor tiles wouldn't sit properly in their supports anymore. That's why we were tripping over tiles. When no one was looking, our tour guide removed one of the floor tiles to show us the cabling underneath. I was horrified by what I saw.

(Not the actual cables - but gives you a flavor of what I saw. Image source: https://commons.wikimedia.org/wiki/File:Pougny,_electric_cables_(4).jpg. License: Creative Commons. Photographer: Jean-Pierre)

Cables were packed together with no room at all between them. They had obviously been laid across each other with no organization. It was as if a demented person had been knitting with cables leaving no gaps. There was no give in the cables and it was plain it was more or less a solid mass down to the real floor. By my estimate, the cabling went to a depth of 30cm or more. I could clearly see why it was impossible to pull out old cables: cables had no markings, so you couldn't tell them apart; they were so intertwined you couldn't unpick them, and there were so many cables, they were too heavy to lift. In fact, there was no room under the floor to do any kind of maintenance.

There were some gaps in the cables though. Our guide told us that the data center was starting to have a vermin problem. Of course, there was a ready supply of food outside, and rats and mice had found sufficiently large gaps in the cabling to set up home.

I asked what happened when they needed to connect up computers now there wasn't any room under the floor to lay anything. Our guide showed us some men working round the corner. They had stepladders and were installing overhead cable ducting. This time, the cables were properly color-coded and properly installed. It was a thing of beauty to see the ordered way they were working and how they'd laid out the cables. The cables were also individually labeled, making the removal of old cables much easier.

The next obvious question was, what about the old cable under the floor? The plan seemed to be to sweep everything under the rug. Create new overhead connections until all of the old connections were unnecessary and then leave the old cables and forget about it.

To his credit, our guide seemed ashamed of the whole thing. He seemed like a decent man who had been forced into doing bad things by poor management decisions. Notably, we never saw senior management on our tour.

A while later, I heard the data center was temporarily closed for improvements. These improvements went on for many months and I never heard exactly what they were. I suspect the executive team was embarrassed by the whole thing once they found out the extent of the problem and ordered a proper cleanup. At the time of my tour, I wondered about the fire risk, and obviously having a vermin problem is never a good thing for any business, so maybe something bad happened that made the problem impossible to ignore.

I heard a rumor sometime later that the data center had passed an external quality inspection and received some form of quality certification. I can see how this might have happened; their new processes actually seemed decent, and if they could make the floor tiles sit flat, they could hide the horror under the floor. Most quality inspections focus on paperwork trails and the inspectors I've met didn't seem like the kind of people who would want to get their hands dirty by lifting floor tiles.

So what did I learn from all of this?

Technical debt is real. You eventually have to pay for short-term time and money-saving decisions. There's never a good time to pay and the longer you leave it, the bigger and more expensive the fix becomes.
Just because something's been done a certain way for a long time, doesn't mean it's good. It might just mean the problems haven't surfaced yet.
If you're inspecting something, always get your hands dirty and always talk to the people doing the work. Things may look good on the outside, but might be rotten underneath. If we hadn't established a good rapport with our guide and I hadn't tripped on the floor tile, we would never have discovered the cable issue.
If something looks bad, look carefully for the cause. It would have been easy to blame the technicians for the cable nightmare, but it wasn't their fault. They were responding to the demands placed on them by their management. Ultimately, management is the cause of most failures.

Saturday, April 11, 2020

How to be more persuasive when you speak: using ‘catchphrases’

One of the most famous speeches in history used ‘catchphrases’ for incredibly powerful effect; you’ll know the speech by its catchphrase alone. I’ve seen modern American politicians use the same rhetorical technique to heighten energy and to unify and drive home their message. You can use it in your speeches too; I’m going to show you how and why.

Like many rhetorical techniques, this one relies on the considered use of repetition. Specifically, it’s the repetition of a phrase or sentence throughout a speech as a kind of catchphrase.

Let me give you an example. Let’s say you’re an engineering leader and you’re trying to convince your team to take data security seriously. Using this technique, your speech might look something like this (catchphrase in bold).

If we lapse in securing our data, our company can be fined large amounts of money, putting our livelihoods at risk. By being secure, we prevent this from happening.

Security is our security.

If we have a data breach, our reputation will be sullied and it’ll be harder for us to win new business, with all that entails.

Security is our security,

Companies have suffered data breaches of employee data too, putting social security numbers and other personal information out on the web for the highest bidder.

Security is our security,

Speakers use this approach to draw the audience’s attention to a key theme again and again and again, they use it to unify and focus a speech. It drives the point home in a forceful, but elegant way.

My real example is by an influential African-American Christian preacher. He repeats one of the most famous lines in rhetoric as a catchphrase again and again. You’ll know it as soon as you hear it – in fact, you already know the words. Here's the YouTube link to the appropriate section.

(Image credit: WikiMedia Commons, open-source)

Here’s part of his speech, the catchphrase is in bold.

I have a dream that my four little children will one day live in a nation where they will not be judged by the color of their skin but by the content of their character.

I have a dream today.

I have a dream that one day, down in Alabama, with its vicious racists, with its governor having his lips dripping with the words of interposition and nullification; one day right there in Alabama, little black boys and black girls will be able to join hands with little white boys and white girls as sisters and brothers.

I have a dream today.

Martin Luther King repeats ‘I have a dream’ to bring the listener back to his point and to reinforce his message. ‘I have a dream – paragraph – ‘I have a dream’ – paragraph – ‘I have a dream’ - paragraph. He unifies his speech and drives home his point. (King’s speech is rhetorically interesting in other ways too; he uses a wide variety of techniques to make his points.)

I’ve done my homework on rhetoric and searched for this method in the books on techniques from antiquity. As far as I can tell, this technique is known as epimone. It's not one of the famous techniques and I think it's very underrated.

It seems to be used a lot in African-American Christian preaching and has spread to American politics from there. (As an aside, I've looked for resources on the analysis of rhetorical techniques used in African-American churches, but I've not been able to find any good ones. If anyone knows of some good analysis, please let me know.) I've heard a well-known American politician use it and I suspect we'll be hearing it more as we head into election season. Bear in mind that politicians use techniques like this deliberately because they know they work.

Here’s my recommendation for using this technique; if you’re trying to persuade or emotionally influence an audience, use it to hammer home your message and provide a simple unifying concept for people to take to heart.

Reading more

This blog post is one in a series of posts on practical rhetoric. Here's the series:

Sunday, April 5, 2020

Sherlock Holmes, business books, Nazi fighters, and survivor bias

In the Sherlock Holmes story, Silver Blaze, Holmes solved the case by a deduction from something that didn’t happen. In the story, the dog didn’t bark, implying the dog knew the horse thief. This a neat twist on something called survivor bias, the best example of which is Abraham Wald’s analysis of surviving bombers. I’ll talk about how survivor bias rears its ugly head and tell you about Wald and what he did.

Survivor bias occurs when we look at the survivors of some process and we try to deduce something about their commonality without considering external factors. An obvious example might be collating details on lottery winners’ lives in an attempt to figure out what factors might lead to winning the lottery. For example, I might study what day of the week and time of day winners bought their tickets, I might look at where winning tickets were bought, and I might look at the age and gender of winners. From this, I might conclude that to improve my chances of winning the lottery I need to be a 45-year-old man who buys tickets at 3:40pm on Wednesday afternoon at a gas station. But the lottery is a random process and all we've done is analyze who's playing, not the causes of winning. Put like this, it seems almost incredible that anyone could have problems with survivor bias, but survivor bias doesn’t always manifest itself in obvious ways.

Let’s imagine I want to write a bestselling business book unraveling the secrets of how to win at business. I could select businesses that have been successful over several years and look for factors they have in common. I might call my book “Building excellent businesses that last”. As you surely know, there have been several bestselling books based on this premise. Unfortunately, they age like milk; it turns out that most of the companies these books identify as winners subsequently performed poorly - which is a regression to the mean. The problem is, other factors may have contributed to these businesses' success, for example, the competitive environment, new product innovation, a favorable economy, and so on. Any factors I derived from commonalities between winning companies today are just like an analysis of the common factors of lottery winners. By focusing on (current) winners, the door is open to survivor bias [Shermer, Jones].

The most famous example of survivor bias is Wald and the bombers. It is a little cliched to tell the story, but it’s such a great story, I’m going to tell it again, but my way.

Abraham Wald (1902-1950) was a great mathematician who made contributions to many fields, including econometrics, statistics, and geometry. A Hungarian Jew, he suffered discrimination and persecution while looking for work in Vienna, Austra in 1938, and so emigrated with his family to New York. During World War II, he worked in the Statistical Research Group at Columbia University. This is where he was given the task of improving bomber survivability; where should armor go to best protect bombers given that armor is heavy and not everywhere can be protected [Wallis]?

Not all bombers came home after bombing runs over Nazi-occupied Europe. Nazi fighter planes attacked bombers on the way out and the way back, and of course, they shot down many planes. To help his analysis, Wald had data on where surviving planes were hit. The image below is a modern simulation of the kind of data he had to work with; to be clear, this is not the exact data Wald had, it’s simulated data. The visualization shows where the bullet holes were on returning planes. If you had this data, where would you put the extra armor to ensure more planes came home?

(Simulated data on bomber aircraft bullet holes. Source: Wikipedia - McGeddon, License: Creative Commons)

Would you say, put the extra armor where the most bullet holes are? Doesn’t that seem the most likely answer?

Wrong.

This is the most famous example of survivor bias - and it’s literally about survival. Wald made the reasonable assumption that bullets would hit the plane randomly, remember, this is 1940’s technology and aerial combat was not millimeter precision. This means the distribution of bullet holes should be more or less even on planes. The distribution he saw was not even - there were few bullet holes in the engine and cockpit, but he was looking at surviving planes. His conclusion was, planes that were hit in key places did not survive. Look at the simulated visualization above - did you notice the absence of bullet holes in the engine areas? If you got hit in an engine, you didn’t come home. This is the equivalent of the dog that didn’t bark in the night. The conclusion was of course to armor the places where there were not bullet holes.

A full appreciation of survivor bias will mean you're more skeptical of many self-help books. A lot of them proceed on the same lines: let's take some selected group of people, for example, successful business people or sports people, and find common factors or habits. By implication, you too can become a winning athlete or business person or politician just by adopting these habits. But how many people had all these habits or traits and didn't succeed? All Presidents breathe, but if you breathe, does this mean you'll become President? Of course, this is ludicrous, but many self-help books are based on similar assumptions, it's just harder to spot.

Survivor bias manifests itself on the web with e-commerce. Users visit websites and make purchases or not. We can view those who make purchases as survivors. One way of increasing the number of buyers (survivors) is to focus on their commonalities, but as we’ve seen, this can give us biased results, or even the wrong result. A better way forward may be to focus on the selection process (the web page) and understand how that’s filtering users; in other words, understanding why people didn't buy.

One of the things I like about statistics in business is that correctly applying what seems like esoteric ideas can lead to real monetary impact, and survivor bias is a great example.

Wednesday, April 1, 2020

A sorry tale of software quality: what went wrong and some lessons

I’m going to tell you a story about a software quality drive that went horribly wrong and what we can learn from it. To protect the guilty, I’ve disguised some details, but the main points are all true.

(The Antares launch failure. Image credit: Nasa. Source: Wikimedia Commons. Public domain image.)

I worked for an organization that got the quality bug. They decided that the cause of the software failures the company had experienced was poor quality processes, so in response, we were going to become a world leader in software quality. To do this, we (the development team) were all going to be certified to software quality standard ABC (not the standard’s name). The project was going to be completed in a year and we were going to see astounding benefits.

The company hired a number of highly paid contractors to advise us, most of whom were well-qualified. Unfortunately, many of the contractors showed two major personality traits: arrogance and condescension. Instead of creating a cooperative approach, they went for a command and control style that was disastrous. I heard there were multiple complaints about how they treated my colleagues.

Not all the contractors were well qualified. In one case, the person hired to create and manage software processes had never worked in software before and had never written any software. As you might expect, they ended up proposing weird metrics to measure quality, for example, counting the number of version control updates as a metric of quality (presumably, more updates meaning better software? It was never defined).

Everyone was trained multiple times, no expense was spared on training, and the sky was the limit for spending time on processes. We went to very, very long training sessions about once a month. These often included long descriptions of the benefits of the process and the fantastic results we might see at the end of it. In one notable case, the presenter (an external contractor) was caught out showing 'expected' results as real results - but our management team was true believers so the presenter got away with it.

As we started to put together processes, we were encouraged to meet once a week to discuss quality, which everyone did. We faithfully started to implement the processes proposed by the consultants and as customized by our teams. Most of these processes were a bit silly, didn’t really add any quality, and took time, but we implemented them anyway because that's what the leadership team wanted. The weekly meetings started to get longer and ended up taking a whole morning.

A few brave people suggested that we should have metrics that measured project deliverables, deadlines, and outages, but the ABC quality standard people wanted us all to focus on measuring the process, not the outcome, so that’s what we did.

Despite all these efforts, the failures kept on coming. The one thing that did change noticeably was that deliverables were now taking twice as long. Executive leadership was unhappy.

As the quality project started to lose support and fail, the leadership tried to change their approach from stick to carrot. One of the things they did was hold a contest for people to suggest 'fun' alternative words for the project acronym, if the project was ABC, an alternative meaning might be 'A Big Cheese'. Unfortunately, a lot of the entries were obscene or negative about the project.

Eventually, the inevitable happened. Executive leadership saw the project was going nowhere, they did a management reshuffle and effectively killed the quality drive. The contractors were shown the door. Everything went back to normal, except the promoters of ABC were no longer in senior positions. The standard ABC became a dirty word and people stopped talking about quality improvements; sadly, the whole ABC debacle meant that discussions of more sensible process improvements were taboo.

The core problem was, the process became about getting certified in ABC and not about reducing defects. We had the wrong goal.

I’ve had some time to think about how this all played out and what I would have done to make it a success.

Start small-scale and learn. I would have started with one team and measured the results very carefully. People talk to each other, so success with one team would have been communicated virally. To oversimplify: If it can’t work with one team, it can’t work with many.
Scale-up slowly.
Bring in contractors, but on a one-off basis and clearly as advisors. Any sign of arrogance and they would be removed. Contractors should be available on a regular basis to ensure change is permanent.
Focus on end results metrics, not processes. If the goal was to reduce defects, then that’s the metric that should have been measured and made public.
Flexibility to change processes in response to increased knowledge.
No certification until the target is achieved - if certification is an option, then inevitably certification becomes the goal. Certification is a reasonable goal, but only once the targets have been reached.

I’m not against quality standards, I think they have an important role. But I do think they need to be implemented with great care and a focus on people. If you don’t do it right, you’ll get what you asked for, but not what you need.

Tuesday, March 24, 2020

John Snow, cholera, and the origins of data science

The John Snow story is so well known, it borders on the cliched, but I discovered some twists and turns I hadn't known that shed new light on what happened and on how to interpret Snow's results. Snow's story isn't just a foundational story for epidemiology, it's a foundational story for data science too.

(Image credit: Cholera bacteria, CDC; Broad Street pump, Betsy Weber; John Snow, Wikipedia)

To very briefly summarize: John Snow was a nineteenth-century doctor with an interest in epidemiology and cholera. When cholera hit London in 1854, he played a pivotal role in understanding cholera in two quite different ways, both of which are early examples of data science practices.

The first way was his use of registry data recording the number of cholera deaths by London district. Snow was able to link the prevalence of deaths to the water company that supplied water to each district. The Southwark & Vauxhall water company sourced their water from a relatively polluted part of the river Thames, while the Lambeth water company took their water from a relatively unpolluted part of the Thames. As it turned out, there was a clear relationship between drinking water source and cholera deaths, with polluted water leading to more deaths.

This wasn't a randomized control trial, but was instead an early form of difference-in-differences analysis. Difference-in-differences analysis was popularized by Card and Krueger in the mid-1990's and is now widely used in econometrics and other disciplines. Notably, there are many difference-in-differences tutorials that use Snow's data set to teach the method.

I've reproduced one of Snow's key tables below, the most important piece is the summary at the bottom comparing deaths from cholera by water supply company. You can see the attraction of this dataset for data scientists, it's calling out for the use of groupby.

The second way he understood cholera is a more dramatic tale and guaranteed his continuing fame. In 1854, there was an outbreak of cholera in the Golden Square part of Soho in London. Right from the start, Snow suspected the water pump at Broad Street was the source of the infection. Snow conducted door-to-door inquiries, asking what people ate and drank. He was able to establish that people who drank water from the pump died at a much higher rate than those that did not. The authorities were desperate to stop the infection, and despite the controversial nature of Snow's work, they listened and took action; famously, they removed the pump handle and the cholera outbreak stopped.

Snow continued his analysis after the pump handle was removed and wrote up his results (along with the district study I mentioned above) in a book published in 1855. In the second edition of his book, he included his famous map, which became an iconic data visualization for data science.

Snow knew where the water pumps were and knew where deaths had occurred. He merged this data into a map-bar chart combination; he started with a street map of the Soho area and placed a bar for each death that occurred at an address. His map showed a concentration of deaths near the Broad Street pump.

I've reproduced a section of his map below. The Broad Street pump I've highlighted in red and you can see a high concentration of deaths nearby. There are two properties that suffered few deaths despite being near the pump, the workhouse and the brewery. I've highlighted the workhouse in green. Despite housing a large number of people, few workhouse residents died. The workhouse had its own water supply, entirely separate from the Broad Street pump. The brewery (highlighted in yellow) had no deaths either; they supplied their workers with free beer (made from boiled water).

(Source: adapted from Wikipedia)

I've been fascinated with this story for a while now, and recent events caused me to take a closer look. There's a tremendous amount of this story that I've left out, including:

The cholera bacteria and the history of cholera infections.
The state of medical knowledge at the time and how the prevailing theory blocked progress on preventing and treating cholera.
The intellectual backlash against John Snow.
The 21st century controversy surrounding the John Snow pub.

I've written up the full story in a longer article you can get from my website. Here's a link to my longer article.

Wednesday, March 18, 2020

Contributing to open-source software

I’ve been using open-source software packages for several years and always felt like a bit of a freeloader; I took, but I never gave back. My excuse was, I didn’t have time to dig into the codebase and familiarize myself with the project's ways of working. But recently, I found easier ways to contribute, and I have been.

(Image credit: Old Book Illustrations)

The first way I found is raising bugs. I’ve pushed open-source software quite hard and found bugs in Pandas and Bokeh. Both of these projects have Github pages and both of them have pages to report bugs. If you’re going to report a bug, here are some rules to follow:

Make sure you’re using the most up-to-date version of the software.
Make sure your bug hasn’t been raised before.
Provide a simple example to duplicate the bug.
Follow the rules for reporting bugs - especially with regard to formatting your report, the heading you use, and any tags.

The open-source community has quite rightly been criticized for occasional toxic behavior, some of which has come from software users. I’ve seen people raise bugs and been quite forceful in their criticisms of the software they’re freely using. Ultimately, open-source software is a volunteer effort and people don’t volunteer to face some of the nastiness I’ve seen. The onus is on you to remain courteous and professional, and part of that is taking the short amount of time to follow the rules. A little kindness and consideration goes a long way.

For reference, here are some bug reports I’ve raised:

Pandas: 20027
Bokeh: 8009, 8042

The second way to contribute is by suggesting new functionality. This is a little harder because it takes more consideration to make sure what you’re suggesting is relevant and hasn’t been suggested before. Once again, I strongly advocate that you find out what the rules are for requesting new functionality. If possible, I suggest you include a mock-up of what you’re suggesting.

For reference, here are some suggestions I’ve raised:

Pandas: 21551
Bokeh: 9144

The final way of contributing is to build a project that uses open-source technology, share it via Github (or the alternatives), and notify the community of your project. Bokeh has a nice showcase section on its Discourse server where you can see what people have built. Seeing what others have built is a great way to get inspiration for your own projects.

For reference, here’s a showcase project I made available for Bokeh.

On the whole, I’ve been very pleased with the response of the developer communities to my meager contributions. Most of my errors or suggestions have been implemented within a few months, which contrasts with my experience with paid-for software where there often isn’t a forum to view bugs or make suggestions.

If you’re a user of open-source software, I urge you to contribute in any way you can. We’re all in this together.

Saturday, March 14, 2020

Niche knowledge and power - knowledge hoarding

A couple of times in my career, I’ve come across people using a strategy to gain short-term power: keeping knowledge and skills to themselves, otherwise known as knowledge hoarding. Unfortunately for them, it doesn’t work in the long term anymore. I'm going to start with some examples, then suggest how you might differentiate between an area that's genuinely hard and when someone's knowledge hoarding, before finally giving you some suggestions on what to do if you find it on your team.

(Keeping knowledge to yourself is like caging kittens. Image credit: Chameleon, source: Wikimedia Commons, License: GNU Free Documentation)

I worked with someone who had developed some in-depth knowledge of a particular technology. I needed his help with a project and I needed his in-depth knowledge. He wouldn’t share what he knew, claiming that the technology was highly complex and difficult to understand. He insisted that he had to do the work if it was done at all, and that I needed to tell his manager how valuable he was. I later heard from others in the organization that he’d taken the same approach and that they’d caved in to him. Some managers started to believe that the technology really was as complex as he said. Fortunately, I knew enough to get started without him. After some diligent Internet searching, I found what I needed and completed my project without his assistance. Unfortunately for my colleague, not too long after this, there were a couple of books published on the technology, which turned out to be much more straightforward than he claimed. His unique knowledge disappeared and his boast of enhanced value to the company evaporated within a few months. His career subsequently stalled; he was relying on his ring-fenced knowledge to give him an advantage and his prior behavior came back to haunt him.

Much later in my career, when I was older and wiser, I came across something similar. Someone two years out of college was working on a commercial tool we’ll call X. X had a sort-of programming language that enabled customization. The recent graduate claimed that only he could understand the language and only he could make the changes the company needed. This time, I didn’t even bother pursuing it. I had my team completely bypass him using a different technology. Had the recent graduate been more open, I would have gladly included them on my project and they would have been cross-trained in other technologies, instead, they ended up leaving to go to a company that used X. Problem is, X has a very small market share (< 5%) and it’s shrinking.

Over the years, I’ve seen the same story play out a number of times. Someone has knowledge of a system (e.g. Salesforce, CRMs, cloud systems, BI, sales analytics, network cards) and claims the area is too complex or difficult for others to understand and that they need to be the point person. But it always turns out not to be true. The situation resolves itself by the person leaving, or a reorganization, or a technology change or something else - it always turns out that the person was never as vital as they claimed. Ring-fencing knowledge to protect your position seems to work in the short term, but it fails spectacularly in the long run.

Of course, there are skills that are difficult to acquire and do provide a barrier to entry into an area. Good examples are statistical analysis, machine learning, and real-time system design. What’s noticeable about all of these areas is the large amount of training content freely and easily available. If you want to learn statistics, there are hundreds of online courses and books you can use. The only impediment is your ability to understand and apply theory and practice.

As a manager, how do you know if someone is ring-fencing knowledge to protect their position versus the area actually being hard? Here are the signs they might be ring-fencing:

Claims that only they can understand the technology.
Knowledge hoarding and refusal/reluctance to share.
Refusing/reluctance to brief or train others - or doing it very badly.
No formal qualification in the area (not all areas have formal qualifications though) and no formal background (e.g. degree in software engineering).
Other groups in the company or outside the company not reporting the area is hard.

Here are signs the area is actually hard:

Other, similar groups in the company or elsewhere reporting the area is hard.
Online commentary that the technology is hard.
Lots of online content to help people learn.
Definite technical requirements, like the ability to understand number theory.
Obvious qualifications, e.g. network engineering certification.

From a management perspective, the best thing to do is stop knowledge-hoarding behavior before it starts. Ideally, there shouldn’t be a single point of failure on your team (a bus factor of more than 1). This means consciously focusing on cross-training (something the military does very well). If you inherit someone showing this behavior, you need to make cross-training a priority and personally intervene to make sure it’s done properly. Cross-training will involve some loss of status for the person which you need to be sensitive to and manage well.

For some people, keeping skills and knowledge to themselves makes perfect sense, it’s a great way to enhance their value to their employer. For their colleagues, it’s not good behavior, and for their manager, it can be disastrous. For (almost) everyone’s sake, you should deal with it if it’s happening in your organization.