Building a Soviet Nail Factory: how KPIs killed efficiency
Vincent Bernat
4-minute read
Also available in
In 2008, I landed my second job, in the network team at Orange Portails, the division behind the websites and search engine of the French telecom operator Orange. The place ran like clockwork: a comprehensive technical setup, a dedicated team for every part of the business, and room to focus on what I do best. A few years later, none of that mattered: thanks to an obsession with the numbers, we could no longer deliver new services on time.
Disclaimer
This is a story I like to tell to warn people about Goodhart’s law.1 As these events happened almost 15 years ago, my recollection is a bit fuzzy. I left in 2012.
The first years#
During my first years, the department operated like a startup. Its cradle was the French company Echo. They built a search engine. France Télécom bought it and renamed it Voila. It was the most visited search engine in France in the early 2000s. France Télécom consolidated the portal activities into the Wanadoo Portails division, later renamed Orange Portails.
The technical environment was excellent. We had many internal tools:2 a ticket system, an RRD-based graphing tool, an IPAM, a reporting tool, and an SNMP-based alerting tool.3 We deployed our Linux servers with CFEngine. We installed systems and applications from internal Debian repositories. We documented everything in a private MediaWiki instance. Supervision was performed with an ancestor of Xymon. The network architecture was clean and scalable with little legacy. We onboarded new people in a day.
It was a nurturing environment for me. I developed several tools: lldpd, an 802.1AB implementation, Snimpy, a pythonic binding for Net-SNMP, Wiremaps, a layer-2 discovery tool with a time machine to know which device is connected where, Kitérő, a tool to simulate network conditions, QCSS-3, a controller for load-balancers, and ipoo, a service available through a Jabber chatbot and a Greasemonkey script to expose IP-related information. I added SNMP support for Keepalived and Quagga. I also started this blog, with articles like “Anycast DNS,” TLS-related articles like “TLS computational DoS mitigation,” SNMP-related articles like “Integration of Net-SNMP into an event loop,” Linux-related articles like “Tuning Linux IPv4 route cache,” and an article about VXLAN long before it was cool.
The collapse#
When we needed new servers, the on-site team would take a set from the inventory, install our base Linux distribution on them, put them in the datacenter, and cable them to the top-of-the-rack switches. We opened a ticket describing the servers we needed, and one week later, our servers were available. 💫
Orange wanted to know if this team was performing well, so they asked for KPIs. They decided to use the number of tickets completed in a year. They asked to double this number. So instead of one ticket for a new service, we would open six tickets—one per server. By the end of the year, the KPIs had more than doubled.
Everybody saw it as a success for performance management. So, they asked to do the same for the next year. Now, we needed to open a ticket per server and per step. Again, the KPIs doubled. Behind the scenes, the tickets went to different people and were no longer handled in order. So, for the next year, it was decided to have meta-tickets and meetings to follow the progress of these tickets. Of course, all these extra steps pushed the KPI even higher.
This performance management method spread to the other teams.4 Everything became slower. Instead of a couple of weeks, a new service now took six months. We built a Soviet nail factory. But the KPIs were good, and we stopped caring.
Let me give you another example. We had to estimate the impact of each night operation. We weren’t half bad: we declared most operations “without any expected impact.” Most of the time, there was no impact. One time out of five, there was a 5-second impact. We were told to try harder to meet our expected impact. What did we do? We started declaring a 5-second expected impact. One day, we got a 30-second impact and were told we failed to match the expected impact. In the end, we declared most operations with a 10-minute expected impact, and we stopped caring: instead of carefully shifting traffic around, we allowed ourselves a 5-minute impact. And our KPIs were never better.
KPIs are not bad, but they are easy to break. Use them carefully: let the people doing the work help choose the metrics, and tie those metrics to the quality of the service—for example, with service level objectives. Otherwise, even dedicated people stop caring, game the system, and eventually quit. 📊