Defining, Measuring, and Managing Technical Debt at Google
In 2023, Ciera Jaspan and Collin Green, as part of the Engineering Productivity Research Team at Google, published a paper titled Defining, Measuring, and Managing Technical Debt based on five years of research on tech debt across different teams in their company.
In this article, I will describe the most interesting findings from that paper and how you can apply them at your company to define, measure, and manage technical debt.
Methodology
Before the team designed their survey, they interviewed a number of subject matter experts at the company to try to understand what were the main components of technical debt as perceived by them:
"We took an empirical approach to understand what engineers mean when they
refer to technical debt. We started by interviewing subject matter experts
at the company, focusing our discussions to generate options for two survey
questions: one asked engineers about the underlying causes of the technical
debt they encountered, and the other asked engineers what mitigation would
be appropriate to fix this debt. We included these questions in the next
round of our quarterly engineering survey and gave engineers the option to
select multiple root causes and multiple mitigations. Most engineers selected
several options in response to each of the items. We then performed a factor
analysis to discover patterns in the responses, and we reran the survey the
next quarter with refined response options, including an “other” response
option to allow engineers to write in descriptions. We did a qualitative
analysis of the descriptions in the “other” bucket, included novel concepts
in our list, and iterated until we hit the point where <2% of the engineers
selected “other.” This provided us with a collectively exhaustive and
mutually exclusive list of 10 categories of technical debt."
As you can read, this was an iterative approach that focused on narrowing down the concept of technical debt in different categories.
Technical Debt Categories
The 10 categories of technical debt that they detected were:
Migration is needed or in progress
This might be related to architectural decisions that were made in the past, which worked fine for a while, but then eventually started causing problems.
"This may be motivated by the need to scale, due to mandates, to reduce
dependencies, or to avoid deprecated technology."
You could think about this as an integration with a third party service which is no longer maintained and/or improved. The team knows that they will need to switch to a different service, but they haven’t had the time yet to execute the migration.
Documentation on project and application programming interfaces (APIs)
This might be related to documentation that is no longer up to date. When documentation is not executed, or constantly read and improved, it tends to fall out of date quickly.
"Information on how your project works is hard to find, missing or incomplete, or may include documentation on APIs or inherited code."
Every project has some sort of documentation. In the most basic format, it could be a README.md file in the project that tells you how to properly set up the application for development purposes.
Testing
"Poor test quality or coverage, such as missing tests or poor test data,
results in fragility, flaky tests, or lots of rollbacks."
Even at Google, teams are complaining about the lack of tests, the flakiness of test suites, and/or test cases that do not cover important edge cases.
This means that having a test suite is not enough. The tests have to be stable, they have to be thorough, and they have to help your team avoid regressions.
Code quality
"Product architecture or code within a project was not well designed. It may
have been rushed or a prototype/demo."
We have all been in this situation. An initial experiment/prototype/demo is successful and we tend to prioritize features/patches before we take a moment to adjust its architecture.
Improving the architecture of the product becomes something that will be done at some point down the line, but that moment never comes. It usually needs non-technical manager buy-in before it can happen.
Dead and/or abandoned code
"Code/features/projects were replaced or superseded but not removed."
Every now and then pieces of code become unreachable, which can create a false sense of complexity. Modules might seem too big and complex, but maybe only half of that code is actually getting used.
There are open source tools out there to help you remove dead code, but doing this takes time. Teams that report these issues often do not have time to stop and remove dead code before they continue shipping features and patching bugs.
Code degradation
"The code base has degraded or not kept up with changing standards over time.
The code may be in maintenance mode, in need of refactoring or updates."
This might be related to a change in one of the core dependencies of your application (e.g. React.js) which means that new code is expected to be written using functions instead of classes .
Open source moves fast. Using one library (e.g. Angular.js) or another library (React.js) will save us time when we are starting a new project. However, the team behind these libraries can decide to change the entire interface and core concepts from one major release to the next.
No matter what library or framework you choose, this will happen. The key to avoid this problem is to quickly (or gradually) adapt your code to comply with the new way of doing things.
Team lacks necessary expertise
"This may be due to staffing gaps and turnover or inherited orphaned
code/projects."
Depending on the job market, key members of a codebase might find jobs in other companies (or other teams within the same company) which will create a vacuum in the existing team.
If teams don’t take the necessary precautions, then there may be gaps where a team is waiting for the next senior hire (while still expected to continue to ship features and patches to production)
Dependencies
"Dependencies are unstable, rapidly changing, or trigger rollbacks."
Once again, open source moves fast. Tools like Dependabot or Depfu can help you stay up to date, but they are only good for small releases. Upgrading major releases of a framework (e.g. Rails) can take days, weeks, or even several developer months.
Non-trivial upgrades usually get postponed for a better time. Often times, this better time never comes. We have seen this firsthand at our productized services:
-
UpgradeJS : We help teams upgrade their React Native , React, Vue, or Angular applications.
-
FastRuby.io : We help teams upgrade their Ruby & Rails applications. We have invested over 30,000 developer/hours upgrading applications !
We have built a couple of profitable services on top of this particular issue, so we know that even the best teams struggle to keep up. It’s not because they don’t want to upgrade, it’s because other priorities get in the way.
Migration was poorly executed or abandoned
"This may have resulted in maintaining two versions."
This might happen due to a combination of the previous issues. The team started a migration project, but then suddenly there was an emergency and the team had to shift focus. Then that focus never came back to the migration of the system.
Another potential scenario is when a team expects certain promises to be true after a migration and then suddenly realizes that it won’t be the case. Rolling back the migration might end up in the back burner for months before it actually happens.
Release process
"The rollout and monitoring of production needs to be updated, migrated, or
maintained."
This might be related to the way the software development lifecycle is being managed. In the past we have encountered teams that deploy to production only once a month (due to environmental factors) which causes unnecessary friction.
As much as we enjoy being an agile software development agency, every now and then we have to work with clients who are not deploying changes to production every week. This is very often the case with our clients in highly-regulated industries (e.g. finance, national security, or healthcare)
Measuring Technical Debt
Google’s Engineering Productivity Research Team explored different ways to use metrics to detect problems before they happened:
"We sought to develop metrics based on engineering log data that capture the presence of technical debt of different types, too. Our goal was then to figure out if there are any metrics we can extract from the code or development process that would indicate technical debt was forming *before* it became a significant hindrance to developer productivity."
They decided to focus on three of the 10 types of technical debt: code degradation, teams lacking expertise, and migrations being needed or in progress.
"For these three forms of technical debt, we explored 117 metrics that were proposed as indicators of one of these forms of technical debt. In our initial analysis, we used a linear regression to determine whether each metric could predict an engineer’s perceptions of technical debt."
They put all of their candidate metrics into a random forest model to see if the combination of metrics could forecast developer’s perception of tech debt .
Unfortunately their results were not positive:
"The results were disappointing, to say the least. No single metric predicted reports of technical debt from engineers; our linear regression models predicted less than 1% of the variance in survey responses."
This might be related to the way developers envision the ideal state of a system, process, architecture, and flow, and maybe also due to the difficulty related to estimating how bad the situation is and how bad the situation is going to be at the end of the quarter (when their quarterly surveys are answered)
Managing Technical Debt
As a way to help teams that struggle with technical debt, Google formed a coalition to “help engineers, managers, and leaders systematically manage and address technical debt within their teams through education, case studies, processes, artifacts, incentives, and tools.”
This coalition started efforts to improve the situation:
- Creating a technical debt management framework to help teams establish good practices.
- Creating a technical debt management maturity model and accompanying technical debt maturity assessment.
- Organizing classroom instruction and self-guided courses to evangelize best practices and community forums to drive continual engagement and sharing of resources.
- Tooling that supports the identification and management of technical debt (for example, indicators of poor test coverage, stale documentation, and deprecated dependencies).
In my opinion, the most interesting effort of this coalition is defining a maturity model around technical debt. This is similar to CMMI (a framework defined at Carnegie Mellon University) which provides a comprehensive integrated set of guidelines for developing products and services.
This defines a new way to approach the subject. Instead of relying on developer’s gut feeling and environmental factors, this maturity model has tracking at its core. This means that there are measurable metrics that will play a key part in informing an engineering team’s decision around technical debt.
Technical Debt Management Maturity Model
This model defines four different levels. From most basic to most advanced:
Reactive Level
"Teams with a reactive approach have no real processes for managing technical
debt (even if they do occasionally make a focused effort to eliminate it, for
example, through a “fixit”)."
In my experience, most engineering teams have the best intentions to make the right decisions, to ship good enough code, and to take on a reasonable amount of technical debt.
They understand that technical debt does not mean it is okay to ship bad code to production. They analyze the trade-offs of their decisions and they make their calls with that in mind.
Every now and then they will take some time (maybe a sprint or two) to pay off technical debt. When doing this, they usually address issues that they are familiar with because they’ve been hindered by those issues.
Non-technical leaders usually don’t understand the significance of taking on too much technical debt. They start to care once issues start popping up because of these issues. It might take a production outage, a security vulnerability, or extremely low development velocity to get them to react.
Proactive Level
"Teams with a proactive approach deliberately identify and track technical debt and make decisions about its urgency and importance relative to other work."
These teams understand that “if you can’t measure it, you can’t improve it.” So they have been actively identifying technical debt issues. They might have metrics related to the application, the development workflow, the release phase, and/or the churn vs. complexity in their application .
They understand that some of the metrics they’ve been tracking show potential issues moving forward. They might notice that their code coverage percentage has been steadily declining which could signal a slippage in their testing best practices.
They care about certain metrics that might help them improve their development workflow and they know that they need to first inventory their tech debt before taking action. They know that addressing some of these issues might improve their DORA metrics .
Strategic Level
"Teams with a strategic approach have a proactive approach to managing technical debt (as in the preceding level) but go further: designating specific champions to improve planning and decision making around technical debt and to identify and address root causes."
These teams have an inventory of technical debt issues. They build on top of the previous level. For example: They proactively address flaky tests in their test suite.
They might assign one person to one of the issues that they detected. They likely know how to prioritize the list of technical debt issues and focus on the most pressing ones.
Structural Level
"Teams with a structural approach are strategic (as in the preceding level) and also take steps to optimize technical debt management locally—embedding technical debt considerations into the developer workflow—and standardize how it is handled across a larger organization."
Improving the situation is a team effort. Non-technical managers treat tech debt remediation as any other task in the sprint. They likely reserve a few hours of every sprint to paying off technical debt.
Conclusion
After reading this paper, I wish the research team had shared more about the different maturity levels. I believe the software engineering community could greatly benefit from a “Technical Debt Management Maturity Model.”
It’s proof that while technical debt metrics may not be perfect indicators, they can allow teams who already believe they have a problem to track their progress toward fixing it.
The goal is not to have zero technical debt. It has never been the goal. The real goal is to understand the trade-offs, to identify what is and what is not debt, and to actively manage it to keep it at levels that allow your team to not be hindered by it.
Need help assessing the technical debt in your application? Need to figure out how mature you are when it comes to managing technical debt? We would love to help! Send us a message and let’s see how we can help!