Introduction

Current status

In progress: publishing a new chapter roughly every week.

This book is a brain dump of tech leadership problems and my proposed solutions.
It has no pretension to be complete or perfect. On the contrary, following the Cunninghams's law, I expect it to be challenged so that the best solution can be found.

Who I am

My name is Matteo Di Tucci, I am software engineer and tech lead at air up.

Teach me back

I really appreciate any feedback about the book and my current understanding of tech leadership. You can share any feedback by opening a GitHub issue or a pull request

Tech leadership

As taught me in Thoughtworks, the what of tech leadership is:

People cultivation
Predictable delivery
Stakeholders management
Tech vision
- Cross-functional requirements
- Target architecture
- Tech principles
- Tech radar

The how is:

Ways of working

Ways of working drive success

Objectives matter, but they’re not enough. Every company sets quarterly goals, yet only a few succeed.
What sets them apart? Effective ways of working.

Objectives are the steering wheel, but ways of working are the engine.

The three principles of ways of working

Focused work – Minimize distractions to get meaningful work done
Definition of done – Clearly define when something is truly complete
Short feedback loops – Get frequent feedback to improve quickly

These principles apply at every level, from company strategy to daily tasks.

Analytics dashboard as first user story

Problem

After a go-live, the delivery team finds measuring success harder than expected.
Building the analytics dashboard reveals gaps in tracking capabilities and overlooked external variables.

Context

When planning a new initiative, success criteria seem clear.
Usually, the initiative is part of a well-defined business objective with its quantified key results.
Everything looks straightforward because at this stage the initiative is in isolation.

Once live, things get harder.
The initiative is now part of a complex system with many variables affecting the same metrics.
For example, data tracking can be too corse grained or an ongoing marketing campaign can skew the results.

Solution

The first user story of an initiative is to build the analytics dashboard. This dashboard encodes the success criteria of the initiative.
The benefits are:

Hard conversations start earlier, for example how to detect revenue cannibalization
Tracking gaps surface sooner, so they do not increase scope unexpectedly just before to go-live
More trustworthy success criteria, as there is no retrofitting to what is measured after go-live

A/B testing

Sometimes the difficulty of defining the analytics dashboard makes it clear that an A/B test is needed.
Understanding this close to go-live, or worse after it, can be very costly.

Notes

Analytics dashboard for tech initiatives

Analytics dashboard are fundamental for product initiatives as for tech ones.
In terms of tools, product analytics dashboards often take the form of something like Mixpanel, while tech dashboards of something like Datadog.

Users behavior as success criteria

For product initiatives, the best success criteria are the ones proving a change in users behavior.
Generating revenue is just a side effect of changing users behavior, hopefully for the better.
On top of this, profitability needs to be accounted for as well.

Bankrupt tech debt

Problem

There is a board which is a graveyard of tech debt tickets.
Nobody knows what is in there and what it is important to work on.

Context

The tech debt board is shared across multiple teams
Whenever tech debt is recognized, a new ticket is added to the board and forgotten
Cyclically, some engineer tries to organize and prioritize the board, but soon it becomes stale again
All engineers are responsible, which means nobody is responsible

Solution

Each engineer picks one tech debt ticket and champions it to completion within the next 2 weeks
Declare bankruptcy on the tech debt board, as in: delete it and forget about it
From now on, when planning a new initiative, any relevant tech debt is identified and worked on as part of the initiative itself

When to work on tech debt

Tech debt can be tackled before, during or after the go-live of an initiative: it does not matter.
What matters is that a new initiative cannot start until all tech debt of the previous one has been covered

Tech debt vs deadlines

It is fine to create new tech debt to respect a deadline. However, the tech debt must be recouped before starting a new initiative

How much tech debt

Perfect is enemy of good: we do not need to completely remove all the tech debt related to an initiative.
However, after an initiative is completed, we need to be in a better state than before starting it.
If the remaining tech debt is relevant, it will soon pop up again as part of a new initiative

Important tech debt can't be lost

Do not fear to lose track of important tech debt because of declaring bankruptcy.
If a piece of tech debt is important, it will be evident as part of the analysis of a new initiative

Tech debt is not tech work

An initiative can be either product work or tech work.
Tech work is what is needed to keep systems operational, not matter if they are customer-facing or internal.
So tech work not tech debt, and it should be delivered mixed with product work through any given interval of time (e.g. 80/20 product/tech work split)

Discarded solutions

Quarter tech debt

Here tech debt is bankrupted at the end of each quarter.
We are back to the original problem. However, by setting an expiration date on, we do not cumulate stale tech debt forever.
This might be a good intermediate step before adopting the proposed solution

Tech debt board kept up to date

There is a recurring meeting where all engineers or at least the tech leads go through the tech debt board and prioritize it.
This approach is very expensive in terms of time and cognitive load as the board keeps growing through time.
This is better than just having a graveyard tech board where nobody is responsible.
However, the lack of an expiration date makes it an ever-increasing burden for the responsible engineers.

Communication Patterns in Remote Team

Problem

How to communicate in a software delivery team to make sure:

Everybody is aligned
Feedback loop is short
People have focused time

Context

Cross-functional team
Full remote with overlapping time zones
Communication tools available:
- Slack (chat)
- Asana (project management tool)
- Google Meet (synchronous meeting)
- Gmail (email)

Solution

Slack

Urgent questions only
- Examples: you are stuck, and you can't do anything else without relevant context switching
Avoid direct messages
- Even for direct questions, use the team channel as this irradiates information
- Examples: clarify a requirement which is blocking a user story
No tagging
- Use "@ person", "@ channel", "@ here" only for urgent topics
- Examples: production incidents, you need an answer within an hour
- Team members are expected to skim through the team channel every couple of hours to triage and reply
Channels follow a convention
- [department]-[team-name]
- Examples: #tech-bunny
Messages follow a convention
- The first line is the summary
- If asking a direct question, put people names on top
- Examples: "[Go-live: countries timeline] Hey Nick, Are we good to launch in Italy after France?"
Single source of truth
- OK to have dedicated channels for roles, but some noise is better than overspecialized channels
- Have a dedicated channel for external stakeholders to ask for help

Asana

Definition of done
- Single source of truth for what needs to be done
- Crystallize any decision-making
- The above holds for both product and tech initiatives
Important, but not urgent
- Use Slack for urgent communication
- Team members are expected to skim through the notifications twice a day to triage and reply
- Bring back decision-making from Slack and meetings to the Project Management tool
Every initiative has its on board
- This board clarifies what has been done and what is left to do
- Tickets are organized in 3 sections: problem definition, analysis, delivery
- Delivery section can be broken down into multiple sections, one per go-live
Team delivery board
- It visualizes the work in progress or soon to be picked up (1-2 weeks)
- 4 columns: in-refinement, next, in-progress, done
- When ready to be refined, a ticket from an initiative board, is added to the team delivery board
- Every ticket is assigned to a person
- One person can't have more than 1 ticket in progress

Google Meet

Only for synchronous collaboration
- Examples: initiative kick-off, retrospective
Use the right tool
- Examples: Tuple for pair programming, Miro for drawing
Reduce recurring meetings
- Examples: move user story refinement to async, is standup really needed?
Keep meetings short and focused
- Smallest audience possible
- Organizer is responsible for time keeping
- Expected outcomes and agenda in meetings invites
- Everybody's calendar is up to date, so anyone invited to a meeting can change the date-time if needed

Gmail

Only to communicate with people outside the company

Notes

Everybody needs focused time, managers too

Decision-making requires focus
Bad decisions by an engineering manager are more harmful than those made by an engineer
Sorry Paul Graham

Repetition is key

Have 1 goal at time for the team
Use repetition in written and verbal communications to remind what the current goal is
This simplifies decision-making and prioritization at any level

Credits

Thanks to David Swallow, who insisted for no direct Slack tagging in our team. We were skeptical, now we love it.

Next Quarter Planning

Problem

Leadership wants to know the delivery timeline of the team for the next quarter:

what will be delivered
when it will be delivered

Context

Different stakeholders push for different initiatives
The problem behind each initiative is clear, but the solution is not
Cross-functional team

Solution

Assign a time appetite for each initiative until the whole quarter is covered
At the end of an initiative time appetite, decide to either:
- move to the next initiative, as planned
- keep working on the current initiative and subtract the extra time appetite from an upcoming initiative

Time appetite

Time appetite means the amount of weeks the team desire to work on an initiative.
Time appetite only covers delivery, not planning (see later).
The key idea is to deliver the best within a fixed time frame and then reassess global priorities.

Small go-lives

The key for the time appetite approach is small go-lives.
Small go-lives makes it easy to decide to move to the next initiative or rather stay on the current one.
Even if an initiative has a small time appetite, something is always delivered to the stakeholders of interest.

Continuous Deployment helps a lot, but it is not a necessary condition. Small go-lives bring many nice side effects per se:

Faster customer feedback
Faster operational feedback (i.e. performance, bugs, observability, etc.)
Reduced work in progress and cognitive load

Planning while delivering

The planning of an initiative is done before its delivery.
For planning we mean:

problem definition
solution definition
risks, assumptions, issues and dependencies spiking and resolution
scope slicing (i.e. scope defined for each small go-live)

The planning is performed by

Product manager
UX designer
Engineer champion (i.e. the engineer who coordinates the others on the initiative)

The planning happens while delivering on another initiative:

be mindful of the extra work for the people involved in planning
the engineer championing the current initiative should not contribute to planning the next one

Time appetite vs estimates

Time appetite reverses the classic approach with estimates where:

by guessing effort, the team define the scope to be delivered in the quarter for each initiative
Team is constantly late to deliver, or worse quality is cut to respect the timeline
Stakeholders are frustrated as their initiatives are late or nothing is delivered at all

Notes

My proposed solution is a lightweight version of the Shape Up approach from Basecamp

No recurring meetings

Problem

Few team members really contribute in team ceremonies.
Most are passive or provide low value input.

Context

This problem applies to both remote and co-located teams.

Following are some examples of how the problem manifests itself:

Standup becomes a status update. People are peer pressured to say something even if they already aligned asynchronously
Demo has either missing key stakeholders, or they provide disrupting feedback that lead to significant rework
User story refinement have 2 people discussing and all the rest sleeping

Solution

Delete all recurring events from the team calendar.
Schedule meetings when needed.

Standup

If team members cannot do their work without standup, the team ways of working are broken.
Improve over the 3 principles of: visualization of work, definition of done, short feedback loops.

User story refinement

Assign one representative per role (e.g. product owner, UX designer, engineer) to each user story.
Let them refine the user story asynchronously.
They will set on demand meetings when the deem pairing necessary.

Demo

Schedule a demo meeting as soon as the team has something to show to get feedback.
Invite only relevant stakeholders.
Do not accumulate work in progress for a generic demo meeting.

Retrospective

This is the only recurring meeting worth to have.
It is a safety net against human arrogance to believe we are always the best version of ourselves.

Notes

Recurring meetings are not evil per se, and they are definitely helpful in chaotic environments. However, they are a crutch.
High performing teams should have meetings adapting to their rhythm, not the other way around.

Tech

The simplicity rules which we learn for coding, apply to systems design too.

Lightweight Business Configuration

Problem

An e-commerce has some commercial configuration to update every now and then. For example: what products are in pre-sale, what products are under promotion, etc.

How to enable non-technical roles to update such configuration with minimal effort?

Context

Update frequency: once a day
Data to change: a handful of products SKU or dates
Existing systems:
- Front-end app
- Product catalog back-end app
- Headless content management system (third party)
Current path to prod:
- Continuous Deployment
Current solution:
- Configuration is stored in code as a handful of JSON files inside the front-end app
- For any configuration change, a support ticket to engineers is created

Solution

Non-technical roles change configuration autonomously using GitHub web interface.

Security

Non-technical roles adhere to the same security practices as engineers (e.g. strong passwords, password manager, 2nd factor authentication, encrypted laptop filesystem, etc.).
Non-technical roles have write privileges on the front-end app repository, but not admin ones.

Testing

Unit tests over the JSON files cover against:

malformed JSON
content validation (e.g. pre-sale dates in the past)
business invariant violations (e.g. avoid duplicates)

Failure and recovery

If a non-technical role makes a mistake with a JSON file, then the Continuous Deployment pipeline breaks.
An engineer reverts the change and then reaches out to the author to fix the issue together

Pros

No need to create a dedicated system to handle the commercial configuration
Network fault-tolerant: commercial configuration is embedded at build time
Unit tests acting as a quality gateway before changes affects production
Configuration changes versioned in git

Cons

Not custom validation possible in the GitHub web interface (e.g. invalid date format)
Basic JSON file manipulation needed by non-technical roles
Changes need to wait a handful of minutes to be deployed in production
Non-technical roles need to create a GitHub account, per person

Note

Next iterations

If the proposed JSON approach suffer from poor UX, the content management system is likely the next step.
If the business configuration grows in complexity, we have identified a standalone business subdomain. This requires its own back-end system with a UI to configure the changes.

Credits

Erik Simon: for proposing custom extensions to the content management system as an interesting alternative
Lukasz Plotnicki: for helping me to simplify the proposed solution without falling into security traps like locally forking the GitHub repo

Long lasting tech initiative

Problem

Several weeks are set aside for a tech initiative.
The changes are invisible outside the codebase, so it is hard to define success and communicate progress to stakeholders.

Context

Examples of long-lasting tech initiatives are:

refactoring an e-commerce cart logic as it is too complex and slow
migrating the integration layer with a content management system
improve the continuous deployment pipeline as it is too slow

Solution

Treat tech initiatives exactly like products ones. Artifacts are different, but the lifecycle is the same.
Stakeholders can be inside and outside the team, sometimes are simply the engineers maintaining a system.

Problem definition

Get the stakeholders together to define:

what is the problem
who is impacting
what are the consequences
initial risk, assumption, issues, dependencies (RAIDs)

Analysis

Analyze the problem and compare different solutions.
Leverage technical spikes to update the RAIDs.
If the problem is complex rather than complicated, start from anywhere and course correct as more information is gained.

Definition of success

The definition of success must be codified in artifacts accessible to everybody and that can evolve with time. For example:

Current state and target state of a system in the form of diagrams
Observability dashboard with key metrics (e.g. performance, number of errors, number of dependencies, number of invocations, etc.)

This must be done as the very first user story of the initiative.

Communicate progress

The artifacts that define success reflects the progress of the initiative so far.
Ideally this is automatically shown, for instance in the case of observability dashboards.
Other times, they need to be updated manually, like with architecture diagrams.

Notes

Codifying the definition of success in shared and evolving artifacts has two benefits.
The most important one is exposing a shared mental model across the engineers working on the initiative.
Secondly, it is an effective way to communicate with stakeholders.

Page data

Problem

In a single page application, a page fetches data from different external systems.
Adding new functionalities to the page gets harder because the unit tests are increasingly complex around faking external systems.
This happens because unit tests are coupled to the internals of how data are fetched and transformed.

Context

The page is a Next.js server side component
External systems are for instance: content management system (CMS) and product service
The external systems are queried multiple times, for instance: fetch different components from the CMS
Page unit tests are written in react-testing-library, test framework is Vitest
Mocking capabilities are provided by Vitest
When page unit tests fail, it is very difficult to understand what Vitest mock needs to be fixed

Solution

Extract all the page data fetching and transformation into a single getPageData() function. Now we can:

Unit test getPageData() without the complexity of the UI (just Vitest, no react-testing-library)
Unit test the page by only faking getPageData()

GetPageData function

An example of such function can be:

export const getHomePageData = (locale: Locale) => {
    const homePageContent = cmsClient().getHomePage() 
    const footer = cmsClient().getFooter() 
    const productRecommendations = productService().getReccomendations() 
    
    return {
        footer,
        heroImage: homePageContent.heroImage,
        productRecommendations,
        title: homePageContent.title
    }
}

Unit tests for getPageData() will fake cmsClient() and productService(). Page unit test will only fake getPageData().

Single responsibility

The responsibility of fetching and transforming data is extracted into getPageData().
This means the page can go back to its main responsibility: presentation through components composition.

Notes

Back-end for front-end

The getPageData() function can be thought as a tiny back-end-for-front-end.
Each page has its own getPageData(), for instance: getHomePageData(), getProductListPageData(), etc.

Practice the interesting, automate the boring

Problem

How to use Large Language Models (LLMs) to increase throughput without sacrificing learning.

Context

Engineers use LLMs to code. Code throughput increases, but knowledge retention decreases among engineers.

Effects on DORA metrics:

increased deployment frequency (good)
smaller lead time (good)
higher mean time to recovery (bad)

Qualitative effects:

shallower technical conversations
engineers do not remember how they implemented the internals of a feature

Solution

Do not use LLMs when practice is needed. Instead, use LLMs to automate routine tasks.
The rule of thumb is: if something is boring, delegate it to a LLM. If something is still interesting, do it yourself.

Learning requires practice

Learning requires repetition. Mental models strengthen and gain higher resolution with deliberate practice.
Delegating a new skill to LLMs after doing it just a few times hinders learning.

When to use LLM

Prototyping

Quickly building different prototypes to compare them, thus increasing the chance of making the right choice.
Examples are: software architectures, UI mock-ups, hackathons, etc.

New instance of an existing feature

Implement a new instance of a very well established feature.
For instance: yet another product list page for a new product category.

Refactoring at scale

Once the same refactoring has been done several times, and engineers move from being comfortable doing it to being bored by doing it.

Simplifying information gathering

Explore documentation efficiently, reduce the initial cognitive load with new technologies (e.g. boilerplate code for external systems integration).

One time occurrences

When doing something we are not planning or interested in doing again in the future.
For example, low impact internal tool (i.e. boring, not interesting).

When not to use LLM

Practicing a new skill

If you are learning test driven development (TDD), don't let the LLM code.
Ask for feedback to the LLM, especially if not pairing. Be mindful to weight and discuss the feedback with others more experienced.

Spaced repetitions

Without practice, it is normal to become rusty even after having mastered something.
We are back to a situation when doing something ourselves is interesting.
Stop delegating to LLMs until it is boring again.

Notes

Not black and white

There is a difference between using LLMs to prompt your way through an entire feature, and using it to make a function more idiomatic in TypeScript.
LLMs can act as a good sparring partner, or bring us to places where we would not like to be.
They are not good or bad per se, we just need to learn when it is better to slow down or speed up.

Engineering instead of vibe coding

The context size of each agent is key.
Have one agent acting as a planner and multiple agents acting as builders.
The builders must have very specific roles and perform very small tasks. For instance, if doing TDD: one agent write the tests, one agent makes the tests pass, one agent refactors.
Have a common initial prompt to specify coding conventions, testing strategy, tech principles, and cross-functional requirements.
Each repository has a folder containing the above prompts. For mono-repo, this is per service.

Credits

Thanks to Sebastian Roidl for the "Engineering instead of vibe coding" section.

Tech Leadership: I'll go first