Make Certificate Expiry Boring

SSL Certificate Expiry

On 18 November 2025, GitHub had an hour-long outage that affected the heart of their product: Git operations. The post-incident summary was brief and honest - the outage was triggered by an internal TLS certificate that had quietly expired, blocking service-to-service communication inside their platform. It’s the kind of issue every engineering team knows can happen, yet it still slips through because certificates live in odd corners of a system, often far from where we normally look.

What struck me about this incident wasn’t that GitHub “missed something.” If anything, it reminded me how easy it is, even for well-run, highly mature engineering orgs, to overlook certificate expiry in their observability and alerting posture. We monitor CPU, memory, latency, error rates, queue depth, request volume - but a certificate that’s about to expire rarely shows up as a first-class signal. It doesn’t scream. It doesn’t gradually degrade. It just keeps working… until it doesn’t.

And that’s why these failures feel unfair. They’re fully preventable, but only if you treat certificates as operational assets, not just security artefacts. This article is about building that mindset: how to surface certificate expiry as a real reliability concern, how to detect issues early, and how to ensure a single date on a single file never brings down an entire system.

Why certificate-expiry outages happen

Most outages have a shape: a graph that starts bending the wrong way, an error budget that begins to evaporate, a queue that grows faster than it drains. Teams get early signals. They get a chance to react.

Certificate expiry is different. It behaves more like a trapdoor. Everything works perfectly… until the moment it doesn’t.

And because certificates sit at the intersection of security and infrastructure, ownership is often ambiguous. One team issues them, another deploys them, a third operates the service that depends on them. Over time, as systems evolve, certificates accumulate in places no one remembers - a legacy load balancer here, a forgotten internal endpoint there, an old mutual-TLS handshake powering a background job that hasn’t been touched in years. Each one quietly counts down to a date that may not exist anywhere in your dashboards.

It’s not that engineering teams are careless. It’s that distributed systems create distributed responsibilities. And unless expiry is treated as an operational metric - something you can alert on, page on, and practice recovering from - it becomes a blind spot.

The GitHub incident is just a recent reminder of a pattern most of us have seen in some form: the system isn’t failing, but our visibility into its prerequisites is.

That’s what we’ll fix next.

Where certificates actually live in a modern system

Before we talk about detection and automation, it helps to map the terrain. Certificates don’t sit in one place; they’re spread across a system the same way responsibilities do. And when teams are busy shipping features, it’s easy to forget how many places depend on a valid chain of trust.

A few common patterns:

1. Public entry points
These are the obvious ones - the certificates on your API gateway, load balancers, reverse proxies, or CDN. They’re usually tracked because they’re customer-facing. But even here, expiry can slip through if ownership rotates or if the renewal mechanism silently fails.

2. Internal service-to-service communication
Modern systems often use mTLS internally. That means each service, sidecar, or pod may hold its own certificate, usually short-lived and automatically rotated. The catch: these automation pipelines need monitoring too. When they fail, the failure is often invisible until the cert expires.

3. Databases, message brokers, and internal control planes
Many teams enable TLS for PostgreSQL, MongoDB, Kafka, Redis, or internal admin endpoints - and then forget about those certs entirely. These can be some of the hardest outages to debug because the components are not exposed externally and failures manifest as connection resets or handshake errors deep inside a dependency chain.

4. Cloud-managed infrastructure
AWS ALBs, GCP Certificate Manager, Azure Key Vault, CloudFront, IoT gateways - each keeps its own certificate store. These systems usually help with automation, but they don’t always alert when renewal fails, and they certainly don’t alert when your usage patterns change.

5. Legacy or security-adjacent components
Some of the most outage-prone certificates sit in places we rarely revisit:

If even one of these expires, the blast radius can be surprisingly wide.

What all of this shows is that certificate expiry isn’t a single-problem problem - it’s an inventory problem. You can’t secure or monitor what you don’t know exists. And you can’t rely on tribal memory to keep track of everything.

The next step, naturally, is stitching visibility back into the system: turning this scattered landscape into something observable, alertable, and resilient.

Detecting certificate expiry across different environments

Once you understand where certificates tend to hide, the next question becomes: how do we surface their expiry in a way that fits naturally into our observability stack?
The good news is that we don’t need anything exotic. We just need a reliable way to extract expiry information and feed it into whatever monitoring and alerting system we already trust.

The exact approach varies by environment, but the principle stays the same: expiry should show up as a first-class metric - just like latency, errors, or disk space.

Let’s break this down across the most common setups.

1. Kubernetes (with cert-manager)

If you’re using cert-manager, you already have expiry information available - it’s just a matter of surfacing it.

Cert-manager stores certificate metadata in the Kubernetes API, including status.notAfter. Expose that through:

Once the metric is flowing into Prometheus or Datadog, you can build straightforward alerts:

This handles most cluster-level certificates, especially ingress TLS and ACME-issued certs.

2. Kubernetes (without cert-manager)

Many clusters use:

In these cases, you can extract expiry from:

The pattern stays the same: gather expiry → convert to a metric → alert early.

3. Virtual machines, bare metal, or traditional workloads

This is where certificate expiry issues happen the most, often because the monitoring setup predates the current system complexity.

Your options here are simple and effective:

Nearly every major outage caused by certificate expiry outside Kubernetes happens in these environments - mostly because there’s no single place where certificates live, so no single tool naturally monitors them. A tiny script with a 30-second probe loop can save hours of downtime.

4. Cloud-managed ecosystems

AWS, GCP, and Azure all provide mature certificate stores:

They usually renew automatically, but renewals can fail silently for reasons like:

The fix: poll these APIs on a schedule and compare expiry timestamps with your policy thresholds. Treat those just like metrics from a node or pod.

5. The hard-to-see corners

No matter how modern your architecture is, you’ll find certificates in:

These deserve monitoring too, and the process is no different: probe, parse, publish.

Focus on expiry as a metric

When certificate expiry becomes just another number that your dashboards understand - a timestamp that can be plotted, queried, alerted on - the problem changes shape. It stops being a last-minute surprise and becomes part of your normal operational rhythm.

The next question, then, is how to automate renewals and rotations so that even when alerts happen, they’re nothing more than a nudge.

Automating certificate renewal and rotation

Detecting certificates before they expire is necessary, but it’s not the end goal. The real win is when expiry becomes uninteresting - when certificates rotate quietly in the background, without paging anyone, and without becoming a stress point before every major release.

Most organisations get stuck on renewals for one of two reasons:

  1. They assume automation is risky.
  2. Their infrastructure is too fragmented for a single renewal flow.

But automation doesn’t have to be fragile. It just has to be explicit.

Here are the most reliable patterns that work across environments.

1. ACME-based automation (Let’s Encrypt and internal ACME servers)

If your certificates can be issued via ACME, life becomes dramatically simpler. ACME clients - whether cert-manager inside Kubernetes or acme.sh / lego on a traditional VM - handle the full cycle:

And because ACME certificates are intentionally short-lived, your system gets frequent practice, making renewal failures visible long before a real expiry.

For internal systems, tools like Smallstep, HashiCorp Vault (ACME mode), or Pebble can act as internal ACME CAs, giving you automatic rotation without public DNS hoops.

2. Renewal via internal CA (Vault PKI, Venafi, Active Directory CA)

Some environments need tighter control than ACME allows. In those cases:

The trick is to treat issuance and renewal as API-driven processes - not as manual handoffs.

The pipeline should be able to:

Once this flow exists, adding observability around it is straightforward.

3. Automating the distribution step

Most certificate outages happen after renewal succeeds - when the new certificate exists but hasn’t been rolled out cleanly.

To make rotation safe and predictable:

This overlap pattern avoids the “everything broke because we reloaded too aggressively” class of outages, which is surprisingly common.

4. Cloud-managed rotation

Cloud providers do a decent job of renewing certificates automatically, but they won’t validate your whole deployment chain. That’s on you.

The safe pattern:

This closes the gap between “cert renewed” and “cert in use.”

5. Rotation in service meshes and sidecar-based systems

Istio, Linkerd, Consul Connect, and similar meshes issue short-lived certificates to workloads and rotate them frequently. This is excellent for security - but only if rotation stays healthy.

You want to monitor:

If rotation falls behind, it should be alerted on long before expiry.

The goal is predictability, not cleverness

A good renewal system doesn’t try to be “smart.”
It tries to be boring - predictable, transparent, observable, and easy to test.

The next step is tying this predictability into your alerting strategy: you want enough signal to catch problems early, but not so much noise that expiry becomes background static.

Alerting strategies that actually prevent downtime

Once certificates are visible in your monitoring system, the next challenge is deciding when to alert and how loudly. Expiry isn’t like latency or saturation - it doesn’t fluctuate minute-to-minute. It moves slowly, predictably, and without drama. That means your alerts should feel the same: calm, early, and useful.

A good alert for certificate expiry does two things:

  1. It tells you early enough that the fix is routine.
  2. It doesn’t page the team unless the system is genuinely at risk.

Taking the risk and being prescriptive, here’s how to design that balance.

1. Use long, staggered alert windows

A 90-day certificate doesn’t need a red alert at day 89. But it also shouldn’t wait until day 3.

A common, reliable pattern is:

This staggered approach ensures:

The goal is to turn expiry into a background piece of operational hygiene - not an adrenaline spike.

2. Alert on renewal failures, not just expiry

A certificate expiring is usually a symptom.
The real problem is that the renewal automation stopped working.

So your monitoring should include:

These alerts often matter more than the expiry date itself.

3. Detect chain issues and intermediate expiries

Sometimes the leaf certificate is fine - but an intermediate in the chain is not. Many teams miss this, because they only check the surface-level cert.

Your probes should validate the full chain:

Broken chains can create outages that look like TLS handshake mysteries, even when the leaf cert is fresh.

4. Surface expiry as a metric your dashboards understand

A certificate’s expiry date is just a timestamp. Expose it like any other metric:

Once it’s a metric:

It becomes part of your observability ecosystem, not an afterthought.

5. Don’t rely on humans to remember edge cases

If your alerts depend on tribal knowledge - someone remembering that “there’s an old VPN gateway in staging with a cert that expires in March” - then you don’t have an alerting strategy, you have a memory test that your team will fail.

Every certificate, in every environment, should be:

The moment monitoring depends on someone remembering “that one place we keep certs,” you’re back to hoping instead of observing.

Alerting should create confidence, not anxiety

Good alerts help teams sleep better. They remove uncertainty and allow engineers to trust that the system will tell them when something important is off. Certificate expiry should fall squarely into this camp - predictable, early, and boring.

With detection and alerting covered, the next piece is ensuring the system behaves safely when certificates actually rotate: how to design zero-downtime deployment patterns so rotation never becomes an outage event.

Zero-downtime rotation patterns

Even with good monitoring and robust automation, certificate renewals can still cause trouble if the rotation process itself is fragile. A surprising number of certificate-related outages happen after a new certificate has already been issued - during the switch-over phase where services, load balancers, or sidecars pick up the new credentials.

Zero-downtime rotation isn’t complicated, but it does require deliberate patterns. Most of these boil down to one principle:

| Never replace a certificate in a way that surprises the system.

Here are the patterns that make rotation predictable and safe.

1. Overlap the old and new certificates

A simple but powerful rule:
Always have a window where both the old and new certificates are valid and deployed.

This overlap ensures:

In practice, this can mean:

Overlap is your safety net.

2. Use atomic attachment for load balancers and gateways

Cloud load balancers usually support:

This is vastly safer than:

Atomic attachment ensures that the traffic shift is instantaneous and consistent.

3. Prefer graceful reloads over restarts

Some services pick up new certificates on reload, others need restarts. Where you can, choose the reload path.

Graceful reloads:

If a service truly cannot reload (rare today), wrap rotation in a:

The idea is the same: no hard cuts.

4. Validate after rotation - not just before

Many teams validate certificates before they rotate:

All good - but not enough.

You also need post-rotation validation:

Treat rotation as a deployment, not a file update.

5. Treat service meshes as first-class rotation systems

Sidecar-based meshes like Istio or Linkerd already rotate certificates frequently. But the control-plane CA certificates still need careful handling.

When rotating a CA certificate in a mesh:

Skipping these steps can break mTLS cluster-wide.

6. Keep rotation logs - they’re your only breadcrumb trail

Certificate rotation has a habit of failing silently.
Most debugging sessions start with, “Did the certificate get picked up?” and end in grepping logs or diffing secrets.

A good rotation system records:

This is invaluable during an incident, and equally helpful for audits or compliance. Drop it into the #release or #deployment slack channel so others can debug faster when things go bad.

Rotation should feel like any other deploy

The most reliable teams treat certificate rotation exactly like they treat code deployment:

When a certificate rotation feels as uninteresting as a config push or a canary rollout, you’ve reached operational maturity in this area.

Building organisation-wide guardrails around certificate management

Everything we’ve covered so far - inventory, monitoring, renewal, rotation - solves the technical side of certificate expiry. But outages rarely happen because of a missing script or exporter. They happen because systems grow, responsibilities shift, and operational assumptions slowly drift out of sync with reality.

Preventing certificate-expiry outages at scale requires more than good automation. It needs guardrails: lightweight, durable structures that support engineers without slowing them down. This isn’t governance, and it isn’t process for process’ sake. It’s giving teams the clarity and safety they need so certificates don’t become an invisible failure mode.

Some if not all of these guardrails aren’t needed if you have a single well known and automated way of dealing with certificates. Sometimes that’s not the case, and that’s where you need guardrails. Here are guardrails that have helped me manage the complexity of manual certificate lifecycle.

1. Make ownership explicit - for every certificate

Every certificate in your system should have:

This sounds formal, but it can be as simple as three fields in an internal inventory:

When ownership is clear, expiry becomes a maintenance task, not a detective story.

2. Set policy, but keep it lightweight

Certificate policies often fail because they become too rigid or too verbose. A practical policy should answer only the essentials:

3. Use the same observability channels you use for everything else

A certificate expiring should appear in:

If you need a separate tool or a second inbox to monitor certificates, you’ve already created inefficiencies and you are going to add more to the confusion. The best guardrail is simply: “This is part of our normal operational metrics.”

4. Run periodic “expiry audits” without blame

Once or twice a year, do a small audit:

The best option is to automate this audit.

5. Practice a certificate-rotation drill

Just like fire drills, rotation drills can build confidence by exposing vulnerabilities and gaps. Pick a non-critical service once a quarter:

This helps teams become comfortable with rotations, and uncovers issues that only show up during real renewals - mismatched trust stores, pinned clients, stale intermediates, or forgotten nodes. Better still, do it in production for a service.

6. Encourage teams to prefer automation over manual fixes

When a certificate is close to expiring, the fastest fix is often manual: generate a cert, upload it, restart a service - thank your sir.

It works in the moment, but creates a hidden cost: the automation is bypassed, and the system drifts.

Guardrails help by making the automated path the default:

Guardrails keep engineering energy focused where it matters

Good guardrails don’t feel heavy. They feel like support structures - the kind that keep important details visible even when everyone is moving fast. They reduce cognitive load, eliminate invisible traps, and give teams a shared mental model for how certificates behave in their environment.

When these guardrails are in place, certificate expiry stops being a background anxiety. It becomes just another part of the system that’s well understood, continuously monitored, and quietly maintained.

Bringing it all together - from trapdoor failures to predictable operations

Certificate-expiry outages feel disproportionate. They don’t arise from a complex scaling limit or an unexpected dependency interaction. They come from a single date embedded in a file - a detail that quietly counts down while everything else appears healthy. And when that date finally arrives, the failure is abrupt. No slow burn, no early symptoms. Just a trapdoor.

But it doesn’t need to be this way.
Expiry is one of the few reliability risks that is both entirely predictable and entirely preventable.

When we treat certificates as operational assets - things we can inventory, observe, rotate, and practice with - the problem changes shape. Instead of scrambling during an incident, teams build a steady rhythm around expiry:

And the result is a system that behaves the way resilient systems should: not because people remembered every corner, but because the structure makes forgetting impossible.

The GitHub outage was a reminder, not a criticism. It showed that even the most sophisticated engineering organisations can be caught off-guard by something small and silent. But it also demonstrated why it’s worth building a culture - and a set of practices - where small and silent things are surfaced early.

If your team can get certificate expiry out of the class of “we hope this doesn’t bite us” and into the class of “this is a well-managed part of our infrastructure,” you’ve eliminated an entire category of avoidable outages.

That’s the goal. Not perfect governance. Just clear guardrails, steady habits, and a system you can trust - even on the days when nothing looks wrong.