certmagic/maintain.go

961 lines
33 KiB
Go
Raw Permalink Normal View History

2018-12-10 13:15:26 +10:00
// Copyright 2015 Matthew Holt
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.
package certmagic
import (
Update rate limiter to allow cancellation; add context to arguments The previous rate limiter design did not allow reservation cancellation. This became problematic with lots of config reloads in Caddy for large numbers of domain names. While the rate limiter had a backlog, a new config would come in and add even more to the rate limiter, and even more over time as background maintenance (renewals) kicked in. This leaked goroutines and memory as a side-effect, and blocked the issuance of certificates nigh indefinitely. The new rate limiter does not make future reservations like the previous one did. However, this requires us to run a single scheduler goroutine when a rate limiter is created, which requires being cleaned up when the rate limiter is no longer needed. As rate limits are global and should live up to the life of the process, there is currently no actual cleanup that takes place, but if it did happen, one would simply call Stop() on the rate limiter to stop that goroutine. With this new design, reservations are made only as the event actually happens; implementing cancellation with the old design would have been almost impossible to do correctly in a practical, elegant way. Although the trade-off is an extra goroutine that needs cleaning up, this is seldom (if ever?) needed in practice, and the benefit is that waiting goroutines can be unblocked when their context is canceled. This allows Caddy, for example, to reload configs often and cancel any goroutines that were merely waiting on the rate limiter. Now, all Obtain, Renew, and Revoke calls accept a context that can be cancelled. We also eliminate the acmeMu, a mutex that permitted only a single ACME operation at a time by the process, which was our early, naive form of rate limiting, which should no longer be necessary. On-demand obtain and renew do not yet use cancelable contexts, because what defines the context of a TLS handshake is still unclear. We might end up using a simple context with a timeout that is the maximum length of a TLS handshake in practice, say, 1 minute. This is a breaking change, but critical for larger deployments with very dynamic configurations.
2019-12-17 06:36:41 +10:00
"context"
"crypto/x509"
"encoding/json"
"encoding/pem"
"errors"
"fmt"
"io/fs"
"path"
"runtime"
"strings"
2018-12-10 13:15:26 +10:00
"time"
"github.com/mholt/acmez/v2/acme"
"go.uber.org/zap"
2018-12-10 13:15:26 +10:00
"golang.org/x/crypto/ocsp"
)
// maintainAssets is a permanently-blocking function
// that loops indefinitely and, on a regular schedule, checks
// certificates for expiration and initiates a renewal of certs
// that are expiring soon. It also updates OCSP stapling. It
// should only be called once per cache. Panics are recovered,
// and if panicCount < 10, the function is called recursively,
// incrementing panicCount each time. Initial invocation should
// start panicCount at 0.
func (certCache *Cache) maintainAssets(panicCount int) {
log := certCache.logger.Named("maintenance")
log = log.With(zap.String("cache", fmt.Sprintf("%p", certCache)))
defer func() {
if err := recover(); err != nil {
buf := make([]byte, stackTraceBufferSize)
buf = buf[:runtime.Stack(buf, false)]
log.Error("panic", zap.Any("error", err), zap.ByteString("stack", buf))
if panicCount < 10 {
certCache.maintainAssets(panicCount + 1)
}
}
}()
certCache.optionsMu.RLock()
renewalTicker := time.NewTicker(certCache.options.RenewCheckInterval)
ocspTicker := time.NewTicker(certCache.options.OCSPCheckInterval)
certCache.optionsMu.RUnlock()
2018-12-10 13:15:26 +10:00
log.Info("started background certificate maintenance")
Update rate limiter to allow cancellation; add context to arguments The previous rate limiter design did not allow reservation cancellation. This became problematic with lots of config reloads in Caddy for large numbers of domain names. While the rate limiter had a backlog, a new config would come in and add even more to the rate limiter, and even more over time as background maintenance (renewals) kicked in. This leaked goroutines and memory as a side-effect, and blocked the issuance of certificates nigh indefinitely. The new rate limiter does not make future reservations like the previous one did. However, this requires us to run a single scheduler goroutine when a rate limiter is created, which requires being cleaned up when the rate limiter is no longer needed. As rate limits are global and should live up to the life of the process, there is currently no actual cleanup that takes place, but if it did happen, one would simply call Stop() on the rate limiter to stop that goroutine. With this new design, reservations are made only as the event actually happens; implementing cancellation with the old design would have been almost impossible to do correctly in a practical, elegant way. Although the trade-off is an extra goroutine that needs cleaning up, this is seldom (if ever?) needed in practice, and the benefit is that waiting goroutines can be unblocked when their context is canceled. This allows Caddy, for example, to reload configs often and cancel any goroutines that were merely waiting on the rate limiter. Now, all Obtain, Renew, and Revoke calls accept a context that can be cancelled. We also eliminate the acmeMu, a mutex that permitted only a single ACME operation at a time by the process, which was our early, naive form of rate limiting, which should no longer be necessary. On-demand obtain and renew do not yet use cancelable contexts, because what defines the context of a TLS handshake is still unclear. We might end up using a simple context with a timeout that is the maximum length of a TLS handshake in practice, say, 1 minute. This is a breaking change, but critical for larger deployments with very dynamic configurations.
2019-12-17 06:36:41 +10:00
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
2018-12-10 13:15:26 +10:00
for {
select {
case <-renewalTicker.C:
Update rate limiter to allow cancellation; add context to arguments The previous rate limiter design did not allow reservation cancellation. This became problematic with lots of config reloads in Caddy for large numbers of domain names. While the rate limiter had a backlog, a new config would come in and add even more to the rate limiter, and even more over time as background maintenance (renewals) kicked in. This leaked goroutines and memory as a side-effect, and blocked the issuance of certificates nigh indefinitely. The new rate limiter does not make future reservations like the previous one did. However, this requires us to run a single scheduler goroutine when a rate limiter is created, which requires being cleaned up when the rate limiter is no longer needed. As rate limits are global and should live up to the life of the process, there is currently no actual cleanup that takes place, but if it did happen, one would simply call Stop() on the rate limiter to stop that goroutine. With this new design, reservations are made only as the event actually happens; implementing cancellation with the old design would have been almost impossible to do correctly in a practical, elegant way. Although the trade-off is an extra goroutine that needs cleaning up, this is seldom (if ever?) needed in practice, and the benefit is that waiting goroutines can be unblocked when their context is canceled. This allows Caddy, for example, to reload configs often and cancel any goroutines that were merely waiting on the rate limiter. Now, all Obtain, Renew, and Revoke calls accept a context that can be cancelled. We also eliminate the acmeMu, a mutex that permitted only a single ACME operation at a time by the process, which was our early, naive form of rate limiting, which should no longer be necessary. On-demand obtain and renew do not yet use cancelable contexts, because what defines the context of a TLS handshake is still unclear. We might end up using a simple context with a timeout that is the maximum length of a TLS handshake in practice, say, 1 minute. This is a breaking change, but critical for larger deployments with very dynamic configurations.
2019-12-17 06:36:41 +10:00
err := certCache.RenewManagedCertificates(ctx)
if err != nil {
log.Error("renewing managed certificates", zap.Error(err))
2018-12-10 13:15:26 +10:00
}
case <-ocspTicker.C:
Update rate limiter to allow cancellation; add context to arguments The previous rate limiter design did not allow reservation cancellation. This became problematic with lots of config reloads in Caddy for large numbers of domain names. While the rate limiter had a backlog, a new config would come in and add even more to the rate limiter, and even more over time as background maintenance (renewals) kicked in. This leaked goroutines and memory as a side-effect, and blocked the issuance of certificates nigh indefinitely. The new rate limiter does not make future reservations like the previous one did. However, this requires us to run a single scheduler goroutine when a rate limiter is created, which requires being cleaned up when the rate limiter is no longer needed. As rate limits are global and should live up to the life of the process, there is currently no actual cleanup that takes place, but if it did happen, one would simply call Stop() on the rate limiter to stop that goroutine. With this new design, reservations are made only as the event actually happens; implementing cancellation with the old design would have been almost impossible to do correctly in a practical, elegant way. Although the trade-off is an extra goroutine that needs cleaning up, this is seldom (if ever?) needed in practice, and the benefit is that waiting goroutines can be unblocked when their context is canceled. This allows Caddy, for example, to reload configs often and cancel any goroutines that were merely waiting on the rate limiter. Now, all Obtain, Renew, and Revoke calls accept a context that can be cancelled. We also eliminate the acmeMu, a mutex that permitted only a single ACME operation at a time by the process, which was our early, naive form of rate limiting, which should no longer be necessary. On-demand obtain and renew do not yet use cancelable contexts, because what defines the context of a TLS handshake is still unclear. We might end up using a simple context with a timeout that is the maximum length of a TLS handshake in practice, say, 1 minute. This is a breaking change, but critical for larger deployments with very dynamic configurations.
2019-12-17 06:36:41 +10:00
certCache.updateOCSPStaples(ctx)
2018-12-10 13:15:26 +10:00
case <-certCache.stopChan:
renewalTicker.Stop()
ocspTicker.Stop()
log.Info("stopped background certificate maintenance")
close(certCache.doneChan)
2018-12-10 13:15:26 +10:00
return
}
}
}
// RenewManagedCertificates renews managed certificates,
// including ones loaded on-demand. Note that this is done
// automatically on a regular basis; normally you will not
// need to call this. This method assumes non-interactive
// mode (i.e. operating in the background).
Update rate limiter to allow cancellation; add context to arguments The previous rate limiter design did not allow reservation cancellation. This became problematic with lots of config reloads in Caddy for large numbers of domain names. While the rate limiter had a backlog, a new config would come in and add even more to the rate limiter, and even more over time as background maintenance (renewals) kicked in. This leaked goroutines and memory as a side-effect, and blocked the issuance of certificates nigh indefinitely. The new rate limiter does not make future reservations like the previous one did. However, this requires us to run a single scheduler goroutine when a rate limiter is created, which requires being cleaned up when the rate limiter is no longer needed. As rate limits are global and should live up to the life of the process, there is currently no actual cleanup that takes place, but if it did happen, one would simply call Stop() on the rate limiter to stop that goroutine. With this new design, reservations are made only as the event actually happens; implementing cancellation with the old design would have been almost impossible to do correctly in a practical, elegant way. Although the trade-off is an extra goroutine that needs cleaning up, this is seldom (if ever?) needed in practice, and the benefit is that waiting goroutines can be unblocked when their context is canceled. This allows Caddy, for example, to reload configs often and cancel any goroutines that were merely waiting on the rate limiter. Now, all Obtain, Renew, and Revoke calls accept a context that can be cancelled. We also eliminate the acmeMu, a mutex that permitted only a single ACME operation at a time by the process, which was our early, naive form of rate limiting, which should no longer be necessary. On-demand obtain and renew do not yet use cancelable contexts, because what defines the context of a TLS handshake is still unclear. We might end up using a simple context with a timeout that is the maximum length of a TLS handshake in practice, say, 1 minute. This is a breaking change, but critical for larger deployments with very dynamic configurations.
2019-12-17 06:36:41 +10:00
func (certCache *Cache) RenewManagedCertificates(ctx context.Context) error {
log := certCache.logger.Named("maintenance")
// configs will hold a map of certificate hash to the config
// to use when managing that certificate
configs := make(map[string]*Config)
2018-12-10 13:15:26 +10:00
// we use the queues for a very important reason: to do any and all
// operations that could require an exclusive write lock outside
// of the read lock! otherwise we get a deadlock, yikes. in other
// words, our first iteration through the certificate cache does NOT
// perform any operations--only queues them--so that more fine-grained
// write locks may be obtained during the actual operations.
var renewQueue, reloadQueue, deleteQueue, ariQueue certList
2018-12-10 13:15:26 +10:00
certCache.mu.RLock()
for certKey, cert := range certCache.cache {
if !cert.managed {
continue
}
// the list of names on this cert should never be empty... programmer error?
if cert.Names == nil || len(cert.Names) == 0 {
log.Warn("certificate has no names; removing from cache", zap.String("cert_key", certKey))
2018-12-10 13:15:26 +10:00
deleteQueue = append(deleteQueue, cert)
continue
}
// get the config associated with this certificate
cfg, err := certCache.getConfig(cert)
if err != nil {
log.Error("unable to get configuration to manage certificate; unable to renew",
zap.Strings("identifiers", cert.Names),
zap.Error(err))
continue
}
if cfg == nil {
// this is bad if this happens, probably a programmer error (oops)
log.Error("no configuration associated with certificate; unable to manage",
zap.Strings("identifiers", cert.Names))
continue
}
if cfg.OnDemand != nil {
continue
}
// ACME-specific: see if if ACME Renewal Info (ARI) window needs refreshing
if !cfg.DisableARI && cert.ari.NeedsRefresh() {
configs[cert.hash] = cfg
ariQueue = append(ariQueue, cert)
}
2018-12-10 13:15:26 +10:00
// if time is up or expires soon, we need to try to renew it
if cert.NeedsRenewal(cfg) {
configs[cert.hash] = cfg
2018-12-10 13:15:26 +10:00
// see if the certificate in storage has already been renewed, possibly by another
// instance that didn't coordinate with this one; if so, just load it (this
// might happen if another instance already renewed it - kinda sloppy but checking disk
// first is a simple way to possibly drastically reduce rate limit problems)
storedCertNeedsRenew, err := cfg.managedCertInStorageNeedsRenewal(ctx, cert)
2018-12-10 13:15:26 +10:00
if err != nil {
// hmm, weird, but not a big deal, maybe it was deleted or something
log.Warn("error while checking if stored certificate is also expiring soon",
zap.Strings("identifiers", cert.Names),
zap.Error(err))
} else if !storedCertNeedsRenew {
// if the certificate does NOT need renewal and there was no error, then we
2018-12-10 13:15:26 +10:00
// are good to just reload the certificate from storage instead of repeating
// a likely-unnecessary renewal procedure
reloadQueue = append(reloadQueue, cert)
continue
}
// the certificate in storage has not been renewed yet, so we will do it
// NOTE: It is super-important to note that the TLS-ALPN challenge requires
// a write lock on the cache in order to complete its challenge, so it is extra
// vital that this renew operation does not happen inside our read lock!
renewQueue.insert(cert)
2018-12-10 13:15:26 +10:00
}
}
certCache.mu.RUnlock()
// Update ARI, and then for any certs where the ARI window changed,
// be sure to queue them for renewal if necessary
for _, cert := range ariQueue {
cfg := configs[cert.hash]
cert, changed, err := cfg.updateARI(ctx, cert, log)
if err != nil {
log.Error("updating ARI", zap.Error(err))
}
if changed && cert.NeedsRenewal(cfg) {
// it's theoretically possible that another instance already got the memo
// on the changed ARI and even renewed the cert already, and thus doing it
// here is wasteful, but I have never heard of this happening in reality,
// so to save some cycles for now I think we'll just queue it for renewal
// (notice how we use 'insert' to avoid duplicates, in case it was already
// scheduled for renewal anyway)
renewQueue.insert(cert)
}
}
2018-12-10 13:15:26 +10:00
// Reload certificates that merely need to be updated in memory
for _, oldCert := range reloadQueue {
timeLeft := expiresAt(oldCert.Leaf).Sub(time.Now().UTC())
log.Info("certificate expires soon, but is already renewed in storage; reloading stored certificate",
zap.Strings("identifiers", oldCert.Names),
zap.Duration("remaining", timeLeft))
2018-12-10 13:15:26 +10:00
cfg := configs[oldCert.hash]
// crucially, this happens OUTSIDE a lock on the certCache
_, err := cfg.reloadManagedCertificate(ctx, oldCert)
2018-12-10 13:15:26 +10:00
if err != nil {
log.Error("loading renewed certificate",
zap.Strings("identifiers", oldCert.Names),
zap.Error(err))
continue
2018-12-10 13:15:26 +10:00
}
}
// Renewal queue
for _, oldCert := range renewQueue {
cfg := configs[oldCert.hash]
Major refactor to improve performance, correctness, and extensibility Breaking changes; thank goodness we're not 1.0 yet 😅 - read on! This change completely separates ACME-specific code from the rest of the certificate management process, allowing pluggable sources for certs that aren't ACME. Notably, most of Config was spliced into ACMEManager. Similarly, there's now Default and DefaultACME. Storage structure had to be reconfigured. Certificates are no longer in the acme/ subfolder since they can be obtained by ways other than ACME! Certificates moved to a new certificates/ subfolder. The subfolders in that folder use the path of the ACME endpoint instead of just the host, so that also changed. Be aware that unless you move your certs over, CertMagic will not find them and will attempt to get new ones. That is usually fine for most users, but for extremely large deployments, you will want to move them over first. Old certs path: acme/acme-staging-v02.api.letsencrypt.org/... New certs path: certificates/acme-staging-v02.api.letsencrypt.org-directory/... That's all for significant storage changes! But this refactor also vastly improves performance, especially at scale, and makes CertMagic way more resilient to errors. Retries are done on the staging endpoint by default, so they won't count against your rate limit. If your hardware can handle it, I'm now pretty confident that you can give CertMagic a million domain names and it will gracefully manage them, as fast as it can within internal and external rate limits, even in the presence of errors. Errors will of course slow some things down, but you should be good to go if you're monitoring logs and can fix any misconfigurations or other external errors! Several other mostly-minor enhancements fix bugs, especially at scale. For example, duplicated renewal tasks (that continuously fail) will not pile up on each other: only one will operate, under exponential backoff. Closes #50 and fixes #55
2020-02-22 07:32:57 +10:00
err := certCache.queueRenewalTask(ctx, oldCert, cfg)
if err != nil {
log.Error("queueing renewal task",
zap.Strings("identifiers", oldCert.Names),
zap.Error(err))
Major refactor to improve performance, correctness, and extensibility Breaking changes; thank goodness we're not 1.0 yet 😅 - read on! This change completely separates ACME-specific code from the rest of the certificate management process, allowing pluggable sources for certs that aren't ACME. Notably, most of Config was spliced into ACMEManager. Similarly, there's now Default and DefaultACME. Storage structure had to be reconfigured. Certificates are no longer in the acme/ subfolder since they can be obtained by ways other than ACME! Certificates moved to a new certificates/ subfolder. The subfolders in that folder use the path of the ACME endpoint instead of just the host, so that also changed. Be aware that unless you move your certs over, CertMagic will not find them and will attempt to get new ones. That is usually fine for most users, but for extremely large deployments, you will want to move them over first. Old certs path: acme/acme-staging-v02.api.letsencrypt.org/... New certs path: certificates/acme-staging-v02.api.letsencrypt.org-directory/... That's all for significant storage changes! But this refactor also vastly improves performance, especially at scale, and makes CertMagic way more resilient to errors. Retries are done on the staging endpoint by default, so they won't count against your rate limit. If your hardware can handle it, I'm now pretty confident that you can give CertMagic a million domain names and it will gracefully manage them, as fast as it can within internal and external rate limits, even in the presence of errors. Errors will of course slow some things down, but you should be good to go if you're monitoring logs and can fix any misconfigurations or other external errors! Several other mostly-minor enhancements fix bugs, especially at scale. For example, duplicated renewal tasks (that continuously fail) will not pile up on each other: only one will operate, under exponential backoff. Closes #50 and fixes #55
2020-02-22 07:32:57 +10:00
continue
}
}
Major refactor to improve performance, correctness, and extensibility Breaking changes; thank goodness we're not 1.0 yet 😅 - read on! This change completely separates ACME-specific code from the rest of the certificate management process, allowing pluggable sources for certs that aren't ACME. Notably, most of Config was spliced into ACMEManager. Similarly, there's now Default and DefaultACME. Storage structure had to be reconfigured. Certificates are no longer in the acme/ subfolder since they can be obtained by ways other than ACME! Certificates moved to a new certificates/ subfolder. The subfolders in that folder use the path of the ACME endpoint instead of just the host, so that also changed. Be aware that unless you move your certs over, CertMagic will not find them and will attempt to get new ones. That is usually fine for most users, but for extremely large deployments, you will want to move them over first. Old certs path: acme/acme-staging-v02.api.letsencrypt.org/... New certs path: certificates/acme-staging-v02.api.letsencrypt.org-directory/... That's all for significant storage changes! But this refactor also vastly improves performance, especially at scale, and makes CertMagic way more resilient to errors. Retries are done on the staging endpoint by default, so they won't count against your rate limit. If your hardware can handle it, I'm now pretty confident that you can give CertMagic a million domain names and it will gracefully manage them, as fast as it can within internal and external rate limits, even in the presence of errors. Errors will of course slow some things down, but you should be good to go if you're monitoring logs and can fix any misconfigurations or other external errors! Several other mostly-minor enhancements fix bugs, especially at scale. For example, duplicated renewal tasks (that continuously fail) will not pile up on each other: only one will operate, under exponential backoff. Closes #50 and fixes #55
2020-02-22 07:32:57 +10:00
// Deletion queue
certCache.mu.Lock()
for _, cert := range deleteQueue {
certCache.removeCertificate(cert)
}
certCache.mu.Unlock()
return nil
}
func (certCache *Cache) queueRenewalTask(ctx context.Context, oldCert Certificate, cfg *Config) error {
log := certCache.logger.Named("maintenance")
timeLeft := expiresAt(oldCert.Leaf).Sub(time.Now().UTC())
log.Info("certificate expires soon; queuing for renewal",
zap.Strings("identifiers", oldCert.Names),
zap.Duration("remaining", timeLeft))
Major refactor to improve performance, correctness, and extensibility Breaking changes; thank goodness we're not 1.0 yet 😅 - read on! This change completely separates ACME-specific code from the rest of the certificate management process, allowing pluggable sources for certs that aren't ACME. Notably, most of Config was spliced into ACMEManager. Similarly, there's now Default and DefaultACME. Storage structure had to be reconfigured. Certificates are no longer in the acme/ subfolder since they can be obtained by ways other than ACME! Certificates moved to a new certificates/ subfolder. The subfolders in that folder use the path of the ACME endpoint instead of just the host, so that also changed. Be aware that unless you move your certs over, CertMagic will not find them and will attempt to get new ones. That is usually fine for most users, but for extremely large deployments, you will want to move them over first. Old certs path: acme/acme-staging-v02.api.letsencrypt.org/... New certs path: certificates/acme-staging-v02.api.letsencrypt.org-directory/... That's all for significant storage changes! But this refactor also vastly improves performance, especially at scale, and makes CertMagic way more resilient to errors. Retries are done on the staging endpoint by default, so they won't count against your rate limit. If your hardware can handle it, I'm now pretty confident that you can give CertMagic a million domain names and it will gracefully manage them, as fast as it can within internal and external rate limits, even in the presence of errors. Errors will of course slow some things down, but you should be good to go if you're monitoring logs and can fix any misconfigurations or other external errors! Several other mostly-minor enhancements fix bugs, especially at scale. For example, duplicated renewal tasks (that continuously fail) will not pile up on each other: only one will operate, under exponential backoff. Closes #50 and fixes #55
2020-02-22 07:32:57 +10:00
// Get the name which we should use to renew this certificate;
// we only support managing certificates with one name per cert,
// so this should be easy.
renewName := oldCert.Names[0]
// queue up this renewal job (is a no-op if already active or queued)
jm.Submit(cfg.Logger, "renew_"+renewName, func() error {
timeLeft := expiresAt(oldCert.Leaf).Sub(time.Now().UTC())
log.Info("attempting certificate renewal",
zap.Strings("identifiers", oldCert.Names),
zap.Duration("remaining", timeLeft))
2018-12-10 13:15:26 +10:00
// perform renewal - crucially, this happens OUTSIDE a lock on certCache
err := cfg.RenewCertAsync(ctx, renewName, false)
2018-12-10 13:15:26 +10:00
if err != nil {
if cfg.OnDemand != nil {
2018-12-10 13:15:26 +10:00
// loaded dynamically, remove dynamically
Major refactor to improve performance, correctness, and extensibility Breaking changes; thank goodness we're not 1.0 yet 😅 - read on! This change completely separates ACME-specific code from the rest of the certificate management process, allowing pluggable sources for certs that aren't ACME. Notably, most of Config was spliced into ACMEManager. Similarly, there's now Default and DefaultACME. Storage structure had to be reconfigured. Certificates are no longer in the acme/ subfolder since they can be obtained by ways other than ACME! Certificates moved to a new certificates/ subfolder. The subfolders in that folder use the path of the ACME endpoint instead of just the host, so that also changed. Be aware that unless you move your certs over, CertMagic will not find them and will attempt to get new ones. That is usually fine for most users, but for extremely large deployments, you will want to move them over first. Old certs path: acme/acme-staging-v02.api.letsencrypt.org/... New certs path: certificates/acme-staging-v02.api.letsencrypt.org-directory/... That's all for significant storage changes! But this refactor also vastly improves performance, especially at scale, and makes CertMagic way more resilient to errors. Retries are done on the staging endpoint by default, so they won't count against your rate limit. If your hardware can handle it, I'm now pretty confident that you can give CertMagic a million domain names and it will gracefully manage them, as fast as it can within internal and external rate limits, even in the presence of errors. Errors will of course slow some things down, but you should be good to go if you're monitoring logs and can fix any misconfigurations or other external errors! Several other mostly-minor enhancements fix bugs, especially at scale. For example, duplicated renewal tasks (that continuously fail) will not pile up on each other: only one will operate, under exponential backoff. Closes #50 and fixes #55
2020-02-22 07:32:57 +10:00
certCache.mu.Lock()
certCache.removeCertificate(oldCert)
certCache.mu.Unlock()
2018-12-10 13:15:26 +10:00
}
Major refactor to improve performance, correctness, and extensibility Breaking changes; thank goodness we're not 1.0 yet 😅 - read on! This change completely separates ACME-specific code from the rest of the certificate management process, allowing pluggable sources for certs that aren't ACME. Notably, most of Config was spliced into ACMEManager. Similarly, there's now Default and DefaultACME. Storage structure had to be reconfigured. Certificates are no longer in the acme/ subfolder since they can be obtained by ways other than ACME! Certificates moved to a new certificates/ subfolder. The subfolders in that folder use the path of the ACME endpoint instead of just the host, so that also changed. Be aware that unless you move your certs over, CertMagic will not find them and will attempt to get new ones. That is usually fine for most users, but for extremely large deployments, you will want to move them over first. Old certs path: acme/acme-staging-v02.api.letsencrypt.org/... New certs path: certificates/acme-staging-v02.api.letsencrypt.org-directory/... That's all for significant storage changes! But this refactor also vastly improves performance, especially at scale, and makes CertMagic way more resilient to errors. Retries are done on the staging endpoint by default, so they won't count against your rate limit. If your hardware can handle it, I'm now pretty confident that you can give CertMagic a million domain names and it will gracefully manage them, as fast as it can within internal and external rate limits, even in the presence of errors. Errors will of course slow some things down, but you should be good to go if you're monitoring logs and can fix any misconfigurations or other external errors! Several other mostly-minor enhancements fix bugs, especially at scale. For example, duplicated renewal tasks (that continuously fail) will not pile up on each other: only one will operate, under exponential backoff. Closes #50 and fixes #55
2020-02-22 07:32:57 +10:00
return fmt.Errorf("%v %v", oldCert.Names, err)
2018-12-10 13:15:26 +10:00
}
// successful renewal, so update in-memory cache by loading
// renewed certificate so it will be used with handshakes
_, err = cfg.reloadManagedCertificate(ctx, oldCert)
2018-12-10 13:15:26 +10:00
if err != nil {
Major refactor to improve performance, correctness, and extensibility Breaking changes; thank goodness we're not 1.0 yet 😅 - read on! This change completely separates ACME-specific code from the rest of the certificate management process, allowing pluggable sources for certs that aren't ACME. Notably, most of Config was spliced into ACMEManager. Similarly, there's now Default and DefaultACME. Storage structure had to be reconfigured. Certificates are no longer in the acme/ subfolder since they can be obtained by ways other than ACME! Certificates moved to a new certificates/ subfolder. The subfolders in that folder use the path of the ACME endpoint instead of just the host, so that also changed. Be aware that unless you move your certs over, CertMagic will not find them and will attempt to get new ones. That is usually fine for most users, but for extremely large deployments, you will want to move them over first. Old certs path: acme/acme-staging-v02.api.letsencrypt.org/... New certs path: certificates/acme-staging-v02.api.letsencrypt.org-directory/... That's all for significant storage changes! But this refactor also vastly improves performance, especially at scale, and makes CertMagic way more resilient to errors. Retries are done on the staging endpoint by default, so they won't count against your rate limit. If your hardware can handle it, I'm now pretty confident that you can give CertMagic a million domain names and it will gracefully manage them, as fast as it can within internal and external rate limits, even in the presence of errors. Errors will of course slow some things down, but you should be good to go if you're monitoring logs and can fix any misconfigurations or other external errors! Several other mostly-minor enhancements fix bugs, especially at scale. For example, duplicated renewal tasks (that continuously fail) will not pile up on each other: only one will operate, under exponential backoff. Closes #50 and fixes #55
2020-02-22 07:32:57 +10:00
return ErrNoRetry{fmt.Errorf("%v %v", oldCert.Names, err)}
2018-12-10 13:15:26 +10:00
}
Major refactor to improve performance, correctness, and extensibility Breaking changes; thank goodness we're not 1.0 yet 😅 - read on! This change completely separates ACME-specific code from the rest of the certificate management process, allowing pluggable sources for certs that aren't ACME. Notably, most of Config was spliced into ACMEManager. Similarly, there's now Default and DefaultACME. Storage structure had to be reconfigured. Certificates are no longer in the acme/ subfolder since they can be obtained by ways other than ACME! Certificates moved to a new certificates/ subfolder. The subfolders in that folder use the path of the ACME endpoint instead of just the host, so that also changed. Be aware that unless you move your certs over, CertMagic will not find them and will attempt to get new ones. That is usually fine for most users, but for extremely large deployments, you will want to move them over first. Old certs path: acme/acme-staging-v02.api.letsencrypt.org/... New certs path: certificates/acme-staging-v02.api.letsencrypt.org-directory/... That's all for significant storage changes! But this refactor also vastly improves performance, especially at scale, and makes CertMagic way more resilient to errors. Retries are done on the staging endpoint by default, so they won't count against your rate limit. If your hardware can handle it, I'm now pretty confident that you can give CertMagic a million domain names and it will gracefully manage them, as fast as it can within internal and external rate limits, even in the presence of errors. Errors will of course slow some things down, but you should be good to go if you're monitoring logs and can fix any misconfigurations or other external errors! Several other mostly-minor enhancements fix bugs, especially at scale. For example, duplicated renewal tasks (that continuously fail) will not pile up on each other: only one will operate, under exponential backoff. Closes #50 and fixes #55
2020-02-22 07:32:57 +10:00
return nil
})
2018-12-10 13:15:26 +10:00
return nil
}
// updateOCSPStaples updates the OCSP stapling in all
2018-12-10 13:15:26 +10:00
// eligible, cached certificates.
//
// OCSP maintenance strives to abide the relevant points on
// Ryan Sleevi's recommendations for good OCSP support:
// https://gist.github.com/sleevi/5efe9ef98961ecfb4da8
Update rate limiter to allow cancellation; add context to arguments The previous rate limiter design did not allow reservation cancellation. This became problematic with lots of config reloads in Caddy for large numbers of domain names. While the rate limiter had a backlog, a new config would come in and add even more to the rate limiter, and even more over time as background maintenance (renewals) kicked in. This leaked goroutines and memory as a side-effect, and blocked the issuance of certificates nigh indefinitely. The new rate limiter does not make future reservations like the previous one did. However, this requires us to run a single scheduler goroutine when a rate limiter is created, which requires being cleaned up when the rate limiter is no longer needed. As rate limits are global and should live up to the life of the process, there is currently no actual cleanup that takes place, but if it did happen, one would simply call Stop() on the rate limiter to stop that goroutine. With this new design, reservations are made only as the event actually happens; implementing cancellation with the old design would have been almost impossible to do correctly in a practical, elegant way. Although the trade-off is an extra goroutine that needs cleaning up, this is seldom (if ever?) needed in practice, and the benefit is that waiting goroutines can be unblocked when their context is canceled. This allows Caddy, for example, to reload configs often and cancel any goroutines that were merely waiting on the rate limiter. Now, all Obtain, Renew, and Revoke calls accept a context that can be cancelled. We also eliminate the acmeMu, a mutex that permitted only a single ACME operation at a time by the process, which was our early, naive form of rate limiting, which should no longer be necessary. On-demand obtain and renew do not yet use cancelable contexts, because what defines the context of a TLS handshake is still unclear. We might end up using a simple context with a timeout that is the maximum length of a TLS handshake in practice, say, 1 minute. This is a breaking change, but critical for larger deployments with very dynamic configurations.
2019-12-17 06:36:41 +10:00
func (certCache *Cache) updateOCSPStaples(ctx context.Context) {
logger := certCache.logger.Named("maintenance")
// temporary structures to store updates or tasks
// so that we can keep our locks short-lived
2018-12-10 13:15:26 +10:00
type ocspUpdate struct {
rawBytes []byte
parsed *ocsp.Response
}
type updateQueueEntry struct {
cert Certificate
certHash string
lastNextUpdate time.Time
cfg *Config
}
type renewQueueEntry struct {
oldCert Certificate
cfg *Config
}
2018-12-10 13:15:26 +10:00
updated := make(map[string]ocspUpdate)
var updateQueue []updateQueueEntry // certs that need a refreshed staple
var renewQueue []renewQueueEntry // certs that need to be renewed (due to revocation)
2018-12-10 13:15:26 +10:00
// obtain brief read lock during our scan to see which staples need updating
2018-12-10 13:15:26 +10:00
certCache.mu.RLock()
for certHash, cert := range certCache.cache {
// no point in updating OCSP for expired or "synthetic" certificates
if cert.Leaf == nil || cert.Expired() {
2018-12-10 13:15:26 +10:00
continue
}
cfg, err := certCache.getConfig(cert)
if err != nil {
logger.Error("unable to get automation config for certificate; maintenance for this certificate will likely fail",
zap.Strings("identifiers", cert.Names),
zap.Error(err))
continue
}
// always try to replace revoked certificates, even if OCSP response is still fresh
2022-02-02 04:04:25 +10:00
if certShouldBeForceRenewed(cert) {
renewQueue = append(renewQueue, renewQueueEntry{
oldCert: cert,
cfg: cfg,
2022-02-02 04:04:25 +10:00
})
continue
}
// if the status is not fresh, get a new one
2018-12-10 13:15:26 +10:00
var lastNextUpdate time.Time
if cert.ocsp != nil {
lastNextUpdate = cert.ocsp.NextUpdate
if cert.ocsp.Status != ocsp.Unknown && freshOCSP(cert.ocsp) {
// no need to update our staple if still fresh and not Unknown
continue
2018-12-10 13:15:26 +10:00
}
}
updateQueue = append(updateQueue, updateQueueEntry{cert, certHash, lastNextUpdate, cfg})
}
certCache.mu.RUnlock()
// perform updates outside of any lock on certCache
for _, qe := range updateQueue {
cert := qe.cert
certHash := qe.certHash
lastNextUpdate := qe.lastNextUpdate
2018-12-10 13:15:26 +10:00
if qe.cfg == nil {
// this is bad if this happens, probably a programmer error (oops)
logger.Error("no configuration associated with certificate; unable to manage OCSP staples",
zap.Strings("identifiers", cert.Names))
continue
}
err := stapleOCSP(ctx, qe.cfg.OCSP, qe.cfg.Storage, &cert, nil)
if err != nil {
if cert.ocsp != nil {
2018-12-10 13:15:26 +10:00
// if there was no staple before, that's fine; otherwise we should log the error
logger.Error("stapling OCSP",
zap.Strings("identifiers", cert.Names),
zap.Error(err))
2018-12-10 13:15:26 +10:00
}
continue
}
// By this point, we've obtained the latest OCSP response.
// If there was no staple before, or if the response is updated, make
// sure we apply the update to all names on the certificate if
// the status is still Good.
if cert.ocsp != nil && cert.ocsp.Status == ocsp.Good && (lastNextUpdate.IsZero() || lastNextUpdate != cert.ocsp.NextUpdate) {
logger.Info("advancing OCSP staple",
zap.Strings("identifiers", cert.Names),
zap.Time("from", lastNextUpdate),
zap.Time("to", cert.ocsp.NextUpdate))
updated[certHash] = ocspUpdate{rawBytes: cert.Certificate.OCSPStaple, parsed: cert.ocsp}
2018-12-10 13:15:26 +10:00
}
// If the updated staple shows that the certificate was revoked, we should immediately renew it
if certShouldBeForceRenewed(cert) {
qe.cfg.emit(ctx, "cert_ocsp_revoked", map[string]any{
"subjects": cert.Names,
"certificate": cert,
"reason": cert.ocsp.RevocationReason,
"revoked_at": cert.ocsp.RevokedAt,
})
renewQueue = append(renewQueue, renewQueueEntry{
oldCert: cert,
cfg: qe.cfg,
})
}
2018-12-10 13:15:26 +10:00
}
// These write locks should be brief since we have all the info we need now.
for certKey, update := range updated {
certCache.mu.Lock()
if cert, ok := certCache.cache[certKey]; ok {
cert.ocsp = update.parsed
cert.Certificate.OCSPStaple = update.rawBytes
certCache.cache[certKey] = cert
}
2018-12-10 13:15:26 +10:00
certCache.mu.Unlock()
}
// We attempt to replace any certificates that were revoked.
// Crucially, this happens OUTSIDE a lock on the certCache.
for _, renew := range renewQueue {
_, err := renew.cfg.forceRenew(ctx, logger, renew.oldCert)
if err != nil {
logger.Info("forcefully renewing certificate due to REVOKED status",
zap.Strings("identifiers", renew.oldCert.Names),
zap.Error(err))
}
}
2018-12-10 13:15:26 +10:00
}
// storageHasNewerARI returns true if the configured storage has ARI that is newer
// than that of a certificate that is already loaded, along with the value from
// storage.
func (cfg *Config) storageHasNewerARI(ctx context.Context, cert Certificate) (bool, acme.RenewalInfo, error) {
storedCertData, err := cfg.loadStoredACMECertificateMetadata(ctx, cert)
if err != nil || storedCertData.RenewalInfo == nil {
return false, acme.RenewalInfo{}, err
}
// prefer stored info if it has a window and the loaded one doesn't,
// or if the one in storage has a later RetryAfter (though I suppose
// it's not guaranteed, typically those will move forward in time)
if (!cert.ari.HasWindow() && storedCertData.RenewalInfo.HasWindow()) ||
(cert.ari.RetryAfter == nil || storedCertData.RenewalInfo.RetryAfter.After(*cert.ari.RetryAfter)) {
return true, *storedCertData.RenewalInfo, nil
}
return false, acme.RenewalInfo{}, nil
}
// loadStoredACMECertificateMetadata loads the stored ACME certificate data
// from the cert's sidecar JSON file.
func (cfg *Config) loadStoredACMECertificateMetadata(ctx context.Context, cert Certificate) (acme.Certificate, error) {
metaBytes, err := cfg.Storage.Load(ctx, StorageKeys.SiteMeta(cert.issuerKey, cert.Names[0]))
if err != nil {
return acme.Certificate{}, fmt.Errorf("loading cert metadata: %w", err)
}
var certRes CertificateResource
if err = json.Unmarshal(metaBytes, &certRes); err != nil {
return acme.Certificate{}, fmt.Errorf("unmarshaling cert metadata: %w", err)
}
var acmeCert acme.Certificate
if err = json.Unmarshal(certRes.IssuerData, &acmeCert); err != nil {
return acme.Certificate{}, fmt.Errorf("unmarshaling potential ACME issuer metadata: %v", err)
}
return acmeCert, nil
}
// updateARI updates the cert's ACME renewal info, first by checking storage for a newer
// one, or getting it from the CA if needed. The updated info is stored in storage and
// updated in the cache. The certificate with the updated ARI is returned. If true is
// returned, the ARI window or selected time has changed, and the caller should check if
// the cert needs to be renewed now, even if there is an error.
2024-06-29 02:33:21 +10:00
//
// This will always try to ARI without checking if it needs to be refreshed. Call
// NeedsRefresh() on the RenewalInfo first, and only call this if that returns true.
func (cfg *Config) updateARI(ctx context.Context, cert Certificate, logger *zap.Logger) (updatedCert Certificate, changed bool, err error) {
logger = logger.With(
zap.Strings("identifiers", cert.Names),
zap.String("cert_hash", cert.hash),
zap.String("ari_unique_id", cert.ari.UniqueIdentifier),
zap.Time("cert_expiry", cert.Leaf.NotAfter))
updatedCert = cert
oldARI := cert.ari
2024-06-29 02:33:21 +10:00
// synchronize ARI fetching; see #297
lockName := "ari_" + cert.ari.UniqueIdentifier
if err := acquireLock(ctx, cfg.Storage, lockName); err != nil {
return cert, false, fmt.Errorf("unable to obtain ARI lock: %v", err)
}
defer func() {
if err := releaseLock(ctx, cfg.Storage, lockName); err != nil {
logger.Error("unable to release ARI lock", zap.Error(err))
}
}()
// see if the stored value has been refreshed already by another instance
gotNewARI, newARI, err := cfg.storageHasNewerARI(ctx, cert)
// when we're all done, log if something about the schedule is different
// ("WARN" level because ARI window changing may be a sign of external trouble
// and we want to draw their attention to a potential explanation URL)
defer func() {
changed = !newARI.SameWindow(oldARI)
if changed {
logger.Warn("ARI window or selected renewal time changed",
zap.Time("prev_start", oldARI.SuggestedWindow.Start),
zap.Time("next_start", newARI.SuggestedWindow.Start),
zap.Time("prev_end", oldARI.SuggestedWindow.End),
zap.Time("next_end", newARI.SuggestedWindow.End),
zap.Time("prev_selected_time", oldARI.SelectedTime),
zap.Time("next_selected_time", newARI.SelectedTime),
zap.String("explanation_url", newARI.ExplanationURL))
}
}()
if err == nil && gotNewARI {
// great, storage has a newer one we can use
cfg.certCache.mu.Lock()
updatedCert = cfg.certCache.cache[cert.hash]
updatedCert.ari = newARI
cfg.certCache.cache[cert.hash] = updatedCert
cfg.certCache.mu.Unlock()
logger.Info("reloaded ARI with newer one in storage",
zap.Timep("next_refresh", newARI.RetryAfter),
zap.Time("renewal_time", newARI.SelectedTime))
return
}
if err != nil {
logger.Error("error while checking storage for updated ARI; updating ARI now", zap.Error(err))
}
// of the issuers configured, hopefully one of them is the ACME CA we got the cert from
for _, iss := range cfg.Issuers {
if ariGetter, ok := iss.(RenewalInfoGetter); ok {
newARI, err = ariGetter.GetRenewalInfo(ctx, cert) // be sure to use existing newARI variable so we can compare against old value in the defer
if err != nil {
// could be anything, but a common error might simply be the "wrong" ACME CA
// (meaning, different from the one that issued the cert, thus the only one
// that would have any ARI for it) if multiple ACME CAs are configured
logger.Error("failed updating renewal info from ACME CA",
zap.String("issuer", iss.IssuerKey()),
zap.Error(err))
continue
}
// when we get the latest ARI, the acme package will select a time within the window
// for us; of course, since it's random, it's likely different from the previously-
// selected time; but if the window doesn't change, there's no need to change the
// selected time (the acme package doesn't know the previous window to know better)
// ... so if the window hasn't changed we'll just put back the selected time
if newARI.SameWindow(oldARI) && !oldARI.SelectedTime.IsZero() {
newARI.SelectedTime = oldARI.SelectedTime
}
// then store the updated ARI (even if the window didn't change, the Retry-After
// likely did) in cache and storage
// be sure we get the cert from the cache while inside a lock to avoid logical races
cfg.certCache.mu.Lock()
updatedCert = cfg.certCache.cache[cert.hash]
updatedCert.ari = newARI
cfg.certCache.cache[cert.hash] = updatedCert
cfg.certCache.mu.Unlock()
// update the ARI value in storage
var certData acme.Certificate
certData, err = cfg.loadStoredACMECertificateMetadata(ctx, cert)
if err != nil {
err = fmt.Errorf("got new ARI from %s, but failed loading stored certificate metadata: %v", iss.IssuerKey(), err)
return
}
certData.RenewalInfo = &newARI
var certDataBytes, certResBytes []byte
certDataBytes, err = json.Marshal(certData)
if err != nil {
err = fmt.Errorf("got new ARI from %s, but failed marshaling certificate ACME metadata: %v", iss.IssuerKey(), err)
return
}
certResBytes, err = json.MarshalIndent(CertificateResource{
SANs: cert.Names,
IssuerData: certDataBytes,
}, "", "\t")
if err != nil {
err = fmt.Errorf("got new ARI from %s, but could not re-encode certificate metadata: %v", iss.IssuerKey(), err)
return
}
if err = cfg.Storage.Store(ctx, StorageKeys.SiteMeta(cert.issuerKey, cert.Names[0]), certResBytes); err != nil {
err = fmt.Errorf("got new ARI from %s, but could not store it with certificate metadata: %v", iss.IssuerKey(), err)
return
}
logger.Info("updated ACME renewal information",
zap.Time("selected_time", newARI.SelectedTime),
zap.Timep("next_update", newARI.RetryAfter),
zap.String("explanation_url", newARI.ExplanationURL))
return
}
}
err = fmt.Errorf("could not fully update ACME renewal info: either no issuer supporting ARI is configured for certificate, or all such failed (make sure the ACME CA that issued the certificate is configured)")
return
}
// CleanStorageOptions specifies how to clean up a storage unit.
type CleanStorageOptions struct {
// Optional custom logger.
Logger *zap.Logger
// Optional ID of the instance initiating the cleaning.
InstanceID string
// If set, cleaning will be skipped if it was performed
// more recently than this interval.
Interval time.Duration
// Whether to clean cached OCSP staples.
OCSPStaples bool
// Whether to cleanup expired certificates, and if so,
// how long to let them stay after they've expired.
ExpiredCerts bool
ExpiredCertGracePeriod time.Duration
}
// CleanStorage removes assets which are no longer useful,
// according to opts.
func CleanStorage(ctx context.Context, storage Storage, opts CleanStorageOptions) error {
const (
lockName = "storage_clean"
storageKey = "last_clean.json"
)
if opts.Logger == nil {
opts.Logger = defaultLogger.Named("clean_storage")
}
opts.Logger = opts.Logger.With(zap.Any("storage", storage))
// storage cleaning should be globally exclusive
2024-06-29 02:33:21 +10:00
if err := acquireLock(ctx, storage, lockName); err != nil {
return fmt.Errorf("unable to acquire %s lock: %v", lockName, err)
}
defer func() {
2024-06-29 02:33:21 +10:00
if err := releaseLock(ctx, storage, lockName); err != nil {
opts.Logger.Error("unable to release lock", zap.Error(err))
return
}
}()
// cleaning should not happen more often than the interval
if opts.Interval > 0 {
lastCleanBytes, err := storage.Load(ctx, storageKey)
if !errors.Is(err, fs.ErrNotExist) {
if err != nil {
return fmt.Errorf("loading last clean timestamp: %v", err)
}
var lastClean lastCleanPayload
err = json.Unmarshal(lastCleanBytes, &lastClean)
if err != nil {
return fmt.Errorf("decoding last clean data: %v", err)
}
lastTLSClean := lastClean["tls"]
if time.Since(lastTLSClean.Timestamp) < opts.Interval {
nextTime := time.Now().Add(opts.Interval)
opts.Logger.Info("storage cleaning happened too recently; skipping for now",
zap.String("instance", lastTLSClean.InstanceID),
zap.Time("try_again", nextTime),
zap.Duration("try_again_in", time.Until(nextTime)),
)
return nil
}
}
}
opts.Logger.Info("cleaning storage unit")
if opts.OCSPStaples {
err := deleteOldOCSPStaples(ctx, storage, opts.Logger)
if err != nil {
opts.Logger.Error("deleting old OCSP staples", zap.Error(err))
}
}
if opts.ExpiredCerts {
err := deleteExpiredCerts(ctx, storage, opts.Logger, opts.ExpiredCertGracePeriod)
if err != nil {
opts.Logger.Error("deleting expired certificates staples", zap.Error(err))
}
}
2020-03-13 08:02:48 +10:00
// TODO: delete stale locks?
// update the last-clean time
lastCleanBytes, err := json.Marshal(lastCleanPayload{
"tls": lastCleaned{
Timestamp: time.Now(),
InstanceID: opts.InstanceID,
},
})
if err != nil {
return fmt.Errorf("encoding last cleaned info: %v", err)
}
if err := storage.Store(ctx, storageKey, lastCleanBytes); err != nil {
return fmt.Errorf("storing last clean info: %v", err)
}
return nil
}
type lastCleanPayload map[string]lastCleaned
type lastCleaned struct {
Timestamp time.Time `json:"timestamp"`
InstanceID string `json:"instance_id,omitempty"`
}
func deleteOldOCSPStaples(ctx context.Context, storage Storage, logger *zap.Logger) error {
ocspKeys, err := storage.List(ctx, prefixOCSP, false)
2018-12-10 13:15:26 +10:00
if err != nil {
// maybe just hasn't been created yet; no big deal
return nil
2018-12-10 13:15:26 +10:00
}
for _, key := range ocspKeys {
// if context was cancelled, quit early; otherwise proceed
select {
case <-ctx.Done():
return ctx.Err()
default:
}
ocspBytes, err := storage.Load(ctx, key)
2018-12-10 13:15:26 +10:00
if err != nil {
logger.Error("while deleting old OCSP staples, unable to load staple file", zap.Error(err))
2018-12-10 13:15:26 +10:00
continue
}
resp, err := ocsp.ParseResponse(ocspBytes, nil)
if err != nil {
// contents are invalid; delete it
err = storage.Delete(ctx, key)
2018-12-10 13:15:26 +10:00
if err != nil {
logger.Error("purging corrupt staple file", zap.String("storage_key", key), zap.Error(err))
2018-12-10 13:15:26 +10:00
}
continue
}
if time.Now().After(resp.NextUpdate) {
// response has expired; delete it
err = storage.Delete(ctx, key)
2018-12-10 13:15:26 +10:00
if err != nil {
logger.Error("purging expired staple file", zap.String("storage_key", key), zap.Error(err))
2018-12-10 13:15:26 +10:00
}
}
}
return nil
2018-12-10 13:15:26 +10:00
}
func deleteExpiredCerts(ctx context.Context, storage Storage, logger *zap.Logger, gracePeriod time.Duration) error {
issuerKeys, err := storage.List(ctx, prefixCerts, false)
if err != nil {
// maybe just hasn't been created yet; no big deal
return nil
}
for _, issuerKey := range issuerKeys {
siteKeys, err := storage.List(ctx, issuerKey, false)
if err != nil {
logger.Error("listing contents", zap.String("issuer_key", issuerKey), zap.Error(err))
continue
}
for _, siteKey := range siteKeys {
// if context was cancelled, quit early; otherwise proceed
select {
case <-ctx.Done():
return ctx.Err()
default:
}
siteAssets, err := storage.List(ctx, siteKey, false)
if err != nil {
logger.Error("listing site contents", zap.String("site_key", siteKey), zap.Error(err))
continue
}
for _, assetKey := range siteAssets {
if path.Ext(assetKey) != ".crt" {
continue
}
certFile, err := storage.Load(ctx, assetKey)
if err != nil {
return fmt.Errorf("loading certificate file %s: %v", assetKey, err)
}
block, _ := pem.Decode(certFile)
if block == nil || block.Type != "CERTIFICATE" {
return fmt.Errorf("certificate file %s does not contain PEM-encoded certificate", assetKey)
}
cert, err := x509.ParseCertificate(block.Bytes)
if err != nil {
return fmt.Errorf("certificate file %s is malformed; error parsing PEM: %v", assetKey, err)
}
if expiredTime := time.Since(expiresAt(cert)); expiredTime >= gracePeriod {
logger.Info("certificate expired beyond grace period; cleaning up",
zap.String("asset_key", assetKey),
zap.Duration("expired_for", expiredTime),
zap.Duration("grace_period", gracePeriod))
baseName := strings.TrimSuffix(assetKey, ".crt")
for _, relatedAsset := range []string{
assetKey,
baseName + ".key",
baseName + ".json",
} {
logger.Info("deleting asset because resource expired", zap.String("asset_key", relatedAsset))
err := storage.Delete(ctx, relatedAsset)
if err != nil {
logger.Error("could not clean up asset related to expired certificate",
zap.String("base_name", baseName),
zap.String("related_asset", relatedAsset),
zap.Error(err))
}
}
}
}
// update listing; if folder is empty, delete it
siteAssets, err = storage.List(ctx, siteKey, false)
if err != nil {
continue
}
if len(siteAssets) == 0 {
logger.Info("deleting site folder because key is empty", zap.String("site_key", siteKey))
err := storage.Delete(ctx, siteKey)
if err != nil {
return fmt.Errorf("deleting empty site folder %s: %v", siteKey, err)
}
}
}
}
return nil
}
// forceRenew forcefully renews cert and replaces it in the cache, and returns the new certificate. It is intended
// for use primarily in the case of cert revocation. This MUST NOT be called within a lock on cfg.certCacheMu.
func (cfg *Config) forceRenew(ctx context.Context, logger *zap.Logger, cert Certificate) (Certificate, error) {
if cert.ocsp != nil && cert.ocsp.Status == ocsp.Revoked {
logger.Warn("OCSP status for managed certificate is REVOKED; attempting to replace with new certificate",
zap.Strings("identifiers", cert.Names),
zap.Time("expiration", expiresAt(cert.Leaf)))
} else {
logger.Warn("forcefully renewing certificate",
zap.Strings("identifiers", cert.Names),
zap.Time("expiration", expiresAt(cert.Leaf)))
Automatically replace revoked certs managed on-demand When I initially wrote the auto-replace feature, it was for the standard mode of operation, which I presumed the vast majority of CertMagic deployments use. At the time, On-Demand mode of operation was fairly niche. And at the time, it looked tricky to properly enable this feature for on-demand certificates, so I shelved it considering it would be low-impact anyway. So on-demand certificates didn't benefit from auto-replace in the case of revocation (oh well, no other servers / ACME clients do that at all anyway). I guess since that time, the use of CertMagic's exclusive on-demand feature has risen in popularity. But there is no way to tell, and I had no real way of knowing whether any significant use of the feature is being had since Caddy has no telemetry. (We used to have telemetry -- benign, anonymous technical stats to help us understand usage -- but unfortunately public backlash forced us to end the program.) Based on public feedback forced by external events, it seems that on-demand TLS deployments are probably rare, but each of those few deployments actually serve thousands of sites/domains. (The true importance of this feature would have been clear months ago if Caddy had telemetry, as Caddy is the primary importer of CertMagic.) This commit should enable auto-replace for on-demand certificates. It required some refactoring and some decisions that aren't *entirely* clear are right, but that's how it goes. I haven't tested this. (Last time I worked on this feature it took me about 2 days to test properly.)
2022-01-31 14:58:34 +10:00
}
renewName := cert.Names[0]
// if revoked for key compromise, we can't be sure whether the storage of
// the key is still safe; however, we KNOW the old key is not safe, and we
// can only hope by the time of revocation that storage has been secured;
// key management is not something we want to get into, but in this case
// it seems prudent to replace the key - and since renewal requires reuse
// of a prior key, we can't do a "renew" to replace the cert if we need a
// new key, so we'll have to do an obtain instead
var obtainInsteadOfRenew bool
if cert.ocsp != nil && cert.ocsp.RevocationReason == acme.ReasonKeyCompromise {
err := cfg.moveCompromisedPrivateKey(ctx, cert, logger)
if err != nil {
Automatically replace revoked certs managed on-demand When I initially wrote the auto-replace feature, it was for the standard mode of operation, which I presumed the vast majority of CertMagic deployments use. At the time, On-Demand mode of operation was fairly niche. And at the time, it looked tricky to properly enable this feature for on-demand certificates, so I shelved it considering it would be low-impact anyway. So on-demand certificates didn't benefit from auto-replace in the case of revocation (oh well, no other servers / ACME clients do that at all anyway). I guess since that time, the use of CertMagic's exclusive on-demand feature has risen in popularity. But there is no way to tell, and I had no real way of knowing whether any significant use of the feature is being had since Caddy has no telemetry. (We used to have telemetry -- benign, anonymous technical stats to help us understand usage -- but unfortunately public backlash forced us to end the program.) Based on public feedback forced by external events, it seems that on-demand TLS deployments are probably rare, but each of those few deployments actually serve thousands of sites/domains. (The true importance of this feature would have been clear months ago if Caddy had telemetry, as Caddy is the primary importer of CertMagic.) This commit should enable auto-replace for on-demand certificates. It required some refactoring and some decisions that aren't *entirely* clear are right, but that's how it goes. I haven't tested this. (Last time I worked on this feature it took me about 2 days to test properly.)
2022-01-31 14:58:34 +10:00
logger.Error("could not remove compromised private key from use",
zap.Strings("identifiers", cert.Names),
zap.String("issuer", cert.issuerKey),
zap.Error(err))
}
obtainInsteadOfRenew = true
}
var err error
if obtainInsteadOfRenew {
err = cfg.ObtainCertAsync(ctx, renewName)
} else {
// notice that we force renewal; otherwise, it might see that the
// certificate isn't close to expiring and return, but we really
// need a replacement certificate! see issue #4191
err = cfg.RenewCertAsync(ctx, renewName, true)
}
if err != nil {
if cert.ocsp != nil && cert.ocsp.Status == ocsp.Revoked {
// probably better to not serve a revoked certificate at all
logger.Error("unable to obtain new to certificate after OCSP status of REVOKED; removing from cache",
zap.Strings("identifiers", cert.Names),
zap.Error(err))
cfg.certCache.mu.Lock()
cfg.certCache.removeCertificate(cert)
cfg.certCache.mu.Unlock()
Automatically replace revoked certs managed on-demand When I initially wrote the auto-replace feature, it was for the standard mode of operation, which I presumed the vast majority of CertMagic deployments use. At the time, On-Demand mode of operation was fairly niche. And at the time, it looked tricky to properly enable this feature for on-demand certificates, so I shelved it considering it would be low-impact anyway. So on-demand certificates didn't benefit from auto-replace in the case of revocation (oh well, no other servers / ACME clients do that at all anyway). I guess since that time, the use of CertMagic's exclusive on-demand feature has risen in popularity. But there is no way to tell, and I had no real way of knowing whether any significant use of the feature is being had since Caddy has no telemetry. (We used to have telemetry -- benign, anonymous technical stats to help us understand usage -- but unfortunately public backlash forced us to end the program.) Based on public feedback forced by external events, it seems that on-demand TLS deployments are probably rare, but each of those few deployments actually serve thousands of sites/domains. (The true importance of this feature would have been clear months ago if Caddy had telemetry, as Caddy is the primary importer of CertMagic.) This commit should enable auto-replace for on-demand certificates. It required some refactoring and some decisions that aren't *entirely* clear are right, but that's how it goes. I haven't tested this. (Last time I worked on this feature it took me about 2 days to test properly.)
2022-01-31 14:58:34 +10:00
}
return cert, fmt.Errorf("unable to forcefully get new certificate for %v: %w", cert.Names, err)
Automatically replace revoked certs managed on-demand When I initially wrote the auto-replace feature, it was for the standard mode of operation, which I presumed the vast majority of CertMagic deployments use. At the time, On-Demand mode of operation was fairly niche. And at the time, it looked tricky to properly enable this feature for on-demand certificates, so I shelved it considering it would be low-impact anyway. So on-demand certificates didn't benefit from auto-replace in the case of revocation (oh well, no other servers / ACME clients do that at all anyway). I guess since that time, the use of CertMagic's exclusive on-demand feature has risen in popularity. But there is no way to tell, and I had no real way of knowing whether any significant use of the feature is being had since Caddy has no telemetry. (We used to have telemetry -- benign, anonymous technical stats to help us understand usage -- but unfortunately public backlash forced us to end the program.) Based on public feedback forced by external events, it seems that on-demand TLS deployments are probably rare, but each of those few deployments actually serve thousands of sites/domains. (The true importance of this feature would have been clear months ago if Caddy had telemetry, as Caddy is the primary importer of CertMagic.) This commit should enable auto-replace for on-demand certificates. It required some refactoring and some decisions that aren't *entirely* clear are right, but that's how it goes. I haven't tested this. (Last time I worked on this feature it took me about 2 days to test properly.)
2022-01-31 14:58:34 +10:00
}
return cfg.reloadManagedCertificate(ctx, cert)
Automatically replace revoked certs managed on-demand When I initially wrote the auto-replace feature, it was for the standard mode of operation, which I presumed the vast majority of CertMagic deployments use. At the time, On-Demand mode of operation was fairly niche. And at the time, it looked tricky to properly enable this feature for on-demand certificates, so I shelved it considering it would be low-impact anyway. So on-demand certificates didn't benefit from auto-replace in the case of revocation (oh well, no other servers / ACME clients do that at all anyway). I guess since that time, the use of CertMagic's exclusive on-demand feature has risen in popularity. But there is no way to tell, and I had no real way of knowing whether any significant use of the feature is being had since Caddy has no telemetry. (We used to have telemetry -- benign, anonymous technical stats to help us understand usage -- but unfortunately public backlash forced us to end the program.) Based on public feedback forced by external events, it seems that on-demand TLS deployments are probably rare, but each of those few deployments actually serve thousands of sites/domains. (The true importance of this feature would have been clear months ago if Caddy had telemetry, as Caddy is the primary importer of CertMagic.) This commit should enable auto-replace for on-demand certificates. It required some refactoring and some decisions that aren't *entirely* clear are right, but that's how it goes. I haven't tested this. (Last time I worked on this feature it took me about 2 days to test properly.)
2022-01-31 14:58:34 +10:00
}
// moveCompromisedPrivateKey moves the private key for cert to a ".compromised" file
// by copying the data to the new file, then deleting the old one.
func (cfg *Config) moveCompromisedPrivateKey(ctx context.Context, cert Certificate, logger *zap.Logger) error {
privKeyStorageKey := StorageKeys.SitePrivateKey(cert.issuerKey, cert.Names[0])
privKeyPEM, err := cfg.Storage.Load(ctx, privKeyStorageKey)
if err != nil {
return err
}
compromisedPrivKeyStorageKey := privKeyStorageKey + ".compromised"
err = cfg.Storage.Store(ctx, compromisedPrivKeyStorageKey, privKeyPEM)
if err != nil {
// better safe than sorry: as a last resort, try deleting the key so it won't be reused
cfg.Storage.Delete(ctx, privKeyStorageKey)
return err
}
err = cfg.Storage.Delete(ctx, privKeyStorageKey)
if err != nil {
return err
}
logger.Info("removed certificate's compromised private key from use",
zap.String("storage_path", compromisedPrivKeyStorageKey),
zap.Strings("identifiers", cert.Names),
zap.String("issuer", cert.issuerKey))
return nil
}
// certShouldBeForceRenewed returns true if cert should be forcefully renewed
// (like if it is revoked according to its OCSP response).
func certShouldBeForceRenewed(cert Certificate) bool {
return cert.managed &&
len(cert.Names) > 0 &&
cert.ocsp != nil &&
cert.ocsp.Status == ocsp.Revoked
}
type certList []Certificate
// insert appends cert to the list if it is not already in the list.
// Efficiency: O(n)
func (certs *certList) insert(cert Certificate) {
for _, c := range *certs {
if c.hash == cert.hash {
return
}
}
*certs = append(*certs, cert)
}
2018-12-10 13:15:26 +10:00
const (
// DefaultRenewCheckInterval is how often to check certificates for expiration.
// Scans are very lightweight, so this can be semi-frequent. This default should
// be smaller than <Minimum Cert Lifetime>*DefaultRenewalWindowRatio/3, which
// gives certificates plenty of chance to be renewed on time.
DefaultRenewCheckInterval = 10 * time.Minute
// DefaultRenewalWindowRatio is how much of a certificate's lifetime becomes the
// renewal window. The renewal window is the span of time at the end of the
// certificate's validity period in which it should be renewed. A default value
// of ~1/3 is pretty safe and recommended for most certificates.
DefaultRenewalWindowRatio = 1.0 / 3.0
2018-12-10 13:15:26 +10:00
// DefaultOCSPCheckInterval is how often to check if OCSP stapling needs updating.
DefaultOCSPCheckInterval = 1 * time.Hour
2018-12-10 13:15:26 +10:00
)