We can't assume the ARI-supporting issuer types are exactly *ACMEIssuer; they may be implemented by third party packages (such as caddytls.ACMEIssuer).
Cleaning storage now obtains a lock, and it can optionally be configured
to only happen once per interval.
This should help lower costs for expensive storage backends
that are used by clusters of CertMagic/Caddy instances.
These are useful for advanced applications (like Caddy) which would
like to remove certificates from the
cache in a controlled way, and operate the
cache with new settings while running.
Eliminates a bajillion nil checks and footguns
(except in tests, which bypass exported APIs, but that is expected)
Most recent #207
Logging can still be disabled via zap.NewNop(), if necessary.
(But disabling logging in CertMagic is a really bad idea.)
* Fix crash because of a zero value cert in cache
Check a cert is still in cache when trying to update its
ocsp & OCSPStaple fields
Why: Bc in parallel of updateOCSPStaples() loops,
any cert can be removed from a full cache to make some room.
* Update maintain.go
Co-authored-by: Matt Holt <mholt@users.noreply.github.com>
Co-authored-by: Matt Holt <mholt@users.noreply.github.com>
* Add context propagation to the Storage interface
Signed-off-by: Dave Henderson <dhenderson@gmail.com>
* Bump to Go 1.17
* Minor cleanup
* filestorage: Honor context cancellation in List()
Co-authored-by: Matthew Holt <mholt@users.noreply.github.com>
* Fix force-renewing revoked on-demand certs
Follow-up to 9245be5a2f
* One more fix for on-demand logic of revoked certs
* OCSP revocation checks at startup, too
Required significant refactoring, hope it works.
Yet again way too late at night for this...
When I initially wrote the auto-replace feature, it was for the standard mode of operation,
which I presumed the vast majority of CertMagic deployments use. At the time, On-Demand
mode of operation was fairly niche. And at the time, it looked tricky to properly enable this feature for on-demand certificates, so I shelved it considering it would be low-impact anyway.
So on-demand certificates didn't benefit from auto-replace in the case of revocation (oh well,
no other servers / ACME clients do that at all anyway).
I guess since that time, the use of CertMagic's exclusive on-demand feature has risen in
popularity. But there is no way to tell, and I had no real way of knowing whether any
significant use of the feature is being had since Caddy has no telemetry. (We used to
have telemetry -- benign, anonymous technical stats to help us understand usage -- but
unfortunately public backlash forced us to end the program.) Based on public feedback
forced by external events, it seems that on-demand TLS deployments are probably rare,
but each of those few deployments actually serve thousands of sites/domains. (The
true importance of this feature would have been clear months ago if Caddy had telemetry,
as Caddy is the primary importer of CertMagic.)
This commit should enable auto-replace for on-demand certificates. It required some
refactoring and some decisions that aren't *entirely* clear are right, but that's how it
goes.
I haven't tested this. (Last time I worked on this feature it took me about 2 days to test properly.)
* Begin refactor of ObtainCert and RenewCert to allow force renews
* Don't reuse private key in case of revocation due to key compromise
* Improve logging in renew
* Run OCSP check at start of cache maintenance
Otherwise we wait until first tick (currently 1 hour) which might be too long
* Fix obtain; move some things around
Obtain now tries to reuse private key if exists, but if it doesn't exist, that shouldn't be an error (so we clear the error in that case).
Moved the removal of compromised private keys to have logging make more sense.
On-demand certs are managed at handshake-time. Doing so in the background was
a temporary holdover until on-demand maintenance improved, which it since has.
Since background maintenance did not consult the "ask" endpoint or decision func,
it would sometimes renew certificates that were not desirable to renew.
See https://caddy.community/t/clean-up-caddy-certificates/11429/11?u=matt
If the machine goes to sleep or the process gets suspended, background
maintenance won't happen, so we need to check for expiration of all
managed, on-demand certificates at every handshake. Fortunately, this is
pretty cheap because it's simple date math.
https://caddy.community/t/local-certificates-not-renewing-on-demand/9482
Logging is now configurable through setting the Logging field on the
various relevant struct types. This is a more useful, consistent, and
higher-performing experience with logs than the std lib logger we used
before.
This isn't a 100% complete transition because there are some parts of
the code base that don't have obvious or easy access to a logger.
They are mostly fringe/edge cases though, and most are error logs, so
you shouldn't see them under normal circumstances. They still emit to
the std lib logger, so it's not like any errors get hidden: they are
just unstructured until we find a way to give them access to a logger.
This allows two certs (say, RSA and ECDSA) for the same names to be
loaded, and CertMagic will consider which one the client supports and
use that.
We used to extract just select fields from the leaf certificate so that
we didn't need to fill memory with more data than necessary, but in
order to use the stdlib's SupportsCertificate() method, we have to keep
the full tls.Certificate.Leaf field set for speed during handshakes.
This allows CertMagic to accommodate certificates with extremely short
lifetimes (new defaults work with cert lifetimes < 24h, but I wouldn't
want to push it < 30m with these defaults).
Breaking changes; thank goodness we're not 1.0 yet 😅 - read on!
This change completely separates ACME-specific code from the rest of the
certificate management process, allowing pluggable sources for certs
that aren't ACME.
Notably, most of Config was spliced into ACMEManager. Similarly, there's
now Default and DefaultACME.
Storage structure had to be reconfigured. Certificates are no longer in
the acme/ subfolder since they can be obtained by ways other than ACME!
Certificates moved to a new certificates/ subfolder. The subfolders in
that folder use the path of the ACME endpoint instead of just the host,
so that also changed. Be aware that unless you move your certs over,
CertMagic will not find them and will attempt to get new ones. That is
usually fine for most users, but for extremely large deployments, you
will want to move them over first.
Old certs path:
acme/acme-staging-v02.api.letsencrypt.org/...
New certs path:
certificates/acme-staging-v02.api.letsencrypt.org-directory/...
That's all for significant storage changes!
But this refactor also vastly improves performance, especially at scale,
and makes CertMagic way more resilient to errors. Retries are done on
the staging endpoint by default, so they won't count against your rate
limit. If your hardware can handle it, I'm now pretty confident that you
can give CertMagic a million domain names and it will gracefully manage
them, as fast as it can within internal and external rate limits, even
in the presence of errors. Errors will of course slow some things down,
but you should be good to go if you're monitoring logs and can fix any
misconfigurations or other external errors!
Several other mostly-minor enhancements fix bugs, especially at scale.
For example, duplicated renewal tasks (that continuously fail) will not
pile up on each other: only one will operate, under exponential backoff.
Closes#50 and fixes#55
The previous rate limiter design did not allow reservation cancellation.
This became problematic with lots of config reloads in Caddy for large
numbers of domain names. While the rate limiter had a backlog, a new
config would come in and add even more to the rate limiter, and even
more over time as background maintenance (renewals) kicked in. This
leaked goroutines and memory as a side-effect, and blocked the issuance
of certificates nigh indefinitely.
The new rate limiter does not make future reservations like the previous
one did. However, this requires us to run a single scheduler goroutine
when a rate limiter is created, which requires being cleaned up when the
rate limiter is no longer needed. As rate limits are global and should
live up to the life of the process, there is currently no actual cleanup
that takes place, but if it did happen, one would simply call Stop() on
the rate limiter to stop that goroutine.
With this new design, reservations are made only as the event actually
happens; implementing cancellation with the old design would have been
almost impossible to do correctly in a practical, elegant way. Although
the trade-off is an extra goroutine that needs cleaning up, this is
seldom (if ever?) needed in practice, and the benefit is that waiting
goroutines can be unblocked when their context is canceled. This allows
Caddy, for example, to reload configs often and cancel any goroutines
that were merely waiting on the rate limiter.
Now, all Obtain, Renew, and Revoke calls accept a context that can be
cancelled.
We also eliminate the acmeMu, a mutex that permitted only a single ACME
operation at a time by the process, which was our early, naive form of
rate limiting, which should no longer be necessary.
On-demand obtain and renew do not yet use cancelable contexts, because
what defines the context of a TLS handshake is still unclear. We might
end up using a simple context with a timeout that is the maximum length
of a TLS handshake in practice, say, 1 minute.
This is a breaking change, but critical for larger deployments with very
dynamic configurations.
Split Manage() into ManageSync() and ManageAsync().
In accordance with developing best practices, ACME operations should be
allowed to happen in the background and not block server startup in
non-interactive environments.
We also no longer return an error during batch cert renewals, because
we always treat it as a background operation. (The ManageSync() method
can perform foreground renewal if that is desired.)
This allows for user-loaded certificates to be associated with arbitrary
values such as user-provided IDs or categories. This can be useful if
multiple certificates satisfy a ClientHello but if a specific one still
needs to be chosen. See for example:
https://github.com/mholt/caddy/issues/2588
This is a breaking API change since we need to expose a tags parameter
to the caching functions, but we're not 1.0 yet so we will try this
API change and see how it goes.
* Significant refactor
This refactoring expands the capabilities of the library for advanced
use cases, as well as improving the overall architecture, including
possible memory leak fixes if used over a long period with many certs
loaded into memory. This refactor enables using different configs
depending on the certificate.
The public API has changed slightly, however, and arguably it is
slightly less convenient/elegant. I have never quite found the perfect
design for this package, and this certainly isn't it, but I think it's
better than what we had before.
There is still work to be done, but this is a good step forward. I've
decoupled Storage from Cache, and made it easier and more correct for
Configs (and Storage values) to be short-lived. Cache is the only value
that should be long-lived.
Note that CertMagic no longer automatically takes care of storage (i.e.
it used to delete old OCSP staples, but now it doesn't). The functions
to do this are still there and even exported, and now we expect the
application to call the cleanup functions when it wants to.
* Fix little oopsies
* Create Manager abstraction so obtain/renew isn't limited to ACME
Adding a recursive option to List(), which, if true, causes List to
act like a walk function.
Also differentiating between "terminal" keys and "non-terminal" in
KeyInfo, since sometimes directories are useful, like listing user
accounts.