Cleaning storage now obtains a lock, and it can optionally be configured
to only happen once per interval.
This should help lower costs for expensive storage backends
that are used by clusters of CertMagic/Caddy instances.
Eliminates a bajillion nil checks and footguns
(except in tests, which bypass exported APIs, but that is expected)
Most recent #207
Logging can still be disabled via zap.NewNop(), if necessary.
(But disabling logging in CertMagic is a really bad idea.)
* Add context propagation to the Storage interface
Signed-off-by: Dave Henderson <dhenderson@gmail.com>
* Bump to Go 1.17
* Minor cleanup
* filestorage: Honor context cancellation in List()
Co-authored-by: Matthew Holt <mholt@users.noreply.github.com>
* Lock now takes a context and should honor cancellation
This allows callers to give up if they can't obtain a lock in a certain
timeframe and for resources to be cleaned up, avoiding potential
resource leaks.
Breaking change for any Storage implementations, sorry about that. (It's
why we're not 1.0 yet.) I'll reach out to known implementations; it's a
simple change.
* Rename obtainLock to acquireLock to be less ambiguous
In our package, "obtain" has a more common meaning related to certs
Breaking changes; thank goodness we're not 1.0 yet 😅 - read on!
This change completely separates ACME-specific code from the rest of the
certificate management process, allowing pluggable sources for certs
that aren't ACME.
Notably, most of Config was spliced into ACMEManager. Similarly, there's
now Default and DefaultACME.
Storage structure had to be reconfigured. Certificates are no longer in
the acme/ subfolder since they can be obtained by ways other than ACME!
Certificates moved to a new certificates/ subfolder. The subfolders in
that folder use the path of the ACME endpoint instead of just the host,
so that also changed. Be aware that unless you move your certs over,
CertMagic will not find them and will attempt to get new ones. That is
usually fine for most users, but for extremely large deployments, you
will want to move them over first.
Old certs path:
acme/acme-staging-v02.api.letsencrypt.org/...
New certs path:
certificates/acme-staging-v02.api.letsencrypt.org-directory/...
That's all for significant storage changes!
But this refactor also vastly improves performance, especially at scale,
and makes CertMagic way more resilient to errors. Retries are done on
the staging endpoint by default, so they won't count against your rate
limit. If your hardware can handle it, I'm now pretty confident that you
can give CertMagic a million domain names and it will gracefully manage
them, as fast as it can within internal and external rate limits, even
in the presence of errors. Errors will of course slow some things down,
but you should be good to go if you're monitoring logs and can fix any
misconfigurations or other external errors!
Several other mostly-minor enhancements fix bugs, especially at scale.
For example, duplicated renewal tasks (that continuously fail) will not
pile up on each other: only one will operate, under exponential backoff.
Closes#50 and fixes#55
* Significant refactor
This refactoring expands the capabilities of the library for advanced
use cases, as well as improving the overall architecture, including
possible memory leak fixes if used over a long period with many certs
loaded into memory. This refactor enables using different configs
depending on the certificate.
The public API has changed slightly, however, and arguably it is
slightly less convenient/elegant. I have never quite found the perfect
design for this package, and this certainly isn't it, but I think it's
better than what we had before.
There is still work to be done, but this is a good step forward. I've
decoupled Storage from Cache, and made it easier and more correct for
Configs (and Storage values) to be short-lived. Cache is the only value
that should be long-lived.
Note that CertMagic no longer automatically takes care of storage (i.e.
it used to delete old OCSP staples, but now it doesn't). The functions
to do this are still there and even exported, and now we expect the
application to call the cleanup functions when it wants to.
* Fix little oopsies
* Create Manager abstraction so obtain/renew isn't limited to ACME
* Replace TryLock and Wait with Lock, and check for idempotency (issue #5)
* Fix logic of lock waiter creation in FileStorage (+ improve client log)
* Return from Wait() if lock file becomes stale
* Remove racy deletion of empty lock folder
* move all (FileStorage) methods to (*FileStorage) so assignments to fields like fileStorageNameLocks aren't lost
* rework lock acquisition
* Create lockDir just before lock file creation to reduce the chance that another process calls Unlock() and removes lockDir while we were waiting, preventing us from creating the lock file.
* Use the same strategy that Wait() uses to avoid depending on internal state.
* fix unlock of unlocked mutex
* Move fileStorageNameLocksMu into FileStorage struct
* implement new lockfile removal strategy and simplify the lock acquisition loop.
* readme: Add link to full examples
* Rework file lock obtaining and waiting logic
* Remove not-useful optimization to simplify file-locking logic
It's still pretty early (day 2!) of the library so I'm OK with adding
a necessary method that removes locks that would become stale.
Also handle stale locks in the FileStorage implementation of Storage.
Adding a recursive option to List(), which, if true, causes List to
act like a walk function.
Also differentiating between "terminal" keys and "non-terminal" in
KeyInfo, since sometimes directories are useful, like listing user
accounts.
Also adjust clients so that they use the configured HTTPPort or
HTTPSPort for solving challenges, if different from the default
challenge port (not as preferred as the Alt*Port values, of course)