mirror of https://github.com/kubernetes-sigs/kubespray.git synced 2025-12-13 21:34:40 +03:00

Files

k8s-infra-cherrypick-robot 9b122fb5a8 [release-2.25] pre-commit: make hooks self contained + ci config (#11359 )

* Use alternate self-sufficient shellcheck precommit

This pre-commit does not require prerequisite on the host, making it
easier to run in CI workflows.

* Switch to upstream ansible-lint pre-commit hook

This way, the hook is self contained and does not depend on a previous
virtualenv installation.

* pre-commit: fix hooks dependencies

- ansible-syntax-check
- tox-inventory-builder
- jinja-syntax-check

* Fix ci-matrix pre-commit hook

- Remove dependency of pydblite which fails to setup on recent pythons
- Discard shell script and put everything into pre-commit

* pre-commit: apply autofixes hooks and fix the rest manually

- markdownlint (manual fix)
- end-of-file-fixer
- requirements-txt-fixer
- trailing-whitespace

* Convert check_typo to pre-commit + use maintained version

client9/misspell is unmaintained, and has been forked by the golangci
team, see https://github.com/client9/misspell/issues/197#issuecomment-1596318684.

They haven't yet added a pre-commit config, so use my fork with the
pre-commit hook config until the pull request is merged.

* collection-build-install convert to pre-commit

* Run pre-commit hooks in dynamic pipeline

Use gitlab dynamic child pipelines feature to have one source of truth
for the pre-commit jobs, the pre-commit config file.

Use one cache per pre-commit. This should reduce the "fetching cache"
time steps in gitlab-ci, since each job will have a separate cache with
only its hook installed.

* Remove gitlab-ci job done in pre-commit

* pre-commit: adjust mardownlint default, md fixes

Use a style file as recommended by upstream. This makes for only one
source of truth.
Conserve previous upstream default for MD007 (upstream default changed
here https://github.com/markdownlint/markdownlint/pull/373)

* Update pre-commit hooks

---------

Co-authored-by: Max Gautier <mg@max.gautier.name>

2024-07-12 00:21:42 -07:00

2.0 KiB

Raw Blame History

Recovering the control plane

To recover from broken nodes in the control plane use the "recover-control-plane.yml" playbook.

Examples of what broken means in this context:

One or more bare metal node(s) suffer from unrecoverable hardware failure
One or more node(s) fail during patching or upgrading
Etcd database corruption
Other node related failures leaving your control plane degraded or nonfunctional

Note that you need at least one functional node to be able to recover using this method.

Runbook

Backup what you can
Provision new nodes to replace the broken ones
Copy any broken etcd nodes into the "broken_etcd" group, make sure the "etcd_member_name" variable is set.
Copy any broken control plane nodes into the "broken_kube_control_plane" group.
Place the surviving nodes of the control plane first in the "etcd" and "kube_control_plane" groups
Add the new nodes below the surviving control plane nodes in the "etcd" and "kube_control_plane" groups

Then run the playbook with --limit etcd,kube_control_plane and increase the number of ETCD retries by setting -e etcd_retries=10 or something even larger. The amount of retries required is difficult to predict.

When finished you should have a fully working control plane again.

Recover from lost quorum

The playbook attempts to figure out it the etcd quorum is intact. If quorum is lost it will attempt to take a snapshot from the first node in the "etcd" group and restore from that. If you would like to restore from an alternate snapshot set the path to that snapshot in the "etcd_snapshot" variable.

-e etcd_snapshot=/tmp/etcd_snapshot

Caveats

The playbook has only been tested with fairly small etcd databases.
There may be disruptions while running the playbook.
There are absolutely no guarantees.

If possible try to break a cluster in the same way that your target cluster is broken and test to recover that before trying on the real target cluster.

2.0 KiB Raw Blame History

Recovering the control plane

Runbook

Recover from lost quorum

Caveats

2.0 KiB

Raw Blame History