* control-plane: fix first_kube_control_plane delegation with kube_override_hostname
When kube_override_hostname is configured, the node names reported by
`kubectl get nodes` differ from the inventory_hostname known to Ansible.
This causes delegation failures in subsequent tasks since Ansible cannot
resolve the hostname from kubectl output to an inventory host.
Signed-off-by: Seena Fallah <seenafallah@gmail.com>
* control-plane: remove fragile first_control_plane selection logic
Current implementation breaks with kube_override_hostname and has
multiple edge cases. Drop until proper kubectl-based node lookup
can be implemented.
Signed-off-by: Seena Fallah <seenafallah@gmail.com>
---------
Signed-off-by: Seena Fallah <seenafallah@gmail.com>
This should make 'no space left on device' problems easier to handle
Use /tmp/releases as local_release_dir CI created machine, while keeping
the same folder on the runner (needed for gitlab-ci runner pods)
* CI: Try a full ssh connection on hosts instead of only checking the port
If we only try the port, we can try to connect in the playbook which is
executed next even though the managed node has not yet completed it's
boot-up sequence ("System is booting up. Unprivileged users are not
permitted to log in yet. Please come back later. For technical details,
see pam_nologin(8).")
This does not account for python-less hosts, but we don't use those in
CI anyway (for now, at least).
* CI: Remove connection method override when creating VMs
This prevented wait_for_connection to work correctly by hijacking the
connection to localhost, thus bypassing the connection check.
Add missing RBAC permissions for Calico apiserver to function correctly
with Kubernetes 1.33+
Changes:
1. Add K8s 1.33 ValidatingAdmissionPolicy resources to calico-webhook-reader
- validatingadmissionpolicies
- validatingadmissionpolicybindings
Kubernetes 1.33 introduced ValidatingAdmissionPolicy resources (KEP-3488)
that require explicit RBAC permissions. Without these changes, Calico
apiserver on k8s 1.33+ will not work and needless errors are logged
* Remove etcd member by peerURLs
The way to obtain the IP of a particular member is convoluted and depend
on multiple variables. The match is also textual and it's not clear
against what we're matching
It's also broken for etcd member which are not also Kubernetes nodes,
because the "Lookup node IP in kubernetes" task will fail and abort the
play.
Instead, match against 'peerURLs', which does not need new variable, and
use json output.
* Add testcase for etcd removal on external etcd
* do not merge
* fixup! Remove etcd member by peerURLs
* fixup! Remove etcd member by peerURLs
The 'old' playbook and the collection use '-' and '_' as separator,
which breaks the logic in scripts/testcases_run.sh.
Add aliases using the old schemes to make the test work and avoid
breaking anything.
Both '-' and '_' variants will be deleted once we switch to supporting
collection only.
fixed kubelet condition
CRI-O: fix for handling of container runtime switching
refactored kubelet start condition
stop/start kubelet and crio only when default runtime is changed
fixed condition for runtime_matches fact variable
fixed set facts for existing container runtime
added crio runtime switch variable
changed condition to use runtime switch variable
added comment for not-found for readers
Allow setting deployment replicas through `coredns_replicas` when
`enable_dns_autoscaler` is set to false.
Signed-off-by: Seena Fallah <seenafallah@gmail.com>
The option ara_default was still present in ansible.cfg under callbacks_enabled.
This is a leftover from commit b9e9364 ("Remove ara support in CI") and should
have been removed together with the rest of the ara integration.
Signed-off-by: Ali Afsharzadeh <afsharzadeh8@gmail.com>
* Enable reserved variable name checks and fix violations
Updated .ansible-lint configuration to skip only var-naming[pattern]
and var-naming[no-role-prefix] instead of skipping the entire var-naming rule.
This enables the check for reserved variable names.
Renamed variables that used reserved names to avoid conflicts.
Updated all references in tasks, variables, and templates.
Signed-off-by: Ali Afsharzadeh <afsharzadeh8@gmail.com>
* Rename namespace variable inside tasks instead of deleting it
Signed-off-by: Ali Afsharzadeh <afsharzadeh8@gmail.com>
* Change hosts variable to vm_hosts
Signed-off-by: Ali Afsharzadeh <afsharzadeh8@gmail.com>
* Use k8s_namespace instead of dashboard_namespace in dashboard.yml.j2 template
Signed-off-by: Ali Afsharzadeh <afsharzadeh8@gmail.com>
---------
Signed-off-by: Ali Afsharzadeh <afsharzadeh8@gmail.com>
Fixes a bug where `kube-apiserver` fails to start if the PodSecurity
configuration file doesn't have the `apiVersion` and `kind` keys.
Signed-off-by: Alejandro Macedo <alex.macedopereira@gmail.com>
Retrying to load conntrack modules was bound to fail due to the way, the current `when` conditions were utilized.
It was based on the assumption, that in case of success, the registered variable would have an `rc` attribute with the value `0`.
Unfortunately, the `rc` attribute is only present in case of a failure, where it's value is >1.
The result of `community.general.modprobe` in case of success looks like this:
```
{
"changed": false,
"msg": "All items completed",
"results": [
{
"ansible_loop_var": "item",
"changed": false,
"failed": false,
"invocation": {
"module_args": {
"name": "nf_conntrack",
"params": "",
"persistent": "present",
"state": "present"
}
},
"item": "nf_conntrack",
"name": "nf_conntrack",
"params": "",
"state": "present"
}
],
"skipped": false
}
```
While it looks like this in case of a failure:
```
{
"changed": false,
"failed": true,
"msg": "One or more items failed",
"results": [
{
"ansible_loop_var": "item",
"attempts": 3,
"changed": false,
"failed": true,
"invocation": {
"module_args": {
"name": "nf_conntrack_doesnotexist",
"params": "",
"persistent": "present",
"state": "present"
}
},
"item": "nf_conntrack_doesnotexist",
"msg": "modprobe: FATAL: Module nf_conntrack_doesnotexist not found in directory /lib/modules/5.14.0-570.32.1.el9_6.x86_64\n",
"name": "nf_conntrack_doesnotexist",
"params": "",
"rc": 1,
"state": "present",
"stderr": "modprobe: FATAL: Module nf_conntrack_doesnotexist not found in directory /lib/modules/5.14.0-570.32.1.el9_6.x86_64\n",
"stderr_lines": [
"modprobe: FATAL: Module nf_conntrack_doesnotexist not found in directory /lib/modules/5.14.0-570.32.1.el9_6.x86_64"
],
"stdout": "",
"stdout_lines": []
}
],
"skipped": false
}
```
By evaluating `failed` instead, this issue can be prevented.
See also:
- https://github.com/kubernetes-sigs/kubespray/issues/11340
Co-authored-by: Max Gautier <mg@max.gautier.name>
The Prometheus Operator CRDs are commonly used for monitoring and are
used by some CNIs (such as Cilium). Kubespray can be installed first,
and the subsequent installation of the operator can be handled by the
user (or later extensions).
Signed-off-by: ChengHao Yang <17496418+tico88612@users.noreply.github.com>
Add variable to set kubelet staticPodPath location.
It can be set to empty so that we can choose to disable it for some nodes.
STIG recommendation is to disable it.
Signed-off-by: Shaleen Bathla <shaleenbathla@gmail.com>
Co-authored-by: ChengHao Yang <17496418+tico88612@users.noreply.github.com>
Debian Trixie recently removed the package `software-properties-common`,
add the condition not on Debian Trixie.
Signed-off-by: ChengHao Yang <17496418+tico88612@users.noreply.github.com>
Remove --auth-anonymous if kube_api_anonymous_auth in undefined, to avoid
compatibility errors with other arguments of the kube-apiserver, such as
--authentication-config when anonymous field is configured.
* Add header configuration in containerd hosts.toml
Signed-off-by: Alexander Gil <pando855@gmail.com>
* Disable log output on containerd mirrors settings if required
Signed-off-by: Alexander Gil <pando855@gmail.com>
---------
Signed-off-by: Alexander Gil <pando855@gmail.com>
* docs: remove obsolete reference to `gen_tags.sh`
`scripts/gen_tags.sh` was removed in 373b952a0c
* docs: fix 404 links
Merge the `Requirements` section with the `Usage` section and just
reference the inventory documentation, which then points to all further
information related to group vars etc.
* feat(subnet): Ensure Vagrant subnet not in use by localhost
This commit ensures that Vagrantfile supplied $subnet is not in use by
the localhost. Previously, if the subnet is in use by localhost (i.e.
bridge network), Vagrant VM boxes can not communicate.
* refactor(socket): Use ruby Socket library to find addrs
This commit reverts the usage of Ruby .scan() which may result in
failure if program is not provided. Instead, this commit refactors to
use Socket library to determine interfaces in use, then proceeds to
compare with Vagrantfile supplied subnets. Additionally, the commit
supports IPv6 comparisons.
This allows to use kubespray_defaults (once) instead of redefining
defaults in the tests.
Test test files becomes imported tasks rather thand standalone
playbooks.
* Test: molecule replace ubuntu2004 with ubuntu2204 ubuntu2404
cri-dockerd, adduser and bastion-ssh-config can't run ubuntu2404, maybe needs to check login.
"System is booting up. Unprivileged users are not permitted to log in yet. Please come back later. For technical details, see pam_nologin(8)."
Signed-off-by: ChengHao Yang
<17496418+tico88612@users.noreply.github.com>
* Test: replace ubuntu-2004 with ubuntu-2404
All ubuntu-2004 tests are removed.
Signed-off-by: ChengHao Yang <17496418+tico88612@users.noreply.github.com>
* Docs: update ci.md
Signed-off-by: ChengHao Yang <17496418+tico88612@users.noreply.github.com>
* Docs: update README.md
Remove Ubuntu 20.04 support
Signed-off-by: ChengHao Yang <17496418+tico88612@users.noreply.github.com>
---------
Signed-off-by: ChengHao Yang
Signed-off-by: ChengHao Yang <17496418+tico88612@users.noreply.github.com>
Not really a reason not to, and this actually breaks daily-ci because
some jobs depends on this one so the whole pipeline is invalid if it's
not created.
This uses the same logic than the other versions, with simplications for
crictl and crio whose versionning scheme is tied to upstream kubernetes.
Also move some version variables in vars/ rather than defaults/, because
they are not used elsewhere and don't really make sense as modifiable by
the user.
Currently, there is no reliable way to obtain individual CRD files, so
the only solution is to update first.
Signed-off-by: ChengHao Yang <17496418+tico88612@users.noreply.github.com>
When installing or upgrading in the past, there was no validation
config. Check if the file exists first to prevent subsequent validation
errors.
Signed-off-by: ChengHao Yang <17496418+tico88612@users.noreply.github.com>
The validation step is moved to the end to avoid the loss of files that
may lead to verification failure.
Signed-off-by: ChengHao Yang <17496418+tico88612@users.noreply.github.com>
`cilium install` is equivalent to `helm install`, it will failed if
cilium relase exist. `cilium version` can know the release exist without
helm binary
Signed-off-by: ChengHao Yang <17496418+tico88612@users.noreply.github.com>
Give users two options: besides skip Cilium, add
`cilium_remove_old_resources`, default is `false`, when set to `true`,
it will remove the content of the old version, but it will cause the
downtime, need to be careful to use.
Signed-off-by: ChengHao Yang <17496418+tico88612@users.noreply.github.com>
This patch fixes the indentation in the `encryption` section.
Previously configuration like this:
```yml
cilium_encryption_enabled: true
cilium_encryption_type: wireguard
```
Would template to a `values.yaml` file with indentation that looks like this:
```yml
encryption:
enabled: True
type: wireguard
nodeEncryption: False
```
instead of this:
```yml
encryption:
enabled: true
type: wireguard
nodeEncryption: false
```
This syntax issue causes an error during Cilium installation.
This patch also makes all boolean values in this template file go through the `to_json` filter.
Since values like `True` and `False` are not compliant with the YAML v1.2 spec,
avoiding them is preferable.
`to_json` may be used for all other values in this template to ensure we end up with
a valid YAML document in all cases (even when various strings include special characters),
but this was left for another (future) patch.
* Fix: check expiraty before renew
Since certificate renewal and container restarts involve higher risks,
they should be executed with extra caution.
* squash to Fix: check expiraty before renew
* squash to Fix: address more comments from VannTen
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
---------
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
* Remove tag master
Following it's deprecation in 4b324cb0f (Rename master to control plane
- non-breaking changes only (#11394), 2024-09-06)
* Add fail fast path when using removed tags
- Used for the master tag, but this could be used for other things in
the future
The checksums are not a defaults and are not meant to be changed from
the inventories.
Furthermore, role defaults have a lower priority that hosts facts, which
technically means a rogue hosts could hijack the hashes for its
variables.
* feat(cilium): add configurable Hubble export log rotation parameters
- Adds support for `cilium_hubble_export_file_max_backups` and `cilium_hubble_export_file_max_size_mb`
- Applies values only if `cilium_hubble_export_file_path` is defined
- Default values are set in role defaults
- Cleans up template logic by removing unnecessary conditionals
* Fix indentation for hubble export settings
* Fix undefined variable issue with ipwrap in kubeconfig override that caused pre-commit errors
* Update main.yml
rollback
This is now handled directly at the failfast-ci level (== integration
Github <-> Gitlab).
The whole pipeline will not be triggered unless:
- The author is a maintainer
- The PR has the /ok-to-test label
dnsautoscaler should only be enabled when enable_dns_autoscaler is
set to true. without this, it could be enabled without any manifest
actually using it, which makes it a false signal.
Signed-off-by: Seena Fallah <seenafallah@gmail.com>
The switch to not use system packages for containerd packages happened
multiples releases ago ; there should not be any up-to-date installation
of kubespray needing that cleanup.
Remove those steps and variables only used by them.
* Delete unused scripts
- gen_tags.sh: not the right file, produce garbage even if path is fixed
- premoderator.sh: not used since ef6d24a49 (CI require a 'lgtm' or
'ok-to-test' labels to pass (#11251), 2024-05-31)
- gitlab-branch-cleanup: unused AFAICT
* CI: inline molecule logs
Single use site -> less indirection makes it easier to read.
- This enables ithe override of the tolerations for the cilium-operator deployment
- default behaviour is to leave the toleration as is unless the var is set
With the current github-workflow setup, workflows are triggered on every
forked repository (which is quite wasteful).
Add a condition to only run on the main repository.
kubespray-defaults currently does two things:
- records a number of default variable values (in particular values used
in several places)
- gather and compose some complex network facts (in particular,
`fallback_ip` and `no_proxy`
There is no actual reason to couple those two things, and it makes using
defaults more difficult (because computing the network facts is somewhat
expensive, we don't want to do it willy-nilly)
Split the two and adjust import paths as needed.
The Gateway API needs to be installed first if you want to use Cilium's
Gateway API functionality. The Gateway API is just CRD without any Pod,
Deployment, etc., so I think it can be brought forward to before the CNI
installation.
Signed-off-by: ChengHao Yang
The recommended usage of kubespray is to use the default versions.
So putting them in inventory/sample is not really very helpful, and
causes:
- churn (keeping the inventory/sample up to date)
- support issues (mismatch between defaults and sample inventory)
Remove all concrete versions from the inventory sample.
bootstrap-os does not do anything in sudoers since e2ad6aad5 (bootstrap:
rework role (#4045), 2019-02-11).
So SSH pipelining working is effectively a pre-requisite anyway.
The preinstall assert cover a number of things, many of which depends
only on the inventory, and can be run without any ansible_facts
collected.
Split them off to simplify re-ordering.
* docs: Fix offline-environment.md to add 'v' prefix of some versions
Now some version variables (kube_version, etcd_version, etc) don't have 'v' prefix,
so you need to add 'v' prefix to download URLs.
* fix: Fix offline.yml to add 'v' prefix of some versions
* [Issue-12117]-Certificates for the new hosts are not generated during scale.yml
* [Issue-12117]-Certificates for the new hosts are not generated during scale.yml
* [Issue-12117]-Certificates for the new hosts are not generated during scale.yml
This commit fixed the process to ensure that CCM is installed first to
avoid the chicken-and-egg problem.
Signed-off-by: ChengHao Yang <17496418+tico88612@users.noreply.github.com>
* fix(containerd): always render NRI plugin block with conditional disable flag
* feat: enable Node Resource Interface plugin when using containerd
* fix: remove the
* fix: fix for linter
subscription-manager status can in some circumstances just never
terminates, with nothing indicating the problem from the Ansible
playbook log.
This makes it difficult to find the hosts misbehaving.
Add a timeout to the subscription checks (defaulting to 3 minutes). This
should be more than enough for normal circumstances while allowing
easier troubleshooting, as the hosts will be FAILED instead of the
playbook just waiting indefinitely.
* terraform upcloud: Added possibility to set up nodes with only private IPs
* terraform upcloud: add support for gateway in private zone
* terraform upcloud: split LB proxy protocol config per backend
* terraform upcloud: fix flexible plans
* terraform upcloud: Removed overview of cluster setup
---------
Co-authored-by: davidumea <david.andersson@elastisys.com>
This is more in-line with dependabot and similar auto-updaters.
Reduce ci coverage on github action updating (it does not change
kubespray code, no need for testing).
* Remove heketi
Heketi is no longer developed or supported and should not be used
anymore.
Remove the contrib playbook.
* Remove contrib glusterfs
Glusterfs integration with glusterfs is now either deprecated or
unsupported.
Other storage solutions should be preferred.
This commit enhances the node removal playbook's reliability and safety by implementing the following changes:
1. **Node Validation**: Added a validation step using assert to ensure the `node` variable is defined and contains nodes. If the list is empty or undefined, the playbook fails early, preventing accidental operations on the entire cluster.
2. **Removed Defaulting for Hosts**: Updated tasks to enforce explicit `node` variable input without defaulting to critical groups (e.g., `etcd:k8s_cluster:calico_rr`). By validating `node` beforehand, tasks now solely rely on user-provided input and safely avoid unintended targeting.
3. **Explicit User Confirmation**: Enhanced the confirmation prompt to clarify the scope of the operation. The admin is now required to explicitly confirm node state deletion, ensuring a deliberate decision before proceeding.
These improvements strengthen the reliability and safety of the `remove-node.yml` playbook by eliminating ambiguous behavior, preventing misconfigurations, and ensuring clear interaction during node removal tasks.
Vagrant jobs needs a big cache which makes them slow / sometimes stuck
completely. Using the kubevirt provisionning playbook is now
significantly faster, so do just that.
Having only one provisionner in CI will also allows us to remove some of
the custom runners executors we use for vagrant, and more generally
reduce the CI maintenance.
Our kubevirt CI platform does not support ivp6 yet, so we keep the
relevant jobs in vagrant, but we'll migrate them as well as soon as
possible.
- Take advantage of `parallel:matrix` to make the jobs definition shorter
and more readable.
- Remove helper scripts which are no longer needed
- Remove redundant indirection in the gitlab-ci pipelines definitions
(only one user)
This commit upgrades ingress-nginx to version v1.12.1, addressing multiple critical vulnerabilities including CVE-2025-1974, CVE-2025-1097, CVE-2025-1098, CVE-2025-24513, and CVE-2025-24514 as detailed in the ingress-nginx release notes: https://github.com/kubernetes/ingress-nginx/releases/tag/controller-v1.12.1
Important Notes:
- Fixing CVE-2025-1974 required disabling validation of the generated NGINX configuration during validation of Ingress resources. Invalid Ingress resources may stop the NGINX configuration from being updated.
- Recommended mitigations include enabling annotation validation and disabling snippet annotations.
Alongside this upgrade, the `ingress_nginx_kube_webhook_certgen_image_tag` has been updated to v1.5.2 for compatibility, based on: https://github.com/kubernetes/ingress-nginx/pull/13066
Changelog:
- Updated ingress-nginx version to v1.12.1 in Kubespray.
- Updated `ingress_nginx_kube_webhook_certgen_image_tag` in `roles/kubespray-defaults/defaults/main/download.yml` to v1.5.2.
Fixes: https://github.com/kubernetes-sigs/kubespray/issues/12073
* Refactor control plane upgrades with reconfiguration support
Adds revised support for:
- The previously removed `--config` argument for `kubeadm upgrade apply`
- Changes to `ClusterConfiguration` as part of the `upgrade-cluster.yml` playbook lifecycle
- kubeadm-config `v1beta4` `UpgradeConfiguration` for the `kubeadm upgrade apply` command: [UpgradeConfiguration v1beta4](https://kubernetes.io/docs/reference/config-api/kubeadm-config.v1beta4/#kubeadm-k8s-io-v1beta4-UpgradeConfiguration).
* Add kubeadm upgrade node support
Per discussion:
- Use `kubeadm upgrade node` on secondary control plane upgrades
- Add support for UpgradeConfiguration.node in kubeadm-config.v1beta4
- Remove redundant `allowRCUpgrades` config
- Revert from `block` for first and secondary control plane back to unblocked tasks since they no longer share much code and it's more readable this way
* Add kubelet and kube-proxy reconfiguration to upgrades
* Fix task to use `kubeadm init phase etcd local`
* Rebase with changes from "Adapt checksums and versions to new hashes updater" PR
* Add `imagePullPolicy` and `imagePullSerial` to kubeadm-config v1beta4 `InitConfiguration.nodeRegistration`
* Ensure correct `AuthorizationConfiguration` API version during upgrades
Fixes an issue where the wrong AuthorizationConfiguration API version could be used by kube-apiserver prematurely during upgrades.
The `kubernets/control-plane` role writes configuration for the target version before control plane pods are upgraded.
However, since the `AuthorizationConfiguration` file is reconciled continuously, this leads to a race condition where a new configuration version can be reconciled before kube-apiserver is upgraded to the compatible version.
This solution ensures the correct configuration is available throughout the process by writing each api version to a different file path. Unused file versions are cleaned up post-upgrade for better hygiene.
* Avoid from_json in cleanup task
The versions which are by default derived from `kube_version` can break
the assert if kube_version start with `v`, because they use the start of
`kube_version` as dict key.
By putting them in their own assert, the first assert should trigger on
`kube_version`, with a more explicit error.
[WARNING][1] kube-controllers/runconfig.go 193: unable to list KubeControllersConfiguration(default) error=connection is unauthorized: kubecontrollersconfigurations.crd.projectcalico.org "default" is forbidden: User "system:serviceaccount:kube-system:calico-kube-controllers" cannot list resource "kubecontrollersconfigurations" in API group "crd.projectcalico.org" at the cluster scope
* Upcloud: Added support for routers and gateways
* Upcloud: Added ipsec properties for UpCloud gateway VPN
* Upcloud: Added support for deprecated network field for loadbalancers
There is litte reason to share an ssh key common to all CI jobs, so
generate one for each on the fly.
Also use plain-text cloud-init config instead of base64 for readability
To work with molecule, we need to use the name provided by molecule_yml
in inventory.
Inject the name in the VirtualMachineInstance (with a default to handle
non-molecule scenario) and get it back as part of inventory).
Account for no ansible groups
The current templating of kubevirt VirtualMachine relies on global
ansible variables, except for the group the nodes are meant to be in.
In order to have more flexibility (in particular, mixed OS cluster for
instances), expect now an abitrary dict to be passed to the template ;
this allows to embed directly in the nodes definition any variable used
by the template.
The script is obsoleted by 5d7236ea5 (Merge pull request #11890 from
VannTen/download_graphql_checksums_2, 2025-03-09), since the format of
checksums is no longer compatible.
Allow the use of different hashes, as support by the get_url
Ansible module.
Change the variable name accordingly to 'checksum' since it's not
exclusively sha256 anymore.
The versions are nearly all .0 because of the gvisor release scheme.
This means they need to be quoted in yaml to be considered strings.
Special casing by removing the .0 make tooling more complicated, and it
does not gain us anything apart from a nicer looking file (I guess).
So just use the version of upstream gvisor and quote it.
* CI: Put pre-commit cache under CI_PROJECT_DIR
Apparently gitlab-runner can't cache stuff outside of the project
directory.
Put the cache under CI_PROJECT_DIR to make it work (which also means we
need to ignore it from ansible-lint).
Also update the pre-commit image while we're at it.
Link: https://gitlab.com/gitlab-org/gitlab/-/issues/14151
* update ansible-lint pre-commit
This adds a new flag with default `kubeadm_config_validate_enabled: true` to use when debugging features and enhancements affected by the `kubeadm config validate command`.
This new flag should be set to `false` only for development and testing scenarios where validation is expected to fail (pre-release Kubernetes versions, etc).
While working with development and test versions of Kubernetes and Kubespray, I found this option very useful.
* Automatically derive defaults versions from checksums
Currently, when updating checksums, we manually update the default
versions.
However, AFAICT, for all components where we have checksums, we're using
the newest version out of those checksums.
Codify this in the `_version` defaults variables definition to make the
process automatic and reduce manual steps (as well as the diff size
during reviews).
We assume the versions are sorted, with newest first. This should be
guaranteed by the pre-commit hooks.
* Validate checksums are ordered by versions, newest first
* Generalize render-readme-versions hook for other static files
The pre-commit hook introduced a142f40e2 (Update versions in README.md
with pre-commit, 2025-01-21) allow to update our README with new
versions.
It turns out other "static" files (== which don't interpret Ansible
variables) also use the default version (in that case, our Dockefiles,
but there might be others)
The Dockerfile breaks if the variable they use (`kube_version`) is a
Jinja template.
For helping with automatic version upgrade, generalize the hook to deal
with other static files, and make a template out of the Dockerfile.
* Dockerfile: template kube_version with pre-commit instead of runtime
* Validate all versions/checksums are strings in pre-commit
All the ansible/python tooling for version is for version strings. YAML
unhelpfully consider some stuff as number, so enforce this.
* Stringify checksums versions
* exclude .ansible in ansible-lint
* remote ctr i pull workdaround
Signed-off-by: Kay Yan <kay.yan@daocloud.io>
---------
Signed-off-by: Kay Yan <kay.yan@daocloud.io>
The build steps at the start of CI takes about 2 minutes; now that we
have greatly reduced the overall duration, this is not an insignificant
impact.
Add timestamps to the build process to see measure which steps of the
image build take the most time.
* Remove krew installation support
Krew is fundamentally to install kubectl plugins, which are eminently a
client side things.
It's also not difficult to install on a client machine.
* Remove krew cleanup
This has been deprecated for a long time, time to pull the plug.
We leave an assert for one release to have a straightforward failure if
some users were still using the variable.
Since 'none' can be, for instance, a manual calico deployment, don't
check whether there is enough ip for pods on a node, because the plugin
can use another mechanism than the podCIDR to allocate IPs.
When the etcd group is not specified we assume it's kube_control_plane.
In that case, etcd still can't be even, so instead of only checking the
etcd group we need to default to kube_control_plane
ansible-lint and yamllint are run as pre-commit hooks, which are
installed by pre-commit directly. So there is no need to put them in
tests/requirements.txt.
So remove them and make it leaner.
Upstream calico isn't doing that, and:
- this can cause throttling
- the cpu needed by calico is very cluster / workload dependent
- missing cpu limits will not starve other pods (unlike missing memory
requests), because the kernel scheduler will still gives priority to
other process in pods not exceeding their requests
Currently, versions in README.md need to be manually updated, and we
check it's done with a bash script.
Add a small utility playbook to add versions in README.md from their
actual default values, automatically.
This is done in pre-commit, and replace the scripted check ; instead it
will autofix the README.md, and fails in CI if needed.
We switch markdownlint behind the local hooks to gave it the opportunity
to catch a problem with the rendering.
Since e8ee42280 (CI: remove deletion tasks of 'packet' VMs, 2024-09-13),
our tests appears to not be flakey anymore.
The current retry slow down the testing feedback on pull request.
Since it's not needed anymore, don't retry and fail fast.
This is handy when some component releases is buggy (missing file at the
download links) to not block everything else.
Move the filtering up the stack so we don't have to do it multiples
times.
Gvisor releases, besides only being tags, have some particularities:
- they are of the form yyyymmdd.p -> this get interpreted as a yaml
float, so we need to explicitely convert to string to make it work.
- there is no semver-like attached to the version numbers, but the API
(= OCI container runtime interface) is expected to be stable (see
linked discussion)
- some older tags don't have hashs for some archs
Link: https://groups.google.com/g/gvisor-users/c/SxMeHt0Yb6Y/m/Xtv7seULCAAJ
Gvisor is the only one of our deployed components which use tags instead
of proper releases. So the tags scraping support will, for now, cater to
gvisor particularities, notably in the tag name format and the fact that
some older releases don't have the same URL scheme.
Containerd use the same repository for releases of it's gRPC API (which
we are not interested in).
Conveniently, those releases have tags which are not valid version
number (being prefixed with 'api/').
This could also be potentially useful for similar cases.
The risk of missing releases because of this are low, since it would
require that a project issue a new release with an invalid format, then
switch back to the previous format (or we miss the fact it's not
updating for a long period of time).
The Github graphQL API needs IDs for querying a variable array of
repository.
Use a dict for components instead of an array of url and record the
corresponding node ID for each component (there are duplicates because
some binaries are provided by the same project/repository).
Adds the ability to configure the Kubernetes API server with a structured authorization configuration file.
Structured AuthorizationConfiguration is a new feature in Kubernetes v1.29+ (GA in v1.32) that configures the API server's authorization modes with a structured configuration file.
AuthorizationConfiguration files offer features not available with the `--authorization-mode` flag, although Kubespray supports both methods and authorization-mode remains the default for now.
Note: Because the `--authorization-config` and `--authorization-mode` flags are mutually exclusive, the `authorization_modes` ansible variable is ignored when `kube_apiserver_use_authorization_config_file` is set to true. The two features cannot be used at the same time.
Docs: https://kubernetes.io/docs/reference/access-authn-authz/authorization/#configuring-the-api-server-using-an-authorization-config-file
Blog + Examples: https://kubernetes.io/blog/2024/04/26/multi-webhook-and-modular-authorization-made-much-easier/
KEP: https://github.com/kubernetes/enhancements/tree/master/keps/sig-auth/3221-structured-authorization-configuration
I tested this all the way back to k8s v1.29 when AuthorizationConfiguration was first introduced as an alpha feature, although v1.29 required some additional workarounds with `kubeadm_patches`, which I included in example comments.
I also included some example comments with CEL expressions that allowed me to configure webhook authorizers without hitting kubeadm 1.29+ issues that block cluster creation and upgrades such as this one: https://github.com/kubernetes/cloud-provider-openstack/issues/2575.
My workaround configures the webhook to ignore requests from kubeadm and system components, which prevents fatal errors from webhooks that are not available yet, and should be authorized by Node or RBAC anyway.
@@ -135,7 +150,7 @@ Note: Upstart/SysV init based OS types are not supported.
## Requirements
- **Minimum required version of Kubernetes is v1.29**
- **Minimum required version of Kubernetes is v1.30**
- **Ansible v2.14+, Jinja 2.11+ and python-netaddr is installed on the machine that will run Ansible commands**
- The target servers must have **access to the Internet** in order to pull docker images. Otherwise, additional configuration is required (See [Offline Environment](docs/operations/offline-environment.md))
- The target servers are configured to allow **IPv4 forwarding**.
@@ -149,10 +164,10 @@ Note: Upstart/SysV init based OS types are not supported.
Hardware:
These limits are safeguarded by Kubespray. Actual requirements for your workload can differ. For a sizing guide go to the [Building Large Clusters](https://kubernetes.io/docs/setup/cluster-large/#size-of-master-and-master-components) guide.
- Master
- Memory: 1500 MB
- Node
- Memory: 1024 MB
- Control Plane
- Memory: 2 GB
- Worker Node
- Memory: 1 GB
## Network Plugins
@@ -167,9 +182,6 @@ You can choose among ten network plugins. (default: `calico`, except Vagrant use
- [cilium](http://docs.cilium.io/en/latest/): layer 3/4 networking (as well as layer 7 to protect and secure application protocols), supports dynamic insertion of BPF bytecode into the Linux kernel to implement security services, networking and visibility logic.
- [weave](docs/CNI/weave.md): Weave is a lightweight container overlay network that doesn't require an external K/V database cluster.
(Please refer to `weave` [troubleshooting documentation](https://www.weave.works/docs/net/latest/troubleshooting/)).
- [kube-ovn](docs/CNI/kube-ovn.md): Kube-OVN integrates the OVN-based Network Virtualization with Kubernetes. It offers an advanced Container Network Fabric for Enterprises.
- [kube-router](docs/CNI/kube-router.md): Kube-router is a L3 CNI for Kubernetes networking aiming to provide operational
@@ -12,7 +12,6 @@ The Kubespray Project is released on an as-needed basis. The process is as follo
1. (For major releases) On the `master` branch: bump the version in `galaxy.yml` to the next expected major release (X.y.0 with y = Y + 1), make a Pull Request.
1. (For minor releases) On the `release-X.Y` branch: bump the version in `galaxy.yml` to the next expected minor release (X.Y.z with z = Z + 1), make a Pull Request.
1. The corresponding version of [quay.io/kubespray/kubespray:vX.Y.Z](https://quay.io/repository/kubespray/kubespray) and [quay.io/kubespray/vagrant:vX.Y.Z](https://quay.io/repository/kubespray/vagrant) container images are built and tagged. See the following `Container image creation` section for the details.
1. (Only for major releases) The `KUBESPRAY_VERSION` in `.gitlab-ci.yml` is upgraded to the version we just released # TODO clarify this, this variable is for testing upgrades.
1. The release issue is closed
1. An announcement email is sent to `dev@kubernetes.io` with the subject `[ANNOUNCE] Kubespray $VERSION is released`
1. The topic of the #kubespray channel is updated with `vX.Y.Z is released! | ...`
@@ -46,7 +45,7 @@ The Kubespray Project is released on an as-needed basis. The process is as follo
* Minor releases can change components' versions, but not the major `kube_version`.
Greater `kube_version` requires a new major or minor release. For example, if Kubespray v2.0.0
is bound to `kube_version: 1.4.x`, `calico_version: 0.22.0`, `etcd_version: v3.0.6`,
is bound to `kube_version: 1.4.x`, `calico_version: 0.22.0`, `etcd_version: 3.0.6`,
then Kubespray v2.1.0 may be bound to only minor changes to `kube_version`, like v1.5.1
and *any* changes to other components, like etcd v4, or calico 1.2.3.
And Kubespray v3.x.x shall be bound to `kube_version: 2.x.x` respectively.
# libvirt__ipv6_address does not work as intended, the address is obtained with the desired prefix, but auto-generated(like fd3c:b398:698:756:5054:ff:fe48:c61e/64)
# add default route for detect ansible_default_ipv6
# TODO: fix libvirt__ipv6 or use $subnet in shell
config.vm.provision"shell",inline:"ip -6 r a fd3c:b398:698:756::/64 dev eth1;ip -6 r add default via fd3c:b398:0698:0756::1 dev eth1 || true"
# Disable swap for each vm
node.vm.provision"shell",inline:"swapoff -a"
@@ -291,9 +336,9 @@ Vagrant.configure("2") do |config|
# Deploying a Kubespray Kubernetes Cluster with GlusterFS
You can either deploy using Ansible on its own by supplying your own inventory file or by using Terraform to create the VMs and then providing a dynamic inventory to Ansible. The following two sections are self-contained, you don't need to go through one to use the other. So, if you want to provision with Terraform, you can skip the **Using an Ansible inventory** section, and if you want to provision with a pre-built ansible inventory, you can neglect the **Using Terraform and Ansible** section.
## Using an Ansible inventory
In the same directory of this ReadMe file you should find a file named `inventory.example` which contains an example setup. Please note that, additionally to the Kubernetes nodes/masters, we define a set of machines for GlusterFS and we add them to the group `[gfs-cluster]`, which in turn is added to the larger `[network-storage]` group as a child group.
Change that file to reflect your local setup (adding more machines or removing them and setting the adequate ip numbers), and save it to `inventory/sample/k8s_gfs_inventory`. Make sure that the settings on `inventory/sample/group_vars/all.yml` make sense with your deployment. Then execute change to the kubespray root folder, and execute (supposing that the machines are all using ubuntu):
If your machines are not using Ubuntu, you need to change the `--user=ubuntu` to the correct user. Alternatively, if your Kubernetes machines are using one OS and your GlusterFS a different one, you can instead specify the `ansible_ssh_user=<correct-user>` variable in the inventory file that you just created, for each machine/VM:
First step is to fill in a `my-kubespray-gluster-cluster.tfvars` file with the specification desired for your cluster. An example with all required variables would look like:
As explained in the general terraform/openstack guide, you need to source your OpenStack credentials file, add your ssh-key to the ssh-agent and setup environment variables for terraform:
```shell
$ source ~/.stackrc
$ eval$(ssh-agent -s)
$ ssh-add ~/.ssh/my-desired-key
$ echo Setting up Terraform creds &&\
exportTF_VAR_username=${OS_USERNAME}&&\
exportTF_VAR_password=${OS_PASSWORD}&&\
exportTF_VAR_tenant=${OS_TENANT_NAME}&&\
exportTF_VAR_auth_url=${OS_AUTH_URL}
```
Then, standing on the kubespray directory (root base of the Git checkout), issue the following terraform command to create the VMs for the cluster:
This will create both your Kubernetes and Gluster VMs. Make sure that the ansible file `contrib/terraform/openstack/group_vars/all.yml` includes any ansible variable that you want to setup (like, for instance, the type of machine for bootstrapping).
Then, provision your Kubernetes (kubespray) cluster with the following ansible call:
For GlusterFS to connect between servers, TCP ports `24007`, `24008`, and `24009`/`49152`+ (that port, plus an additional incremented port for each additional server in the cluster; the latter if GlusterFS is version 3.4+), and TCP/UDP port `111` must be open. You can open these using whatever firewall you wish (this can easily be configured using the `geerlingguy.firewall` role).
This role performs basic installation and setup of Gluster, but it does not configure or mount bricks (volumes), since that step is easier to do in a series of plays in your own playbook. Ansible 1.9+ includes the [`gluster_volume`](https://docs.ansible.com/ansible/latest/collections/gluster/gluster/gluster_volume_module.html) module to ease the management of Gluster volumes.
## Role Variables
Available variables are listed below, along with default values (see `defaults/main.yml`):
```yaml
glusterfs_default_release:""
```
You can specify a `default_release` for apt on Debian/Ubuntu by overriding this variable. This is helpful if you need a different package or version for the main GlusterFS packages (e.g. GlusterFS 3.5.x instead of 3.2.x with the `wheezy-backports` default release on Debian Wheezy).
```yaml
glusterfs_ppa_use:true
glusterfs_ppa_version:"3.5"
```
For Ubuntu, specify whether to use the official Gluster PPA, and which version of the PPA to use. See Gluster's [Getting Started Guide](https://docs.gluster.org/en/latest/Quick-Start-Guide/Quickstart/) for more info.
## Dependencies
None.
## Example Playbook
```yaml
- hosts:server
roles:
- geerlingguy.glusterfs
```
For a real-world use example, read through [Simple GlusterFS Setup with Ansible](http://www.jeffgeerling.com/blog/simple-glusterfs-setup-ansible), a blog post by this role's author, which is included in Chapter 8 of [Ansible for DevOps](https://www.ansiblefordevops.com/).
## License
MIT / BSD
## Author Information
This role was created in 2015 by [Jeff Geerling](http://www.jeffgeerling.com/), author of [Ansible for DevOps](https://www.ansiblefordevops.com/).
- name:Ensure GlusterFS is started and enabled at boot.
service:
name:"{{ glusterfs_daemon }}"
state:started
enabled:true
- name:Ensure Gluster brick and mount directories exist.
file:
path:"{{ item }}"
state:directory
mode:"0775"
with_items:
- "{{ gluster_brick_dir }}"
- "{{ gluster_mount_dir }}"
- name:Configure Gluster volume with replicas
gluster.gluster.gluster_volume:
state:present
name:"{{ gluster_brick_name }}"
brick:"{{ gluster_brick_dir }}"
replicas:"{{ groups['gfs-cluster'] | length }}"
cluster:"{% for item in groups['gfs-cluster'] -%}{{ hostvars[item]['ip'] | default(hostvars[item].ansible_default_ipv4['address']) }}{% if not loop.last %},{% endif %}{%- endfor %}"
host:"{{ inventory_hostname }}"
force:true
run_once:true
when:groups['gfs-cluster'] | length > 1
- name:Configure Gluster volume without replicas
gluster.gluster.gluster_volume:
state:present
name:"{{ gluster_brick_name }}"
brick:"{{ gluster_brick_dir }}"
cluster:"{% for item in groups['gfs-cluster'] -%}{{ hostvars[item]['ip'] | default(hostvars[item].ansible_default_ipv4['address']) }}{% if not loop.last %},{% endif %}{%- endfor %}"
host:"{{ inventory_hostname }}"
force:true
run_once:true
when:groups['gfs-cluster'] | length <= 1
- name:Mount glusterfs to retrieve disk size
ansible.posix.mount:
name:"{{ gluster_mount_dir }}"
src:"{{ ip | default(ansible_default_ipv4['address']) }}:/gluster"
fstype:glusterfs
opts:"defaults,_netdev"
state:mounted
when:groups['gfs-cluster'] is defined and inventory_hostname == groups['gfs-cluster'][0]
- name:Get Gluster disk size
setup:
filter:ansible_mounts
register:mounts_data
when:groups['gfs-cluster'] is defined and inventory_hostname == groups['gfs-cluster'][0]
- name:Set Gluster disk size to variable
set_fact:
gluster_disk_size_gb:"{{ (mounts_data.ansible_facts.ansible_mounts | selectattr('mount', 'equalto', gluster_mount_dir) | map(attribute='size_total') | first | int / (1024 * 1024 * 1024)) | int }}"
when:groups['gfs-cluster'] is defined and inventory_hostname == groups['gfs-cluster'][0]
- name:Create file on GlusterFS
template:
dest:"{{ gluster_mount_dir }}/.test-file.txt"
src:test-file.txt
mode:"0644"
when:groups['gfs-cluster'] is defined and inventory_hostname == groups['gfs-cluster'][0]
- name:Unmount glusterfs
ansible.posix.mount:
name:"{{ gluster_mount_dir }}"
fstype:glusterfs
src:"{{ ip | default(ansible_default_ipv4['address']) }}:/gluster"
state:unmounted
when:groups['gfs-cluster'] is defined and inventory_hostname == groups['gfs-cluster'][0]
when:inventory_hostname == groups['kube_control_plane'][0] and groups['gfs-cluster'] is defined and hostvars[groups['gfs-cluster'][0]].gluster_disk_size_gb is defined
- name:Kubernetes Apps | Set GlusterFS endpoint and PV
# Deploy Heketi/Glusterfs into Kubespray/Kubernetes
This playbook aims to automate [this](https://github.com/heketi/heketi/blob/master/docs/admin/install-kubernetes.md) tutorial. It deploys heketi/glusterfs into kubernetes and sets up a storageclass.
## Important notice
> Due to resource limits on the current project maintainers and general lack of contributions we are considering placing Heketi into a [near-maintenance mode](https://github.com/heketi/heketi#important-notice)
## Client Setup
Heketi provides a CLI that provides users with a means to administer the deployment and configuration of GlusterFS in Kubernetes. [Download and install the heketi-cli](https://github.com/heketi/heketi/releases) on your client machine.
## Install
Copy the inventory.yml.sample over to inventory/sample/k8s_heketi_inventory.yml and change it according to your setup.
An SSH keypair is required so Ansible can access the newly provisioned nodes (Equinix Metal hosts). By default, the public SSH key defined in cluster.tfvars will be installed in authorized_key on the newly provisioned nodes (~/.ssh/id_rsa.pub). Terraform will upload this public key and then it will be distributed out to all the nodes. If you have already set this public key in Equinix Metal (i.e. via the portal), then set the public keyfile name in cluster.tfvars to blank to prevent the duplicate key from being uploaded which will cause an error.
If you don't already have a keypair generated (~/.ssh/id_rsa and ~/.ssh/id_rsa.pub), then a new keypair can be generated with the command:
```ShellSession
ssh-keygen -f ~/.ssh/id_rsa
```
## Terraform
Terraform will be used to provision all of the Equinix Metal resources with base software as appropriate.
### Configuration
#### Inventory files
Create an inventory directory for your cluster by copying the existing sample and linking the `hosts` script (used to build the inventory based on Terraform state):
The Equinix Metal Project ID associated with the key will be set later in `cluster.tfvars`.
For more information about the API, please see [Equinix Metal API](https://metal.equinix.com/developers/api/).
For more information about terraform provider authentication, please see [the equinix provider documentation](https://registry.terraform.io/providers/equinix/equinix/latest/docs).
Example:
```ShellSession
export METAL_AUTH_TOKEN="Example-API-Token"
```
Note that to deploy several clusters within the same project you need to use [terraform workspace](https://www.terraform.io/docs/state/workspaces.html#using-workspaces).
#### Cluster variables
The construction of the cluster is driven by values found in
[variables.tf](variables.tf).
For your cluster, edit `inventory/$CLUSTER/cluster.tfvars`.
The `cluster_name` is used to set a tag on each server deployed as part of this cluster.
This helps when identifying which hosts are associated with each cluster.
While the defaults in variables.tf will successfully deploy a cluster, it is recommended to set the following values:
- cluster_name = the name of the inventory directory created above as $CLUSTER
- equinix_metal_project_id = the Equinix Metal Project ID associated with the Equinix Metal API token above
#### Enable localhost access
Kubespray will pull down a Kubernetes configuration file to access this cluster by enabling the
`kubeconfig_localhost: true` in the Kubespray configuration.
Edit `inventory/$CLUSTER/group_vars/k8s_cluster/k8s_cluster.yml` and comment back in the following line and change from `false` to `true`:
`\# kubeconfig_localhost: false`
becomes:
`kubeconfig_localhost: true`
Once the Kubespray playbooks are run, a Kubernetes configuration file will be written to the local host at `inventory/$CLUSTER/artifacts/admin.conf`
#### Terraform state files
In the cluster's inventory folder, the following files might be created (either by Terraform
or manually), to prevent you from pushing them accidentally they are in a
`.gitignore` file in the `contrib/terraform/equinix` directory :
-`.terraform`
-`.tfvars`
-`.tfstate`
-`.tfstate.backup`
-`.lock.hcl`
You can still add them manually if you want to.
### Initialization
Before Terraform can operate on your cluster you need to install the required
If you've started the Ansible run, it may also be a good idea to do some manual cleanup:
- Remove SSH keys from the destroyed cluster from your `~/.ssh/known_hosts` file
- Clean up any temporary cache files: `rm /tmp/$CLUSTER-*`
### Debugging
You can enable debugging output from Terraform by setting `TF_LOG` to `DEBUG` before running the Terraform command.
## Ansible
### Node access
#### SSH
Ensure your local ssh-agent is running and your ssh key has been added. This
step is required by the terraform provisioner:
```ShellSession
eval $(ssh-agent -s)
ssh-add ~/.ssh/id_rsa
```
If you have deployed and destroyed a previous iteration of your cluster, you will need to clear out any stale keys from your SSH "known hosts" file ( `~/.ssh/known_hosts`).
#### Test access
Make sure you can connect to the hosts. Note that Flatcar Container Linux by Kinvolk will have a state `FAILED` due to Python not being present. This is okay, because Python will be installed during bootstrapping, so long as the hosts are not `UNREACHABLE`.
```ShellSession
$ ansible -i inventory/$CLUSTER/hosts -m ping all
example-k8s_node-1 | SUCCESS => {
"changed": false,
"ping": "pong"
}
example-etcd-1 | SUCCESS => {
"changed": false,
"ping": "pong"
}
example-k8s-master-1 | SUCCESS => {
"changed": false,
"ping": "pong"
}
```
If it fails try to connect manually via SSH. It could be something as simple as a stale host key.
This will take some time as there are many tasks to run.
## Kubernetes
### Set up kubectl
- [Install kubectl](https://kubernetes.io/docs/tasks/tools/install-kubectl/) on the localhost.
- Verify that Kubectl runs correctly
```ShellSession
kubectl version
```
- Verify that the Kubernetes configuration file has been copied over
```ShellSession
cat inventory/alpha/$CLUSTER/admin.conf
```
- Verify that all the nodes are running correctly.
```ShellSession
kubectl version
kubectl --kubeconfig=inventory/$CLUSTER/artifacts/admin.conf get nodes
```
## What's next
Try out your new Kubernetes cluster with the [Hello Kubernetes service](https://kubernetes.io/docs/tasks/access-application-cluster/service-access-application-cluster/).
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.