* control-plane: fix first_kube_control_plane delegation with kube_override_hostname
When kube_override_hostname is configured, the node names reported by
`kubectl get nodes` differ from the inventory_hostname known to Ansible.
This causes delegation failures in subsequent tasks since Ansible cannot
resolve the hostname from kubectl output to an inventory host.
Signed-off-by: Seena Fallah <seenafallah@gmail.com>
* control-plane: remove fragile first_control_plane selection logic
Current implementation breaks with kube_override_hostname and has
multiple edge cases. Drop until proper kubectl-based node lookup
can be implemented.
Signed-off-by: Seena Fallah <seenafallah@gmail.com>
---------
Signed-off-by: Seena Fallah <seenafallah@gmail.com>
This should make 'no space left on device' problems easier to handle
Use /tmp/releases as local_release_dir CI created machine, while keeping
the same folder on the runner (needed for gitlab-ci runner pods)
* CI: Try a full ssh connection on hosts instead of only checking the port
If we only try the port, we can try to connect in the playbook which is
executed next even though the managed node has not yet completed it's
boot-up sequence ("System is booting up. Unprivileged users are not
permitted to log in yet. Please come back later. For technical details,
see pam_nologin(8).")
This does not account for python-less hosts, but we don't use those in
CI anyway (for now, at least).
* CI: Remove connection method override when creating VMs
This prevented wait_for_connection to work correctly by hijacking the
connection to localhost, thus bypassing the connection check.
Add missing RBAC permissions for Calico apiserver to function correctly
with Kubernetes 1.33+
Changes:
1. Add K8s 1.33 ValidatingAdmissionPolicy resources to calico-webhook-reader
- validatingadmissionpolicies
- validatingadmissionpolicybindings
Kubernetes 1.33 introduced ValidatingAdmissionPolicy resources (KEP-3488)
that require explicit RBAC permissions. Without these changes, Calico
apiserver on k8s 1.33+ will not work and needless errors are logged
* Remove etcd member by peerURLs
The way to obtain the IP of a particular member is convoluted and depend
on multiple variables. The match is also textual and it's not clear
against what we're matching
It's also broken for etcd member which are not also Kubernetes nodes,
because the "Lookup node IP in kubernetes" task will fail and abort the
play.
Instead, match against 'peerURLs', which does not need new variable, and
use json output.
* Add testcase for etcd removal on external etcd
* do not merge
* fixup! Remove etcd member by peerURLs
* fixup! Remove etcd member by peerURLs
The 'old' playbook and the collection use '-' and '_' as separator,
which breaks the logic in scripts/testcases_run.sh.
Add aliases using the old schemes to make the test work and avoid
breaking anything.
Both '-' and '_' variants will be deleted once we switch to supporting
collection only.
fixed kubelet condition
CRI-O: fix for handling of container runtime switching
refactored kubelet start condition
stop/start kubelet and crio only when default runtime is changed
fixed condition for runtime_matches fact variable
fixed set facts for existing container runtime
added crio runtime switch variable
changed condition to use runtime switch variable
added comment for not-found for readers
Allow setting deployment replicas through `coredns_replicas` when
`enable_dns_autoscaler` is set to false.
Signed-off-by: Seena Fallah <seenafallah@gmail.com>
The option ara_default was still present in ansible.cfg under callbacks_enabled.
This is a leftover from commit b9e9364 ("Remove ara support in CI") and should
have been removed together with the rest of the ara integration.
Signed-off-by: Ali Afsharzadeh <afsharzadeh8@gmail.com>
* Enable reserved variable name checks and fix violations
Updated .ansible-lint configuration to skip only var-naming[pattern]
and var-naming[no-role-prefix] instead of skipping the entire var-naming rule.
This enables the check for reserved variable names.
Renamed variables that used reserved names to avoid conflicts.
Updated all references in tasks, variables, and templates.
Signed-off-by: Ali Afsharzadeh <afsharzadeh8@gmail.com>
* Rename namespace variable inside tasks instead of deleting it
Signed-off-by: Ali Afsharzadeh <afsharzadeh8@gmail.com>
* Change hosts variable to vm_hosts
Signed-off-by: Ali Afsharzadeh <afsharzadeh8@gmail.com>
* Use k8s_namespace instead of dashboard_namespace in dashboard.yml.j2 template
Signed-off-by: Ali Afsharzadeh <afsharzadeh8@gmail.com>
---------
Signed-off-by: Ali Afsharzadeh <afsharzadeh8@gmail.com>
Fixes a bug where `kube-apiserver` fails to start if the PodSecurity
configuration file doesn't have the `apiVersion` and `kind` keys.
Signed-off-by: Alejandro Macedo <alex.macedopereira@gmail.com>
Retrying to load conntrack modules was bound to fail due to the way, the current `when` conditions were utilized.
It was based on the assumption, that in case of success, the registered variable would have an `rc` attribute with the value `0`.
Unfortunately, the `rc` attribute is only present in case of a failure, where it's value is >1.
The result of `community.general.modprobe` in case of success looks like this:
```
{
"changed": false,
"msg": "All items completed",
"results": [
{
"ansible_loop_var": "item",
"changed": false,
"failed": false,
"invocation": {
"module_args": {
"name": "nf_conntrack",
"params": "",
"persistent": "present",
"state": "present"
}
},
"item": "nf_conntrack",
"name": "nf_conntrack",
"params": "",
"state": "present"
}
],
"skipped": false
}
```
While it looks like this in case of a failure:
```
{
"changed": false,
"failed": true,
"msg": "One or more items failed",
"results": [
{
"ansible_loop_var": "item",
"attempts": 3,
"changed": false,
"failed": true,
"invocation": {
"module_args": {
"name": "nf_conntrack_doesnotexist",
"params": "",
"persistent": "present",
"state": "present"
}
},
"item": "nf_conntrack_doesnotexist",
"msg": "modprobe: FATAL: Module nf_conntrack_doesnotexist not found in directory /lib/modules/5.14.0-570.32.1.el9_6.x86_64\n",
"name": "nf_conntrack_doesnotexist",
"params": "",
"rc": 1,
"state": "present",
"stderr": "modprobe: FATAL: Module nf_conntrack_doesnotexist not found in directory /lib/modules/5.14.0-570.32.1.el9_6.x86_64\n",
"stderr_lines": [
"modprobe: FATAL: Module nf_conntrack_doesnotexist not found in directory /lib/modules/5.14.0-570.32.1.el9_6.x86_64"
],
"stdout": "",
"stdout_lines": []
}
],
"skipped": false
}
```
By evaluating `failed` instead, this issue can be prevented.
See also:
- https://github.com/kubernetes-sigs/kubespray/issues/11340
Co-authored-by: Max Gautier <mg@max.gautier.name>
The Prometheus Operator CRDs are commonly used for monitoring and are
used by some CNIs (such as Cilium). Kubespray can be installed first,
and the subsequent installation of the operator can be handled by the
user (or later extensions).
Signed-off-by: ChengHao Yang <17496418+tico88612@users.noreply.github.com>
Add variable to set kubelet staticPodPath location.
It can be set to empty so that we can choose to disable it for some nodes.
STIG recommendation is to disable it.
Signed-off-by: Shaleen Bathla <shaleenbathla@gmail.com>
Co-authored-by: ChengHao Yang <17496418+tico88612@users.noreply.github.com>
Debian Trixie recently removed the package `software-properties-common`,
add the condition not on Debian Trixie.
Signed-off-by: ChengHao Yang <17496418+tico88612@users.noreply.github.com>
Remove --auth-anonymous if kube_api_anonymous_auth in undefined, to avoid
compatibility errors with other arguments of the kube-apiserver, such as
--authentication-config when anonymous field is configured.
* Add header configuration in containerd hosts.toml
Signed-off-by: Alexander Gil <pando855@gmail.com>
* Disable log output on containerd mirrors settings if required
Signed-off-by: Alexander Gil <pando855@gmail.com>
---------
Signed-off-by: Alexander Gil <pando855@gmail.com>
* docs: remove obsolete reference to `gen_tags.sh`
`scripts/gen_tags.sh` was removed in 373b952a0c
* docs: fix 404 links
Merge the `Requirements` section with the `Usage` section and just
reference the inventory documentation, which then points to all further
information related to group vars etc.
* feat(subnet): Ensure Vagrant subnet not in use by localhost
This commit ensures that Vagrantfile supplied $subnet is not in use by
the localhost. Previously, if the subnet is in use by localhost (i.e.
bridge network), Vagrant VM boxes can not communicate.
* refactor(socket): Use ruby Socket library to find addrs
This commit reverts the usage of Ruby .scan() which may result in
failure if program is not provided. Instead, this commit refactors to
use Socket library to determine interfaces in use, then proceeds to
compare with Vagrantfile supplied subnets. Additionally, the commit
supports IPv6 comparisons.
This allows to use kubespray_defaults (once) instead of redefining
defaults in the tests.
Test test files becomes imported tasks rather thand standalone
playbooks.
* Test: molecule replace ubuntu2004 with ubuntu2204 ubuntu2404
cri-dockerd, adduser and bastion-ssh-config can't run ubuntu2404, maybe needs to check login.
"System is booting up. Unprivileged users are not permitted to log in yet. Please come back later. For technical details, see pam_nologin(8)."
Signed-off-by: ChengHao Yang
<17496418+tico88612@users.noreply.github.com>
* Test: replace ubuntu-2004 with ubuntu-2404
All ubuntu-2004 tests are removed.
Signed-off-by: ChengHao Yang <17496418+tico88612@users.noreply.github.com>
* Docs: update ci.md
Signed-off-by: ChengHao Yang <17496418+tico88612@users.noreply.github.com>
* Docs: update README.md
Remove Ubuntu 20.04 support
Signed-off-by: ChengHao Yang <17496418+tico88612@users.noreply.github.com>
---------
Signed-off-by: ChengHao Yang
Signed-off-by: ChengHao Yang <17496418+tico88612@users.noreply.github.com>
Not really a reason not to, and this actually breaks daily-ci because
some jobs depends on this one so the whole pipeline is invalid if it's
not created.
This uses the same logic than the other versions, with simplications for
crictl and crio whose versionning scheme is tied to upstream kubernetes.
Also move some version variables in vars/ rather than defaults/, because
they are not used elsewhere and don't really make sense as modifiable by
the user.
Currently, there is no reliable way to obtain individual CRD files, so
the only solution is to update first.
Signed-off-by: ChengHao Yang <17496418+tico88612@users.noreply.github.com>
When installing or upgrading in the past, there was no validation
config. Check if the file exists first to prevent subsequent validation
errors.
Signed-off-by: ChengHao Yang <17496418+tico88612@users.noreply.github.com>
The validation step is moved to the end to avoid the loss of files that
may lead to verification failure.
Signed-off-by: ChengHao Yang <17496418+tico88612@users.noreply.github.com>
`cilium install` is equivalent to `helm install`, it will failed if
cilium relase exist. `cilium version` can know the release exist without
helm binary
Signed-off-by: ChengHao Yang <17496418+tico88612@users.noreply.github.com>
Give users two options: besides skip Cilium, add
`cilium_remove_old_resources`, default is `false`, when set to `true`,
it will remove the content of the old version, but it will cause the
downtime, need to be careful to use.
Signed-off-by: ChengHao Yang <17496418+tico88612@users.noreply.github.com>
This patch fixes the indentation in the `encryption` section.
Previously configuration like this:
```yml
cilium_encryption_enabled: true
cilium_encryption_type: wireguard
```
Would template to a `values.yaml` file with indentation that looks like this:
```yml
encryption:
enabled: True
type: wireguard
nodeEncryption: False
```
instead of this:
```yml
encryption:
enabled: true
type: wireguard
nodeEncryption: false
```
This syntax issue causes an error during Cilium installation.
This patch also makes all boolean values in this template file go through the `to_json` filter.
Since values like `True` and `False` are not compliant with the YAML v1.2 spec,
avoiding them is preferable.
`to_json` may be used for all other values in this template to ensure we end up with
a valid YAML document in all cases (even when various strings include special characters),
but this was left for another (future) patch.
* Fix: check expiraty before renew
Since certificate renewal and container restarts involve higher risks,
they should be executed with extra caution.
* squash to Fix: check expiraty before renew
* squash to Fix: address more comments from VannTen
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
---------
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
* Remove tag master
Following it's deprecation in 4b324cb0f (Rename master to control plane
- non-breaking changes only (#11394), 2024-09-06)
* Add fail fast path when using removed tags
- Used for the master tag, but this could be used for other things in
the future
The checksums are not a defaults and are not meant to be changed from
the inventories.
Furthermore, role defaults have a lower priority that hosts facts, which
technically means a rogue hosts could hijack the hashes for its
variables.
* feat(cilium): add configurable Hubble export log rotation parameters
- Adds support for `cilium_hubble_export_file_max_backups` and `cilium_hubble_export_file_max_size_mb`
- Applies values only if `cilium_hubble_export_file_path` is defined
- Default values are set in role defaults
- Cleans up template logic by removing unnecessary conditionals
* Fix indentation for hubble export settings
* Fix undefined variable issue with ipwrap in kubeconfig override that caused pre-commit errors
* Update main.yml
rollback
This is now handled directly at the failfast-ci level (== integration
Github <-> Gitlab).
The whole pipeline will not be triggered unless:
- The author is a maintainer
- The PR has the /ok-to-test label
dnsautoscaler should only be enabled when enable_dns_autoscaler is
set to true. without this, it could be enabled without any manifest
actually using it, which makes it a false signal.
Signed-off-by: Seena Fallah <seenafallah@gmail.com>
The switch to not use system packages for containerd packages happened
multiples releases ago ; there should not be any up-to-date installation
of kubespray needing that cleanup.
Remove those steps and variables only used by them.
* Delete unused scripts
- gen_tags.sh: not the right file, produce garbage even if path is fixed
- premoderator.sh: not used since ef6d24a49 (CI require a 'lgtm' or
'ok-to-test' labels to pass (#11251), 2024-05-31)
- gitlab-branch-cleanup: unused AFAICT
* CI: inline molecule logs
Single use site -> less indirection makes it easier to read.
- This enables ithe override of the tolerations for the cilium-operator deployment
- default behaviour is to leave the toleration as is unless the var is set
With the current github-workflow setup, workflows are triggered on every
forked repository (which is quite wasteful).
Add a condition to only run on the main repository.
kubespray-defaults currently does two things:
- records a number of default variable values (in particular values used
in several places)
- gather and compose some complex network facts (in particular,
`fallback_ip` and `no_proxy`
There is no actual reason to couple those two things, and it makes using
defaults more difficult (because computing the network facts is somewhat
expensive, we don't want to do it willy-nilly)
Split the two and adjust import paths as needed.
The Gateway API needs to be installed first if you want to use Cilium's
Gateway API functionality. The Gateway API is just CRD without any Pod,
Deployment, etc., so I think it can be brought forward to before the CNI
installation.
Signed-off-by: ChengHao Yang
The recommended usage of kubespray is to use the default versions.
So putting them in inventory/sample is not really very helpful, and
causes:
- churn (keeping the inventory/sample up to date)
- support issues (mismatch between defaults and sample inventory)
Remove all concrete versions from the inventory sample.
bootstrap-os does not do anything in sudoers since e2ad6aad5 (bootstrap:
rework role (#4045), 2019-02-11).
So SSH pipelining working is effectively a pre-requisite anyway.
The preinstall assert cover a number of things, many of which depends
only on the inventory, and can be run without any ansible_facts
collected.
Split them off to simplify re-ordering.
* docs: Fix offline-environment.md to add 'v' prefix of some versions
Now some version variables (kube_version, etcd_version, etc) don't have 'v' prefix,
so you need to add 'v' prefix to download URLs.
* fix: Fix offline.yml to add 'v' prefix of some versions
* [Issue-12117]-Certificates for the new hosts are not generated during scale.yml
* [Issue-12117]-Certificates for the new hosts are not generated during scale.yml
* [Issue-12117]-Certificates for the new hosts are not generated during scale.yml
This commit fixed the process to ensure that CCM is installed first to
avoid the chicken-and-egg problem.
Signed-off-by: ChengHao Yang <17496418+tico88612@users.noreply.github.com>
* fix(containerd): always render NRI plugin block with conditional disable flag
* feat: enable Node Resource Interface plugin when using containerd
* fix: remove the
* fix: fix for linter
subscription-manager status can in some circumstances just never
terminates, with nothing indicating the problem from the Ansible
playbook log.
This makes it difficult to find the hosts misbehaving.
Add a timeout to the subscription checks (defaulting to 3 minutes). This
should be more than enough for normal circumstances while allowing
easier troubleshooting, as the hosts will be FAILED instead of the
playbook just waiting indefinitely.
* terraform upcloud: Added possibility to set up nodes with only private IPs
* terraform upcloud: add support for gateway in private zone
* terraform upcloud: split LB proxy protocol config per backend
* terraform upcloud: fix flexible plans
* terraform upcloud: Removed overview of cluster setup
---------
Co-authored-by: davidumea <david.andersson@elastisys.com>
This is more in-line with dependabot and similar auto-updaters.
Reduce ci coverage on github action updating (it does not change
kubespray code, no need for testing).
* Remove heketi
Heketi is no longer developed or supported and should not be used
anymore.
Remove the contrib playbook.
* Remove contrib glusterfs
Glusterfs integration with glusterfs is now either deprecated or
unsupported.
Other storage solutions should be preferred.
This commit enhances the node removal playbook's reliability and safety by implementing the following changes:
1. **Node Validation**: Added a validation step using assert to ensure the `node` variable is defined and contains nodes. If the list is empty or undefined, the playbook fails early, preventing accidental operations on the entire cluster.
2. **Removed Defaulting for Hosts**: Updated tasks to enforce explicit `node` variable input without defaulting to critical groups (e.g., `etcd:k8s_cluster:calico_rr`). By validating `node` beforehand, tasks now solely rely on user-provided input and safely avoid unintended targeting.
3. **Explicit User Confirmation**: Enhanced the confirmation prompt to clarify the scope of the operation. The admin is now required to explicitly confirm node state deletion, ensuring a deliberate decision before proceeding.
These improvements strengthen the reliability and safety of the `remove-node.yml` playbook by eliminating ambiguous behavior, preventing misconfigurations, and ensuring clear interaction during node removal tasks.
Vagrant jobs needs a big cache which makes them slow / sometimes stuck
completely. Using the kubevirt provisionning playbook is now
significantly faster, so do just that.
Having only one provisionner in CI will also allows us to remove some of
the custom runners executors we use for vagrant, and more generally
reduce the CI maintenance.
Our kubevirt CI platform does not support ivp6 yet, so we keep the
relevant jobs in vagrant, but we'll migrate them as well as soon as
possible.
- Take advantage of `parallel:matrix` to make the jobs definition shorter
and more readable.
- Remove helper scripts which are no longer needed
- Remove redundant indirection in the gitlab-ci pipelines definitions
(only one user)
This commit upgrades ingress-nginx to version v1.12.1, addressing multiple critical vulnerabilities including CVE-2025-1974, CVE-2025-1097, CVE-2025-1098, CVE-2025-24513, and CVE-2025-24514 as detailed in the ingress-nginx release notes: https://github.com/kubernetes/ingress-nginx/releases/tag/controller-v1.12.1
Important Notes:
- Fixing CVE-2025-1974 required disabling validation of the generated NGINX configuration during validation of Ingress resources. Invalid Ingress resources may stop the NGINX configuration from being updated.
- Recommended mitigations include enabling annotation validation and disabling snippet annotations.
Alongside this upgrade, the `ingress_nginx_kube_webhook_certgen_image_tag` has been updated to v1.5.2 for compatibility, based on: https://github.com/kubernetes/ingress-nginx/pull/13066
Changelog:
- Updated ingress-nginx version to v1.12.1 in Kubespray.
- Updated `ingress_nginx_kube_webhook_certgen_image_tag` in `roles/kubespray-defaults/defaults/main/download.yml` to v1.5.2.
Fixes: https://github.com/kubernetes-sigs/kubespray/issues/12073
* Refactor control plane upgrades with reconfiguration support
Adds revised support for:
- The previously removed `--config` argument for `kubeadm upgrade apply`
- Changes to `ClusterConfiguration` as part of the `upgrade-cluster.yml` playbook lifecycle
- kubeadm-config `v1beta4` `UpgradeConfiguration` for the `kubeadm upgrade apply` command: [UpgradeConfiguration v1beta4](https://kubernetes.io/docs/reference/config-api/kubeadm-config.v1beta4/#kubeadm-k8s-io-v1beta4-UpgradeConfiguration).
* Add kubeadm upgrade node support
Per discussion:
- Use `kubeadm upgrade node` on secondary control plane upgrades
- Add support for UpgradeConfiguration.node in kubeadm-config.v1beta4
- Remove redundant `allowRCUpgrades` config
- Revert from `block` for first and secondary control plane back to unblocked tasks since they no longer share much code and it's more readable this way
* Add kubelet and kube-proxy reconfiguration to upgrades
* Fix task to use `kubeadm init phase etcd local`
* Rebase with changes from "Adapt checksums and versions to new hashes updater" PR
* Add `imagePullPolicy` and `imagePullSerial` to kubeadm-config v1beta4 `InitConfiguration.nodeRegistration`
* Ensure correct `AuthorizationConfiguration` API version during upgrades
Fixes an issue where the wrong AuthorizationConfiguration API version could be used by kube-apiserver prematurely during upgrades.
The `kubernets/control-plane` role writes configuration for the target version before control plane pods are upgraded.
However, since the `AuthorizationConfiguration` file is reconciled continuously, this leads to a race condition where a new configuration version can be reconciled before kube-apiserver is upgraded to the compatible version.
This solution ensures the correct configuration is available throughout the process by writing each api version to a different file path. Unused file versions are cleaned up post-upgrade for better hygiene.
* Avoid from_json in cleanup task
The versions which are by default derived from `kube_version` can break
the assert if kube_version start with `v`, because they use the start of
`kube_version` as dict key.
By putting them in their own assert, the first assert should trigger on
`kube_version`, with a more explicit error.
[WARNING][1] kube-controllers/runconfig.go 193: unable to list KubeControllersConfiguration(default) error=connection is unauthorized: kubecontrollersconfigurations.crd.projectcalico.org "default" is forbidden: User "system:serviceaccount:kube-system:calico-kube-controllers" cannot list resource "kubecontrollersconfigurations" in API group "crd.projectcalico.org" at the cluster scope
* Upcloud: Added support for routers and gateways
* Upcloud: Added ipsec properties for UpCloud gateway VPN
* Upcloud: Added support for deprecated network field for loadbalancers
There is litte reason to share an ssh key common to all CI jobs, so
generate one for each on the fly.
Also use plain-text cloud-init config instead of base64 for readability
To work with molecule, we need to use the name provided by molecule_yml
in inventory.
Inject the name in the VirtualMachineInstance (with a default to handle
non-molecule scenario) and get it back as part of inventory).
Account for no ansible groups
The current templating of kubevirt VirtualMachine relies on global
ansible variables, except for the group the nodes are meant to be in.
In order to have more flexibility (in particular, mixed OS cluster for
instances), expect now an abitrary dict to be passed to the template ;
this allows to embed directly in the nodes definition any variable used
by the template.
The script is obsoleted by 5d7236ea5 (Merge pull request #11890 from
VannTen/download_graphql_checksums_2, 2025-03-09), since the format of
checksums is no longer compatible.
Allow the use of different hashes, as support by the get_url
Ansible module.
Change the variable name accordingly to 'checksum' since it's not
exclusively sha256 anymore.
The versions are nearly all .0 because of the gvisor release scheme.
This means they need to be quoted in yaml to be considered strings.
Special casing by removing the .0 make tooling more complicated, and it
does not gain us anything apart from a nicer looking file (I guess).
So just use the version of upstream gvisor and quote it.
* CI: Put pre-commit cache under CI_PROJECT_DIR
Apparently gitlab-runner can't cache stuff outside of the project
directory.
Put the cache under CI_PROJECT_DIR to make it work (which also means we
need to ignore it from ansible-lint).
Also update the pre-commit image while we're at it.
Link: https://gitlab.com/gitlab-org/gitlab/-/issues/14151
* update ansible-lint pre-commit
This adds a new flag with default `kubeadm_config_validate_enabled: true` to use when debugging features and enhancements affected by the `kubeadm config validate command`.
This new flag should be set to `false` only for development and testing scenarios where validation is expected to fail (pre-release Kubernetes versions, etc).
While working with development and test versions of Kubernetes and Kubespray, I found this option very useful.
* Automatically derive defaults versions from checksums
Currently, when updating checksums, we manually update the default
versions.
However, AFAICT, for all components where we have checksums, we're using
the newest version out of those checksums.
Codify this in the `_version` defaults variables definition to make the
process automatic and reduce manual steps (as well as the diff size
during reviews).
We assume the versions are sorted, with newest first. This should be
guaranteed by the pre-commit hooks.
* Validate checksums are ordered by versions, newest first
* Generalize render-readme-versions hook for other static files
The pre-commit hook introduced a142f40e2 (Update versions in README.md
with pre-commit, 2025-01-21) allow to update our README with new
versions.
It turns out other "static" files (== which don't interpret Ansible
variables) also use the default version (in that case, our Dockefiles,
but there might be others)
The Dockerfile breaks if the variable they use (`kube_version`) is a
Jinja template.
For helping with automatic version upgrade, generalize the hook to deal
with other static files, and make a template out of the Dockerfile.
* Dockerfile: template kube_version with pre-commit instead of runtime
* Validate all versions/checksums are strings in pre-commit
All the ansible/python tooling for version is for version strings. YAML
unhelpfully consider some stuff as number, so enforce this.
* Stringify checksums versions
* exclude .ansible in ansible-lint
* remote ctr i pull workdaround
Signed-off-by: Kay Yan <kay.yan@daocloud.io>
---------
Signed-off-by: Kay Yan <kay.yan@daocloud.io>
The build steps at the start of CI takes about 2 minutes; now that we
have greatly reduced the overall duration, this is not an insignificant
impact.
Add timestamps to the build process to see measure which steps of the
image build take the most time.
* Remove krew installation support
Krew is fundamentally to install kubectl plugins, which are eminently a
client side things.
It's also not difficult to install on a client machine.
* Remove krew cleanup
This has been deprecated for a long time, time to pull the plug.
We leave an assert for one release to have a straightforward failure if
some users were still using the variable.
Since 'none' can be, for instance, a manual calico deployment, don't
check whether there is enough ip for pods on a node, because the plugin
can use another mechanism than the podCIDR to allocate IPs.
When the etcd group is not specified we assume it's kube_control_plane.
In that case, etcd still can't be even, so instead of only checking the
etcd group we need to default to kube_control_plane
ansible-lint and yamllint are run as pre-commit hooks, which are
installed by pre-commit directly. So there is no need to put them in
tests/requirements.txt.
So remove them and make it leaner.
Upstream calico isn't doing that, and:
- this can cause throttling
- the cpu needed by calico is very cluster / workload dependent
- missing cpu limits will not starve other pods (unlike missing memory
requests), because the kernel scheduler will still gives priority to
other process in pods not exceeding their requests
Currently, versions in README.md need to be manually updated, and we
check it's done with a bash script.
Add a small utility playbook to add versions in README.md from their
actual default values, automatically.
This is done in pre-commit, and replace the scripted check ; instead it
will autofix the README.md, and fails in CI if needed.
We switch markdownlint behind the local hooks to gave it the opportunity
to catch a problem with the rendering.
Since e8ee42280 (CI: remove deletion tasks of 'packet' VMs, 2024-09-13),
our tests appears to not be flakey anymore.
The current retry slow down the testing feedback on pull request.
Since it's not needed anymore, don't retry and fail fast.
This is handy when some component releases is buggy (missing file at the
download links) to not block everything else.
Move the filtering up the stack so we don't have to do it multiples
times.
Gvisor releases, besides only being tags, have some particularities:
- they are of the form yyyymmdd.p -> this get interpreted as a yaml
float, so we need to explicitely convert to string to make it work.
- there is no semver-like attached to the version numbers, but the API
(= OCI container runtime interface) is expected to be stable (see
linked discussion)
- some older tags don't have hashs for some archs
Link: https://groups.google.com/g/gvisor-users/c/SxMeHt0Yb6Y/m/Xtv7seULCAAJ
Gvisor is the only one of our deployed components which use tags instead
of proper releases. So the tags scraping support will, for now, cater to
gvisor particularities, notably in the tag name format and the fact that
some older releases don't have the same URL scheme.
Containerd use the same repository for releases of it's gRPC API (which
we are not interested in).
Conveniently, those releases have tags which are not valid version
number (being prefixed with 'api/').
This could also be potentially useful for similar cases.
The risk of missing releases because of this are low, since it would
require that a project issue a new release with an invalid format, then
switch back to the previous format (or we miss the fact it's not
updating for a long period of time).
The Github graphQL API needs IDs for querying a variable array of
repository.
Use a dict for components instead of an array of url and record the
corresponding node ID for each component (there are duplicates because
some binaries are provided by the same project/repository).
Adds the ability to configure the Kubernetes API server with a structured authorization configuration file.
Structured AuthorizationConfiguration is a new feature in Kubernetes v1.29+ (GA in v1.32) that configures the API server's authorization modes with a structured configuration file.
AuthorizationConfiguration files offer features not available with the `--authorization-mode` flag, although Kubespray supports both methods and authorization-mode remains the default for now.
Note: Because the `--authorization-config` and `--authorization-mode` flags are mutually exclusive, the `authorization_modes` ansible variable is ignored when `kube_apiserver_use_authorization_config_file` is set to true. The two features cannot be used at the same time.
Docs: https://kubernetes.io/docs/reference/access-authn-authz/authorization/#configuring-the-api-server-using-an-authorization-config-file
Blog + Examples: https://kubernetes.io/blog/2024/04/26/multi-webhook-and-modular-authorization-made-much-easier/
KEP: https://github.com/kubernetes/enhancements/tree/master/keps/sig-auth/3221-structured-authorization-configuration
I tested this all the way back to k8s v1.29 when AuthorizationConfiguration was first introduced as an alpha feature, although v1.29 required some additional workarounds with `kubeadm_patches`, which I included in example comments.
I also included some example comments with CEL expressions that allowed me to configure webhook authorizers without hitting kubeadm 1.29+ issues that block cluster creation and upgrades such as this one: https://github.com/kubernetes/cloud-provider-openstack/issues/2575.
My workaround configures the webhook to ignore requests from kubeadm and system components, which prevents fatal errors from webhooks that are not available yet, and should be authorized by Node or RBAC anyway.
This avoids spurious failure with 'localhost'.
It should also be more correct the inventory contains uncached hosts
which are not in `k8s_cluster` and therefore should not be Kubespray
business.
(We still use hostvars for uncached hosts, because it's easier to select
on 'ansible_default_ipv4' that way and does not change the end result)
We use a lot of facts where variables are enough, and format too early,
which prevent reusing the variables in different contexts.
- Moves set_fact variables to the vars directory, remove unnecessary
intermediate variables, and render them at usage sites to only do logic
on native Ansible/Jinja lists.
- Use defaults/ rather than default filters for several variables.
There is no test with IDEMPOT_CHECK=true since commit 7b78e6872 (disable
idempotency tests (#1872), 2017-10-26)
Remove the related infra from our CI scripts.
This display in a readable (by humans) way the result of most tasks, and
should be way more readable that what we have now, which is frequently a
bunch of unreadable json.
+ some small fixes (using delegated_to instead of when
<control_plane_host>)
- special casing should be in Kubespray, not in the test. It makes no
sense to do something in tests which won't be done in actual usage.
- We don't actually test CoreOS at all in the CI.
This is a followup to 2ba28a338 (Revert "Wait for available API token in
a new namespace (#7045)", 2024-10-25).
While checking for the serviceaccount token is not effective, there is
still a race when creating a Pod directly, because the ServiceAccount
itself might not be created yet.
More details at https://github.com/kubernetes/kubernetes/issues/66689.
This cause very frequent flakes in our CI with spurious failures.
Use a Deployment instead ; it will takes cares of creating the Pods and
retrying ; it also let us use kubectl rollout status instead of manually
checking for the pods.
To reproduce this commit run in bash:
for file in $(ls tests/files/)
do
if ! grep -Rq ${file%.*} .gitlab.ci; then
rm tests/files/${file}
fi
done
This also means that our CI matrix was not accurate.
This is needed for shutdown ordering: while at startup, it's not a
problem that containerd start before dbus (the dbus socket already
exists) it needs to shutdown before dbus to do its cleanup (asking
systemd via dbus to cleanup cgroups).
This is expected to be used in the command module this way:
command:
cmd: "{{ kubectl_apply_stdin }}"
stdin: <... rendered manifests > -> using the 'template' lookup plugin
in most cases.
The advantages over the kube plugin module integrated in kubespray
(which this should replace eventually):
- way easier to modify to take advantage of new features (server-side
apply for instance)
- no need for a separate template tasks + checking the result (which can
introduce problem if the first playbook runs encounters an error).
Those files haven't been touched in roughly 5 years, and pip install on
Kubespray errors out.
The 'Requires:' are outdated, which suggests that no one is using this.
8ff4ad2d8 (preinstall: simplify OS packages selection, 2024-11-04)
removed all usages of ansible.utils.validate (not that many), so the
dependencies is no longer necessary.
config_path was introduced in containerd 1.5.0, and registry.mirrors is
deprecated.
There is no reason to keep the old alternative, so just always use
config_path, and consequently remove the option.
Our README is currently pretty cluttered:
- Part of the README duplicates docs/getting_started/getting-started.md
-> Remove duplicates and extract useful info into the getting-started.md
- General info on Ansible environment troubleshooting
-> remove most of it as it's not specific to Kubespray, move to
docs/ansible/ansible.md
-> split inventory-related stuff of ansible.md into it's own file. This
should host documentation on how to manages Kubespray inventories in the
future.
ansible.md:
- remove the list of "Unused" variables, as:
1. It's not accurate
2. What matters is where users should put their variables
contrib/dind use inventory_builder, which is removed. It overlaps with
the function of kind (Kubernetes in Docker) and has not see change
(apart from linting driven ones) for a long time.
It also does not seem to work (provisioning playbook crash).
This only really help with the easiest part of building your inventory
(listing the hosts) as you still need to edit your groups vars and
similar.
The opaqueness of the script does not really help our users to
understand their own inventory.
Furthermore, there is not really a reason that something which is common
to all the Ansible ecosystem should be done in a special way for
Kubespray.
* Add vars for configuring cilium IP load balancer pools and bgp peer policies
* Cilium 1.16+ Support - Add vars for configuring cilium bgpv2 api & handle cilium_kube_proxy_replacement unsupported values
When using
dns_upstream_forward_extra_opts:
prefer_udp: "" # the option as no value so use empty string to just
# put the key
This is rendered in the dns configmap as ($ for end-of-line)
...
prefer_udp $
...
Note the trailing space.
This triggers https://github.com/kubernetes/kubernetes/issues/36222,
which makes the configmap hardly readable when editing them manually or
simply putting them in a yaml file for inspection.
Trim the concatenation of option + value to get rid of any trailing
space.
* Make Helm's 'atomic' parameter configurable from role variables
* Configure Helm with 'atomic' and 'wait' set to false for generic CNI to prevent kubelet-csr-approver installation failures
We use shell scripts and conf files in some roles (notably, certificates
provisioning), so we need to include them in order for the collection to
work when using the configurations depending on those roles.
* kubeadm: do not ignore preflight errors blindly
The "ignoring all errors" seems to date back to the inception of the
kubeadm support (it was --skip-preflight-check before).
This can mask real errors and prevent users from seeing them.
Do not ignore any errors by default and make the set of ignored errors
configurable.
* download/kubeadm: remove redundant task
The mode is already set by the previous `copy` task.
* Validate kubeadm configs
This should help to fail early when we have invalid kubeadm configs (from
a kubespray bug or a misconfiguration).
* kubeadm-upgrade: remove unnecessary bool cast
* Convert kubeadm join discovery timeout to v1beta4 config
* CI: Ignore kubeadm:Mem errors on some setup.
The new CI does not define k8s_cluster group, so it relies on
kubernetes-sigs/kubespray#11559.
This does not work for upgrade testing (which use the previous release).
We can revert this commit after 2.27.0
We should not rollback our test setup during upgrade test.
The only reason to do that would be for incompatible changes in the test
inventory, and we already checkout master for those (${CI_JOB_NAME}.yml)
Also do some cleanup by removing unnecessary intermediary variables
VirtualMachineInstance resources sometimes temporarily loose their
IP (at least as far as the kubevirt controllers can see).
See https://github.com/kubevirt/kubevirt/issues/12698 for the upstream
bug.
This does not seems to affect actual connection (if it did, our current
CI would not work).
However, our CI execute multiple playbooks, and in particular:
1. The provisioning playbook (which checks that the IPs have been
provisioned by querying the K8S API)
2. Kubespray itself
If any of the VirtualMachineInstance looses its IP between after 1
checked for it, and before 2 starts, the dynamic inventory (which is
invoked when the playbook is launched by ansible-playbook) will not have
an ip for that host, and will try to use the name for ssh, which of
course will not work.
Instead, when we have a valid state during provisioning (all IPs
presents), use it to construct a static inventory which will be used for
the rest of the CI run.
This allows a single source of truth for the virtual machines in a
kubevirt ci-run.
`etcd_member_name` should be correctly handled in kubespray-defaults for
testing the recover cases.
Not constraining the inventory to .ini allows us to use dynamic
inventory, which is needed for simplifying kubevirt jobs inventory.
Also reduces the scope of the ANSIBLE_INVENTORY variable.
VMI in Kubevirt are the abstraction below VirtualMachine.
- We don't really need the extra abstraction of VirtualMachine objects
- Convert the waiting for VMs ip address to use kubernetes.core.k8s_info
and no shell pipeline
We're still getting bug reports circumventing the bug report template
and omitting informations.
Blank issue (aka, not using the form templates) can still be created
using the gh cli, the API, etc. This only disable the possibility in the
Web UI.
- Lookup was not returning a list, making the difference filter spit out
garbage -> query always return a list
- hostvars is a dictionnary, so convert to list before selectattr and
map back to only get keys
Currently there is not much difference between the files, if there are more changes in the future,
please use different files to distinguish them (you can use the kubeadm_config_api_version variable)
Signed-off-by: ChengHao Yang <17496418+tico88612@users.noreply.github.com>
Currently there is not much difference between the files, if there are more changes in the future,
please use different files to distinguish them (you can use the kubeadm_config_api_version variable)
Signed-off-by: ChengHao Yang <17496418+tico88612@users.noreply.github.com>
Remove kubeadm api version condition.
Currently there is not much difference between the files, if there are more changes in the future,
please use different files to distinguish them (you can use the kubeadm_config_api_version variable)
Signed-off-by: ChengHao Yang <17496418+tico88612@users.noreply.github.com>
I added the kubeadm_config_api_version variable in the previous commit,
and remove kubeadm api version condition.
Signed-off-by: ChengHao Yang <17496418+tico88612@users.noreply.github.com>
v1beta4 has changed a lot in this file (e.g. ExtraArgs etc.), so it was implemented in separate files.
Signed-off-by: ChengHao Yang <17496418+tico88612@users.noreply.github.com>
Since a2019c1c2 (Add a JSON schema describing the packages install
structure, 2024-04-25), we use a custom structure to select which
packages should be installed on a particular host OS.
This has proven too rigid in practice, and the query is pretty
complicated.
Replace this by simply using an array of jinja conditions for the
packages, which should be easier to understand for everyone and more
flexible.
Also remove the associated schema and validation which are no longer
needed.
* etcd: throttle restart for availability
During upgrade, etcd member are restarted all at once.
This can impact the availability of the etcd cluster and subsequently of
the Kubernetes cluster.
Limit the concurrent restart so that the etcd cluster can keep quorum.
* Simplify etcd handlers
* Feat: add external OCI cloud controller manager template & variable
Signed-off-by: tico88612 <17496418+tico88612@users.noreply.github.com>
* Feat: add external OCI cloud controller manager workflow
Signed-off-by: tico88612 <17496418+tico88612@users.noreply.github.com>
* Feat: migrate external OCI CCM config check from OCI cloud provider
Signed-off-by: tico88612 <17496418+tico88612@users.noreply.github.com>
* cloud_controller: oracle: simpler asserts
Make the asserts check for Oracle Cloud Infrastructure external cloud
controller more compact, and hence readable.
Allows to put them back in the main tasks for less back and forth when
reading the code.
---------
Signed-off-by: tico88612 <17496418+tico88612@users.noreply.github.com>
Co-authored-by: Max Gautier <mg@max.gautier.name>
This reverts commit 275c54e810.
Static tokens are no longer created automatically for service account in
Kubernetes. Instead, they are dynamically injected into pods using a
projected volume.
Thus there is no longer a need to check for this (it didn't work anyway,
since the describe output actually contains <none> when there is no
tokens:
{
"attempts": 1,
"changed": false,
"cmd": "set -o pipefail && /usr/local/bin/kubectl describe serviceaccounts default --namespace test | grep Tokens | awk '{print $2}'",
"delta": "0:00:00.075633",
"end": "2024-10-19 14:25:04.858871",
"msg": "",
"rc": 0,
"start": "2024-10-19 14:25:04.783238",
"stderr": "",
"stderr_lines": [],
"stdout": "<none>",
"stdout_lines": [
"<none>"
]
}
)
This leverage the Kubernetes GC to delete kubevirt VMs, by using
ownerReferences, with the CI pod running the playbook as the owner.
This concretely means that the control plane in our CI cluster will
delete the kubevirt VMs associated with a particular ci job as soon as
that pod job is deleted, which usually happens when the job terminates,
(barring errors, which will be addressed in the cluster directly)
Upgrade to kubevirt.io/v1 for the VirtualMachine manifests, since the
alpha version is deprecated.
Before adding these changes, `ansible_facts.services["containerd.service"]` will not defined and fail to check for triggering the container stop and delete behaviors.
Signed-off-by: ChengHao Yang <17496418+tico88612@users.noreply.github.com>
Simplify registry mirror rendering in config.toml.
The map filter can extract the host list from mirrors so we can
just unique them and render them without needing to construct vars
for it.
For the registry mirror tls section, we can first extract mirrors
from the dict then filter on only the ones having skip_veridy defined
first and then filter on the ones having true (as the dict might not
have skip_verify defined and that would cause errors of undefined var).
This will speed up and simply the templating.
Signed-off-by: Seena Fallah <seenafallah@gmail.com>
Dropping the ansible dependencies for ansible-lint will allow us to
catch missing dependencies collections in galaxy.yml. For collections
needed for contrib/ or tests/ (i.e: not part of core kubespray
dependencies), we can just configure ansible-lint to mock them.
This mean it won't check the mocked module parameters, but for those
area of the code base it's an acceptable trade-off.
The fallback_ips tasks are essentially serializing the gathering of one
fact on all the hosts, which can have dramatic performance implications
on large clusters (several minutes).
This is essentially a reversal of 35f248dff0
Being able to run without refreshing the cache facts is not worth it.
We keep fallback_ip for now, simply changing the access to a normal
hostvars variable instead of a custom dictionnary.
Using the hosts directive at the play level prevent those tasks from
being run when using --limit and the group in question is not part of
the limit (ex: running scale.yml on new worker nodes only)
Instead, run on all hosts, and for each group, partition between that
group and '_' (generic group name which is not used; using an empty
string as the group is not supported by ansible.builtin.group_by)
Reported-by: asteppat <asteppat@cisco.com>
* Update cluster-role for cilium to prevent errors in agent startup
ciliumloadbalancerippools permissions exists in the cilium helm chart for version 1.13.0
https://github.com/cilium/cilium/blob/v1.13.0/install/kubernetes/cilium/templates/cilium-agent/clusterrole.yaml#L71
The agent also needs permissions to read/watch secrets for bgp auth secrets when using CiliumBGPPeeringPolicy with a secret.
* Remove list/watch permissions for secrets
* Remove secrets from list/watch permissions
The old repository for these has been deleted, leaving the previous
configuration not possible to deploy, and even currently running clusters
fail after a restart as the DeameonSet has ImagePullPolicy: Always. More
details can be found here: kubernetes-sigs/vsphere-csi-driver#3053
As of writing, only CSI driver versions 3.1.2 to 3.3.1 is available in
this registry. This "officially" supports Kubernetes 1.26 to 1.30. Since
older drivers are not available, I have removed some feature-gating for
those unavailable versions while I was at it. For the cloud provider,
the `latest` image is now missing, and only 1.28.0 to 1.31.0 are
available. I've set the latest of these as the new default.
I also updated the documented default versions, as they were all out of
date and not aligned with actual code defaults.
Nodes to api-server relies by default certificates, and bootstrap
tokens, and there should be no need to generate tokens for every nodes,
even when enabling static token auth.
Testing for group membership with group names makes Kubespray more
tolerant towards the structure of the inventory.
Where 'inventory_hostname in groups["some_group"] would fail if
"some_group" is not defined, '"some_group" in group_names' would not.
- Use proper syntax highlighting for config.rb examples
- Consistent shell style ($ as prompt)
- Use only one way to do things
- Remove OS specific details
The current way to handle a custom inventory in vagrant is a bit
hackish, copy files around and can break Vagrantfile parsing in
cornercase scenarios (removing vagrant inventories, or the inventory
copied into vagrant inventory).
Instead, simply pass additional inventories to the ansible-playbook
command lines as raw arguments with `-i`.
This also makes supporting multiples inventories trivial, so we add a
new `$inventories` variable for that purpose.
Specifying one directory for kubeadm patches is not ideal:
1. It does not allow working with multiples inventories easily
2. No ansible templating of the patch
3. Ansible path searching can sometimes be confusing
Instead, provide the patch directly in a variable, and add some quality
of life to handle components targeting and patch ordering more
explicitly (`target` and `type` which are translated to the kubeadm
scheme which is based on the file name)
kubernetes/control-plane and kubernetes/kubeadm roles both push kubeadm
patches in the same way.
Extract that code and make it a dependency of both.
This is safe because it's only configuration for kubeadm, which only
takes effect when kubeadm is run.
* Update multus to v4.1.0 and clarify cilium compatibility
* Fix: bug introduced by #10934 where the template would break if multus was defined
* Set priorityClassName to system-node-critical for multus pods
Allow the script to be called with a list of components, to only
download new versions checksums for those.
By default, we get new versions checksums for all supported (by the
script) components.
runc upstream does not provide one hash file per assets in their
releases, but one file with all the hashes.
To handle this (and/or any arbitrary format from upstreams), add a
dictionary mapping the name of the download to a lambda function which
transform the file provided by upstream into a dictionary of hashes,
keyed by architecture.
The script is currently limited to one hardcoded URL for kubernetes
related binaries, and a fixed set of architectures.
The solution is three-fold:
1. Use an url template dictionary for each download -> this allow to easily
add support for new downloads.
2. Source the architectures to search from the existing data
3. Enumerate the existing versions in the data and start searching from
the last one until no newer version is found (newer in the version
order sense, irrespective of actual age)
Remove system|kube_master_<resource>_reserved variables.
Those variables are unnecessary because users can simply use the
variables in group_vars if they which to differentiate control plane
nodes from other nodes.
Set conservative defaults for ephemeral-storage and pids for both kube
and system reserved resources.
Working symlinks are dependant on git configuration (when using the playbook as
a git repository, which is common), precisely `git config
core.symlinks`.
While this is enabled by default, some company policies will disable it.
Instead, use import_tasks which should avoid that class of bugs.
* Simplify docker systemd unit
systemd handles missing unit by ignoring the dependency so we don't need
to template them.
* Remove RHEL 7/CentOS 7 support
- remove ref in kubespray roles
- move CI from centos 7 to 8
- remove docs related to centos7
* Remove container-storage-setup
Only used for RHEL 7 and CentOS 7
* Fix: fix testcases_run.sh for upgrade tests
Need to git checkout ${CI_COMMIT_SHA} before running upgrade playbook (revert #11173 partially)
* feat: add CI job to test upgrade
Add a packet_ubuntu22-calico-all-in-one-upgrade job
* make calico api server manifest backward compatible with version older than 3.27.3
Add 3.28.1 checksums
Add 3.28.0 checksums
Change default version to 3.27.3
* change default calico version to 3.28.1
* Set mount type to DirectoryOrCreate for hostPath needed by Calico
Fixes https://github.com/kubernetes-sigs/kubespray/issues/10947
This patch aims to be minimal and intentionally:
- does not change the generation logic for `supersede_domain` and `supersede_search`
- does not change how `nameserverentries` (for NetworkManager) is built
It seems like `nameserverentries` in the "Generate nameservers for resolvconf, including cluster DNS"
task is built the same way as `dhclient_supersede_nameserver_entries_list`.
However, `nameserverentries` in the "Generate nameservers for resolvconf, not including cluster DNS"
task (below) is built differently for some reason. It includes `configured_nameservers` as well.
Due to these differences, I have refrained from reusing the same building logic
(`dhclient_supersede_nameserver_entries_list`) for both.
If the `configured_nameservers` addition can be removed or made to apply
to dhclient as well, we could potentially build a single list and then
generate the `nameserverentries` and `supersede_nameserver` strings from it.
* CI: reduce VM resources requests to improve scheduling
* CI: Reduce default jobs; add labels(ci-full/extended) to run more test
* CI: use jobs dependencies instead of stages
* precommit one-job
* CI: Use Kubevirt VM to run Molecule and Vagrant jobs
The inventory file generated by Terraform produces the following warnings:
```
[WARNING]: * Failed to parse <PATH>/kubespray/contrib/terraform/hetzner/inventory.ini with ini plugin:
<PATH>/kubespray/contrib/terraform/hetzner/inventory.ini:21: Section [k8s_cluster:children] includes undefined group: kube-master
...
[WARNING]: Could not match supplied host pattern, ignoring: kube-master
PLAY [Add kube-master nodes to kube_control_plane] ********************************************************************************************************
skipping: no hosts matched
[WARNING]: Could not match supplied host pattern, ignoring: kube-node
PLAY [Add kube-node nodes to kube_node] *******************************************************************************************************************
skipping: no hosts matched
```
If you have questions, check the documentation at [kubespray.io](https://kubespray.io) and join us on the [kubernetes slack](https://kubernetes.slack.com), channel **\#kubespray**.
You can get your invite [here](http://slack.k8s.io/)
- Can be deployed on **[AWS](docs/cloud_providers/aws.md), GCE, [Azure](docs/cloud_providers/azure.md), [OpenStack](docs/cloud_providers/openstack.md), [vSphere](docs/cloud_providers/vsphere.md), [Equinix Metal](docs/cloud_providers/equinix-metal.md) (bare metal), Oracle Cloud Infrastructure (Experimental), or Baremetal**
- Can be deployed on **[AWS](docs/cloud_providers/aws.md), GCE, [Azure](docs/cloud_providers/azure.md), [OpenStack](docs/cloud_controllers/openstack.md), [vSphere](docs/cloud_controllers/vsphere.md), [Equinix Metal](docs/cloud_providers/equinix-metal.md) (bare metal), Oracle Cloud Infrastructure (Experimental), or Baremetal**
- **Highly available** cluster
- **Composable** (Choice of the network plugin for instance)
- Supports most popular **Linux distributions**
@@ -15,74 +15,23 @@ You can get your invite [here](http://slack.k8s.io/)
Below are several ways to use Kubespray to deploy a Kubernetes cluster.
### Docker
Ensure you have installed Docker then
```ShellSession
docker run --rm -it --mount type=bind,source="$(pwd)"/inventory/sample,dst=/inventory \
@@ -201,7 +150,7 @@ Note: Upstart/SysV init based OS types are not supported.
## Requirements
- **Minimum required version of Kubernetes is v1.28**
- **Minimum required version of Kubernetes is v1.30**
- **Ansible v2.14+, Jinja 2.11+ and python-netaddr is installed on the machine that will run Ansible commands**
- The target servers must have **access to the Internet** in order to pull docker images. Otherwise, additional configuration is required (See [Offline Environment](docs/operations/offline-environment.md))
- The target servers are configured to allow **IPv4 forwarding**.
@@ -215,10 +164,10 @@ Note: Upstart/SysV init based OS types are not supported.
Hardware:
These limits are safeguarded by Kubespray. Actual requirements for your workload can differ. For a sizing guide go to the [Building Large Clusters](https://kubernetes.io/docs/setup/cluster-large/#size-of-master-and-master-components) guide.
- Master
- Memory: 1500 MB
- Node
- Memory: 1024 MB
- Control Plane
- Memory: 2 GB
- Worker Node
- Memory: 1 GB
## Network Plugins
@@ -233,9 +182,6 @@ You can choose among ten network plugins. (default: `calico`, except Vagrant use
- [cilium](http://docs.cilium.io/en/latest/): layer 3/4 networking (as well as layer 7 to protect and secure application protocols), supports dynamic insertion of BPF bytecode into the Linux kernel to implement security services, networking and visibility logic.
- [weave](docs/CNI/weave.md): Weave is a lightweight container overlay network that doesn't require an external K/V database cluster.
(Please refer to `weave` [troubleshooting documentation](https://www.weave.works/docs/net/latest/troubleshooting/)).
- [kube-ovn](docs/CNI/kube-ovn.md): Kube-OVN integrates the OVN-based Network Virtualization with Kubernetes. It offers an advanced Container Network Fabric for Enterprises.
- [kube-router](docs/CNI/kube-router.md): Kube-router is a L3 CNI for Kubernetes networking aiming to provide operational
@@ -12,10 +12,10 @@ The Kubespray Project is released on an as-needed basis. The process is as follo
1. (For major releases) On the `master` branch: bump the version in `galaxy.yml` to the next expected major release (X.y.0 with y = Y + 1), make a Pull Request.
1. (For minor releases) On the `release-X.Y` branch: bump the version in `galaxy.yml` to the next expected minor release (X.Y.z with z = Z + 1), make a Pull Request.
1. The corresponding version of [quay.io/kubespray/kubespray:vX.Y.Z](https://quay.io/repository/kubespray/kubespray) and [quay.io/kubespray/vagrant:vX.Y.Z](https://quay.io/repository/kubespray/vagrant) container images are built and tagged. See the following `Container image creation` section for the details.
1. (Only for major releases) The `KUBESPRAY_VERSION` in `.gitlab-ci.yml` is upgraded to the version we just released # TODO clarify this, this variable is for testing upgrades.
1. The release issue is closed
1. An announcement email is sent to `dev@kubernetes.io` with the subject `[ANNOUNCE] Kubespray $VERSION is released`
1. The topic of the #kubespray channel is updated with `vX.Y.Z is released! | ...`
1. Create/Update Issue for upgradeing kubernetes and [k8s-conformance](https://github.com/cncf/k8s-conformance)
## Major/minor releases and milestones
@@ -45,7 +45,7 @@ The Kubespray Project is released on an as-needed basis. The process is as follo
* Minor releases can change components' versions, but not the major `kube_version`.
Greater `kube_version` requires a new major or minor release. For example, if Kubespray v2.0.0
is bound to `kube_version: 1.4.x`, `calico_version: 0.22.0`, `etcd_version: v3.0.6`,
is bound to `kube_version: 1.4.x`, `calico_version: 0.22.0`, `etcd_version: 3.0.6`,
then Kubespray v2.1.0 may be bound to only minor changes to `kube_version`, like v1.5.1
and *any* changes to other components, like etcd v4, or calico 1.2.3.
And Kubespray v3.x.x shall be bound to `kube_version: 2.x.x` respectively.
# libvirt__ipv6_address does not work as intended, the address is obtained with the desired prefix, but auto-generated(like fd3c:b398:698:756:5054:ff:fe48:c61e/64)
# add default route for detect ansible_default_ipv6
# TODO: fix libvirt__ipv6 or use $subnet in shell
config.vm.provision"shell",inline:"ip -6 r a fd3c:b398:698:756::/64 dev eth1;ip -6 r add default via fd3c:b398:0698:0756::1 dev eth1 || true"
# Disable swap for each vm
node.vm.provision"shell",inline:"swapoff -a"
@@ -238,15 +271,16 @@ Vagrant.configure("2") do |config|
# Deploying a Kubespray Kubernetes Cluster with GlusterFS
You can either deploy using Ansible on its own by supplying your own inventory file or by using Terraform to create the VMs and then providing a dynamic inventory to Ansible. The following two sections are self-contained, you don't need to go through one to use the other. So, if you want to provision with Terraform, you can skip the **Using an Ansible inventory** section, and if you want to provision with a pre-built ansible inventory, you can neglect the **Using Terraform and Ansible** section.
## Using an Ansible inventory
In the same directory of this ReadMe file you should find a file named `inventory.example` which contains an example setup. Please note that, additionally to the Kubernetes nodes/masters, we define a set of machines for GlusterFS and we add them to the group `[gfs-cluster]`, which in turn is added to the larger `[network-storage]` group as a child group.
Change that file to reflect your local setup (adding more machines or removing them and setting the adequate ip numbers), and save it to `inventory/sample/k8s_gfs_inventory`. Make sure that the settings on `inventory/sample/group_vars/all.yml` make sense with your deployment. Then execute change to the kubespray root folder, and execute (supposing that the machines are all using ubuntu):
If your machines are not using Ubuntu, you need to change the `--user=ubuntu` to the correct user. Alternatively, if your Kubernetes machines are using one OS and your GlusterFS a different one, you can instead specify the `ansible_ssh_user=<correct-user>` variable in the inventory file that you just created, for each machine/VM:
First step is to fill in a `my-kubespray-gluster-cluster.tfvars` file with the specification desired for your cluster. An example with all required variables would look like:
As explained in the general terraform/openstack guide, you need to source your OpenStack credentials file, add your ssh-key to the ssh-agent and setup environment variables for terraform:
```shell
$ source ~/.stackrc
$ eval$(ssh-agent -s)
$ ssh-add ~/.ssh/my-desired-key
$ echo Setting up Terraform creds &&\
exportTF_VAR_username=${OS_USERNAME}&&\
exportTF_VAR_password=${OS_PASSWORD}&&\
exportTF_VAR_tenant=${OS_TENANT_NAME}&&\
exportTF_VAR_auth_url=${OS_AUTH_URL}
```
Then, standing on the kubespray directory (root base of the Git checkout), issue the following terraform command to create the VMs for the cluster:
This will create both your Kubernetes and Gluster VMs. Make sure that the ansible file `contrib/terraform/openstack/group_vars/all.yml` includes any ansible variable that you want to setup (like, for instance, the type of machine for bootstrapping).
Then, provision your Kubernetes (kubespray) cluster with the following ansible call:
For GlusterFS to connect between servers, TCP ports `24007`, `24008`, and `24009`/`49152`+ (that port, plus an additional incremented port for each additional server in the cluster; the latter if GlusterFS is version 3.4+), and TCP/UDP port `111` must be open. You can open these using whatever firewall you wish (this can easily be configured using the `geerlingguy.firewall` role).
This role performs basic installation and setup of Gluster, but it does not configure or mount bricks (volumes), since that step is easier to do in a series of plays in your own playbook. Ansible 1.9+ includes the [`gluster_volume`](https://docs.ansible.com/ansible/latest/collections/gluster/gluster/gluster_volume_module.html) module to ease the management of Gluster volumes.
## Role Variables
Available variables are listed below, along with default values (see `defaults/main.yml`):
```yaml
glusterfs_default_release:""
```
You can specify a `default_release` for apt on Debian/Ubuntu by overriding this variable. This is helpful if you need a different package or version for the main GlusterFS packages (e.g. GlusterFS 3.5.x instead of 3.2.x with the `wheezy-backports` default release on Debian Wheezy).
```yaml
glusterfs_ppa_use:yes
glusterfs_ppa_version:"3.5"
```
For Ubuntu, specify whether to use the official Gluster PPA, and which version of the PPA to use. See Gluster's [Getting Started Guide](https://docs.gluster.org/en/latest/Quick-Start-Guide/Quickstart/) for more info.
## Dependencies
None.
## Example Playbook
```yaml
- hosts:server
roles:
- geerlingguy.glusterfs
```
For a real-world use example, read through [Simple GlusterFS Setup with Ansible](http://www.jeffgeerling.com/blog/simple-glusterfs-setup-ansible), a blog post by this role's author, which is included in Chapter 8 of [Ansible for DevOps](https://www.ansiblefordevops.com/).
## License
MIT / BSD
## Author Information
This role was created in 2015 by [Jeff Geerling](http://www.jeffgeerling.com/), author of [Ansible for DevOps](https://www.ansiblefordevops.com/).
- name:Ensure GlusterFS is started and enabled at boot.
service:
name:"{{ glusterfs_daemon }}"
state:started
enabled:yes
- name:Ensure Gluster brick and mount directories exist.
file:
path:"{{ item }}"
state:directory
mode:0775
with_items:
- "{{ gluster_brick_dir }}"
- "{{ gluster_mount_dir }}"
- name:Configure Gluster volume with replicas
gluster.gluster.gluster_volume:
state:present
name:"{{ gluster_brick_name }}"
brick:"{{ gluster_brick_dir }}"
replicas:"{{ groups['gfs-cluster'] | length }}"
cluster:"{% for item in groups['gfs-cluster'] -%}{{ hostvars[item]['ip'] | default(hostvars[item].ansible_default_ipv4['address']) }}{% if not loop.last %},{% endif %}{%- endfor %}"
host:"{{ inventory_hostname }}"
force:yes
run_once:true
when:groups['gfs-cluster'] | length > 1
- name:Configure Gluster volume without replicas
gluster.gluster.gluster_volume:
state:present
name:"{{ gluster_brick_name }}"
brick:"{{ gluster_brick_dir }}"
cluster:"{% for item in groups['gfs-cluster'] -%}{{ hostvars[item]['ip'] | default(hostvars[item].ansible_default_ipv4['address']) }}{% if not loop.last %},{% endif %}{%- endfor %}"
host:"{{ inventory_hostname }}"
force:yes
run_once:true
when:groups['gfs-cluster'] | length <= 1
- name:Mount glusterfs to retrieve disk size
ansible.posix.mount:
name:"{{ gluster_mount_dir }}"
src:"{{ ip | default(ansible_default_ipv4['address']) }}:/gluster"
fstype:glusterfs
opts:"defaults,_netdev"
state:mounted
when:groups['gfs-cluster'] is defined and inventory_hostname == groups['gfs-cluster'][0]
- name:Get Gluster disk size
setup:
filter:ansible_mounts
register:mounts_data
when:groups['gfs-cluster'] is defined and inventory_hostname == groups['gfs-cluster'][0]
- name:Set Gluster disk size to variable
set_fact:
gluster_disk_size_gb:"{{ (mounts_data.ansible_facts.ansible_mounts | selectattr('mount', 'equalto', gluster_mount_dir) | map(attribute='size_total') | first | int / (1024 * 1024 * 1024)) | int }}"
when:groups['gfs-cluster'] is defined and inventory_hostname == groups['gfs-cluster'][0]
- name:Create file on GlusterFS
template:
dest:"{{ gluster_mount_dir }}/.test-file.txt"
src:test-file.txt
mode:0644
when:groups['gfs-cluster'] is defined and inventory_hostname == groups['gfs-cluster'][0]
- name:Unmount glusterfs
ansible.posix.mount:
name:"{{ gluster_mount_dir }}"
fstype:glusterfs
src:"{{ ip | default(ansible_default_ipv4['address']) }}:/gluster"
state:unmounted
when:groups['gfs-cluster'] is defined and inventory_hostname == groups['gfs-cluster'][0]
when:inventory_hostname == groups['kube_control_plane'][0] and groups['gfs-cluster'] is defined and hostvars[groups['gfs-cluster'][0]].gluster_disk_size_gb is defined
- name:Kubernetes Apps | Set GlusterFS endpoint and PV
# Deploy Heketi/Glusterfs into Kubespray/Kubernetes
This playbook aims to automate [this](https://github.com/heketi/heketi/blob/master/docs/admin/install-kubernetes.md) tutorial. It deploys heketi/glusterfs into kubernetes and sets up a storageclass.
## Important notice
> Due to resource limits on the current project maintainers and general lack of contributions we are considering placing Heketi into a [near-maintenance mode](https://github.com/heketi/heketi#important-notice)
## Client Setup
Heketi provides a CLI that provides users with a means to administer the deployment and configuration of GlusterFS in Kubernetes. [Download and install the heketi-cli](https://github.com/heketi/heketi/releases) on your client machine.
## Install
Copy the inventory.yml.sample over to inventory/sample/k8s_heketi_inventory.yml and change it according to your setup.
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.