This should make 'no space left on device' problems easier to handle
Use /tmp/releases as local_release_dir CI created machine, while keeping
the same folder on the runner (needed for gitlab-ci runner pods)
Co-authored-by: Max Gautier <mg@max.gautier.name>
* CI: Try a full ssh connection on hosts instead of only checking the port
If we only try the port, we can try to connect in the playbook which is
executed next even though the managed node has not yet completed it's
boot-up sequence ("System is booting up. Unprivileged users are not
permitted to log in yet. Please come back later. For technical details,
see pam_nologin(8).")
This does not account for python-less hosts, but we don't use those in
CI anyway (for now, at least).
* CI: Remove connection method override when creating VMs
This prevented wait_for_connection to work correctly by hijacking the
connection to localhost, thus bypassing the connection check.
---------
Co-authored-by: Max Gautier <mg@max.gautier.name>
Add missing RBAC permissions for Calico apiserver to function correctly
with Kubernetes 1.33+
Changes:
1. Add K8s 1.33 ValidatingAdmissionPolicy resources to calico-webhook-reader
- validatingadmissionpolicies
- validatingadmissionpolicybindings
Kubernetes 1.33 introduced ValidatingAdmissionPolicy resources (KEP-3488)
that require explicit RBAC permissions. Without these changes, Calico
apiserver on k8s 1.33+ will not work and needless errors are logged
Co-authored-by: rickerc <chris.ricker@gmail.com>
The 'old' playbook and the collection use '-' and '_' as separator,
which breaks the logic in scripts/testcases_run.sh.
Add aliases using the old schemes to make the test work and avoid
breaking anything.
Both '-' and '_' variants will be deleted once we switch to supporting
collection only.
Co-authored-by: Max Gautier <mg@max.gautier.name>
* Remove etcd member by peerURLs
The way to obtain the IP of a particular member is convoluted and depend
on multiple variables. The match is also textual and it's not clear
against what we're matching
It's also broken for etcd member which are not also Kubernetes nodes,
because the "Lookup node IP in kubernetes" task will fail and abort the
play.
Instead, match against 'peerURLs', which does not need new variable, and
use json output.
* Add testcase for etcd removal on external etcd
* do not merge
* fixup! Remove etcd member by peerURLs
* fixup! Remove etcd member by peerURLs
---------
Co-authored-by: Max Gautier <mg@max.gautier.name>
fixed kubelet condition
CRI-O: fix for handling of container runtime switching
refactored kubelet start condition
stop/start kubelet and crio only when default runtime is changed
fixed condition for runtime_matches fact variable
fixed set facts for existing container runtime
added crio runtime switch variable
changed condition to use runtime switch variable
added comment for not-found for readers
Allow setting deployment replicas through `coredns_replicas` when
`enable_dns_autoscaler` is set to false.
Signed-off-by: Seena Fallah <seenafallah@gmail.com>
The option ara_default was still present in ansible.cfg under callbacks_enabled.
This is a leftover from commit b9e9364 ("Remove ara support in CI") and should
have been removed together with the rest of the ara integration.
Signed-off-by: Ali Afsharzadeh <afsharzadeh8@gmail.com>
* Enable reserved variable name checks and fix violations
Updated .ansible-lint configuration to skip only var-naming[pattern]
and var-naming[no-role-prefix] instead of skipping the entire var-naming rule.
This enables the check for reserved variable names.
Renamed variables that used reserved names to avoid conflicts.
Updated all references in tasks, variables, and templates.
Signed-off-by: Ali Afsharzadeh <afsharzadeh8@gmail.com>
* Rename namespace variable inside tasks instead of deleting it
Signed-off-by: Ali Afsharzadeh <afsharzadeh8@gmail.com>
* Change hosts variable to vm_hosts
Signed-off-by: Ali Afsharzadeh <afsharzadeh8@gmail.com>
* Use k8s_namespace instead of dashboard_namespace in dashboard.yml.j2 template
Signed-off-by: Ali Afsharzadeh <afsharzadeh8@gmail.com>
---------
Signed-off-by: Ali Afsharzadeh <afsharzadeh8@gmail.com>
Fixes a bug where `kube-apiserver` fails to start if the PodSecurity
configuration file doesn't have the `apiVersion` and `kind` keys.
Signed-off-by: Alejandro Macedo <alex.macedopereira@gmail.com>
Retrying to load conntrack modules was bound to fail due to the way, the current `when` conditions were utilized.
It was based on the assumption, that in case of success, the registered variable would have an `rc` attribute with the value `0`.
Unfortunately, the `rc` attribute is only present in case of a failure, where it's value is >1.
The result of `community.general.modprobe` in case of success looks like this:
```
{
"changed": false,
"msg": "All items completed",
"results": [
{
"ansible_loop_var": "item",
"changed": false,
"failed": false,
"invocation": {
"module_args": {
"name": "nf_conntrack",
"params": "",
"persistent": "present",
"state": "present"
}
},
"item": "nf_conntrack",
"name": "nf_conntrack",
"params": "",
"state": "present"
}
],
"skipped": false
}
```
While it looks like this in case of a failure:
```
{
"changed": false,
"failed": true,
"msg": "One or more items failed",
"results": [
{
"ansible_loop_var": "item",
"attempts": 3,
"changed": false,
"failed": true,
"invocation": {
"module_args": {
"name": "nf_conntrack_doesnotexist",
"params": "",
"persistent": "present",
"state": "present"
}
},
"item": "nf_conntrack_doesnotexist",
"msg": "modprobe: FATAL: Module nf_conntrack_doesnotexist not found in directory /lib/modules/5.14.0-570.32.1.el9_6.x86_64\n",
"name": "nf_conntrack_doesnotexist",
"params": "",
"rc": 1,
"state": "present",
"stderr": "modprobe: FATAL: Module nf_conntrack_doesnotexist not found in directory /lib/modules/5.14.0-570.32.1.el9_6.x86_64\n",
"stderr_lines": [
"modprobe: FATAL: Module nf_conntrack_doesnotexist not found in directory /lib/modules/5.14.0-570.32.1.el9_6.x86_64"
],
"stdout": "",
"stdout_lines": []
}
],
"skipped": false
}
```
By evaluating `failed` instead, this issue can be prevented.
See also:
- https://github.com/kubernetes-sigs/kubespray/issues/11340
Co-authored-by: Max Gautier <mg@max.gautier.name>
The Prometheus Operator CRDs are commonly used for monitoring and are
used by some CNIs (such as Cilium). Kubespray can be installed first,
and the subsequent installation of the operator can be handled by the
user (or later extensions).
Signed-off-by: ChengHao Yang <17496418+tico88612@users.noreply.github.com>
Add variable to set kubelet staticPodPath location.
It can be set to empty so that we can choose to disable it for some nodes.
STIG recommendation is to disable it.
Signed-off-by: Shaleen Bathla <shaleenbathla@gmail.com>
Co-authored-by: ChengHao Yang <17496418+tico88612@users.noreply.github.com>
Debian Trixie recently removed the package `software-properties-common`,
add the condition not on Debian Trixie.
Signed-off-by: ChengHao Yang <17496418+tico88612@users.noreply.github.com>
Remove --auth-anonymous if kube_api_anonymous_auth in undefined, to avoid
compatibility errors with other arguments of the kube-apiserver, such as
--authentication-config when anonymous field is configured.
* Add header configuration in containerd hosts.toml
Signed-off-by: Alexander Gil <pando855@gmail.com>
* Disable log output on containerd mirrors settings if required
Signed-off-by: Alexander Gil <pando855@gmail.com>
---------
Signed-off-by: Alexander Gil <pando855@gmail.com>
* docs: remove obsolete reference to `gen_tags.sh`
`scripts/gen_tags.sh` was removed in 373b952a0c
* docs: fix 404 links
Merge the `Requirements` section with the `Usage` section and just
reference the inventory documentation, which then points to all further
information related to group vars etc.
* feat(subnet): Ensure Vagrant subnet not in use by localhost
This commit ensures that Vagrantfile supplied $subnet is not in use by
the localhost. Previously, if the subnet is in use by localhost (i.e.
bridge network), Vagrant VM boxes can not communicate.
* refactor(socket): Use ruby Socket library to find addrs
This commit reverts the usage of Ruby .scan() which may result in
failure if program is not provided. Instead, this commit refactors to
use Socket library to determine interfaces in use, then proceeds to
compare with Vagrantfile supplied subnets. Additionally, the commit
supports IPv6 comparisons.
This allows to use kubespray_defaults (once) instead of redefining
defaults in the tests.
Test test files becomes imported tasks rather thand standalone
playbooks.
* Test: molecule replace ubuntu2004 with ubuntu2204 ubuntu2404
cri-dockerd, adduser and bastion-ssh-config can't run ubuntu2404, maybe needs to check login.
"System is booting up. Unprivileged users are not permitted to log in yet. Please come back later. For technical details, see pam_nologin(8)."
Signed-off-by: ChengHao Yang
<17496418+tico88612@users.noreply.github.com>
* Test: replace ubuntu-2004 with ubuntu-2404
All ubuntu-2004 tests are removed.
Signed-off-by: ChengHao Yang <17496418+tico88612@users.noreply.github.com>
* Docs: update ci.md
Signed-off-by: ChengHao Yang <17496418+tico88612@users.noreply.github.com>
* Docs: update README.md
Remove Ubuntu 20.04 support
Signed-off-by: ChengHao Yang <17496418+tico88612@users.noreply.github.com>
---------
Signed-off-by: ChengHao Yang
Signed-off-by: ChengHao Yang <17496418+tico88612@users.noreply.github.com>
Not really a reason not to, and this actually breaks daily-ci because
some jobs depends on this one so the whole pipeline is invalid if it's
not created.
This uses the same logic than the other versions, with simplications for
crictl and crio whose versionning scheme is tied to upstream kubernetes.
Also move some version variables in vars/ rather than defaults/, because
they are not used elsewhere and don't really make sense as modifiable by
the user.
Currently, there is no reliable way to obtain individual CRD files, so
the only solution is to update first.
Signed-off-by: ChengHao Yang <17496418+tico88612@users.noreply.github.com>
When installing or upgrading in the past, there was no validation
config. Check if the file exists first to prevent subsequent validation
errors.
Signed-off-by: ChengHao Yang <17496418+tico88612@users.noreply.github.com>
The validation step is moved to the end to avoid the loss of files that
may lead to verification failure.
Signed-off-by: ChengHao Yang <17496418+tico88612@users.noreply.github.com>
`cilium install` is equivalent to `helm install`, it will failed if
cilium relase exist. `cilium version` can know the release exist without
helm binary
Signed-off-by: ChengHao Yang <17496418+tico88612@users.noreply.github.com>
Give users two options: besides skip Cilium, add
`cilium_remove_old_resources`, default is `false`, when set to `true`,
it will remove the content of the old version, but it will cause the
downtime, need to be careful to use.
Signed-off-by: ChengHao Yang <17496418+tico88612@users.noreply.github.com>
This patch fixes the indentation in the `encryption` section.
Previously configuration like this:
```yml
cilium_encryption_enabled: true
cilium_encryption_type: wireguard
```
Would template to a `values.yaml` file with indentation that looks like this:
```yml
encryption:
enabled: True
type: wireguard
nodeEncryption: False
```
instead of this:
```yml
encryption:
enabled: true
type: wireguard
nodeEncryption: false
```
This syntax issue causes an error during Cilium installation.
This patch also makes all boolean values in this template file go through the `to_json` filter.
Since values like `True` and `False` are not compliant with the YAML v1.2 spec,
avoiding them is preferable.
`to_json` may be used for all other values in this template to ensure we end up with
a valid YAML document in all cases (even when various strings include special characters),
but this was left for another (future) patch.
* Fix: check expiraty before renew
Since certificate renewal and container restarts involve higher risks,
they should be executed with extra caution.
* squash to Fix: check expiraty before renew
* squash to Fix: address more comments from VannTen
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
---------
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
* Remove tag master
Following it's deprecation in 4b324cb0f (Rename master to control plane
- non-breaking changes only (#11394), 2024-09-06)
* Add fail fast path when using removed tags
- Used for the master tag, but this could be used for other things in
the future
The checksums are not a defaults and are not meant to be changed from
the inventories.
Furthermore, role defaults have a lower priority that hosts facts, which
technically means a rogue hosts could hijack the hashes for its
variables.
* feat(cilium): add configurable Hubble export log rotation parameters
- Adds support for `cilium_hubble_export_file_max_backups` and `cilium_hubble_export_file_max_size_mb`
- Applies values only if `cilium_hubble_export_file_path` is defined
- Default values are set in role defaults
- Cleans up template logic by removing unnecessary conditionals
* Fix indentation for hubble export settings
* Fix undefined variable issue with ipwrap in kubeconfig override that caused pre-commit errors
* Update main.yml
rollback
This is now handled directly at the failfast-ci level (== integration
Github <-> Gitlab).
The whole pipeline will not be triggered unless:
- The author is a maintainer
- The PR has the /ok-to-test label
dnsautoscaler should only be enabled when enable_dns_autoscaler is
set to true. without this, it could be enabled without any manifest
actually using it, which makes it a false signal.
Signed-off-by: Seena Fallah <seenafallah@gmail.com>
The switch to not use system packages for containerd packages happened
multiples releases ago ; there should not be any up-to-date installation
of kubespray needing that cleanup.
Remove those steps and variables only used by them.
* Delete unused scripts
- gen_tags.sh: not the right file, produce garbage even if path is fixed
- premoderator.sh: not used since ef6d24a49 (CI require a 'lgtm' or
'ok-to-test' labels to pass (#11251), 2024-05-31)
- gitlab-branch-cleanup: unused AFAICT
* CI: inline molecule logs
Single use site -> less indirection makes it easier to read.
- This enables ithe override of the tolerations for the cilium-operator deployment
- default behaviour is to leave the toleration as is unless the var is set
With the current github-workflow setup, workflows are triggered on every
forked repository (which is quite wasteful).
Add a condition to only run on the main repository.
kubespray-defaults currently does two things:
- records a number of default variable values (in particular values used
in several places)
- gather and compose some complex network facts (in particular,
`fallback_ip` and `no_proxy`
There is no actual reason to couple those two things, and it makes using
defaults more difficult (because computing the network facts is somewhat
expensive, we don't want to do it willy-nilly)
Split the two and adjust import paths as needed.
The Gateway API needs to be installed first if you want to use Cilium's
Gateway API functionality. The Gateway API is just CRD without any Pod,
Deployment, etc., so I think it can be brought forward to before the CNI
installation.
Signed-off-by: ChengHao Yang
The recommended usage of kubespray is to use the default versions.
So putting them in inventory/sample is not really very helpful, and
causes:
- churn (keeping the inventory/sample up to date)
- support issues (mismatch between defaults and sample inventory)
Remove all concrete versions from the inventory sample.
bootstrap-os does not do anything in sudoers since e2ad6aad5 (bootstrap:
rework role (#4045), 2019-02-11).
So SSH pipelining working is effectively a pre-requisite anyway.
The preinstall assert cover a number of things, many of which depends
only on the inventory, and can be run without any ansible_facts
collected.
Split them off to simplify re-ordering.
* docs: Fix offline-environment.md to add 'v' prefix of some versions
Now some version variables (kube_version, etcd_version, etc) don't have 'v' prefix,
so you need to add 'v' prefix to download URLs.
* fix: Fix offline.yml to add 'v' prefix of some versions
* [Issue-12117]-Certificates for the new hosts are not generated during scale.yml
* [Issue-12117]-Certificates for the new hosts are not generated during scale.yml
* [Issue-12117]-Certificates for the new hosts are not generated during scale.yml
This commit fixed the process to ensure that CCM is installed first to
avoid the chicken-and-egg problem.
Signed-off-by: ChengHao Yang <17496418+tico88612@users.noreply.github.com>
* fix(containerd): always render NRI plugin block with conditional disable flag
* feat: enable Node Resource Interface plugin when using containerd
* fix: remove the
* fix: fix for linter
subscription-manager status can in some circumstances just never
terminates, with nothing indicating the problem from the Ansible
playbook log.
This makes it difficult to find the hosts misbehaving.
Add a timeout to the subscription checks (defaulting to 3 minutes). This
should be more than enough for normal circumstances while allowing
easier troubleshooting, as the hosts will be FAILED instead of the
playbook just waiting indefinitely.
* terraform upcloud: Added possibility to set up nodes with only private IPs
* terraform upcloud: add support for gateway in private zone
* terraform upcloud: split LB proxy protocol config per backend
* terraform upcloud: fix flexible plans
* terraform upcloud: Removed overview of cluster setup
---------
Co-authored-by: davidumea <david.andersson@elastisys.com>
This is more in-line with dependabot and similar auto-updaters.
Reduce ci coverage on github action updating (it does not change
kubespray code, no need for testing).
* Remove heketi
Heketi is no longer developed or supported and should not be used
anymore.
Remove the contrib playbook.
* Remove contrib glusterfs
Glusterfs integration with glusterfs is now either deprecated or
unsupported.
Other storage solutions should be preferred.
This commit enhances the node removal playbook's reliability and safety by implementing the following changes:
1. **Node Validation**: Added a validation step using assert to ensure the `node` variable is defined and contains nodes. If the list is empty or undefined, the playbook fails early, preventing accidental operations on the entire cluster.
2. **Removed Defaulting for Hosts**: Updated tasks to enforce explicit `node` variable input without defaulting to critical groups (e.g., `etcd:k8s_cluster:calico_rr`). By validating `node` beforehand, tasks now solely rely on user-provided input and safely avoid unintended targeting.
3. **Explicit User Confirmation**: Enhanced the confirmation prompt to clarify the scope of the operation. The admin is now required to explicitly confirm node state deletion, ensuring a deliberate decision before proceeding.
These improvements strengthen the reliability and safety of the `remove-node.yml` playbook by eliminating ambiguous behavior, preventing misconfigurations, and ensuring clear interaction during node removal tasks.
Vagrant jobs needs a big cache which makes them slow / sometimes stuck
completely. Using the kubevirt provisionning playbook is now
significantly faster, so do just that.
Having only one provisionner in CI will also allows us to remove some of
the custom runners executors we use for vagrant, and more generally
reduce the CI maintenance.
Our kubevirt CI platform does not support ivp6 yet, so we keep the
relevant jobs in vagrant, but we'll migrate them as well as soon as
possible.
- Take advantage of `parallel:matrix` to make the jobs definition shorter
and more readable.
- Remove helper scripts which are no longer needed
- Remove redundant indirection in the gitlab-ci pipelines definitions
(only one user)
This commit upgrades ingress-nginx to version v1.12.1, addressing multiple critical vulnerabilities including CVE-2025-1974, CVE-2025-1097, CVE-2025-1098, CVE-2025-24513, and CVE-2025-24514 as detailed in the ingress-nginx release notes: https://github.com/kubernetes/ingress-nginx/releases/tag/controller-v1.12.1
Important Notes:
- Fixing CVE-2025-1974 required disabling validation of the generated NGINX configuration during validation of Ingress resources. Invalid Ingress resources may stop the NGINX configuration from being updated.
- Recommended mitigations include enabling annotation validation and disabling snippet annotations.
Alongside this upgrade, the `ingress_nginx_kube_webhook_certgen_image_tag` has been updated to v1.5.2 for compatibility, based on: https://github.com/kubernetes/ingress-nginx/pull/13066
Changelog:
- Updated ingress-nginx version to v1.12.1 in Kubespray.
- Updated `ingress_nginx_kube_webhook_certgen_image_tag` in `roles/kubespray-defaults/defaults/main/download.yml` to v1.5.2.
Fixes: https://github.com/kubernetes-sigs/kubespray/issues/12073
* Refactor control plane upgrades with reconfiguration support
Adds revised support for:
- The previously removed `--config` argument for `kubeadm upgrade apply`
- Changes to `ClusterConfiguration` as part of the `upgrade-cluster.yml` playbook lifecycle
- kubeadm-config `v1beta4` `UpgradeConfiguration` for the `kubeadm upgrade apply` command: [UpgradeConfiguration v1beta4](https://kubernetes.io/docs/reference/config-api/kubeadm-config.v1beta4/#kubeadm-k8s-io-v1beta4-UpgradeConfiguration).
* Add kubeadm upgrade node support
Per discussion:
- Use `kubeadm upgrade node` on secondary control plane upgrades
- Add support for UpgradeConfiguration.node in kubeadm-config.v1beta4
- Remove redundant `allowRCUpgrades` config
- Revert from `block` for first and secondary control plane back to unblocked tasks since they no longer share much code and it's more readable this way
* Add kubelet and kube-proxy reconfiguration to upgrades
* Fix task to use `kubeadm init phase etcd local`
* Rebase with changes from "Adapt checksums and versions to new hashes updater" PR
* Add `imagePullPolicy` and `imagePullSerial` to kubeadm-config v1beta4 `InitConfiguration.nodeRegistration`
* Ensure correct `AuthorizationConfiguration` API version during upgrades
Fixes an issue where the wrong AuthorizationConfiguration API version could be used by kube-apiserver prematurely during upgrades.
The `kubernets/control-plane` role writes configuration for the target version before control plane pods are upgraded.
However, since the `AuthorizationConfiguration` file is reconciled continuously, this leads to a race condition where a new configuration version can be reconciled before kube-apiserver is upgraded to the compatible version.
This solution ensures the correct configuration is available throughout the process by writing each api version to a different file path. Unused file versions are cleaned up post-upgrade for better hygiene.
* Avoid from_json in cleanup task
The versions which are by default derived from `kube_version` can break
the assert if kube_version start with `v`, because they use the start of
`kube_version` as dict key.
By putting them in their own assert, the first assert should trigger on
`kube_version`, with a more explicit error.
[WARNING][1] kube-controllers/runconfig.go 193: unable to list KubeControllersConfiguration(default) error=connection is unauthorized: kubecontrollersconfigurations.crd.projectcalico.org "default" is forbidden: User "system:serviceaccount:kube-system:calico-kube-controllers" cannot list resource "kubecontrollersconfigurations" in API group "crd.projectcalico.org" at the cluster scope
* Upcloud: Added support for routers and gateways
* Upcloud: Added ipsec properties for UpCloud gateway VPN
* Upcloud: Added support for deprecated network field for loadbalancers
There is litte reason to share an ssh key common to all CI jobs, so
generate one for each on the fly.
Also use plain-text cloud-init config instead of base64 for readability
To work with molecule, we need to use the name provided by molecule_yml
in inventory.
Inject the name in the VirtualMachineInstance (with a default to handle
non-molecule scenario) and get it back as part of inventory).
Account for no ansible groups
The current templating of kubevirt VirtualMachine relies on global
ansible variables, except for the group the nodes are meant to be in.
In order to have more flexibility (in particular, mixed OS cluster for
instances), expect now an abitrary dict to be passed to the template ;
this allows to embed directly in the nodes definition any variable used
by the template.
The script is obsoleted by 5d7236ea5 (Merge pull request #11890 from
VannTen/download_graphql_checksums_2, 2025-03-09), since the format of
checksums is no longer compatible.
Allow the use of different hashes, as support by the get_url
Ansible module.
Change the variable name accordingly to 'checksum' since it's not
exclusively sha256 anymore.
The versions are nearly all .0 because of the gvisor release scheme.
This means they need to be quoted in yaml to be considered strings.
Special casing by removing the .0 make tooling more complicated, and it
does not gain us anything apart from a nicer looking file (I guess).
So just use the version of upstream gvisor and quote it.
* CI: Put pre-commit cache under CI_PROJECT_DIR
Apparently gitlab-runner can't cache stuff outside of the project
directory.
Put the cache under CI_PROJECT_DIR to make it work (which also means we
need to ignore it from ansible-lint).
Also update the pre-commit image while we're at it.
Link: https://gitlab.com/gitlab-org/gitlab/-/issues/14151
* update ansible-lint pre-commit
This adds a new flag with default `kubeadm_config_validate_enabled: true` to use when debugging features and enhancements affected by the `kubeadm config validate command`.
This new flag should be set to `false` only for development and testing scenarios where validation is expected to fail (pre-release Kubernetes versions, etc).
While working with development and test versions of Kubernetes and Kubespray, I found this option very useful.
* Automatically derive defaults versions from checksums
Currently, when updating checksums, we manually update the default
versions.
However, AFAICT, for all components where we have checksums, we're using
the newest version out of those checksums.
Codify this in the `_version` defaults variables definition to make the
process automatic and reduce manual steps (as well as the diff size
during reviews).
We assume the versions are sorted, with newest first. This should be
guaranteed by the pre-commit hooks.
* Validate checksums are ordered by versions, newest first
* Generalize render-readme-versions hook for other static files
The pre-commit hook introduced a142f40e2 (Update versions in README.md
with pre-commit, 2025-01-21) allow to update our README with new
versions.
It turns out other "static" files (== which don't interpret Ansible
variables) also use the default version (in that case, our Dockefiles,
but there might be others)
The Dockerfile breaks if the variable they use (`kube_version`) is a
Jinja template.
For helping with automatic version upgrade, generalize the hook to deal
with other static files, and make a template out of the Dockerfile.
* Dockerfile: template kube_version with pre-commit instead of runtime
* Validate all versions/checksums are strings in pre-commit
All the ansible/python tooling for version is for version strings. YAML
unhelpfully consider some stuff as number, so enforce this.
* Stringify checksums versions
* exclude .ansible in ansible-lint
* remote ctr i pull workdaround
Signed-off-by: Kay Yan <kay.yan@daocloud.io>
---------
Signed-off-by: Kay Yan <kay.yan@daocloud.io>
The build steps at the start of CI takes about 2 minutes; now that we
have greatly reduced the overall duration, this is not an insignificant
impact.
Add timestamps to the build process to see measure which steps of the
image build take the most time.
* Remove krew installation support
Krew is fundamentally to install kubectl plugins, which are eminently a
client side things.
It's also not difficult to install on a client machine.
* Remove krew cleanup
This has been deprecated for a long time, time to pull the plug.
We leave an assert for one release to have a straightforward failure if
some users were still using the variable.
Since 'none' can be, for instance, a manual calico deployment, don't
check whether there is enough ip for pods on a node, because the plugin
can use another mechanism than the podCIDR to allocate IPs.
When the etcd group is not specified we assume it's kube_control_plane.
In that case, etcd still can't be even, so instead of only checking the
etcd group we need to default to kube_control_plane
ansible-lint and yamllint are run as pre-commit hooks, which are
installed by pre-commit directly. So there is no need to put them in
tests/requirements.txt.
So remove them and make it leaner.
Upstream calico isn't doing that, and:
- this can cause throttling
- the cpu needed by calico is very cluster / workload dependent
- missing cpu limits will not starve other pods (unlike missing memory
requests), because the kernel scheduler will still gives priority to
other process in pods not exceeding their requests
Currently, versions in README.md need to be manually updated, and we
check it's done with a bash script.
Add a small utility playbook to add versions in README.md from their
actual default values, automatically.
This is done in pre-commit, and replace the scripted check ; instead it
will autofix the README.md, and fails in CI if needed.
We switch markdownlint behind the local hooks to gave it the opportunity
to catch a problem with the rendering.
Since e8ee42280 (CI: remove deletion tasks of 'packet' VMs, 2024-09-13),
our tests appears to not be flakey anymore.
The current retry slow down the testing feedback on pull request.
Since it's not needed anymore, don't retry and fail fast.
This is handy when some component releases is buggy (missing file at the
download links) to not block everything else.
Move the filtering up the stack so we don't have to do it multiples
times.
Gvisor releases, besides only being tags, have some particularities:
- they are of the form yyyymmdd.p -> this get interpreted as a yaml
float, so we need to explicitely convert to string to make it work.
- there is no semver-like attached to the version numbers, but the API
(= OCI container runtime interface) is expected to be stable (see
linked discussion)
- some older tags don't have hashs for some archs
Link: https://groups.google.com/g/gvisor-users/c/SxMeHt0Yb6Y/m/Xtv7seULCAAJ
Gvisor is the only one of our deployed components which use tags instead
of proper releases. So the tags scraping support will, for now, cater to
gvisor particularities, notably in the tag name format and the fact that
some older releases don't have the same URL scheme.
Containerd use the same repository for releases of it's gRPC API (which
we are not interested in).
Conveniently, those releases have tags which are not valid version
number (being prefixed with 'api/').
This could also be potentially useful for similar cases.
The risk of missing releases because of this are low, since it would
require that a project issue a new release with an invalid format, then
switch back to the previous format (or we miss the fact it's not
updating for a long period of time).
The Github graphQL API needs IDs for querying a variable array of
repository.
Use a dict for components instead of an array of url and record the
corresponding node ID for each component (there are duplicates because
some binaries are provided by the same project/repository).
Adds the ability to configure the Kubernetes API server with a structured authorization configuration file.
Structured AuthorizationConfiguration is a new feature in Kubernetes v1.29+ (GA in v1.32) that configures the API server's authorization modes with a structured configuration file.
AuthorizationConfiguration files offer features not available with the `--authorization-mode` flag, although Kubespray supports both methods and authorization-mode remains the default for now.
Note: Because the `--authorization-config` and `--authorization-mode` flags are mutually exclusive, the `authorization_modes` ansible variable is ignored when `kube_apiserver_use_authorization_config_file` is set to true. The two features cannot be used at the same time.
Docs: https://kubernetes.io/docs/reference/access-authn-authz/authorization/#configuring-the-api-server-using-an-authorization-config-file
Blog + Examples: https://kubernetes.io/blog/2024/04/26/multi-webhook-and-modular-authorization-made-much-easier/
KEP: https://github.com/kubernetes/enhancements/tree/master/keps/sig-auth/3221-structured-authorization-configuration
I tested this all the way back to k8s v1.29 when AuthorizationConfiguration was first introduced as an alpha feature, although v1.29 required some additional workarounds with `kubeadm_patches`, which I included in example comments.
I also included some example comments with CEL expressions that allowed me to configure webhook authorizers without hitting kubeadm 1.29+ issues that block cluster creation and upgrades such as this one: https://github.com/kubernetes/cloud-provider-openstack/issues/2575.
My workaround configures the webhook to ignore requests from kubeadm and system components, which prevents fatal errors from webhooks that are not available yet, and should be authorized by Node or RBAC anyway.
This avoids spurious failure with 'localhost'.
It should also be more correct the inventory contains uncached hosts
which are not in `k8s_cluster` and therefore should not be Kubespray
business.
(We still use hostvars for uncached hosts, because it's easier to select
on 'ansible_default_ipv4' that way and does not change the end result)
We use a lot of facts where variables are enough, and format too early,
which prevent reusing the variables in different contexts.
- Moves set_fact variables to the vars directory, remove unnecessary
intermediate variables, and render them at usage sites to only do logic
on native Ansible/Jinja lists.
- Use defaults/ rather than default filters for several variables.
There is no test with IDEMPOT_CHECK=true since commit 7b78e6872 (disable
idempotency tests (#1872), 2017-10-26)
Remove the related infra from our CI scripts.
This display in a readable (by humans) way the result of most tasks, and
should be way more readable that what we have now, which is frequently a
bunch of unreadable json.
+ some small fixes (using delegated_to instead of when
<control_plane_host>)
- special casing should be in Kubespray, not in the test. It makes no
sense to do something in tests which won't be done in actual usage.
- We don't actually test CoreOS at all in the CI.
This is a followup to 2ba28a338 (Revert "Wait for available API token in
a new namespace (#7045)", 2024-10-25).
While checking for the serviceaccount token is not effective, there is
still a race when creating a Pod directly, because the ServiceAccount
itself might not be created yet.
More details at https://github.com/kubernetes/kubernetes/issues/66689.
This cause very frequent flakes in our CI with spurious failures.
Use a Deployment instead ; it will takes cares of creating the Pods and
retrying ; it also let us use kubectl rollout status instead of manually
checking for the pods.
To reproduce this commit run in bash:
for file in $(ls tests/files/)
do
if ! grep -Rq ${file%.*} .gitlab.ci; then
rm tests/files/${file}
fi
done
This also means that our CI matrix was not accurate.
This is needed for shutdown ordering: while at startup, it's not a
problem that containerd start before dbus (the dbus socket already
exists) it needs to shutdown before dbus to do its cleanup (asking
systemd via dbus to cleanup cgroups).
This is expected to be used in the command module this way:
command:
cmd: "{{ kubectl_apply_stdin }}"
stdin: <... rendered manifests > -> using the 'template' lookup plugin
in most cases.
The advantages over the kube plugin module integrated in kubespray
(which this should replace eventually):
- way easier to modify to take advantage of new features (server-side
apply for instance)
- no need for a separate template tasks + checking the result (which can
introduce problem if the first playbook runs encounters an error).
Those files haven't been touched in roughly 5 years, and pip install on
Kubespray errors out.
The 'Requires:' are outdated, which suggests that no one is using this.
8ff4ad2d8 (preinstall: simplify OS packages selection, 2024-11-04)
removed all usages of ansible.utils.validate (not that many), so the
dependencies is no longer necessary.
config_path was introduced in containerd 1.5.0, and registry.mirrors is
deprecated.
There is no reason to keep the old alternative, so just always use
config_path, and consequently remove the option.
Our README is currently pretty cluttered:
- Part of the README duplicates docs/getting_started/getting-started.md
-> Remove duplicates and extract useful info into the getting-started.md
- General info on Ansible environment troubleshooting
-> remove most of it as it's not specific to Kubespray, move to
docs/ansible/ansible.md
-> split inventory-related stuff of ansible.md into it's own file. This
should host documentation on how to manages Kubespray inventories in the
future.
ansible.md:
- remove the list of "Unused" variables, as:
1. It's not accurate
2. What matters is where users should put their variables
contrib/dind use inventory_builder, which is removed. It overlaps with
the function of kind (Kubernetes in Docker) and has not see change
(apart from linting driven ones) for a long time.
It also does not seem to work (provisioning playbook crash).
This only really help with the easiest part of building your inventory
(listing the hosts) as you still need to edit your groups vars and
similar.
The opaqueness of the script does not really help our users to
understand their own inventory.
Furthermore, there is not really a reason that something which is common
to all the Ansible ecosystem should be done in a special way for
Kubespray.
* Add vars for configuring cilium IP load balancer pools and bgp peer policies
* Cilium 1.16+ Support - Add vars for configuring cilium bgpv2 api & handle cilium_kube_proxy_replacement unsupported values
When using
dns_upstream_forward_extra_opts:
prefer_udp: "" # the option as no value so use empty string to just
# put the key
This is rendered in the dns configmap as ($ for end-of-line)
...
prefer_udp $
...
Note the trailing space.
This triggers https://github.com/kubernetes/kubernetes/issues/36222,
which makes the configmap hardly readable when editing them manually or
simply putting them in a yaml file for inspection.
Trim the concatenation of option + value to get rid of any trailing
space.
* Make Helm's 'atomic' parameter configurable from role variables
* Configure Helm with 'atomic' and 'wait' set to false for generic CNI to prevent kubelet-csr-approver installation failures
We use shell scripts and conf files in some roles (notably, certificates
provisioning), so we need to include them in order for the collection to
work when using the configurations depending on those roles.
* kubeadm: do not ignore preflight errors blindly
The "ignoring all errors" seems to date back to the inception of the
kubeadm support (it was --skip-preflight-check before).
This can mask real errors and prevent users from seeing them.
Do not ignore any errors by default and make the set of ignored errors
configurable.
* download/kubeadm: remove redundant task
The mode is already set by the previous `copy` task.
* Validate kubeadm configs
This should help to fail early when we have invalid kubeadm configs (from
a kubespray bug or a misconfiguration).
* kubeadm-upgrade: remove unnecessary bool cast
* Convert kubeadm join discovery timeout to v1beta4 config
* CI: Ignore kubeadm:Mem errors on some setup.
The new CI does not define k8s_cluster group, so it relies on
kubernetes-sigs/kubespray#11559.
This does not work for upgrade testing (which use the previous release).
We can revert this commit after 2.27.0
We should not rollback our test setup during upgrade test.
The only reason to do that would be for incompatible changes in the test
inventory, and we already checkout master for those (${CI_JOB_NAME}.yml)
Also do some cleanup by removing unnecessary intermediary variables
VirtualMachineInstance resources sometimes temporarily loose their
IP (at least as far as the kubevirt controllers can see).
See https://github.com/kubevirt/kubevirt/issues/12698 for the upstream
bug.
This does not seems to affect actual connection (if it did, our current
CI would not work).
However, our CI execute multiple playbooks, and in particular:
1. The provisioning playbook (which checks that the IPs have been
provisioned by querying the K8S API)
2. Kubespray itself
If any of the VirtualMachineInstance looses its IP between after 1
checked for it, and before 2 starts, the dynamic inventory (which is
invoked when the playbook is launched by ansible-playbook) will not have
an ip for that host, and will try to use the name for ssh, which of
course will not work.
Instead, when we have a valid state during provisioning (all IPs
presents), use it to construct a static inventory which will be used for
the rest of the CI run.
This allows a single source of truth for the virtual machines in a
kubevirt ci-run.
`etcd_member_name` should be correctly handled in kubespray-defaults for
testing the recover cases.
Not constraining the inventory to .ini allows us to use dynamic
inventory, which is needed for simplifying kubevirt jobs inventory.
Also reduces the scope of the ANSIBLE_INVENTORY variable.
VMI in Kubevirt are the abstraction below VirtualMachine.
- We don't really need the extra abstraction of VirtualMachine objects
- Convert the waiting for VMs ip address to use kubernetes.core.k8s_info
and no shell pipeline
We're still getting bug reports circumventing the bug report template
and omitting informations.
Blank issue (aka, not using the form templates) can still be created
using the gh cli, the API, etc. This only disable the possibility in the
Web UI.
- Lookup was not returning a list, making the difference filter spit out
garbage -> query always return a list
- hostvars is a dictionnary, so convert to list before selectattr and
map back to only get keys
Currently there is not much difference between the files, if there are more changes in the future,
please use different files to distinguish them (you can use the kubeadm_config_api_version variable)
Signed-off-by: ChengHao Yang <17496418+tico88612@users.noreply.github.com>
Currently there is not much difference between the files, if there are more changes in the future,
please use different files to distinguish them (you can use the kubeadm_config_api_version variable)
Signed-off-by: ChengHao Yang <17496418+tico88612@users.noreply.github.com>
Remove kubeadm api version condition.
Currently there is not much difference between the files, if there are more changes in the future,
please use different files to distinguish them (you can use the kubeadm_config_api_version variable)
Signed-off-by: ChengHao Yang <17496418+tico88612@users.noreply.github.com>
I added the kubeadm_config_api_version variable in the previous commit,
and remove kubeadm api version condition.
Signed-off-by: ChengHao Yang <17496418+tico88612@users.noreply.github.com>
v1beta4 has changed a lot in this file (e.g. ExtraArgs etc.), so it was implemented in separate files.
Signed-off-by: ChengHao Yang <17496418+tico88612@users.noreply.github.com>
Since a2019c1c2 (Add a JSON schema describing the packages install
structure, 2024-04-25), we use a custom structure to select which
packages should be installed on a particular host OS.
This has proven too rigid in practice, and the query is pretty
complicated.
Replace this by simply using an array of jinja conditions for the
packages, which should be easier to understand for everyone and more
flexible.
Also remove the associated schema and validation which are no longer
needed.
* etcd: throttle restart for availability
During upgrade, etcd member are restarted all at once.
This can impact the availability of the etcd cluster and subsequently of
the Kubernetes cluster.
Limit the concurrent restart so that the etcd cluster can keep quorum.
* Simplify etcd handlers
* Feat: add external OCI cloud controller manager template & variable
Signed-off-by: tico88612 <17496418+tico88612@users.noreply.github.com>
* Feat: add external OCI cloud controller manager workflow
Signed-off-by: tico88612 <17496418+tico88612@users.noreply.github.com>
* Feat: migrate external OCI CCM config check from OCI cloud provider
Signed-off-by: tico88612 <17496418+tico88612@users.noreply.github.com>
* cloud_controller: oracle: simpler asserts
Make the asserts check for Oracle Cloud Infrastructure external cloud
controller more compact, and hence readable.
Allows to put them back in the main tasks for less back and forth when
reading the code.
---------
Signed-off-by: tico88612 <17496418+tico88612@users.noreply.github.com>
Co-authored-by: Max Gautier <mg@max.gautier.name>
This reverts commit 275c54e810.
Static tokens are no longer created automatically for service account in
Kubernetes. Instead, they are dynamically injected into pods using a
projected volume.
Thus there is no longer a need to check for this (it didn't work anyway,
since the describe output actually contains <none> when there is no
tokens:
{
"attempts": 1,
"changed": false,
"cmd": "set -o pipefail && /usr/local/bin/kubectl describe serviceaccounts default --namespace test | grep Tokens | awk '{print $2}'",
"delta": "0:00:00.075633",
"end": "2024-10-19 14:25:04.858871",
"msg": "",
"rc": 0,
"start": "2024-10-19 14:25:04.783238",
"stderr": "",
"stderr_lines": [],
"stdout": "<none>",
"stdout_lines": [
"<none>"
]
}
)
This leverage the Kubernetes GC to delete kubevirt VMs, by using
ownerReferences, with the CI pod running the playbook as the owner.
This concretely means that the control plane in our CI cluster will
delete the kubevirt VMs associated with a particular ci job as soon as
that pod job is deleted, which usually happens when the job terminates,
(barring errors, which will be addressed in the cluster directly)
Upgrade to kubevirt.io/v1 for the VirtualMachine manifests, since the
alpha version is deprecated.
Before adding these changes, `ansible_facts.services["containerd.service"]` will not defined and fail to check for triggering the container stop and delete behaviors.
Signed-off-by: ChengHao Yang <17496418+tico88612@users.noreply.github.com>
Simplify registry mirror rendering in config.toml.
The map filter can extract the host list from mirrors so we can
just unique them and render them without needing to construct vars
for it.
For the registry mirror tls section, we can first extract mirrors
from the dict then filter on only the ones having skip_veridy defined
first and then filter on the ones having true (as the dict might not
have skip_verify defined and that would cause errors of undefined var).
This will speed up and simply the templating.
Signed-off-by: Seena Fallah <seenafallah@gmail.com>
Dropping the ansible dependencies for ansible-lint will allow us to
catch missing dependencies collections in galaxy.yml. For collections
needed for contrib/ or tests/ (i.e: not part of core kubespray
dependencies), we can just configure ansible-lint to mock them.
This mean it won't check the mocked module parameters, but for those
area of the code base it's an acceptable trade-off.
The fallback_ips tasks are essentially serializing the gathering of one
fact on all the hosts, which can have dramatic performance implications
on large clusters (several minutes).
This is essentially a reversal of 35f248dff0
Being able to run without refreshing the cache facts is not worth it.
We keep fallback_ip for now, simply changing the access to a normal
hostvars variable instead of a custom dictionnary.
Using the hosts directive at the play level prevent those tasks from
being run when using --limit and the group in question is not part of
the limit (ex: running scale.yml on new worker nodes only)
Instead, run on all hosts, and for each group, partition between that
group and '_' (generic group name which is not used; using an empty
string as the group is not supported by ansible.builtin.group_by)
Reported-by: asteppat <asteppat@cisco.com>
* Update cluster-role for cilium to prevent errors in agent startup
ciliumloadbalancerippools permissions exists in the cilium helm chart for version 1.13.0
https://github.com/cilium/cilium/blob/v1.13.0/install/kubernetes/cilium/templates/cilium-agent/clusterrole.yaml#L71
The agent also needs permissions to read/watch secrets for bgp auth secrets when using CiliumBGPPeeringPolicy with a secret.
* Remove list/watch permissions for secrets
* Remove secrets from list/watch permissions
The old repository for these has been deleted, leaving the previous
configuration not possible to deploy, and even currently running clusters
fail after a restart as the DeameonSet has ImagePullPolicy: Always. More
details can be found here: kubernetes-sigs/vsphere-csi-driver#3053
As of writing, only CSI driver versions 3.1.2 to 3.3.1 is available in
this registry. This "officially" supports Kubernetes 1.26 to 1.30. Since
older drivers are not available, I have removed some feature-gating for
those unavailable versions while I was at it. For the cloud provider,
the `latest` image is now missing, and only 1.28.0 to 1.31.0 are
available. I've set the latest of these as the new default.
I also updated the documented default versions, as they were all out of
date and not aligned with actual code defaults.
Nodes to api-server relies by default certificates, and bootstrap
tokens, and there should be no need to generate tokens for every nodes,
even when enabling static token auth.
Testing for group membership with group names makes Kubespray more
tolerant towards the structure of the inventory.
Where 'inventory_hostname in groups["some_group"] would fail if
"some_group" is not defined, '"some_group" in group_names' would not.
- Use proper syntax highlighting for config.rb examples
- Consistent shell style ($ as prompt)
- Use only one way to do things
- Remove OS specific details
The current way to handle a custom inventory in vagrant is a bit
hackish, copy files around and can break Vagrantfile parsing in
cornercase scenarios (removing vagrant inventories, or the inventory
copied into vagrant inventory).
Instead, simply pass additional inventories to the ansible-playbook
command lines as raw arguments with `-i`.
This also makes supporting multiples inventories trivial, so we add a
new `$inventories` variable for that purpose.
Specifying one directory for kubeadm patches is not ideal:
1. It does not allow working with multiples inventories easily
2. No ansible templating of the patch
3. Ansible path searching can sometimes be confusing
Instead, provide the patch directly in a variable, and add some quality
of life to handle components targeting and patch ordering more
explicitly (`target` and `type` which are translated to the kubeadm
scheme which is based on the file name)
kubernetes/control-plane and kubernetes/kubeadm roles both push kubeadm
patches in the same way.
Extract that code and make it a dependency of both.
This is safe because it's only configuration for kubeadm, which only
takes effect when kubeadm is run.
* Update multus to v4.1.0 and clarify cilium compatibility
* Fix: bug introduced by #10934 where the template would break if multus was defined
* Set priorityClassName to system-node-critical for multus pods
Allow the script to be called with a list of components, to only
download new versions checksums for those.
By default, we get new versions checksums for all supported (by the
script) components.
runc upstream does not provide one hash file per assets in their
releases, but one file with all the hashes.
To handle this (and/or any arbitrary format from upstreams), add a
dictionary mapping the name of the download to a lambda function which
transform the file provided by upstream into a dictionary of hashes,
keyed by architecture.
The script is currently limited to one hardcoded URL for kubernetes
related binaries, and a fixed set of architectures.
The solution is three-fold:
1. Use an url template dictionary for each download -> this allow to easily
add support for new downloads.
2. Source the architectures to search from the existing data
3. Enumerate the existing versions in the data and start searching from
the last one until no newer version is found (newer in the version
order sense, irrespective of actual age)
Remove system|kube_master_<resource>_reserved variables.
Those variables are unnecessary because users can simply use the
variables in group_vars if they which to differentiate control plane
nodes from other nodes.
Set conservative defaults for ephemeral-storage and pids for both kube
and system reserved resources.
Working symlinks are dependant on git configuration (when using the playbook as
a git repository, which is common), precisely `git config
core.symlinks`.
While this is enabled by default, some company policies will disable it.
Instead, use import_tasks which should avoid that class of bugs.
* Simplify docker systemd unit
systemd handles missing unit by ignoring the dependency so we don't need
to template them.
* Remove RHEL 7/CentOS 7 support
- remove ref in kubespray roles
- move CI from centos 7 to 8
- remove docs related to centos7
* Remove container-storage-setup
Only used for RHEL 7 and CentOS 7
* Fix: fix testcases_run.sh for upgrade tests
Need to git checkout ${CI_COMMIT_SHA} before running upgrade playbook (revert #11173 partially)
* feat: add CI job to test upgrade
Add a packet_ubuntu22-calico-all-in-one-upgrade job
* make calico api server manifest backward compatible with version older than 3.27.3
Add 3.28.1 checksums
Add 3.28.0 checksums
Change default version to 3.27.3
* change default calico version to 3.28.1
* Set mount type to DirectoryOrCreate for hostPath needed by Calico
Fixes https://github.com/kubernetes-sigs/kubespray/issues/10947
This patch aims to be minimal and intentionally:
- does not change the generation logic for `supersede_domain` and `supersede_search`
- does not change how `nameserverentries` (for NetworkManager) is built
It seems like `nameserverentries` in the "Generate nameservers for resolvconf, including cluster DNS"
task is built the same way as `dhclient_supersede_nameserver_entries_list`.
However, `nameserverentries` in the "Generate nameservers for resolvconf, not including cluster DNS"
task (below) is built differently for some reason. It includes `configured_nameservers` as well.
Due to these differences, I have refrained from reusing the same building logic
(`dhclient_supersede_nameserver_entries_list`) for both.
If the `configured_nameservers` addition can be removed or made to apply
to dhclient as well, we could potentially build a single list and then
generate the `nameserverentries` and `supersede_nameserver` strings from it.
* CI: reduce VM resources requests to improve scheduling
* CI: Reduce default jobs; add labels(ci-full/extended) to run more test
* CI: use jobs dependencies instead of stages
* precommit one-job
* CI: Use Kubevirt VM to run Molecule and Vagrant jobs
The inventory file generated by Terraform produces the following warnings:
```
[WARNING]: * Failed to parse <PATH>/kubespray/contrib/terraform/hetzner/inventory.ini with ini plugin:
<PATH>/kubespray/contrib/terraform/hetzner/inventory.ini:21: Section [k8s_cluster:children] includes undefined group: kube-master
...
[WARNING]: Could not match supplied host pattern, ignoring: kube-master
PLAY [Add kube-master nodes to kube_control_plane] ********************************************************************************************************
skipping: no hosts matched
[WARNING]: Could not match supplied host pattern, ignoring: kube-node
PLAY [Add kube-node nodes to kube_node] *******************************************************************************************************************
skipping: no hosts matched
```
Use a style file as recommended by upstream. This makes for only one
source of truth.
Conserve previous upstream default for MD007 (upstream default changed
here https://github.com/markdownlint/markdownlint/pull/373)
Use gitlab dynamic child pipelines feature to have one source of truth
for the pre-commit jobs, the pre-commit config file.
Use one cache per pre-commit. This should reduce the "fetching cache"
time steps in gitlab-ci, since each job will have a separate cache with
only its hook installed.
We have inconsistent sets of options passed to the playbooks during our
CI runs.
Don't run ansible-playbook directly, instead factorize the execution in
a bash function using all the common flags.
Also remove various ENABLE_* variables and instead directly test for the
relevant conditions at execution time, as this makes it more obvious and
does not force one to go back and forth in the script.
With CentOS, kubespray currently produces the following warning:
[WARNING]: TASK: bootstrap-os : Enable Oracle Linux repo: The loop variable
'item' is already in use. You should set the `loop_var` value in the
`loop_control` option for the task to something else to avoid variable
collisions and unexpected behavior.
This could bites us in nasty ways, so fix it.
c58497cde (Refactor bootstrap-os (#10983), 2024-03-27) refactored the
boostrap-os include but didn't adapt the amazon linux tasks to the
actual ID of amazon linux ('amzn')
Re-enable the CI so we can avoid that kind of breakage.
if node.projectcalico.org already existe patch node to set asNumber
instead of apply resource to prevent remove of existing fields feed by
calico-node pods
✅ Closes: 11096
When using the kube_node_instancers_with_disks* variables, there were
no configuration block using those to provision disks with the
VirtualBox provider.
This commit fixes it.
Some packages requirements depends on inventory variables
(`kube_proxy_mode` in that case but it could apply to others).
As the case seems pretty rare, instead of adding complexity to pkgs, we
add an escape hatch to use jinja conditions.
That should be revisited if we find ourselves shoehorning lots of logic
in this later on.
Uses the logic introduced in the previous patch to convert all
kubernetes/preinstall/vars/* os specific files to the `pkgs`
dictionary.
Some niceties for devs:
- always validate the `pkgs` variable to catch mistakes in CI.
- ensure that `pkgs` is always sorted. This makes it easier to find the
packages you're looking for.
Adds infrastructure to install OS packages depending not only on OS
(family, versions, etc) but on groups.
All the informations related to a particular package should reside in
the `pkgs` dictionnary, which takes inspiration from the `downloads`
dictionary structure.
Since the structure we're setting in place for installing packages has
some complexity, add a JSON schema to avoid frustrating errors when
modifying the informations (adding/removing packages install).
This reverts commit 4b0a134bc9.
The mentionned PR break scale.yml. This goes back to the status quo
until a proper fix can be provided, at which point we'll reapply the
PR.
Upgrade Snapshot controller installed for all supported Kubernetes
versions to v7.0.2. Also update the manifests used to deploy the
Snapshot controller.
* feat: add user facing variable with default
* feat: remove rolebinding to anonymous users after init and upgrade
* feat: use file discovery for secondary control plane nodes
* feat: use file discovery for nodes
* fix: do not fail if rolebinding does not exist
* docs: add warning about kube_api_anonymous_auth
* style: improve readability of delegate_to parameter
* refactor: rename discovery kubeconfig file
* test: enable new variable in hardening and upgrade test cases
* docs: add option to config parameters
* test: multiple instances and upgrade
* Move fedora ansible python install to bootstrap-os
* /bin/dir is set in bootstrap-os
* Removing ansible_os_family workarounds
Support for these distributions was merged in Ansible, no need to
override it ourselves now.
https://github.com/ansible/ansible/pull/69324 openEuler
https://github.com/ansible/ansible/pull/77275/ UnionTech OS Server 20
https://github.com/ansible/ansible/pull/78232/ Kylin
* Don't unconditionnaly set VARIANT_ID=coreos in os-release
WTF, this is so wrong.
Furthermore, is_fedora_coreos is already handled in boostrap-os
* Handle Clearlinux generically
Followup of 4eec302e86 (since we're using
package module anyway, let's get rid of the custom task)
* Remove leftover files for Coreos
Coreos was replaced by flatcar in 058438a25 but the file was copied
instead of moved.
* Remove workarounds for resolved ansible issues
* boostrap: Use first_found to include per distro
Using directly ID and VARIANT_ID with first_found allow for less manual
includes.
Distro "families" are simply handled by symlinks.
* boostrap: don't set ansible_python_interpreter
- Allows users to override the chosen python_interpreter with group_vars
easily (group_vars have lesser precedence than facts)
- Allows us to use vars at the task scope to use a virtual env
Ansible python discovery has improved, so those workarounds should not
be necessary anymore.
Special workaround for Flatcar, due to upstream ansible not willing to
support it.
* upgrade ansible version
Needed for with_first_found to work correctly:
https://github.com/ansible/ansible/issues/70772 fixed in 2.16
* Remove unused google cloud cloud_playbook
* Fix dpkg_selection on non-existing packages
Needed since ansible-core>2.16, see:
f10d11bcdc
* scripts: ignore download_hash download failures
Binary names on github releases often change and this script might break
because of that, this commit allow to ignore these failures as a mean to
be able to run the script anyway.
Signed-off-by: Arthur Outhenin-Chalandre <arthur@cri.epita.fr>
* scripts: use sha256sums for crio as well
Signed-off-by: Arthur Outhenin-Chalandre <arthur@cri.epita.fr>
* scripts: add ppc64le support for crio
Signed-off-by: Arthur Outhenin-Chalandre <arthur@cri.epita.fr>
---------
Signed-off-by: Arthur Outhenin-Chalandre <arthur@cri.epita.fr>
The new version brings the following improvements:
- remove having to resort to python python to limit tags (it it slower than
the sh equivalent as python has a somewhat significant startup time).
- Introduce a concept of min version so that it can only get Kubernetes
version supported by Kubespray.
- Fix an issue with kata changing their file scheme (the arch
specifically)
- Now download sha256/sha256sum files if provided rather than
downloading the full file and computing the hash
- A few minor style tweaks
Signed-off-by: Arthur Outhenin-Chalandre <arthur@cri.epita.fr.fr>
Fixes bug for retrieving images with tags containing image digests.
Script now gets images from jobs and cronjobs as well.
New env variable DESTINATION_REGISTRY to push to another registry
instead of local registry.
New env variable IMAGES_FROM_FILE to pull images listed in a file
instead of getting images from a running k8s environment.
New env variable REGISTRY_PORT to override port (default is 5000).
Under the original code, leader election failed for ingress controllers
as a result of mismatch between election-id in the controller config,
and the resourceName in the relevant rule of role 'ingress-nginx'.
This appeared in the controller logs.
To fix the issue, a command-line option was added to container
execution (--election-id=...).
Now, the election-id agrees with the resourceName provided in
the role-ingress-nginx.yml file. A comment in that file was
changed to reflect the new logic.
Co-authored-by: Vasilis Samoladas <vsam@softnet.tuc.gr>
Co-authored-by: Mohamed Omar Zaian <mohamedzaian@gmail.com>
This should avoid permissions problems when the user creating the
directory and the user creating the content are different (when
containers images are saved by root for instances, because the user
can't use the container runtime).
* Refactor of kubeadm images listing
Instead of setting multiples facts, we directly create the dict we need from
kubeadm output.
* Remove useless 'default' filters in roles/download
* Only download kubeadm images where needed
The current state waiting method is bad to implement.
When changing the deployment version, which is execute with the upgrade_cluster in the previous ansible task: "Kubernetes Apps | Install and configure MetalLB", next ansible task: "Kubernetes Apps | Wait for MetalLB controller to be running" may fall with an error.
* [kubernetes] Make kubernetes 1.29.1 default
* [cri-o]: support cri-o 1.29
Use "crio status" instead of "crio-status" for cri-o >=1.29.0
* Remove GAed feature gates SecCompDefault
The SecCompDefault feature gate was removed since k8s 1.29
https://github.com/kubernetes/kubernetes/pull/121246
* Update upgrades.md
modify env serial to have real rolling upgrades
* Update upgrades.md
change section for serial
* Update docs/upgrades.md
Co-authored-by: Kundan Kumar <kundan.kumar@india.nec.com>
---------
Co-authored-by: Kundan Kumar <kundan.kumar@india.nec.com>
Handle all old dns domains:
- for nodelocaldns: in the same server block as the current dns_domain
- for coredns: uffix rewrite of each of the old dns domains to the
current one
The old version of the script downloaded all binaries and generated file checksums locally.
This was a slow process since all binaries of all architectures needed to be downloaded.
The new version simply downloads the .sha256 files containing the binary checksum in text
form which saves a lot of traffic and time.
* Enable configuring mountOptions, reclaimPolicy and volumeBindingMode for cinder-csi StorageClasses
* Check if class.mount_options is defined at all, before generating the option list
In the case were vagrant is not invoked directly from the repository,
but from another location, and the Vagrantfile is "included" into
another, we need to be able to specify where the location of the vagrant
directory is, as of now it's hardcoded relative to the Vagrantfile
location. This commit fix it.
* ignore_unreachable for etcd dir cleanup
ignore_errors ignores errors occur within "file" module. However, when
the target node is offline, the playbook will still fail at this task
with node "unreachable" state. Setting "ignore_unreachable: true" allows
the playbook to bypass offline nodes and move on to proceed recovery
tasks on remaining online nodes.
* Re-arrange control plane recovery runbook steps
* Remove suggestion to manually update IP addresses
The suggestion was added in 48a182844c 4
years ago. But a new task added 2 years ago, in
ee0f1e9d58, automatically update API
server arg with updated etcd node ip addresses. This suggestion is no
longer needed.
* ci: redefine multinode to node-etcd-client
This should allow to catch several class of problem rather than just
one -> from network plugin such as calico or cilium talking directly to
the etcd.
* Dynamically define etcd host range
This has two benefits:
- We don't play the etcd role twice for no reason
- We have access to the whole cluster (if needed) to use things like
group_by.
* Convert the bug-report template to issue form
* Convert the enchancement issue template to form
* Convert "Failing Test" template to issue form
* github: Remove support request template, direct to slack instead
* Remove checks for docs using exact tags
Instead use a more generic documentation for installing kubespray as a
collection from git.
* Check that we upgraded galaxy.yml to next version
This is only intented to check for human error. The version in galaxy
should be the next (which does not mean the same if we're on master or a
release branch).
* Set collection version to KUBESPRAY_NEXT_VERSION
* update cilium configmap template for new routing mode and tunnel-protocol options
Ryan Lonergan ryan.tlonergan@gmail.com
* add rbac for new cilium crd in 1.14
Ryan Lonergan ryan.tlonergan@gmail.com
* add conditional for cni-install.sh that's no longer included in cilium 1.14
Ryan Lonergan ryan.tlonergan@gmail.com
* Update roles/network_plugin/cilium/templates/cilium/ds.yml.j2
Co-authored-by: Cyclinder <qifeng.guo@daocloud.io>
---------
Co-authored-by: Cyclinder <qifeng.guo@daocloud.io>
* Disable control plane allocating podCIDR for nodes when using calico
Calico does not use the .spec.podCIDR field for its IP address
management.
Furthermore, it can false positives from the kube controller manager if
kube_network_node_prefix and calico_pool_blocksize are unaligned, which
is the case with the default shipped by kubespray.
If the subnets obtained from using kube_network_node_prefix are bigger,
this would result at some point in the control plane thinking it does
not have subnets left for a new node, while calico will work without
problems.
Explicitely set a default value of false for calico_ipam_host_local to
facilitate its use in templates.
* Don't default to kube_network_node_prefix for calico_pool_blocksize
They have different semantics: kube_network_node_prefix is intended to
be the size of the subnet for all pods on a node, while there can be
more than on calico block of the specified size (they are allocated on
demand).
Besides, this commit does not actually change anything, because the
current code is buggy: we don't ever default to
kube_network_node_prefix, since the variable is defined in the role
defaults.
We take advantage of group_by to create the list of nodes needing new
certs, instead of manually looping inside a Jinja template.
This should make the role more readable and less susceptible to
white space problems.
* Decouple role kubespray-defaults from download
Avoids doing re-importing the download role on every invocation of
kubespray-defaults (and skipping everything).
This has a measurable effect on playbook performance.
* Update docs refering to moved download defaults
* Mask systemd swap.target do disable swap
This is a more generic way to disable swap, since it pulls .swap units
in systemd distributions; fstab is only one way to generate .swap units.
* Unconditionally disable swap
We only care to disable it (the "swapon" registered variable is not used
anywhere else.
This allows to get rid of the ignore_errors, since this was added
because swapon.stdout does not exist in check_mode (see issue #6642).
* Don't explicitly disable swapOnZram
We're already masking the swap.target, which would pull the zram unit,
hence no need to handle zram-generator specifically.
Allow to fail early (pre-commit time) for jinja error, rather than
waiting until executing the playbook and the invalid template.
I could not find a simple jinja pre-commit hook in the wild.
* Clean up redondant defaulting
drain_{timeout,grace_period}_after_failure don't exist at this point, so
they always default.
* Remove useless facts
The drain_*_after_failure are never used
* Try both conntrack modules instead of checking kernel version
Depending on kernel distributor, the kernel version might not be a
correct indicator of the conntrack module use.
Instead, we check both (and use the first found).
* Use modproble.persistent rather than manual persistence
When installed as an ansible collection, roles in
ansible_play_role_names will be designated by their FQDN (i.e
'kubernetes-sigs.kubespray.<role-name>).
It means we need to check for both when checking for roles in the play.
* Validate systemd unit files
This ensure that we fail early if we have a bad systemd unit file
(syntax error, using a version not available in the local version, etc)
* Hack to check systemd version for service files validation
factory-reset.target was introduced in system 250, same version as the
aliasing feature we need for verifying systemd services with ansible.
So we only actually executes the validation if that target is present.
This is an horrible hack which should be reverted as soon as we drop
support for distributions with systemd<250.
* ansible: upgrade to version >= 2.15.5
Signed-off-by: Arthur Outhenin-Chalandre <arthur@cri.epita.fr>
* tests: update requirements
Signed-off-by: Arthur Outhenin-Chalandre <arthur@cri.epita.fr>
* contrib/openstack: fix wrong gitignore pattern
Signed-off-by: Arthur Outhenin-Chalandre <arthur@cri.epita.fr>
* tests: add missing tzdata requirement
Signed-off-by: Arthur Outhenin-Chalandre <arthur@cri.epita.fr>
* tests: remove some molecules tests
Those doesn't work in Ansible 2.15. Ansible can't load builtin now
apparently and these tests are not worth it.
Signed-off-by: Arthur Outhenin-Chalandre <arthur@cri.epita.fr>
---------
Signed-off-by: Arthur Outhenin-Chalandre <arthur@cri.epita.fr>
Sets ignore_unreachable: true to `Gather ansible_default_ipv4 from all hosts`
task from fallback_ips.yml
Without this scale.yml will fail if a single node in the cluster is down, which
for large clusters happens often.
Remove cri-o apt repo job has state present but need absent
Uninstall CRI-O packages job has undefined variable crio_packages
replaced by list of packages
* metallb --lb-class cmd arg to support multiple load balancer implementations
* removed loadbalancer_class from metallb_config; metallb_loadbalancer_class in role defaults
* Use RandomizedDelaySec to spread out control certificates renewal plane
If the number of control plane node is superior to 6, using (index * 10
minutes) will fail (03:60:00 is not a valid timestamp).
Compared to just fixing the jinja expression (to use a modulo for
example), this should avoid having two control planes certificates
update node being triggered at the same time.
* Make k8s-certs-renew.timer Persistent
If the control plane happens to be offline during the scheduled
certificates renewal (node failure or anything like that), we still want
the renewal to happen.
* containerd: refactor handlers to use 'listen'
* cri-dockerd: refactor handlers to use 'listen'
* cri-o: refactor handlers to use 'listen'
* docker: refactor handlers to use 'listen'
* etcd: refactor handlers to use 'listen'
* control-plane: refactor handlers to use 'listen'
* kubeadm: refactor handlers to use 'listen'
* node: refactor handlers to use 'listen'
* preinstall: refactor handlers to use 'listen'
* calico: refactor handlers to use 'listen'
* kube-router: refactor handlers to use 'listen'
* macvlan: refactor handlers to use 'listen'
It was not 'false', which made some tasks (e.g. using systemd-resolved
template) to effectively remove default search domains; caused DNS loop
after rebooting the node/restarting cluster, so localdns service didn't
run correctly.
This make native ansible features (dry-run, changed state) easier to
have, and should have a minimal performance impact, since it only runs
on the etcd members.
* Specify the runc path when we use the containerd container engine
and change the bin_dir path.
Signed-off-by: Jin Li <qlijin@gmail.com>
* Update roles/container-engine/containerd/templates/config.toml.j2
Co-authored-by: Arthur Outhenin-Chalandre <arthur@cri.epita.fr>
---------
Signed-off-by: Jin Li <qlijin@gmail.com>
Co-authored-by: Arthur Outhenin-Chalandre <arthur@cri.epita.fr>
The blockSize attribute from Calico IPPool resources cannot be changed
once set [1]. Consequently, we use the one currently defined when
configuring the existing IPPool, avoiding upgrade errors by trying to
change it.
In particular, this can be useful when calico_pool_blocksize default
changes in kubespray, which would otherwise force users to add an
explicit setting to their inventories.
[1]: https://docs.tigera.io/calico/latest/reference/resources/ippool#spec
* modify variables.tf to accept AMI attributes via variables
* update README to guide users on utilizing variable-driven AMI configuration
* fix markdown lint error
This allows this task to work with a forks count > 10 and the default
configuration of sshd, which is to limit sessions to 10. (see
MaxSessions in sshd_config).
Since this is a delegate_to task, it connects to the same host (first
etcd) for each node in the cluster, thus easily going above 10.
Raising the ssh connection attempts allow for more robustness, without
decreasing the forks count or serialising the tasks, which could slow
the task (or the playbook as a whole, if decreasing forks).
* Migrate node-role.kubernetes.io/master to node-role.kubernetes.io/control-plane
* Migrate node-role.kubernetes.io/master to node-role.kubernetes.io/control-plane
* Migrate node-role.kubernetes.io/master to node-role.kubernetes.io/control-plane
Refactor NRI (Node Resource Interface) activation in CRI-O and
containerd. Introduce a shared variable, nri_enabled, to streamline
the process. Currently, enabling NRI requires a separate update of
defaults for each container runtime independently, without any
verification of NRI support for the specific version of containerd
or CRI-O in use.
With this commit, the previous approach is replaced. Now, a single
variable, nri_enabled, handles this functionality. Also, this commit
separates the responsibility of verifying NRI supported versions of
containerd and CRI-O from cluster administrators, and leaves it to
Ansible.
Signed-off-by: Feruzjon Muyassarov <feruzjon.muyassarov@intel.com>
* [containerd] Add Configuration option for Node Resource Interface
Node Resource Interface (NRI) is a common is a common framework for
plugging domain or vendor-specific custom logic into container
runtime like containerd. With this commit, we introduce the
containerd_disable_nri configuration flag, providing cluster
administrators the flexibility to opt in or out (defaulted to 'out')
of this feature in containerd. In line with containerd's default
configuration, NRI is disabled by default in this containerd role
defaults.
Signed-off-by: Feruzjon Muyassarov <feruzjon.muyassarov@intel.com>
* [cri-o] Add configuration option for Node Resource Interface
Node Resource Interface (NRI) is a common is a common framework for
plugging domain or vendor-specific custom logic into container
runtimes like containerd/crio. With this commit, we introduce the
crio_enable_nri configuration flag, providing cluster
administrators the flexibility to opt in or out (defaulted to 'out')
of this feature in cri-o runtime. In line with crio's default
configuration, NRI is disabled by default in this cri-o role
defaults.
Signed-off-by: Feruzjon Muyassarov <feruzjon.muyassarov@intel.com>
---------
Signed-off-by: Feruzjon Muyassarov <feruzjon.muyassarov@intel.com>
* terraform-openstack: Updated extra partitions to use empty list by default
* terraform-openstack: Added possibility to enable dhcp flag critical on one interface
when people run playbook with option `--tags=kubelet`, the kubelet config may changed, because some variables used in task populating `kubelet-config.yml` could be different with running task(`Fetch facts`)
* Fix containerd_registries in config_path for mirrors and remove nerdctl global insecure_registry setting
* Make containerd hosts.toml mode 0640
* Add containerd_registries_mirrors and keep containerd_registries to pass packet_debian11-calico-upgrade
Set owner/group to root/root when unarchiving kata-containers binary to prevent kata-containers binaries/directories and especially / from getting chowned to 1001:123, the file owner specified in the kata-containers archive
* tests: replace fedora35 with fedora37
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* tests: replace fedora36 with fedora38
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* docs: update fedora version in docs
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* molecule: upgrade fedora version
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* tests: upgrade fedora images for vagrant and kubevirt
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* vagrant: workaround to fix private network ip address in fedora
Fedora stop supporting syconfig network script so we added a workaround
here
https://github.com/hashicorp/vagrant/issues/12762#issuecomment-1535957837
to fix it.
* netowrkmanager: do not configure dns if using systemd-resolved
We should not configure dns if we point to systemd-resolved.
Systemd-resolved is using NetworkManager to infer the upstream DNS
server so if we set NetworkManager to 127.0.0.53 it will prevent
systemd-resolved to get the correct network DNS server.
Thus if we are in this case we just don't set this setting.
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* image-builder: update centos7 image
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* gitlab-ci: mark fedora packet jobs as allow failure
Fedora networking is still broken on Packet, let's mark it as allow
failure for now.
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
---------
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* Update reset.yml
reset confirmation user input fix
* Update reset.yml
added default for non-interactive run in ci/cd
* fix reset_confirmation in reset.yml
* skip reset_confirmation promtp when reset_confirmation is defined via extra-vars option (for tests)
* check both string type and object type with user_input for reset_confirmation var
* reset_confirmation_prompt in conjunction with reset_confirmation
improvement inspired by:
https://github.com/kubernetes-sigs/kubespray/pull/10288#issuecomment-1637056880
* project: update all dependencies including ansible
Upgrade to ansible 7.x and ansible-core 2.14.x. There seems to be issue
with ansible 8/ansible-core 2.15 so we remain on those versions for now.
It's quite a big bump already anyway.
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* tests: install aws galaxy collection
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* ansible-lint: disable various rules after ansible upgrade
Temporarily disable a bunch of linting action following ansible upgrade.
Those should be taken care of separately.
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* project: resolve deprecated-module ansible-lint error
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* project: resolve no-free-form ansible-lint error
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* project: resolve schema[meta] ansible-lint error
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* project: resolve schema[playbook] ansible-lint error
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* project: resolve schema[tasks] ansible-lint error
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* project: resolve risky-file-permissions ansible-lint error
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* project: resolve risky-shell-pipe ansible-lint error
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* project: remove deprecated warn args
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* project: use fqcn for non builtin tasks
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* project: resolve syntax-check[missing-file] for contrib playbook
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* project: use arithmetic inside jinja to fix ansible 6 upgrade
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
---------
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* tests: cleanup stale packet namespace automatically
Cancelled job on Gitlab can produce stale VMs as the delete playbook
will never be executed. This commits allow removing old vms by getting
all the namespace created from the same branch with an older pipeline
id.
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* tests: cleanup stale packet namespace after 2 hours
This ensure that we don't have any packet namespace remaining for more
than 2 hours. All the jobs complete usually within 30min-1hour so 2
hours is enough to detect a stale namespace.
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* tests: ignore vm cleanup failure
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* tests: use pipeline_id var instead of fetching namespace for cleanup packet vm
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
---------
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
New line was not inserted between image and imagePullPolicy for some
reasons with the jinja. Simplifying this altogether should fix this.
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* Fix upgrade-path for kubelet-csr-approver
Fixes an error when you enable kubelet-csr-approver when upgrading.
It hangs waiting for the certificate to be approved since the
kubelet-csr-approver is not installed yet.
* Add missing package when using helm role
Molecule 5.0 require ansible-core 2.12.10.
So this commit we update ansible-core from 2.12.5 to 2.12.10.
We also drop supporting two ansible-core version. Also we now use the "oldest"
still supported ansible-core version as both 2.11 is EOL and not
supported by molecule.
tests/molecule: remove linting in molecule to support molecule 5
tests/molecule: remove role name check for molecule 5 support
Kubespray doesn't use ansible galaxy style naming so we have to disable
that check.
contrib/inventory_builder: fix tox.ini for tox4
tests/molecule: fix get_playbook in testinfra tests
tests: upgrade most tests requirements
Exclude ansible-lint for now, I will do that in a separate PR.
tests/molecule: force kvm driver option
If we don't do this it fallbacks to qemu emulated on our CI for some
reasons.
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* Fix `The task includes an option with an undefined variable` for 1.27
* delete old flag --container-runtime
Signed-off-by: Victor Login <batazor@evrone.com>
---------
Signed-off-by: Victor Login <batazor@evrone.com>
Add option to configure class as the default class
Add option to disable wathcing for ingresses without class
Remove redundant if that always evaluates to true
Fix default value missing for ingress_nginx_default
``
"msg": "Failed to template loop_control.label: 'ansible.utils.unsafe_proxy.AnsibleUnsafeText object' has no attribute 'item'. 'ansible.utils.unsafe_proxy.AnsibleUnsafeText object' has no attribute 'item'", "skip_reason": "Conditional result was False"}
``
fixes case when multus should NOT be included.
Calling bootstrap in facts.yaml so that we can always collect facts even on
new nodes. This is useful when you want to add nodes to an inventory
beforehand and then collect facts and scale the cluster with the scale
playbook and --limits. With dynamic inventory sometimes it might be more
difficult to add the nodes after running the facts playbook in this
specific situation.
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
This feature no longer works on Ansible 6 / ansible-core 2.13. We do not
support these version officially yet but this will help for the future
upgrade and may help some people running those inadvertently.
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
- Test with new version: 37.20230322.3.0. Both containerd and
cri-o is tested
- bugfix: when we use crio and the var bin_dir is changed,
there will be some error about the new bin dir.
* remove-debian9-support
* Add six module into openstack-cleanup/requirements.txt (#10099)
To fix tf-elastx_cleanup job which was failed with the following error:
File "/usr/local/lib/python3.11/site-packages/keystoneauth1/identity/generic/password.py", line 16, in <module>
from keystoneauth1.identity import v3
File "/usr/local/lib/python3.11/site-packages/keystoneauth1/identity/v3/__init__.py", line 27, in <module>
from keystoneauth1.identity.v3.oauth2_mtls_client_credential import * # noqa
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/keystoneauth1/identity/v3/oauth2_mtls_client_credential.py", line 17, in <module>
import six
ModuleNotFoundError: No module named 'six'
---------
Co-authored-by: Kenichi Omichi <ken1ohmichi@gmail.com>
According to the canal github[1] the repo is not maintained over 5 years.
In addition, the README says
```
Originally, we thought we might more deeply integrate the two projects
(possibly even going as far as a rebranding!). However, over time it
became clear that that wasn't really necessary to fulfil our goal of
making them work well together. Ultimately, we decided to focus on
adding features to both projects rather than doing work just to
combine them.
```
So it is difficult to support canal by Kubespray at this situation.
[1]: https://github.com/projectcalico/canal
To fix tf-elastx_cleanup job which was failed with the following error:
File "/usr/local/lib/python3.11/site-packages/keystoneauth1/identity/generic/password.py", line 16, in <module>
from keystoneauth1.identity import v3
File "/usr/local/lib/python3.11/site-packages/keystoneauth1/identity/v3/__init__.py", line 27, in <module>
from keystoneauth1.identity.v3.oauth2_mtls_client_credential import * # noqa
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/keystoneauth1/identity/v3/oauth2_mtls_client_credential.py", line 17, in <module>
import six
ModuleNotFoundError: No module named 'six'
* Drop CI jobs related to canal
According to the canal github[1] the repo is not maintained over 5 years.
In addition, the README says
Originally, we thought we might more deeply integrate the two projects
(possibly even going as far as a rebranding!). However, over time it
became clear that that wasn't really necessary to fulfil our goal of
making them work well together. Ultimately, we decided to focus on
adding features to both projects rather than doing work just to
combine them.
So we don't need to run CI jobs related to the canal at this situation.
[1]: https://github.com/projectcalico/canal
* Update ci.md
* chore(helm-apps): fix README example
README shows a non-working example according to the specs for this role.
* Add support for kubelet-csr-approver
Co-Authored-By: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* Add tests for kubelet-csr-approver
Co-Authored-By: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* Add Documentation for Kubelet CSR Approver
Co-Authored-By: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
---------
Co-authored-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* Remove deprecated provider and fix flatcar configs
* Refactor for DRYness
* Add missing line endings
* Enable tests for hetzner terraform in CI
* Add missing inventory for CI tests
* Fix: vSphere Error: `Apply a CSI secret manifest`
This PR will fix an issue that you will see on 2nd deploy when deploying External vSphere
How to re-produce:
1. Set custom `vsphere_csi_namespace: "vmware-system-csi"`
2. Deploy as usual
3. Observe no errors
4. Deploy 2nd time without `reset`
5. Playbook fails with:
```
TASK [kubernetes-apps/csi_driver/vsphere : vSphere CSI Driver | Apply a CSI secret manifest]
fatal: [node-00]: FAILED! => changed=true
censored: 'the output has been hidden due to the fact that ''no_log: true'' was specified for this result'
```
* create namespace if does not exist
* lint fix
* try to fix lint errors
* fix `too few spaces before comment`
* change the order of applied manifests
* typo
* [cilium] fix rbac and upgrade hubble v0.11.0 (#3)
* [cilium] fix rbac for LB bgp ipam
* [cilium] Upgrade Hubble to v0.11.0 and add mTLS between Hubble UI and Hubble Relay
* fix dns domain hubble for tls
---------
Co-authored-by: Thuon Jeremy <d107869@olinfra1.infra.bdm.outscale.c1.dav.fr>
* Fix blank line
---------
Co-authored-by: Thuon Jeremy <d107869@olinfra1.infra.bdm.outscale.c1.dav.fr>
Cilium 1.13.1 changed how the cilium-cni binary gets placed in /opt/cni/bin,
so that it takes place in an init container rather than in the main agent.
This commit removes the variable `use_localhost_as_kubeapi_loadbalancer`
and rather detects that we are in a situation where we can use the
localhost apiserver loadbalancer (meaning that we use the localhost load
balancer and that the same ports are used for both the load balancer and
the kube-apiserver).
This also cleanups the calico code to use `kube_apiserver_global_endpoint`
rather than implementing the same logic all over again.
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* node: fix default kubelet/runtime cgroups when kube_reserved is false (default)
Commit 1c4db6132d introduced a notion of
kube_reserved. This introduced a breaking change defaulting to use
kube.slice for the container_manager and the kubelet as if kube_reserved
was always enabled whereas it is disabled by default.
This commit fixes this by bringing back system.slice whenever
kube_reserved is disabled.
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* inventory/sample: change false for kube_reserved as its the default
Changing the commented value in sample inventory to the actual default
value.
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
---------
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* network_plugin/custom_cni: add CNI to apply provided manifests
Add a new simple custom_cni to install provided Kubernetes manifests.
This could be useful to use manifests directly provided by a CNI when
there are not support by Kubespray (i.e.: helm chart or any other manifests
generation method).
Co-authored-by: James Landrein <james.landrein@proton.ch>
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* network_plugin/custom_cni: add test with cilium
Co-authored-by: James Landrein <james.landrein@proton.ch>
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
---------
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
Co-authored-by: James Landrein <james.landrein@proton.ch>
crun_bin_dir was used to specify the destination of the crun binary during the
download process. This path must match with the value provided in the CRI-O
configuration file. So changing its value to bin_dir helps to mismatch errors.
Signed-off-by: Victor Morales <chipahuac@hotmail.com>
requirements-$ANSIBLE_VERSION.yml doesn't exist in Kubespray repo.
That was for supporting ansible 2.10-, and now Kubespray supports
2.11+. So this drops the part to avoid confusion.
Reducing the number of layers, increasing readability, reducing the size of the image (how much I can’t check, it’s impossible for me to build due to the unavailability of the vagrant repository)
* Fix yq install in argocd role: use download_file instead of get_url
* Fix use download_file instead of get_url to download argocd-install manifest in argocd role
* Fix order and add arm64 checksum
* Fix: Failed to template loop_control.label: 'None'
Since Kubespray v2.21.0, the commit a98ab40434 removes copying
contrib/ accidentaly. The contrib/ contains useful tools like offline
tools etc. This adds the contrib/ to Dockerfile again.
1.5.7 was released Aug 2, 2021 and 1.6.1 came out on Dec 13, 2022.
There's been a good amount of new features, improvements and fixes since
1.5.7 and the changelogs for each version are available in the docs:
https://ara.readthedocs.io/en/latest/changelog-release-notes.html
This fixes the CrashLoopBackoff error that appears because envoy
configuration has changed a lot and upstream removed the envoy proxy to
use nginx only instead. Those changes are based on upstream cilium helm.
It is quite confusing that there's an all-caps, bolded comment that seems to imply that `etcd_download_url` is relevant only when not using host-based deployment. The opposite is true: of course the artifact download URL is relevant and required for host-based etcd.
Perhaps the entire comment can be read in a different way, and should perhaps be reworded entirely, cf. 374438a3d6/docs/offline-environment.md (L38)
Removing the "**DON'T**" matches the way the other comments in this file are written and matches my personal interpretation.
This commit uses a kubeadm join config to pull down cert for etcd in
workers nodes (which is needed in some circumstances, for instance with
calico or cilium).
The previous way didn't allow us to pass certain parameters which was
typically given in the config in other kubeadm invokations in Kubespray.
This made kubeadm produced some errors for some edge cases.
For example, in our deployment we don't have a default route and even
though it's only to download the certificates, kubeadm produce an error
`unable to select an IP from default routes` (these command are kubeadm
controlplane command, so kubeadm does some additional checks). This is
fixed by specifying `advertiseAddress` within the kubeadm config.
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
Add a variable `calico_felix_floatingIPs` which permit to enable calico feature `floatingIPs`
(disabled per default).
Signed-off-by: MatthieuFin <matthieu2717@gmail.com>
#9679
In 6db6c8678c, this was disabled becaue
kubesrpay gave too much permissions that were not needed. This commit
re-enable back this option by default and also removes the extra
permissions that kubespray gave that were in fact not needed.
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* optimize cgroups settings for node reserved
* fix
* set cgroup slice for multi container engine
* set cgroup slice for crio
* add reserved cgroups variables to sample files
* Compatible with cgroup path for different container managers
* add cgroups doc
* fix markdown
At the upstream calico development, the v3.21 branch is not updated
over 2 monthes. In addition, unnecessary error message is output at
Kubespray deployment due to different URLs for calico v3.21 or v3.22+
This drops the v3.21 support to solve the issue.
Without minimal cluster configuration, even on a one node control plane,
the health check of the ovn-cental container always fails as it queries the
cluster/status.
Crio registries configuration changed from crio_registries_mirrors to
crio_registries. The configuration in the test was however forgotten.
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@proton.ch>
* feat(): Add wireguard backend to flannel cni
As described in the flannel docs:
https://github.com/flannel-io/flannel/blob/master/Documentation/backends.md#wireguard
This does not support optional configuration methods like:
- setting a psk (will be autogenerated by default)
- chang listening ports
- change mode (defaults to 'separate')
- change PersistentKeepaliveInterval (defaults to 0)
* Add supported backends to flannel docs
* Fix markdown in docs
This fixes a task failure in the rescue block that uncordons nodes after an unsuccessful drain. The issue occurs when `kube_override_hostname` is set and does not match `inventory_hostname`.
The `containerd-common` role is responsible for gathering OS specific variables from the vars directory of the roles that include or import it. `containerd-common` is imported via role dependency by a total of two roles, `container-engine/docker`, and `container-engine/containerd`.
containerd-common is needed by both the docker and containerd roles as a dependency when:
- containerd is selected as the container engine
- a docker install is detected and needs to be removed
- apt is the package manager
However, by default, roles can not be invoked more than once in the same play, unless `allow_duplicates: true` is set for that role. This results in the failure of the `containerd | Remove containerd repository` task, since only the docker vars will be loaded in the play, and `containerd_repo_info.repos`, normally populated by containerd/vars, is left empty.
This change sets `allow_duplicates: true` for `containerd-common` which fixes the currently failing containerd tasks if docker was detected and removed in the same play.
During the reset, restart network was not completing in distros
like RHEL/CentOS/AlmaLinux with major version higher than 8.
Example:
kubespray> ansible-playbook -i inventory/mydomain/hosts.yml reset.yml -b -v
fatal: [mynode]: FAILED! => {"changed": false, "msg": "Could not find the requested service network: host"}
Signed-off-by: Douglas Schilling Landgraf <dlandgra@redhat.com>
Signed-off-by: Douglas Schilling Landgraf <dlandgra@redhat.com>
There is a wrong directory path to all.yml and vsphere.yml. The wrong directory is `inventory/sample/group_vars/all.yml` and `inventory/sample/group_vars/all/vsphere.yml` which should be `inventory/sample/group_vars/all/all.yml` and `inventory/sample/group_vars/all/vsphere.yml`.
When operating kubespray from kubespray image with docker run,
we need to checkout the specific kubespray version as the same as
the image, because the sample inventory contains kubernetes version
and the version of master branch could not be supported on the released
kubespray, for example.
The certified-conformance mode took 2+ hours and that was too long
by comparing Quick mode which was specified previously.
So this updates the mode to Quick again.
The latest version of sonobuoy is v0.56.11.
This updates the version to the latest.
As the file name, this makes it use certified-conformance mode
clearly for the latest version of sonobuoy.
by setting a default runtime spec with a patch for RLIMIT_NOFILE.
- Introduces containerd_base_runtime_spec_rlimit_nofile.
- Generates base_runtime_spec on-the-fly, to use the containerd version
of the node.
- Update and re-work the documentation:
- Update links
- Fix formatting (especially for lists)
- Remove documentation about `useAlphaApi`,
a flag only for k8s versions < v1.10
- Attempt to clarify the doc
- Update to version 1.5.0
- Remove PodSecurityPolicy (deprecated in k8s v1.21+)
- Update ClusterRole following upstream
(cf https://github.com/kubernetes-sigs/sig-storage-local-static-provisioner/pull/292)
- Add nodeSelector to DaemonSet (following upstream)
We made all vagrant jobs non-voting because those jobs were not stable.
However the setting allowed a pull request which broke vagrant jobs
completely merged into the master branch.
To avoid such situation, this makes one of vagrant jobs voting.
Let's see the stability of the job.
The install-cni-plugin image was not updated to the corresponding
arch when building the different DS.
Fixes issue #9460
Signed-off-by: Fred Rolland <frolland@nvidia.com>
Signed-off-by: Fred Rolland <frolland@nvidia.com>
`test` is is a internal Python package (see [doc]), and as such should not be
used here. It make tests fail in some environments.
[doc]: https://docs.python.org/3/library/test.html
* Fix inconsistent handling of admission plugin list
* Adjust hardening doc with the normalized admission plugin list
* Add pre-check for admission plugins format change
* Ignore checking admission plugins value when variable is not defined
On hardening environments, cert-manager pods could not be created
from the corresponding deployments. This adds the securityContext
to solve the issue.
To verify the hardening method works always.
The configuration comes from docs/hardening.md
Fix yaml format of hardening.yml
Add condition to skip 040 test for hardening
busybox container requires a root permission for ping.
For testing hardening method at CI, we need to switch to another image
which doesn't require the root permission for network testing.
On kubernetes/kubernetes repo, we are using agnhost which doesn't
require it. So this makes the test use aghhost image.
In addition, this updates the test manifest to specify securityContext
without any privilege.
* Avoid MetalLB speaker image download when metallb_speaker_enabled is set to
* Move metallb_speaker_enabled var to allow outside metalLB role references
* Move metallb_speaker_enabled var to allow outside metalLB role references
* Improve metallb_speaker_enabled default values
When trying to add a hardening CI job by copying configuration from
hardening.md, yamllint CI job deleted invalid format.
This fixes it for maintaining the CI job.
* Fix: install policy controller on kdd too
* Remove the calico_policy_version condition altogether
* Install policy controller both on canal and calico under same condition
* Support kubeadm patches in v1beta3
* Update kubeadm patches sample files in inventory
* Fix pre-commit syntax
* Set kubeadm_patches enabled to false in sample inventory
* Add optional NAT support in calico router mode
* Add a blank line in front of lists
* Remove mutual exclusivity: NAT and router mode
* Ignore router mode from NAT
* Update calico doc
Since the commit fad296616c cri_dockerd_enabled
has not been used. But the packet_ubuntu22-aio-docker.yml still contains
the configuration and causes confusions.
This removes the configuration for cleanup.
It seems that PR #8839 broke `calico_datastore: etcd` when it removed ipamconfig support for etcd mode.
This PR fixes some failing tasks when `calico_datastore == etcd`, but it does not restore ipamconfig support for calico in etcd mode. If someone wants to restore ipamconfig support for `calico_datastore: etcd` please submit a follow up PR for that.
For the following configuration
```
containerd_insecure_registries:
docker.io:
- dockerhubcache.example.com
```
the rendered /etc/containerd/config.toml contains
```
[plugins."io.containerd.grpc.v1.cri".registry.configs."docker.io".tls]
insecure_skip_verify = true
```
but it needs to be
```
[plugins."io.containerd.grpc.v1.cri".registry.configs."dockerhubcache.example.com".tls]
insecure_skip_verify = true
```
* Add the option to enable default Pod Security Configuration
Enable Pod Security in all namespaces by default with the option to
exempt some namespaces. Without the change only namespaces explicitly
configured will receive the admission plugin treatment.
* Fix the PR according to code review comments
* Revert the latest changes
- leave the empty file when kube_pod_security_use_default, but add comment explaining the empty file
- don't attempt magic at conditionally adding PodSecurity to kube_apiserver_admission_plugins_needs_configuration
* Add 'flush ip6tables' task in reset role
If enable_dual_stack_networks is set to true and ip6 is defined,ip6tables will be created. But when reset the kubernetes cluster, kubespray doesn't flush ip6tables.
* [CI] fix molecule tests on opensuse by upgrading to 15.4 (#9175)
* [CI] fix molecule tests on opensuse by upgrading to 15.4
* [opensuse] use correct python crytography package name depending on distribution version
Co-authored-by: Cristian Calin <6627509+cristicalin@users.noreply.github.com>
This condition blocks the creation of the `etcd` user in certain conditions.
Specifically, when you have a `etcd_deployment_type: kubeadm` and `kube_owner: root`.
Being the `root` user already present on the system, this will not be a problem (due to the idempotency of ansible).
Today we have many contributions to contrib/offline/ and some PRs
contained invalid coding style for those scripts.
This enables shellcheck to make such invalid coding style easily.
* Update main.yaml
* remove version in dpkg_selection name
* make lint happy
* Fix typo
* add comment / remove useless contition
* remove dpkg hold in reset tasks
* This release removes support for Kubernetes v1.19.0
* This release adds support for Kubernetes v1.24.0
* Starting with this release, we will need permissions on the coordination.k8s.io/leases resource for leaderelection lock
The commit 1ce2f04 tried to merge multiple SUSE OS checks including
"openSUSE Leap" and "openSUSE Tumbleweed" into a single SUSE, but
that was a perfect change.
Then the commit c16efc9 tried to fix it for "openSUSE Leap", but it
didn't take care of "openSUSE Tumbleweed".
Then this adds "openSUSE Tumbleweed" to the OS check.
* Fix vcloud-csi bug related to #9046
Signed-off-by: yasintahaerol <yasintahaerol@gmail.com>
* add supervisor-fss-namespace=kube-system flag to vsphere-csi-controller-deployment
Signed-off-by: yasintahaerol <yasintahaerol@gmail.com>
This adds target components on check_readme_versions.sh after
merging https://github.com/kubernetes-sigs/kubespray/pull/9044
In addition, this fixes typo on check_readme_versions.sh
This adds `foo_version` variables for some components because
check_readme_versions.sh verifies the corresponding version for
`<component name>_version` from main.yml. This change also makes
consistency in the main.yml. In long-term, we will be able to
remove the existing `foo_image_tag` variables, but that is not now
for backwards compatibility for users.
During code-review, reviwers needed to take care of README.md also
should be updated when the pull request updated component versions.
This adds the corresponding check to reduce reviwer's burden.
* Added new configuration item for extra tolerations in policy controllers
Signed-off-by: Sébastien Masset <smt.masset@gmail.com>
* Added new configuration item for extra tolerations in DNS autoscaler
Signed-off-by: Sébastien Masset <smt.masset@gmail.com>
* Aligned existing handling of extra DNS tolerations
Signed-off-by: Sébastien Masset <smt.masset@gmail.com>
* feat: make kubernetes owner parametrized
* docs: update hardening guide with configuration for CIS 1.1.19
* fix: set etcd data directory permissions to be compliant to CIS 1.1.12
* extra admission controls now don't have a version in their file names
eventratelimit.v1beta2.yaml.j2 -> eventratelimit.yaml.j2
* cri_socket variable includes the unix:// prefix to be conformat with
upstream
When running molecule jobs, we saw the folloing warning message:
[DEPRECATION WARNING]: [defaults]callback_whitelist option, normalizing names
to new standard, use callbacks_enabled instead. This feature will be removed
from ansible-core in version 2.15. Deprecation warnings can be disabled by
setting deprecation_warnings=False in ansible.cfg.
callbacks_enabled has been added since Ansible 2.11 and Kubespray is using
Ansible 2.12 at master branch. So we can use callbacks_enabled safely to
avoid the warning message.
* Allow disabling calico CNI logs with calico_cni_log_file_path
Calico CNI logs up to 1G if it log a lot with current default settings:
log_file_max_size 100 Max file size in MB log files can reach before they are rotated.
log_file_max_age 30 Max age in days that old log files will be kept on the host before they are removed.
log_file_max_count 10 Max number of rotated log files allowed on the host before they are cleaned up.
See https://projectcalico.docs.tigera.io/reference/cni-plugin/configuration#logging
To save disk space, make the path configurable and allow disabling this log by setting
`calico_cni_log_file_path: false`
* Fix markdown
* Update roles/network_plugin/canal/templates/cni-canal.conflist.j2
Co-authored-by: Kenichi Omichi <ken1ohmichi@gmail.com>
Co-authored-by: Kenichi Omichi <ken1ohmichi@gmail.com>
This reverts commit e375678674.
The workaround of explicitly specifying root for the kubelet unit was
for pulling images from private registry. Kubernetes now have a
dedicated mechanism with imagePullSecret.
Current Kubespray supports the Kubernetes version 1.21 or upper with
`kube_version_min_required: v1.21.0`
Then kube_version v1.20- related code is not used at all.
This deletes those code for cleanup.
* [etcd] Add extra documentation for `etcd_memory_limit` and `etcd_quota_backend_bytes`
Signed-off-by: necatican <necaticanyildirim@gmail.com>
* [etcd] Add support for setting ETCD_MAX_REQUEST_BYTES
Signed-off-by: necatican <necaticanyildirim@gmail.com>
* [upcloud] add option to use preconfigured cpu/mem plan
* [upcloud] add option to use firewall rules for API server/SSH access
* [upcloud] add option to use managed load balancer
* [cilium] Separate templates for cilium, cilium-operator, and hubble installations
Signed-off-by: necatican <necaticanyildirim@gmail.com>
* [cilium] Update cilium-operator templates
Signed-off-by: necatican <necaticanyildirim@gmail.com>
* [cilium] Allow using custom args and mounting extra volumes for the Cilium Operator
Signed-off-by: necatican <necaticanyildirim@gmail.com>
* [cilium] Update the cilium configmap to filter out the deprecated variables, and add the new variables
Signed-off-by: necatican <necaticanyildirim@gmail.com>
* [cilium] Add an option to use Wireguard encryption on Cilium 1.10 and up
Signed-off-by: necatican <necaticanyildirim@gmail.com>
* [cilium] Update cilium-agent templates
Signed-off-by: necatican <necaticanyildirim@gmail.com>
* [cilium] Bump Cilium version to 1.11.3
Signed-off-by: necatican <necaticanyildirim@gmail.com>
* Assert that IP range is enough for the nodes
Co-authored-by: Necatican Yıldırım <necaticanyildirim@gmail.com>
* Fixed whitespace
* Fixed errors
* Fixed errors
Co-authored-by: Necatican Yıldırım <necaticanyildirim@gmail.com>
The version 1.4 of containerd has been End of Life since March 3, 2022
as https://containerd.io/releases/#support-horizon
It is nice to drop the support from Kubespray also to follow containerd.
tests/requirements.txt links to tests/requirements-2.12.txt, so
Kubespray uses ansible 2.12 by default for testing. However we
forgot to update testcases_prepare.sh to use ansible 2.12.
This updates testcases_prepare to use ansible 2.12.
* Force containerd service unmasking
Force systemd to unmask and start service when adding containerd service
* Eliminate restart and move unmasking step
Switch to start instead of restart
Move unmasking to restart handler
* Add unmasking to similar container runtimes
* Add missing service names
aufs-tools was required for docker.io package originally,
but Kubespray installs docker-ce package instead today.
In addition, Ubuntu 20.04 doesn't provide aufs-tools as [1].
Then this removes aufs-tools from Ubuntu requirement.
[1]: https://bugs.launchpad.net/ubuntu/+source/aufs-tools/+bug/1947004
* Update verbs for volumeattachments resource
Update verbs for volumeattachments resource so that the kubelet can create volumeattachments and mount volumes when deploying Kubernetes on VMware vSphere.
* Update verbs for volumeattachments resource
Update verbs for volumeattachments resource to match upstream
* Update vsphere-csi-controller-rbac.yml.j2
* - add ability to specify the network_zone in hetzner terraform
- Export the network id from hetzner terraform the the generated inventory.ini
* - Add with_networks variable to allow different deployments of hcloud controller manager
- Add network id to hcloud controller secret (added via the inventory)
- Don't include extra_args if it's not set
Current ansible.tags 'facts' is for skipping actual Kubespray deployment
at vagrant CI because the deployment takes much time. However the static
'facts' skips the deployment for normal usage of vagrant also.
That causes confusions.
This adds VAGRANT_ANSIBLE_TAGS to skip the deployment for vagrant CI.
The quotations in the variable nerdctl_extra_flags are not required for the `nerdctl_image_pull_command` and throw the following error when executing the cluster-playbook with `container_insecure_registries` set:
unknown flag: --insecure-registry\\\"
This happens as the complete nerdctl_image_pull_command string variable gets split into an array string for the cmd task. The escaped quotation doesn't get escaped properly and is added to the cmd-string array as part of the command. This leads to a wrong written insecure-registry flag, which throws this error.
Due to missing quotation of nerdctl_extra_flags, ansible-playbook was failed:
Using module file /usr/local/lib/python3.6/dist-packages/ansible/modules/command.py
Pipelining is enabled.
[..]
File "/usr/lib/python3.8/shlex.py", line 191, in read_token
raise ValueError("No closing quotation")
This fixes the issue.
T-Eberle investigated the issue and found the solution.
Thank you T-Eberle!
* [ansible] make ansible 5.x the new default version and move different versions tested to nightly jobs
* [CI] jobs were missing proper ansible cleanup
If running Kubespray on static IP environments, a task was failed like:
TASK [kubernetes/preinstall : Configure dhclient hooks for resolv.conf (RH-only)]
fatal: [ak8s2]: FAILED! => {
"changed": false, "checksum": "..",
"msg": "Destination directory /etc/dhcp/dhclient.d does not exist"}
This adds a check for dhclientconffile for running 0100-dhclient-hooks to
run the task only if dhcpclient is enabled.
* terraform/openstack: Use path.module for ansible_bastion_template.txt
This extends on #7643 by not using path.root, but switching to path.module
to allow use of the terraform code as a module itself. This change then keeps
all calls to the template file stable even for that use-case.
* terraform/openstack: Make sed calls fail on errors
By using a single call with two replacements to use of sed will create proper exit codes
and allowing for errors to be recognized by terraform.
When running cluster.yml for new machines what containerd is already
install but Kubernetes cluster were not installed before, the task
"remove-node | List nodes" is failed like
"changed": false,
"cmd": [
"/usr/local/bin/kubectl", "--kubeconfig",
"/etc/kubernetes/admin.conf", "get", "nodes", "-o",
"go-template={{ range .items }}{{ .metadata.name }}
{{ "\n" }}{{ end }}"
],
..
"stderr": "error: stat /etc/kubernetes/admin.conf: no such file or directory",
That was due to lack to check the existing Kubernetes cluster exists
or not before running "kubectl drain" command.
This adds the check to avoid the issue.
* [calico] make vxlan encapsulation the default
* don't enable ipip encapsulation by default
* set calico_network_backend by default to vxlan
* update sample inventory and documentation
* [CI] pin default calico parameters for upgrade tests to ensure proper upgrade
* [CI] improve netchecker connectivity testing
* [CI] show logs for tests
* [calico] tweak task name
* [CI] Don't run the provisioner from vagrant since we run it in testcases_run.sh
* [CI] move kube-router tests to vagrant to avoid network connectivity issues during netchecker check
* service proxy mode still fails connectivity tests so keeping it manual mode
* [kube-router] account for containerd use-case
* Sketch of helm-apps role interface
* helm-apps: Early implementation and settings
* helm-apps: Fix README.md example playbook
* fixup! Sketch of helm-apps role interface
* Make the argument specs more explicit
* Remove exposed options from hardcoded default
* Simplify example playbook in README.md
- Define directly the roles parameters
- Add an example of option override for one chart only
* Use release instead of charts
Make explicit that the role is mananing releases, not charts.
Simplify parameters naming
* Add epoch to docker-ce and docker-ce-cli packages to ensure docker upgrade
* Split container-engine redhat vars to support legacy RHEL 7 version management
* Support ansible_distribution_major_version when disvering vars with ansible_os_family
* Update ansible-lint to 5.4.0 (#8607)
It seems that the Rich version 11.0.0 has a breaking change.
So need to update ansible-lint to 5.3.2 or later.
* Fix for ansible-lint no-changed-when rule (#8607)
* There is an issue with etcd v3.5.0 where it resurrects ancient members see: https://github.com/etcd-io/etcd/issues/13196
This issue is clearly fixed in etcd v3.5.2
* Just keep the checksums
* [containerd] add hashes for 1.6.1
* [contained] make 1.6.1 the default
* [containerd] add hashes for 1.5.10
* [containerd] add hashes for 1.4.13
* [nerdct] bump to 0.17.1
* [containerd] add checksums for 1.6.0
* [containerd] promote 1.6.0 as the new default
* [runc] promote 1.1.0 as the new default to allow arm deployments out of the box
* [nerdctl] bump to 0.17.0 to align with containerd 1.6.0
* [reset] allow crictl stopp and rmp commands to fail
As far as I can tell this is simply a typo that has existed from the beginning. Having it this way around (`etcd` group as a child and thus subset of `k8s_cluster`) mirrors what is written in the preceeding sentence.
When trying to run print_hostnames of inventory.py, it outputs the following
error:
$ CONFIG_FILE=./test-hosts.yaml python3 ./inventory.py print_hostnames
Traceback (most recent call last):
File "./inventory.py", line 472, in <module>
sys.exit(main())
File "./inventory.py", line 467, in main
KubesprayInventory(argv, CONFIG_FILE)
File "./inventory.py", line 92, in __init__
self.parse_command(changed_hosts[0], changed_hosts[1:])
File "./inventory.py", line 415, in parse_command
self.print_hostnames()
File "./inventory.py", line 455, in print_hostnames
print(' '.join(self.yaml_config['all']['hosts'].keys()))
KeyError: 'all'
because it is missed to load a hosts config file before printing hostnames.
This fixes the issue.
* [calico] upgrade 3.19.x to 3.19.4
* [calico] upgrade 3.20.x to 3.20.4
* [calico] upgrade 3.21.x to 3.21.4 and make it the default
* [calico] add 3.22.0 checksums
* [calico] account for path changes in calico 3.21.4 crd archive and above
Fixed the problem when call ansible-playbook contrib/mitogen/mitogen.yml
"The error was: 'dict object' has no attribute 'section'"
What type of PR is this?
/kind bug
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?:
Use openstack_networking_port_v2 and openstack_networking_floatingip_associate_v2
to attach floating ips. This gives us more flexibility on disabling port security
when binding instances directly on provider networks in private cloud scenario.
If kubelet is run with systemd (as it always is when using kubespray),
it starts in systemd's /system.slice/kubelet.service cgroup.
This commit prevents a creation and usage of a second unrelated cgroup.
* terraform/gcp: Do not create unused subnetworks
By default terraform creates a subnetwork in each 39 regions
* terraform/gcp: Upgrade to latest google provider
... where "one of source_tags, source_ranges, or source_service_accounts must be defined"
* Use sysctl_file_path variable for all sysctl_file locations
* Add sysctl_file_path variable to kubespay-defaults
* Remove previously used sysctl file locations if present
* Use explicit filename in roles/kubernetes/node/defaults/main.yml
* Defaults: use explicit value
targetRef on endpoints surfaces as
__meta_kubernetes_endpoint_address_target_kind/__meta_kubernetes_endpoint_address_target_name
in prometheus and gets converted to the label `node` by
prometheus-operator
* Fix terraform Warning
Version constraints inside provider configuration blocks are deprecated
Terraform 0.13 and earlier allowed provider version constraints inside the
provider configuration block, but that is now deprecated and will be removed
in a future version of Terraform. To silence this warning, move the provider
version constraint into the required_providers block.
* Fix terraform Warning: Quoted references are deprecated
* terraform: Update GCP Ubuntu to latest LTS
The tf-elastx_cleanup test job was failed with error message:
Port xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx cannot be deleted
directly via the port API: has device owner network:router_interface.
That means necessary to remove a subnet from the router before
deleting the port.
This adds a method to removes a subnet from the router automatically.
This allow to workaround #8375 by using image_command_tool=crictl
when containerd_registries is used for containerd.
Also changes image_info_command_on_localhost for docker to return digests.
This fixes the following types of failures:
- empty-string-compare
- literal-compare
- risky-file-permissions
- risky-shell-pipe
- var-spacing
In addition, this changes .gitlab-ci/lint.yml to block the same issue
by using the same method at Kubespray CI.
When running ansible-lint directly, we can see a lot of warning
message like
risky-file-permissions File permissions unset or incorrect
This fixes the warning messages.
All container image versions were defined in download/defaults/main.yml
except containerd.
The inconsistency caused the offline script(generate_list.sh) could not
output the URL of containerd image.
This moves the definition into a valid file.
In addition, this adds host_os to generate_list.sh for downloading
krew from a valid URL.
* Update config.toml.j2
i think this commit code is not completed works
exam registry address : a.com:5000
insecure registry must be http://a.com:5000
but this code add insecure a.com:5000 (without http://)
If there is no http, containerd accesses with https even if insecure_skip_verify = true
solution is code edit
* Update config.toml.j2
* Update containerd.yml
* Update containerd.yml
* Update containerd.yml
* Update config.toml.j2
I was deploy my cluster with separate etcd cluster and not intersect with kube_control_plane or kube_node. And I want to run etcd cluster in docker but still used containerd to make container runtime for all other nodes. Therefore, I was added note to this doc for everyone
Thank !
- Use builtin task scheduling of ansible (same task on each host)
instead of manual looping on master
Benefits:
- One less play in remove-node.yml playbook
- Parralel node drain
- Drain parameters (timeout, grace period, retries,
allow_ungraceful_removal) can be adjusted separately for each node
with ansible variables
* Ensure entries for 1.23 are added for supported_versions vars
* cri-o: add support for kubernetes 1.23 but still use cri-o 1.22
* kubescheduler-config: diferentiate config versions based on kube_version
* registry: service add clusterIP, nodePort, loadBalancer support
* modify camelcase name to underscore
* Add registry service type compatibility check
* containerd: change default resolvconf_mode to host_resolvconf
* Wait for kube-apiserver to come back after pod refresh
* Handle resolv.conf gracefully
* Retain currently configured DNS entries to ensure we don't break the resolvers
* Suse uses wickedd for network management so no dhcp hooks
* Molecule: increase ansible timeout
* CI: Increase ansible timeout to 120s for Packet jobs
* Improve control plane scale flow (#13)
* Added version 1.20.10 of K8s
* Setting first_kube_control_plane to a existing one
* Setting first_kube_control_plane to a existing one
* change first_kube_master for first_kube_control_plane
* Ansible-lint changes
If trying to pull k8scsi/csi-resizer image from gcr.io, we face the error
like:
$ docker pull gcr.io/k8scsi/csi-resizer:v1.0.0
Error response from daemon: Head https://gcr.io/v2/k8scsi/csi-resizer/
manifests/v1.0.0: unknown: Project 'project:k8scsi' not found or deleted.
$
We can pull the image from quay.io instead.
This fixes the issue.
* containerd: add hashes for 1.5.8 and 1.4.12 and make 1.5.8 the new default
* containerd: make nerdctl mandatory for container_manager = containerd
* nerdctl: bump to version 0.14.0
* containerd: use nerdctl for image manipulation
* OpenSuSE: install basic nerdctl dependencies
* set ingress-nginx default terminationGracePeriodSeconds to 5 min for the drain of connection
* Add ingress_nginx_termination_grace_period_seconds at sample inventory
install containerd on centos need to binary download it
but offline.yml has no that value
binary download url default in
roles/download/defaults/main.yml:runc_download_url: "https://github.com/opencontainers/runc/releases/download/{{ runc_version }}/runc.{{ image_arch }}"
roles/download/defaults/main.yml:containerd_download_url: "https://github.com/containerd/containerd/releases/download/v{{ containerd_version }}/containerd-{{ containerd_version }}-linux-{{ image_arch }}.tar.gz"
if i use default offlie.yml, it's error from task download files
because runc,containerd down url is none offline
i want fix this
just add 2 new line
* Docs: update CONTRIBUTING.md
* Docs: clean up outdated roadmap and point to github issues instead
* Docs: update note on kubelet_cgroup_driver
* Docs: update kata containers docs with note about cgroup driver
* Docs: note about CI specific overrides
* Defaults: replace docker with containerd as our default container_manager
* CI: Use docker for download_localhost test
* Defaults: with container_manager=containerd we need etcd_deployment_type=host
* CI: Run weave jobs with docker
* CI: Vagrant don't download_force_cache
* CI: Fix upgrade tests
* should run compatible with old settings, this means docker
* we need to run with a distro that has at least modern containerd,
this means move from debian9 to debian10 to allow `containerd_version`
to match between 2.17 and master
* add metallb auto-assign property for main IP range & update addons.yml for sample inventory
* add new line at the end of file roles\kubernetes-apps\metallb\defaults\main.yml
* set default value for matallb_auto_assign = true
* Fixes various issues in vSphere Terraform code
Provided to address various shortcomings and to fix the following
issue in upstream Kubespray:
https://github.com/kubernetes-sigs/kubespray/issues/8176
* Resolves Terraform formatting issues
* Sets default prefix to human-readable name
* Documents new default prefix in README
* Ansible: separate requirements files for supported ansible versions
* Ansible: allow using ansible 2.11
* CI: Exercise Ansible 2.9 and Ansible 2.11 in a basic AIO CI job
* CI: Allow running a reset test outside of idempotency tests and running it in stage1
* CI: move ubuntu18-calico-aio job to stage2 and relay only on ubuntu20 with the variously supported ansible versions for stage1
* CI: add capability to install collections or roles from ansible-galaxy to mitigate missing behavior in older ansible versions
* Kata-containes: Fix for ubuntu and centos sometimes kata containers fail to start because of access errors to /dev/vhost-vsock and /dev/vhost-net
* Kata-containers: use similar testing strategy as gvisor
* Kata-Containers: adjust values for 2.2.0 defaults
Make CI tests actually pass
* Kata-Containers: bump to 2.2.2 to fix sandbox_cgroup_only issue
* Limit kubectl delete node to k8s nodes
This avoids the use of `kubectl delete node` when removing etcd nodes
which are not part of the cluser (separate etcd)
* Take errors into account when deleting node
There should not be error now that we're limiting the deletion to nodes
actually in the cluster
* Retrying on error
* Ensure addon-resizer 1.8.11 only effective at arch amd64.
k8s.gcr.io/addon-resizer:1.8.11 returns the amd64 image which is not executable at arm64.
Disable addon-resizer when the platform is not amd64.
When metrics-server upgrade and use addon-resizer:2.3, then revert this
commit and `image_arch` will determine the `addon_resizer_image_tag`.
* Add metrics_server_resizer architectures check
* nginx-ingress: bump to 1.0.4
* Disable builtin ssl_session_cache solving the problem with OpenSSL consuming memory.
* Print warning only instead of error if no IngressClass permission is available.
* nginx-ingress: bump to 1.0.4 in the README
* Disable builtin ssl_session_cache solving the problem with OpenSSL consuming memory.
* Print warning only instead of error if no IngressClass permission is available.
* Containerd: download containerd from upstream instead of using distro specific packages
split runc download to separate role
make bootstrap-os role deploy container-selinux and seccomp libraries
clean up package manager provided containerd
move variables to docker role that are no longer common with containerd
* Containerd: make molecule testing more relevant
* replace ubuntu18 with ubuntu20
* add centos8 and debian11 to molecule tests
* run kubernetes/preinstall role to ensure relevancy
of test including dependency packages
* CI: adjust test scenarios for downloaded containerd
kube-bench scan outputs warning related to Calico like:
* text: "Ensure that the Container Network Interface file
permissions are set to 644 or more restrictive (Manual)"
* text: "Ensure that the Container Network Interface file
ownership is set to root:root (Manual)"
This fixes these warnings.
* netchecker: update images to 1.2.2 from Mirantis which is slightly less ancinet than the l23networks images
* Netchecker: use local etcd instead of kubernetes v1beta1 crds which are no longer suported by kube 1.22+
* Add Rocky as a known OS
* Make sure Rocky includes bootstrap-centos.yml
* Update docs with Rocky Linux
* Rocky Linux wireguard and EPEL
* Rocky Linux in the list of supported distributions
If the etcd cluster is separate and the etcd_deployment_type is "host",
there is no need for a container engine on the etcd nodes
Do not rely on a 'default(true)' filter, but define a proper default in
kubespray-defaults depending on etcd deployment method and if internal
or external etcd is used
to remove deprecation warning:
> Flag --feature-gates has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag.
Kubespray deployment failed when using containerd backend on nodes that apparmor was not installed or previously removed. This PR ensure apparmor is installed by adding it into required_pkgs var.
* Fix invalid link to Ansible documentation
* Fix invalid link to mitogen doc page
* Fix invalid link to calico doc page
* Fix all invalid links to doc pages
The addon-resizer container can reduce resource limits of cpu and
memory of metrics-server container in the pod, and that caused
OOMKilled.
In addition, the original metrics-server manifest doesn't contain
the addon-resizer container as [1].
So this adds metrics_server_resizer option to control the addon-resizer
container deployment and the default value is false to make it stable
for most environments.
[1]: 527679e5e8/manifests/base/deployment.yaml
"allowPrivilegeEscalation: false" blocks deploying metrics-server
on CentOS7. In addition, the original metrics-server manifest doesn't
contain it as [1]. This removes it.
[1]: 527679e5e8/manifests/base/deployment.yaml
* Kata-Containers: add 2.2.0 hashes and make default
* Kata-Containers: replace 2.1.0 with bugfix version 2.1.1
* Kata-Containers: move to q35 a more modern VM architecture as 'pc' is removed in 2.2.0
Kubespray deployment failed when using containerd backend on nodes that apparmor was not installed or previously removed. This PR ensure apparmor is installed by adding it into required_pkgs var.
The path of kubeconfig should be configurable, and its default value
is /etc/kubernetes/admin.conf. Most paths of the file are configurable
but some were not. This make those configurable.
* Calico: make calico_min_version check relevant
* Calico: only check currently installed version against the oldest supported version by the previous release
Modify connection_strings_etcd to only return etcd nodes - not master nodes - since this results in duplicate hosts in the generated Ansible inventory and is unnecessary.
On Debian 11, `ipset` just recommend `iptables` so on the system that apt is configured with `APT::Install-Recommends "0";` iptables will not install automatically.
* Fix: adding new ips with inventory builder (#7577)
* moved conflig loading logic
to after checking whether the config
should be loaded, and added check for
whether the config should be loaded
* added check for removing nodes from config
if the user wants to remove a node, we
need to load the config
* Fix tox errors
* Fix missing file mode (risky-file-permissions)
Found this using ansible-lint.
Signed-off-by: Bryan Hundven <bryanhundven@gmail.com>
* Fix another missing file mode (risky-file-permissions)
This one fixes `/etc/crio/config.json`
Signed-off-by: Bryan Hundven <bryanhundven@gmail.com>
* CSI: update CSI snapshot CRDs
* CSI: update snapshot controller tag version with kubernetes specific versions
* CSI: allow enabling csi_snapshot_controller independent of Cinder CSI
* CSI: Align csi-snapshot-controller with upstream and use a Deployment instead of a StatefulSet
When using Calico with:
- `calico_network_backend: vxlan`,
- `calico_ipip_mode: "Never"`,
- `calico_vxlan_mode: "Always"`,
the `FelixConfiguration` object has `ipipEnabled: true`, when it should be false:
This is caused by an error in the `| bool` conversion in the install task:
when `calico_ipip_mode` is `Never`,
`{{ calico_ipip_mode != 'Never' | bool }}` evaluates to `true`:
* Fedora and RHEL use etc_t and the convention is <type_name>_t
* Docs: specify all values for preinstall_selinux_state
* CI: Add Fedora 34 with SELinux in enforcing mode
using --no-cache-dir flag in pip install ,make sure downloaded packages
by pip don't cached on system . This is a best practice which make sure
to fetch from repo instead of using local cached one . Further , in case
of Docker Containers , by restricting caching , we can reduce image size.
In term of stats , it depends upon the number of python packages
multiplied by their respective size . e.g for heavy packages with a lot
of dependencies it reduce a lot by don't caching pip packages.
Further , more detail information can be found at
https://medium.com/sciforce/strategies-of-docker-images-optimization-2ca9cc5719b6
Signed-off-by: Pratik Raj <rajpratik71@gmail.com>
Fix task 'Cert Manager | Wait for Webhook pods become ready' failed due to webhook pods don't exist yet by using `retries..until` trick like kubernetes-sigs/kubespray#7842
This fix should be removed in the future if the kubernetes/kubernetes#83242 is resolved.
Signed-off-by: rtsp <git@rtsp.us>
The main functions are wrapped by a sys.exit function which expects and
argument. The curent implementation isn't returning values in all cases.
This change ensures main functions return a value in all cases.
Fix task 'Cert Manager | Apply ClusterIssuer manifest' failed due to service/endpoints updating delayed even though the wekhook pod status is ready.
Signed-off-by: rtsp <git@rtsp.us>
Changes:
* ClusterRole updated according to the latest manifests from
https://github.com/kubernetes/cloud-provider-vsphere
* vSphere CPI/CSI default versions bumped and
tested successfully on K8S 1.21.1
* vSphere documentation updated
Signed-off-by: Vitaliy D <vi7alya@gmail.com>
non-kubeadm mode has been removed since ddffdb63bf
2.5 years ago. The non-kubeadm makes unnecessary confusion today, then
this updates the documentation.
Previously IDs of container images were gotten from tar files of container
images but that way was wrong. If multiple json files are contained in a
tar file, the script got multiple IDs and tried to pass these IDs on
`docker tag` command. Then the command was failed.
This updates the script to get image IDs from `docker image inspect` command
to fix this issue.
In addition, this adds a check a registry container exists already or not
before deploying registry container to avoid a container conflict failure.
* CRI-O: Install libseccomp2 from backports on Debian 10
libseccomp2 is a required dependency of cri-o-runc package
The one provided in Debian 10 repositories is outdated
* 7816: Remove useless when condition
As this condition is handled by block
* fix(misc): terraform/aws
- handles deployment with a single availability zone
- handles deployment with more than two availability zone
- handles etcd collocation with control-plane nodes (`aws_etcd_num=0`)
- allows to set a bastion instances count (`aws_bastion_num`)
- allows to set bastion/etcd/control-plane/workers rootfs volume size
- removes variables from terraform.tfvars that were not re-used
- adds .terraform.lock.hcl to .gitignore
- changes/updates base image from ubuntu-18.03 to debian-10
tested by a few coworkers of mine, and myself: thanks for the outstanding
work, on both those terraform samples and kubespray playbooks.
I did not test ubuntu deployments, I could still swap from buster to
focal. LMK.
* fix(gitlab-ci)
AFAIU, terraform.tfvars indentation should be fixed for / no diff
returned running `terraform fmt -check -diff`
https://gitlab.com/kargo-ci/kubernetes-sigs-kubespray/-/jobs/1445622114
To download necessary files in advance for offline deployment,
we can see all file URLs with contrib/offline/generate_list.sh
Most URLs are downloadable, but gvisor's one is not because the
URL is a part of full URLs for gvisor.
To download gvisor's files from the URLs directory, this separates
into two URLs for runsc and the shim.
tf-elax_ubuntu18-calico is so flake today. The test job is failed
due to SSH connectivity check error after deploying virtual machines
which are used for Kubernetes nodes.
This allows failure on the job to see the test situation without
pull request merger failures.
When running the script, I faced the following error but it was
difficult to know the root problem due to lack of error handling.
docker tag" requires exactly 2 arguments.
See 'docker tag --help'.
Usage: docker tag SOURCE_IMAGE[:TAG] TARGET_IMAGE[:TAG]
Create a tag TARGET_IMAGE that refers to SOURCE_IMAGE
To investigate such errors easily, this adds an error handling.
* csi-driver: Added possibility to use application credentials for cinder
* external-cloud-controller: Added env vars for openstack application credentials
* set selinux type t_etc if selinux state is enforcing
* workaround with update repo is no longer needed
remove comments about failing playbook
* grubby is not available in distros using ostree
* remove docker support because removed in fcos
update install script example with live rootfs
* do not call grubby on ostree based distro
* update docs enabling containerd on fedora coreos
* Ansible: move to Ansible 3.4.0 which uses ansible-base 2.10.10
* Docs: add a note about ansible upgrade post 2.9.x
* CI: ensure ansible is removed before ansible 3.x is installed to avoid pip failures
* Ansible: use newer ansible-lint
* Fix ansible-lint 5.0.11 found issues
* syntax issues
* risky-file-permissions
* var-naming
* role-name
* molecule tests
* Mitogen: use 0.3.0rc1 which adds support for ansible 2.10+
* Pin ansible-base to 2.10.11 to get package fix on RHEL8
* terraform/openstack: Use path.root for ansible_bastion_template.txt
The path.root variable points to the root module path. Using this
instead of a relative path makes less assumptions about the current
working directory.
* terraform/openstack: Add group_vars_path variable
Previously, the group_vars path was assumed to be in CWD. The
default value for the group_vars_path variable is still relative
to CWD and thus should be backwards compatible if unset.
* Calico: align manifests with upstream
* allow enabling typha prometheus metrics
* Calico: enable eBPF support
* manage the kubernetes-services-endpoint configmap
* Calico: document the use of eBPF dataplane
* Calico: improve checks before deployment
* enforce disabling kube-proxy when using eBPF dataplane
* ensure calico_version is supported
* Kata: add Kata 2.x checksums and adjust download urls for 2.x
* Kata: drop 1.x version which is no longer supported
* Kata: set default version 2.1.0
* Packet->Equinix Metal rename #6901
Updates throughout to reflect #6901 renaming for Packet to Equinix Metal.
* Rename Packet to Equinix Metal throughout the project #6901
Packet is renamed to Equinix Metal in more contexts including
documentation links. The Terraform provider used is still the Packet
provider. The environment variables and configuration options still
refer to the Packet name.
Signed-off-by: Marques Johansson <mjohansson@equinix.com>
Co-authored-by: Edward Vielmetti <ed@packet.net>
* Calico: add v3.19.1 hashes
* enable liveness probe for calico-kube-controllers
3.19.1
* Calico: drop support for v3.16.x
* Calico: promote v3.18.3 as default
* Override the default value of containerd's root, state, and oom_score configurations
* Add tests data for containerd_storage_dir, containerd_state_dir and containerd_oom_score variables
Basically we need to make necessary TCP/UDP ports open.
However the necessary ports are so many, and sometimes it is difficult
to figure out that is due to firewall issues or not if facing deployment
issues.
To distinguish a root problem on such situation, this adds contrib
playbook to disable the service firewall for Kubespray development
and test.
* add support for using ansible 2.10.x for deploying kubespray
* move dns-autoscaler-clusterrole{binding}.yml to files/ folder
* note that ansible 2.10 is now experimentally supported
* coredns: move files to templates like before #4341
* add initial MetalLB docs
* metallb allow disabling the deployment of the metallb speaker
* calico>=3.18 allow using calico to advertise service loadbalancer IPs
* Document the use of MetalLB and Calico
* clean MetalLB docs
Since K8S 1.21, BoundServiceAccountTokenVolume feature gate is in beta stage, thus activated by default (anyone who follows CSI guidelines has enabled AllAlpha and faced the issue before 1.21).
With this feature, SA tokens are regenerated every hour.
As a consequence for Calico CNI, token in /etc/cni/net.d/calico-kubeconfig copied from /var/run/secrets/kubernetes.io/serviceaccount in install-cni initContainer expires after one hour and any pod creation fails due to unauthorization.
Calico pods need to be restarted so that /etc/cni/net.d/calico-kubeconfig is updated with the new SA token.
follow new naming conventions for gcr's coredns image.
starting from 1.21 kubeadm assumes it to be `coredns/coredns`:
this causes the kubeadm deployment being unable to pull image, beacuse `v`
was also added in image tag, until the role `kubernetes-apps` ovverides
it with the old name, which is only compatible with <=1.7.
Backward comptability with kubeadm <=1.20 is mantained checking
kubernetes version and falling back to old names (`coredns:1.xx`) when
the version is less than 1.21
* rename ansible groups to use _ instead of -
k8s-cluster -> k8s_cluster
k8s-node -> k8s_node
calico-rr -> calico_rr
no-floating -> no_floating
Note: kube-node,k8s-cluster groups in upgrade CI
need clean-up after v2.16 is tagged
* ensure old groups are mapped to the new ones
* crio: add supported versions 1.20 and 1.21 and align default with k8s version
* cri-o: drop versions 1.17 and 1.18 from version matrix
* update note on cri-o version alignment
* calico: drop support for version 3.15
* drop check for calico version >= 3.3, we are at 3.16 minimum now
* we moved to calico 3.16+ so we can default to /opt/cni/bin/install
* AlmaLinux: ansible>2.9.19 is needed to know about AlmaLinux
* AlmaLinux: identify as a centos derrivative
* AlmaLinux: add AlmaLinux to checks for CentOS
* Use ansible_os_family to compare family and not distribution
As the official document[1], the parameter keepcache should be
'0' or '1' as string. To avoid the following warning message,
this fixes the parameter value:
[WARNING]: The value False (type bool) in a string field was
converted to u'False' (type string). If this does not look
like what you expect, quote the entire value to ensure it
does not change.
https://docs.ansible.com/ansible/latest/collections/ansible/builtin/yum_repository_module.html
Context: Load-balancing in Exoscale is performed by associating many
workers with the same EIP. This works, however, the workers cannot access
themselves via the EIP, which is needed at least for cert-managers
"self-test".
Problem: The old iptables based workaround felt fragile and disappointed
me at least once.
New solution: Add the EIP to a loopback interface on each worker.
* Add containerd_extra_args
This is useful for custom containerd config, e.g. auth
Signed-off-by: Zhong Jianxin <azuwis@gmail.com>
* Make containerd config.toml mode 0640
It may contain sensitive information like password
Signed-off-by: Zhong Jianxin <azuwis@gmail.com>
This PR is to move the cilium kvstore options to the configmap
rather than specifying them in the deployment as args. This
is not technically necessary but keeping all the options in
one place is probably not a bad idea.
Tested with cilium 1.9.5.
When attempting a fresh install without cilium_ipsec_enabled I ran
into the following error:
failed: [k8m01] (item={'name': 'cilium', 'file': 'cilium-secret.yml', 'type': 'secret', 'when': 'cilium_ipsec_enabled'}) =>
{"ansible_loop_var": "item", "changed": false, "item": {"file": "cilium-secret.yml", "name": "cilium", "type": "secret",
"when": "cilium_ipsec_enabled"},"msg": "AnsibleUndefinedVariable: 'cilium_ipsec_key' is undefined"}
Moving the when condition from the item level to the task level solved
the issue.
* Add KubeSchedulerConfiguration for k8s 1.19 and up
With release of version 1.19.0 of kubernetes KubeSchedulerConfiguration
was graduated to beta. It allows to extend different stages of
scheduling with profiles. Such effect is achieved by using plugins and
extensions.
This patch adds KubeSchedulerConfiguration for versions 1.19 and later.
Configuration is set to k8s defaults or to kubespray vars. Moving those
defaults to new vars will be done in following patch.
Signed-off-by: Maciej Wereski <m.wereski@partner.samsung.com>
* KubeSchedulerConfiguration: add defaults
Signed-off-by: Maciej Wereski <m.wereski@partner.samsung.com>
* Add documentation for audit webhook variables
* Enclose the value of audit_webhook_server_url in a codeblock
* Add default value for audit_webhook_batch_max_wait
Starting with Cilium v1.9 the default ipam mode has changed to "Cluster
Scope". See:
https://docs.cilium.io/en/v1.9/concepts/networking/ipam/
With this ipam mode Cilium handles assigning subnets to nodes to use
for pod ip addresses. The default Kubespray deploy uses the Kube
Controller Manager for this (the --allocate-node-cidrs
kube-controller-manager flag is set). This makes the proper ipam mode
for kubespray using cilium v1.9+ "kubernetes".
Tested with Cilium 1.9.5.
This PR also mounts the cilium-config ConfigMap for this variable
to be read properly.
In the future we can probably remove the kvstore and kvstore-opt
Cilium Operator args since they can be in the ConfigMap. I will tackle
that after this merges.
When upgrading cilium from 1.8.8 to 1.9.5 I ran into the following
error:
level=error msg="Unable to update CRD" error="customresourcedefinitions.apiextensions.k8s.io
\"ciliumnodes.cilium.io\" is forbidden: User \"system:serviceaccount:kube-system:cilium-operator\"
cannot update resource \"customresourcedefinitions\" in API group \"apiextensions.k8s.io\" at the
cluster scope" name=CiliumNode/v2 subsys=k8s
The fix was to add the update verb to the clusterrole. I also added
create to match the clusterrole created by the cilium helm chart.
DNSSEC is off by default on ubuntu/bionic64 (18.04) as per resolved.conf(5).
These tasks are artefacts of obsolete infra configuration, and no longer needed.
Further removing these tasks resolves the issue that the tasks always reports
'changed' and bounces systemd-resolved unneccesarily, even if there was no
actual modification of /etc/systemd/resolved.conf.
* Remove contrib/vault
This is marked as broken since 2018 / 3dcb914607
This still reference apiserver.pem, not used since ddffdb63bf
Signed-off-by: Etienne Champetier <e.champetier@ateme.com>
* Finish nuking vault from the codebase
Signed-off-by: Etienne Champetier <e.champetier@ateme.com>
This replaces kube-master with kube_control_plane because of [1]:
The Kubernetes project is moving away from wording that is
considered offensive. A new working group WG Naming was created
to track this work, and the word "master" was declared as offensive.
A proposal was formalized for replacing the word "master" with
"control plane". This means it should be removed from source code,
documentation, and user-facing configuration from Kubernetes and
its sub-projects.
NOTE: The reason why this changes it to kube_control_plane not
kube-control-plane is for valid group names on ansible.
[1]: https://github.com/kubernetes/enhancements/blob/master/keps/sig-cluster-lifecycle/kubeadm/2067-rename-master-label-taint/README.md#motivation
While at it remove force_certificate_regeneration
This boolean only forced the renewal of the apiserver certs
Either manually use k8s-certs-renew.sh or set auto_renew_certificates
Signed-off-by: Etienne Champetier <e.champetier@ateme.com>
* Add crun download_url and checksum
* Change versioning format to crun native versioning
* Download crun using download_file.yml
* Get crun version from download defaults
* Delegate crun binary copy task to crun role
* Download Calico KDD CRDs
* Replace kustomize with lineinfile and use ansible assemble module
* Replace find+lineinfile by sed in shell module to avoid nested loop
* add condition on sed
* use block for kdd tasks + remove supernumerary kdd manifest apply in start "Start Calico resources"
"The error was: 'proxy_disable_env' is undefined\n\nThe error appears to
be in '<censored>scale.yml': line 72, column 7"
Fixes 067db686f6
Signed-off-by: Etienne Champetier <e.champetier@ateme.com>
upgrades.md explains how to do upgrade from v1.4.3 to v1.4.6 as an
example. The versions are a little old, and the doc readers would
have a concern the upgrade works fine or not.
This updates versions after verifying the way works fine by hands.
* terraform support for UpCloud
* terraform support for UpCloud
* terraform support for UpCloud
* terraform support for UpCloud
* terraform support for UpCloud
* terraform support for UpCloud
* terraform support for UpCloud
* Updates to README.md and main.tf files
* formatting and updating readme
* added a .terraform_validate CI job
* fixed format issue
* added sample inventory
* added symbolic link to group_vars
* added missing tf variables and minor fixes
* added text formatting
* minor formatting fixes
* add nodeselector and tolerations for metallb
* remove unnecessary commented lines in metallb template
* set default speaker toleration to match original manifest
When privileged is enabled for a container, all the `/dev/*` block
devices from the host are mounted into the guest. The
`privileged_without_host_devices` flag prevents host devices from
being passed to privileged containers.
More information:
* https://github.com/containerd/cri/pull/1225
* 1d0f68156b
The important action in kubeadm-version.yml is the templating of the configuration,
not finding / setting the version
Signed-off-by: Etienne Champetier <e.champetier@ateme.com>
kubeadm is the default for a long time now,
and admin.conf is created by it, so let kubeadm handle it
Signed-off-by: Etienne Champetier <e.champetier@ateme.com>
Using `kubeadm init phase kubeconfig all` breaks kubelet client certificate rotation
as we are missing `kubeadm init phase kubelet-finalize all` to point to `kubelet-client-current.pem`
kubeconfig format is stable so let's just use lineinfile,
this will avoid other future breakage
This revert to the logic before 6fe2248314
Signed-off-by: Etienne Champetier <e.champetier@ateme.com>
On CentOS 8 they seem to be ignored by default, but better be extra safe
This also make it easy to exclude other network plugin interfaces
Signed-off-by: Etienne Champetier <e.champetier@ateme.com>
* Add terraform scripts for vSphere
* Fixup: Add terraform scripts for vSphere
* Add inventory generation
* Use machines var to provide IPs
* Add README file
* Add default.tfvars file
* Fix newlines at the end of files
* Remove master.count and worker.count variables
* Fixup cloud-init formatting
* Fixes after initial review
* Add warning about disabled DHCP
* Fixes after second review
* Add sample-inventory
* use external_openstack_lbaas_use_octavia for template openstack-cloud-config
* Delete external_openstack_lbaas_use_octavia from default values. Added description and default values of variables to docs
* markdown fix
* make this simple
* set external_openstack_lbaas_use_octavia in default values
* duplicated variable in doc
This replaces KUBE_MASTERS with KUBE_CONTROL_HOSTS because of [1]:
```
The Kubernetes project is moving away from wording that is
considered offensive. A new working group WG Naming was created
to track this work, and the word "master" was declared as offensive.
A proposal was formalized for replacing the word "master" with
"control plane". This means it should be removed from source code,
documentation, and user-facing configuration from Kubernetes and
its sub-projects.
```
[1]: https://github.com/kubernetes/enhancements/blob/master/keps/sig-cluster-lifecycle/kubeadm/2067-rename-master-label-taint/README.md#motivation
Since a790935d02 all proxy users
should be properly configured
Now when you have *_PROXY vars in your environment it can leads to failure
if NO_PROXY is not correct, or to persistent configuration changes
as seen with kubeadm in 1c5391dda7
Instead of playing constant whack-a-bug, inject empty *_PROXY vars everywhere
at the play level, and override at the task level when needed
Signed-off-by: Etienne Champetier <e.champetier@ateme.com>
Before this commit, we were gathering:
1 !all
7 network
7 hardware
After we are gathering:
1 !all
1 network
1 hardware
ansible_distribution_major_version is gathered by '!all'
Signed-off-by: Etienne Champetier <e.champetier@ateme.com>
* Move proxy_env to kubespray-defaults/defaults
There is no reasons to use set_facts here
Signed-off-by: Etienne Champetier <e.champetier@ateme.com>
* Ensure kubeadm doesn't use proxy
*_proxy variables might be present in the environment (/etc/environment, bash profile, ...)
When this is the case we end up with those proxy configuration in /etc/kubernetes/manifests/kube-*.yaml manifests
We cannot unset env variables, but kubeadm is nice enough to ignore empty vars
93d288e2a4/cmd/kubeadm/app/util/env.go (L27)
Signed-off-by: Etienne Champetier <e.champetier@ateme.com>
Ubuntu 18.04 crio package ships with 'mountopt = "nodev,metacopy=on"'
even if GA kernel is 4.15 (HWE Kernel can be more recent)
Fedora package ships without metacopy=on
Signed-off-by: Etienne Champetier <e.champetier@ateme.com>
By default Ansible stat module compute checksum, list extended attributes and find mime type
To find all stat invocations that really use one of those:
git grep -F stat. | grep -vE 'stat.(islnk|exists|lnk_source|writeable)'
Signed-off-by: Etienne Champetier <e.champetier@ateme.com>
`containerd.io` is the companion package of `docker-ce` and is the
proper package name. This is needed to avoid apt upgrade/dist-upgrade
from breaking kubernetes.
Running remove-node.yml tasks for clean up cluster on Fedora CoreOS.
The task failed to restart network daemon (task name: "reset | Restart network").
Fedora CoreOS is essentially using NetworkManager, but this task returns network.
Signed-off-by: Takashi IIGUNI <iiguni.tks@gmail.com>
* Add unique annotation on coredns deployment and only remove existing deployment if annotation is missing.
* Ignore errors when gathering coredns deployment details to handle case where it doesn't exist yet
* Remove run_once, deletegate_to and add to when statement
* Added force_etcd_cert_refresh var to maintain existing functionality. Broke out etcd node cert syncing from member and admin cert sync logic. Now first etcd will sync node certs to other etcd members on every run to keep all etcds up to date after adding additional worker nodes to the cluster
* Updated etcd cert check tasks to better detect when new certificates need to be generated
* Move usage of force_etcd_cert_refresh var to gen_certs fact set
* Force etcd cert generation per server if force_etcd_cert_refresh is set to true
* Include gathering of node certs even if k8s-cluster member and in etcd group.
* Removed run_once due to when statement
Helm v3.5.2 is a security (patch) release. Users are strongly
recommended to update to this release. It fixes two security issues in
upstream dependencies and one security issue in the Helm codebase.
See https://github.com/helm/helm/releases/tag/v3.5.2
This makes the docker role work the same as the containerd role.
Being able to override this is needed when you have your own debian
repository. E.g. when performing an airgapped installation
* contrib/terraform/exoscale: Rework SSH public keys
Exoscale has a few limitations with `exoscale_ssh_keypair` resources.
Creating several clusters with these scripts may lead to an error like:
```
Error: API error ParamError 431 (InvalidParameterValueException 4350): The key pair "lj-sc-ssh-key" already has this fingerprint
```
This patch reworks handling of SSH public keys. Specifically, we rely on
the more cloud-agnostic way of configuring SSH public keys via
`cloud-init`.
* contrib/terraform/exoscale: terraform fmt
* contrib/terraform/exoscale: Add terraform validate
* contrib/terraform/exoscale: Inline public SSH keys
The Terraform scripts need to install some SSH key, so that Kubespray
(i.e., the "Ansible part") can take over. Initially, we pointed the
Terraform scripts to `~/.ssh/id_rsa.pub`. This proved to be suboptimal:
Operators sharing responbility for a cluster risk unnecessarily replacing resources.
Therefore, it has been determined that it's best to inline the public
SSH keys. The chosen variable `ssh_public_keys` provides some uniformity
with `contrib/azurerm`.
* Fix Terraform Exoscale test
* Fix Terraform 0.14 test
* update local-path-storage config template to version v0.0.19
* changes local_path_provisioner image tag to v0.0.19
* removes copy paste example from rancher local-path-provisioner repo
According to the following recommendation, this moves the directory
to control-plane:
The Kubernetes project is moving away from wording that is considered
offensive. A new working group WG Naming was created to track this work,
and the word "master" was declared as offensive. A proposal was formalized
for replacing the word "master" with "control plane".
Fixes the following error when using Bastion Node with the sample config.
```
fatal: [bastion]: FAILED! => {"msg": "The task includes an option with an undefined variable. The error was: 'dict object' has no attribute 'bastion'\n\nThe error appears to be in '/home/felix/inovex/kubespray/roles/bastion-ssh-config/tasks/main.yml': line 2, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n---\n- name: set bastion host IP\n ^ here\n"}
```
Previous check for presence of NM assumed "systemctl show
NetworkManager" would exit with a nonzero status code, which seems not
the case anymore with recent Flatcar Container Linux.
This new check also checks the activeness of network manager, as
`is-active` implies presence.
Signed-off-by Jorik Jonker <jorik@kippendief.biz>
This was introduced in 143e2272ff
Extra repo is enabled by default in CentOS, and is not the right repo for EL8
Instead of adding a CentOS repo to RHEL, enable the needed RHEL repos with rhsm_repository
For RHEL 7, we need the "extras" repo for container-selinux
For RHEL 8, we need the "appstream" repo for container-selinux, ipvsadm and socat
Signed-off-by: Etienne Champetier <champetier.etienne@gmail.com>
Only checking the kubernetes api on the first master when upgrading is not enough.
Each master needs to be checked before it's upgrade.
Signed-off-by: Rick Haan <rickhaan94@gmail.com>
yum_repository expect really different params, so nothing to factor here
Ubuntu is not an ansible_os_family, the OS family for Ubuntu is Debian
Check for ansible_pkg_mgr == apt
Signed-off-by: Etienne Champetier <champetier.etienne@gmail.com>
we don't need rpm_key, so nothing to factor here
Ubuntu is not an ansible_os_family, the OS family for Ubuntu is Debian
Check for ansible_pkg_mgr == apt
Signed-off-by: Etienne Champetier <champetier.etienne@gmail.com>
Here the desciption from Ansible docs
Corresponds to the --force-yes to apt-get and implies allow_unauthenticated: yes
This option will disable checking both the packages' signatures and the certificates of the web servers they are downloaded from.
This option *is not* the equivalent of passing the -f flag to apt-get on the command line
**This is a destructive operation with the potential to destroy your system, and it should almost never be used.** Please also see man apt-get for more information.
Signed-off-by: Etienne Champetier <champetier.etienne@gmail.com>
This variable was added as KUBE_MASTERS_MASTERS. That's probably a typo.
Remove the redundant `_MASTERS` suffix. Also, document the variable in the
help message.
TASK [Generate a list of information about the images on a node]
registers list of container images to docker_images.
Then the next TASK [Set pull_required if the desired image is not
yet loaded] does based on expecting images are registered.
However sometimes the first TASK was failed as [1] but the failure
is ignored due to failed_when:false and it makes another issue.
This removes this unnecessary failed_when to detect the failure
at the point.
In addition, this removes no_log:true also because the output doesn't
contain any sensitive data and now it just makes debugging difficult.
[1]: https://gitlab.com/kargo-ci/kubernetes-sigs-kubespray/-/jobs/934714534#L2953
no_proxy is a pain to get right, and having proxy variables present causes issues
(k8s components get proxy configuration after upgrade, see #7100)
It's better to only configure what require proxy:
- the runtime (containerd/docker/crio)
- the package manager + apt_key
- the download tasks
Tested with the following clusters
- 4 CentOS 8 nodes
- 1 Ubuntu 20.04 node
Signed-off-by: Etienne Champetier <champetier.etienne@gmail.com>
In some environments, it might not be possible to ping the IP address
of the nodes, e.g., because ICMP echo is blocked.
This commit allows kubespray to be configured to disable the ping
check, while performing all other checks.
TASK [network_plugin/calico : Calico | Configure calico network pool] **********
task path: /builds/kargo-ci/kubernetes-sigs-kubespray/roles/network_plugin/calico/tasks/install.yml:138
Friday 08 January 2021 17:10:12 +0000 (0:00:01.521) 0:11:36.885 ********
[WARNING]: The value {'kind': 'IPPool', 'apiVersion': 'projectcalico.org/v3',
'metadata': {'name': 'default-pool'}, 'spec': {'blockSize': 24, 'cidr':
'10.233.64.0/18', 'ipipMode': 'Always', 'vxlanMode': 'Never', 'natOutgoing':
True}} (type dict) in a string field was converted to "{'kind': 'IPPool',
'apiVersion': 'projectcalico.org/v3', 'metadata': {'name': 'default-pool'},
'spec': {'blockSize': 24, 'cidr': '10.233.64.0/18', 'ipipMode': 'Always',
'vxlanMode': 'Never', 'natOutgoing': True}}" (type string). If this does not
look like what you expect, quote the entire value to ensure it does not change.
Signed-off-by: Etienne Champetier <champetier.etienne@gmail.com>
This fixes the following failures:
./contrib/offline/README.md:14:1 MD014/commands-show-output Dollar signs used before commands without showing output [Context: "$ ./manage-offline-container-i..."]
./contrib/offline/README.md:20:1 MD014/commands-show-output Dollar signs used before commands without showing output [Context: "$ ./manage-offline-container-i..."]
* Remove because of empty need_http_proxy.rc if http/https_proxy and skip_http_proxy_on_os_packages=true is set
* Modify sample for debian and centos skip_http_proxy
* Modify sample for debian and centos skip_http_proxy
If some settings were changed from the default but not commited into an inventory repo,
we risk breaking the cluster / cause downtime, so add some extra checks
Signed-off-by: Etienne Champetier <champetier.etienne@gmail.com>
Upgrading docker / containerd without adapting the configuration might break the node,
so disable docker-ce repo by default.
We are already using dpkg hold for Debian.
All containerd.io packages provide /usr/bin/runc, so no need to check
yum_conf was never used for containerd
module_hotfixes should not be needed with the EL8 repo
Signed-off-by: Etienne Champetier <champetier.etienne@gmail.com>
* [terraform/aws] Fix Terraform >=0.13 warnings
Terraform >=0.13 gives the following warning:
```
Warning: Interpolation-only expressions are deprecated
```
The fix was tested as follows:
```
rm -rf .terraform && terraform0.12.26 init && terraform0.12.26 validate
rm -rf .terraform && terraform0.13.5 init && terraform0.13.5 validate
rm -rf .terraform && terraform0.14.3 init && terraform0.14.3 validate
```
which gave no errors nor warnings.
* [terraform/openstack] Fixes for Terraform >=0.13
Terraform >=0.13 gives the following error:
```
Error: Failed to install providers
Could not find required providers, but found possible alternatives:
hashicorp/openstack -> terraform-provider-openstack/openstack
```
This patch fixes these errors.
This fix was tested as follows:
```
rm -rf .terraform && terraform0.12.26 init && terraform0.12.26 validate
rm -rf .terraform && terraform0.13.5 init && terraform0.13.5 validate
rm -rf .terraform && terraform0.14.3 init && terraform0.14.3 validate
```
which gave no errors nor warnings for Terraform 0.13.5 and Terraform
0.14.3. Unfortunately, 0.12.x gives a harmless warning, but
with 0.14.3 out the door, I guess we need to move on.
* [terraform/packet] Fixes for Terraform >=0.13
This fix was tested as follows:
```
export PACKET_AUTH_TOKEN=blah-blah
rm -rf .terraform && terraform0.12.26 init && terraform0.12.26 validate
rm -rf .terraform && terraform0.13.5 init && terraform0.13.5 validate
rm -rf .terraform && terraform0.14.3 init && terraform0.14.3 validate
```
Errors are gone, but warnings still remain. It is impossible to please
all three versions of Terraform.
* Add tests for Terraform >=0.13
* Fedora CoreOS: Fix for ethtool pre-installed
Fix error in rpm-ostree when ethtool is already insatlled (FCOS >= 32.20201104.3.0)
* Fedora CoreOS: Fix connection lost
Fedora CoreOS: Ignore connection lost due to reboot and continues the playbook
Now markdownlint covers ./README.md and md files under ./docs only.
However we have a lot of md files under different directories also.
This enables markdownlint for other md files also.
We are currently setting the IP variable to hostIP,
Before https://github.com/projectcalico/node/pull/593 (not yet released)
Calico interpret that as hostIP/32
Using 'can-reach' we get the future behavior
This fixes vxlan and IPIP CrossSubnet modes
Signed-off-by: Etienne Champetier <champetier.etienne@gmail.com>
Just after creating a namespace, the corresponding token could not be
created and sometimes the pod creation might be failed.
This adds check of the token in the new namespace to make this test
case stable.
* update files to handle multi-asn bgp peering conditions.
* put back in the serviceClusterIPs. Bad merge.
* remove extraneous environment var.
* update files as discussed with mirwan
* update titles.
* add not in.
* add a conditional for using bgp to advertise cluster ips.
Co-authored-by: marlow-h <mweston@habana.ai>
If cluster-name is not set, the default value "kubernetes" is used.
The loadbalancees created by Kubernetes follow the format:
kube_service_clusterName_serviceNamespace_serviceName
If 2 clusters create a loadbalancer for the same service in the same
namespace, they will share the same non-working loadbalancer.
Signed-off-by: Cedric Hnyda <cedric.hnyda@itera.io>
tests/scripts/ansible-lint.sh was written on the doc, but there was
not such file actually. We can use ansible-lint command to check
ansible yml files without any options.
This updates to use the command.
If a branch name contains '.sh', current shellcheck checks the branch
file under .git/ and outputs error because the format is not shell
script one.
This makes shellcheck exclude files under .git/ to avoid this issue.
* Update hashes and set default version to 1.19.5
Signed-off-by: anthr76 <hello@anthonyrabbito.com>
* Reorder hashes
1.19.5 hashes should be near 1.19.x
* Added back blank line
This fixes the following warning:
[kubernetes/client : Generate admin kubeconfig with external api endpoint]
[WARNING]: Consider using the file module with state=directory rather than
running 'mkdir'. If you need to use command because file is insufficient
you can
RedHat 8.3 merged nf_conntrack_ipv4 in nf_conntrack but still advertise 4.18
so just try to modprobe and decide depending on the success
Also nf_conntrack is a dependency of ip_vs, so no need to care about it
Signed-off-by: Etienne Champetier <champetier.etienne@gmail.com>
* Ensure libseccomp is installed before starting containerd on CentOS 8
* Simplify libseccomp install on CentOS 8
- Uses `package` module
- Replaces complex version check with 'state: latest'. The version must
be > 2.3 when using with cri-o.
- Removes unnecessary `not is_ostree` condition as CentOS 8 does not use
ostree
* copying ssh key no longer required, works with password auth
* use copy module instead of synchronize (which requires sshpass)
* less tasks and always changed tasks
* containerd docker hub registry mirror support
* add docs
* fix typo
* fix yamllint
* fix indent in sample
and ansible-playbook param in testcases_run
* fix md
* mv common vars to tests/common/_docker_hub_registry_mirror.yml
* checkout vars to upgrade tests
If crictl (and docker) binaries are deployed to the directories
that are not in standard PATH (e.g. /usr/local/bin), it is required
to specify full path to the binaries.
The task outputs the following warning:
TASK [kubernetes/preinstall : Enable ip forwarding]
[WARNING]: The value 1 (type int) in a string field was converted
to u'1' (type string). If this does not look like what you expect,
quote the entire value to ensure it does not change.
This new version uses the same base image as kube-proxy
(k8s.gcr.io/build-image/debian-iptables)
This allow to automatically pick iptables-legacy or iptables-nft,
and be compatible with RHEL/CentOS 8
https://github.com/kubernetes/dns/pull/367
Signed-off-by: Etienne Champetier <champetier.etienne@gmail.com>
* fix flake8 errors in Kubespray CI - tox-inventory-builder
* Invalidate CRI-O kubic repo's cache
Signed-off-by: Victor Morales <v.morales@samsung.com>
* add support to configure pkg install retries
and use in CI job tf-ovh_ubuntu18-calico (due to it failing often)
* Switch Calico, Cilium and MetalLB image repos to Quay.io
Co-authored-by: Victor Morales <v.morales@samsung.com>
Co-authored-by: Barry Melbourne <9964974+bmelbourne@users.noreply.github.com>
calico PODs are first started and then in a handler killed and
restarted for no reason, nothing has changed.
By using the existing variable 'calico_cni_config' (only defined when
calico has already started) the restart can be skipped.
* create a wrapper script with pki options
* supports all kubespray managed container engines
Co-authored-by: Hans Feldt <hafe@users.noreply.github.com>
* Allow the eventRecordQPS setting to be set.
The eventRecordQPS parameter controls rate limiting for event recording. When zero, unlimited events can cause denial-of-service situations. For my situation, I don't need more than a setting of "5". This change allows me to configure the setting before creating the cluster.
* Allow the eventRecordQPS setting to be set.
The default settings (see types.go) is five. So, this change does not affect the cluster provisioning. However, it does allow for the setting to be changed.
Fedora 31 uses Cgroups v2 by default. This change by passes the kernel
parameter systemd.unified_cgroup_hierarchy=0.
Signed-off-by: Victor Morales <v.morales@samsung.com>
When https://kubespray.io/#/docs/comparisons is generated, having the link in the heading creates the following HTML. When displayed there is no space between "vs" and the link. I simply moved the link into the following paragraph.
```
<h2 id="kubespray-vs-kops"><a href="#/docs/comparisons?id=kubespray-vs-kops" data-id="kubespray-vs-kops" class="anchor"><span>Kubespray vs </span></a><a href="https://github.com/kubernetes/kops" target="_blank" rel="noopener">Kops</a></h2>
```
* Add note about changing private IP in admin.conf.
When I run kubespray, a load balancer is created which should be used instead of the ip of the controller node.
* Procedure to find load balancer and update admin.conf
When I run kubespray, a load balancer is used instead of the private ip of the controller.
* update version of ingress-nginx controller.
Change tag from controller-v0.34.0 to controller-v0.40.2 to use newest tag.
* Update docs about aws deploy templates.
In the yaml templates, there is no mention of idle timeouts. This is why I removed the documentation about it. This might be a mistake. Please verify this. I don't know enough to verify it myself.
* Change label when checking version.
When checking for `app.kubernetes.io/name=ingress-nginx`, a completed pod was selected which is not helpful when trying to `exec`. Changing the label selects the running controller pod.
* put back the information about ELB Idle Timeouts.
When I removed the information, I had overlooked that it was mentioned in the L7 yaml file. Thanks.
and thereby support upgrade from e.g. 1.18.x to 1.19.y
Included OSes:
- Centos7/8
- Ubuntu18/20
New variables for overriding by default installed packages:
- centos_crio_packages
- ubuntu_crio_packages
* Enable Kata Containers for CRI-O runtime
Kata Containers is an OCI runtime where containers are run inside
lightweight VMs. This runtime has been enabled for containerd runtime
thru the kata_containers_enabled variable. This change enables Kata
Containers to CRI-O container runtime.
Signed-off-by: Victor Morales <v.morales@samsung.com>
* Set appropiate conmon_cgroup when crio_cgroup_manager is 'cgroupfs'
* Set manage_ns_lifecycle=true when KataContainers is enabed
* Add preinstall check for katacontainers
Signed-off-by: Victor Morales <v.morales@samsung.com>
Co-authored-by: Pasquale Toscano <pasqualetoscano90@gmail.com>
Command line flags aren't added to kube-proxy which results in missing
feature gates set in this component. Add appropriate setting to
ConfigMap instead.
Signed-off-by: Maciej Wereski <m.wereski@partner.samsung.com>
'ansible.vars.hostvars.HostVarsVars object' has no attribute 'kubeadm_upload_cert'
kubeadm_upload_cert will never be found as a hostvar for the first
master since the task is executed for a worker.
Fix by executing the upload task for the first master and register
the needed key. After that, workers can read hostvars for the master
Var kubeadm_etcd_refresh_cert_key removed since it no longer has
any use.
* Adding option to disable gloablly applying a proxy to etc/yum.conf
* Change made to proxy_yum_globaly basedon reviewer feedback
* fix trailing spaces in ymllint
crio refuses to delete pods when cni is unavailable which is the
case e.g. using calico with kdd datastore. See:
https://github.com/cri-o/cri-o/issues/4084
Fix by deleting storage associated with containers. Stop and disable
crio service so switching container runtime can be done.
* Added option to force apiserver and respective client certificate to be regenerated without necessarily needing to bump the K8S cluster version
* Removed extra blank line
Handlers with the same name (Kubeadm | restart kubelet) leads to incorrect playbook execution. As a result, after completing the tasks, kubelet does not restart. This PR fix this behavior
After upgrading to newer Kubernetes(v1.17 at least), kubectl command
shows the following warning message:
WARNING: Kubernetes configuration file is group-readable.
This is insecure. Location: /home/foo/.kube/config
The kubeconfig was copied from {{ artifacts_dir }}/admin.conf with
kubeconfig_localhost feature. It is better to set valid file mode
at getting it on Kubespray.
The label triage/support has been reclassified as kind/support. The
kind/* family of labels makes more logical sense, as they describe the
"kind" of thing an issue or PR is.
For more information, see the announcement email:
https://groups.google.com/g/kubernetes-dev/c/YcaJpsjjLKw/m/i15cLLx5CAAJ
If the `mitogen.yml` playbook is run, it installs Mitogen in this path, causing Git to believe there to 500+ changes. This simply excludes that external module from git
The 0d0cc8cf9c change creates several
DaemonSets to cover the Flannel CNI installation for different CPU
architectures. This change removes the unnecessary architecture value
from the docker tag value.
Signed-off-by: Victor Morales <v.morales@samsung.com>
In case multiple nodeselectors are specified in ingress_nginx_nodeselector, the generated daemonset yaml template for nginx is invalid due to missing indentation starting with the second nodeselector
When stopping at the check of "Stop if ip var does not match local ips"
the error message is like:
fatal: [single-k8s]: FAILED! => {
"assertion": "ip in ansible_all_ipv4_addresses",
"changed": false,
"evaluated_to": false,
"msg": "Assertion failed"
}
That doesn't contain actual IP addresses and it is difficult to understand
what was wrong. This adds the error message which contain actual IP addresses
to investigate the issue if happens.
* calico: add constant calico_min_version_required
and verify current deployed version against it.
* calico: remove upgrade support with data migration
The tool was used pre v3.0.0 and is no longer needed.
* calico: remove old version support from tasks
* calico: remove old ver support from policy ctrl
* calico: remove old ver support from node
* canal: remove old ver support
* remove unused calicoctl download checksums
calico_min_version_required is the oldest version that can be installed
Older versions can be removed.
* Add retries to update calico-rr data in etcd through calicoctl
* Update update-node yaml syntax
* Add comment to clarify ansible block loop
* Remove trailing space
* Fix reserved memory unit in kubelet configuration
Signed-off-by: Wang Zhen <lazybetrayer@gmail.com>
* Move systemReserved default values from template
Signed-off-by: Wang Zhen <lazybetrayer@gmail.com>
* Added ability to set calico vxlan vni and port. defaults to calico's documented defaults.
* Check if calico_network_backend is defined prior to checking value
* Removed calico hidden defaults for vxlan port and vni
* Fixed FELIX_VXLANVNI typo
I kept seeing `TLS handshake error from 10.250.250.158:63770: EOF` from two IP addresses that correlate to my ELB. Changing the health check from TCP to HTTPS stopped the errors from being generated.
* Added support for setting tiller_service_account and tiller_replicas
* Specify helm 2 version to ensure we have a test path that still hits helm 2 code
* Moved tiller_service_account to defaults.yml. Fixed is tiller_replicas defined check.
It was documented as if it were an Ansible variable, but it is a Terraform variable.
This also means the colon syntax was incorrect. TF variables are assigned with an equals sign.
Co-authored-by: rptaylor <rptaylor@uvic.ca>
* Make metallb image repos configurable
* Moved metallb image repo definitions to download role defaults
* Removed comment. These are set in download defaults
After host reboot kubelet and crio goes into a loop and no container is started.
storage_driver in crio.conf overrides system defaults in etc/containers/storage.conf
/etc/containers/storage.conf is installed by package containers-common dependency
installed from cri-o (centos7) and contains "overlay".
Hosts already configured with overlay2 should be reconfigured and the
/var/lib/containers content removed.
* Add comment from roles/kubespray-defaults/defaults/main.yaml clarifying network allocation and sizes
Signed-off-by: Mikael Johansson <mik.json@gmail.com>
* Rewrite of the comment and added new examples
Signed-off-by: Mikael Johansson <mik.json@gmail.com>
* remove podman cni plugin
* configure networkamanger global dns
* allow installation of python3-libselinux by disabling update repo temporary
* remove ipv4 section because it is not a valid configuration
Removes these startup warnings:
Warning: For remote container runtime, --pod-infra-container-image is ignored in kubelet, which should be set in that remote runtime instead
Using "/var/run/crio/crio.sock" as endpoint is deprecated, please consider using full url format "unix:///var/run/crio/crio.sock".
* add snapshot-controller and v1beta1 snapshot api
* fix typo
* udpate manifest to v1beta1
* update
* update manifests
* fix spelling
* wait until crd is applied
* fix missing info in kube module
* revert snapshotclass
* add snapshot crds before applying the csi driver
* add crds, missed them in last commit
* use pull policy from kubespray
* Update CustomResourceDefinition for kubecontrollersconfigurations.crd.projectcalico.org to v1
* Align ClusterRole for kube-controllers with upstream (calico)
* Add support for openstack application credentials
* Add some lines for readability
* Update external_openstack_tenant_id check
Do not check external_openstack_tenant_id when application credentials are defined
* Add check for external_openstack_domain_id
* Fix typo
By default do not allow "unqualified" (without a registry) images
because it is considered unsecure and subject to mitm attacks.
To enable insecure pull configure for example:
crio_registries:
- "docker.io"
- "quay.io"
* Use proper openssl command to differentiate between host and ip in current certificate check
* fixup! Use proper openssl command to differentiate between host and ip in current certificate check
The bootstrap-os role uses a bootstrap script to provision a
python interpreter on flatcar and container os hosts. As the
pypy project switched to another hoster, the download url changed.
If applied this will use the new proper pypy download url in bootstrap script
* Update the cilium svc proxy test to HA mode
Signed-off-by: Arthur Outhenin-Chalandre <arthur@cri.epita.fr>
* Fix cilium strict kube-proxy in HA
Signed-off-by: Arthur Outhenin-Chalandre <arthur@cri.epita.fr>
* Add a single global endpoint variable
Signed-off-by: Arthur Outhenin-Chalandre <arthur@cri.epita.fr>
* Add cilium docs about kube-proxy replacement
Signed-off-by: Arthur Outhenin-Chalandre <arthur@cri.epita.fr>
* Fix issues in docs
Signed-off-by: Arthur Outhenin-Chalandre <arthur@cri.epita.fr>
Nit alert. Sample inventory throws an error when processed
by yamllint. The default line is currently commented out.
However, when uncommenting it our linters fail.
* Option for MetalLB to talk BGP
* Check for BGP peers when metallb_protocol is bgp
* README clarification
* Commented values as documentation only in the sample inventory
* layer 2 or BGP, not both
* log level by default increased to 'info'
* cgroup manager by default set to 'systemd'
* stream port (used by kubelet) bound to 127.0.0.1 for security reasons
* metrics can be enabled and port specified
To make it less confusing for users who uncommented whole block of
local path provisioner [1] the samples should point at least to
version 0.0.3 which supports helper image [2] configured by
local_path_provisioner_helper_image_repo variable. As 0.0.3 is a bit old
samples could point to current newest release 0.0.14.
[1] 45a177e2a0 (commitcomment-38625688)
[2] 315d67fa8c
Trying to layer this package on Fedora 32 causes the install to crash
and furthermore it looks like the original bug linked to in the comment
has been resolved for Fedora 31
* Update calico_veth_mtu to FELIX_IPINIP variable
calico_veth_mtu is specified in the configuration, but since it only works for wireguard, modify it to work for IP-in-IP users.
* Update template with more cleaner expression
* Add additional metadata configuration option to external Openstack CCM (kubernetes-sigs#6338)
* Set the variable external_openstack_metadata_search_order undefined by default
* Fix kubelet cgroup driver detection for crio
Remove fact standalone_kubelet since it is not used
* Fix yamllint complaints of roles/kubernetes/node/tasks/facts.yml
Co-authored-by: Hans Feldt <hafe@users.noreply.github.com>
This changes MetalLB contrib to one of addons for deploying MetalLB with
Kubernetes cluster deployment. By the default, Kubespray doesn't deploy
MetalLB addon.
inventory_builder creates hosts.yaml file with hostnames like "node1",
"node2", etc. Even if specifying override_system_hostname=false, the
output of "kubectl get nodes" shows those hostnames ("node1", etc.)
without using actual hostnames.
To solve this issue, this adds an option USE_REAL_HOSTNAME to get
actual hostnames when creating hosts.yaml file instead of "node1", etc.
Support for Ambassador OSS as an Ingress Controller when
settings `ingress_ambassador_enabled: true`.
Signed-off-by: Alvaro Saurin <alvaro.saurin@gmail.com>
Now the in-tree cloud provider is deprecated and it is recommended to
the external cloud provider for OpenStack instead.
The doc described how to upgrade from the in-tree cloud provider, but
it is better to describe how to deploy the external cloud provider from
scratch instead for current situation.
This updates the OpenStack doc for this usecase.
* Install Kata Containers as additional container runtime
* Create RuntimeClasses for Kata Containers
* Updated Vagrant to optionally run without Docker as container manager
* Updated Vagrant to optionally use Libvirt nested virtualization
* Add Kata Containers documentation
* Fix lint errors
* Add kata_containers_enabled to kubespray-defaults
* Fixed typo error
* Fixed typo error
We need to specify either external_openstack_tenant_name or
external_openstack_tenant_id. Those values were checked by seeing they
are defined or they have actual values separately.
However those values are always defined because of the following code
of openstack/defaults/main.yml:
external_openstack_tenant_id: "{{ lookup('env','OS_TENANT_ID')| default(lookup('env','OS_PROJECT_ID'),true) }}"
external_openstack_tenant_name: "{{ lookup('env','OS_TENANT_NAME')| default(lookup('env','OS_PROJECT_NAME'),true) }}"
So even if not specifying both values, those checks could not detect
the misconfiguration. This fixes this to detect the misconfiguration.
* MINOR: Check kernel version before enable modprobe nf_conntrack
* CLEANUP: no more need to ignore error of this task
* MINOR: Fixing yaml and ansible lint error - remove trailling-space
If the special parameter "$@" is not quoted, the following command will not work:
./kubectl.sh patch storageclass my-storage-class -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'
* Add oraclelinux8 and disable firewalld
Add oraclelinux8 image and disable firewalld on oraclelinux VMs
* Fix Oracle Linux repositories
As documented in: http://yum.oracle.com/getting-started.html#installing-software-from-oracle-linux-yum-server
public-yum-ol7.repo was deprecated on release 7.6. Some repos were integrated into oracle-linux-ol7.repo (i.e.: ol7_latest, ol7_addons) and other are available as packages (epel). This also adds support for oraclelinux8
* Fix to use ansible_distribution_version
Instead of ansible_distribution_major_version
* Update README.md
On OpenStack history, we used to call "tenant" for separeted namespace.
However we use "project" now instead.
Then we have replaced "tenant" with "project". Then all "TENANT" variables
also are renamed to "PROJECT".
This makes Kubespray search "PROJECT" variable also for newer OpenStack
clouds.
with the Python ruamel.yml library
- Change True/False to true/false in a few places so file can
be more easily round-tripped with the Python ruamel.yml library
flannel, ovn and multus network plugins did not support all taint keys. This
update changes the tolerations to support them all.
According to the documentation:
```
There are two special cases: An empty key with operator Exists matches all keys,
values and effects which means this will tolerate everything. An empty effect matches
all effects with key key.
```
Usage of the empty `key` and `effect` ensures the network plugin daemonset will
be deployed on every nodes (ex: in case of custom taints, or NoExecute effect)
Since MetalLB v0.8[1], metallb:speaker has started publishing an event
nodeAssigned on k8s resource.
To support MetalLB v0.8+, this allows metallb:speaker to create events.
[1]: 5cc6e23776 (diff-60053ad6fecb5a3cfabb6f3d9e720899R246)
Since weave 2.5.1, `NoExecute` taint effect is no more supported,
this changes the daemonset tolerations to change this behavior.
Also remove the toleration key `CriticalAddonsOnly` not required anymore.
If running MetalLB v0.7.3 on k8s v1.18.2, metallb pods output the
following parsing error of v1.ServiceList:
$ kubectl logs controller-dbb46cf84-fw8h8 -n metallb-system
{
"caller":"reflector.go:205",
"level":"error",
"msg":"go.universe.tf/metallb/internal/k8s/k8s.go:231:
Failed to list *v1.Service: v1.ServiceList:
Items: []v1.Service: v1.Service: ObjectMeta:
v1.ObjectMeta: readObjectFieldAsBytes:
expect : after object field, parsing 1605
Then an external IP address is never allocated to the Service of
LoadBalancer type.
By updating MetalLB version to the latest v0.9[1] today, this issue
can be solved.
[1]: https://hub.docker.com/r/metallb/controller/tags
* fix(kubelet): exec notify restart kubelet service when kube-config.yml changed
* Revert "refactor(kubelet handler): change task name("reload kubelet") this is misleading"
This reverts commit 8f5d29560802c7c997293adb1ce9f84d3b20b6cb.
* fix(handlers,kubelet): setting right notify task name
* update documentation to add and remove nodes
* add information about parameters to change when adding multiple etcd nodes
* add information about reset_nodes
* add documentation about adding existing nodes to ectd masters.
* Add additional network configuration options to external Openstack CCM (#6083)
* Change the default version of external openstack cloud controller image to v1.18.1 since there was an issue in v1.18.0 where some IPs of the private network were ignored
* Change Network section in external-openstack-cloud-config.j2 to Networking
* Add networking customization information in the openstack documentation
This updates MetalLB README as following
- Remove unnecessary markdown to read it easily on github
- Make words consistency (kubernetes, loadbalancer)
- Add change-required option
Due to lack of requirements installation on Azure README, the error
can happen:
"The ipaddr filter requires python's netaddr be installed on the
ansible controller"
It is nice to add the installation for Azure users.
The 98e7a07fba commit udpates the
dashboard version to 2.0.0 but it enable skip login flag wasn't
updated. This change updates its identation to avoid issues when
dashboard_skip_login is enabled.
apply-rg.sh was for Azure command version 1("azure" command) and the
command is old and version 2("az" command) is officially used today.
apply-rg_2.sh was for the version 2. In addition, the README[1] says
we need to run apply-rg.sh for applying templates.
This renames apply-rg_2.sh to apply-rg.sh for common usages of the
version 2.
[1]: https://github.com/kubernetes-sigs/kubespray/tree/master/contrib/azurerm#generating-and-applying
about: Support request or question relating to Kubespray
labels: triage/support
---
<!--
STOP -- PLEASE READ!
GitHub is not the right place for support requests.
If you're looking for help, check [Stack Overflow](https://stackoverflow.com/questions/tagged/kubespray) and the [troubleshooting guide](https://kubernetes.io/docs/tasks/debug-application-cluster/troubleshooting/).
You can also post your question on the [Kubernetes Slack](http://slack.k8s.io/) or the [Discuss Kubernetes](https://discuss.kubernetes.io/) forum.
If the matter is security related, please disclose it privately via https://kubernetes.io/security/.
It is recommended to use filter to manage the GitHub email notification, see [examples for setting filters to Kubernetes Github notifications](https://github.com/kubernetes/community/blob/master/communication/best-practices.md#examples-for-setting-filters-to-kubernetes-github-notifications)
To install development dependencies you can use`pip install -r tests/requirements.txt`
To install development dependencies you can set up a python virtual env with the necessary dependencies:
```ShellSession
virtualenv venv
source venv/bin/activate
pip install -r tests/requirements.txt
ansible-galaxy install -r tests/requirements.yml
```
#### Linting
Kubespray uses `yamllint` and `ansible-lint`. To run them locally use `yamllint .` and `./tests/scripts/ansible-lint.sh`
Kubespray uses [pre-commit](https://pre-commit.com) hook configuration to run several linters, please install this tool and use it to run validation tests before submitting a PR.
```ShellSession
pre-commit install
pre-commit run -a # To run pre-commit hook on all files in the repository, even if they were not modified
```
#### Molecule
@@ -27,5 +39,9 @@ Vagrant with VirtualBox or libvirt driver helps you to quickly spin test cluster
1. Submit an issue describing your proposed change to the repo in question.
2. The [repo owners](OWNERS) will respond to your issue promptly.
3. Fork the desired repo, develop and test your code changes.
4.Sign the CNCF CLA (<https://git.k8s.io/community/CLA.md#the-contributor-license-agreement>)
5.Submit a pull request.
4.Install [pre-commit](https://pre-commit.com) and install it in your development repo.
5.Address any pre-commit validation failures.
6. Sign the CNCF CLA (<https://git.k8s.io/community/CLA.md#the-contributor-license-agreement>)
7. Submit a pull request.
8. Work with the reviewers on their suggestions.
9. Ensure to rebase to the HEAD of your target branch and squash un-necessary commits (<https://blog.carbonfive.com/always-squash-and-rebase-your-git-commits/>) before final merger of your contribution.
If you have questions, check the documentation at [kubespray.io](https://kubespray.io) and join us on the [kubernetes slack](https://kubernetes.slack.com), channel **\#kubespray**.
You can get your invite [here](http://slack.k8s.io/)
- Can be deployed on **AWS, GCE, Azure, OpenStack, vSphere, Packet (bare metal), Oracle Cloud Infrastructure (Experimental), or Baremetal**
- Can be deployed on **[AWS](docs/cloud_providers/aws.md), GCE, [Azure](docs/cloud_providers/azure.md), [OpenStack](docs/cloud_controllers/openstack.md), [vSphere](docs/cloud_controllers/vsphere.md), [Equinix Metal](docs/cloud_providers/equinix-metal.md) (bare metal), Oracle Cloud Infrastructure (Experimental), or Baremetal**
- **Highly available** cluster
- **Composable** (Choice of the network plugin for instance)
- Supports most popular **Linux distributions**
@@ -13,187 +13,198 @@ You can get your invite [here](http://slack.k8s.io/)
## Quick Start
To deploy the cluster you can use :
Below are several ways to use Kubespray to deploy a Kubernetes cluster.
### Docker
Ensure you have installed Docker then
```ShellSession
docker run --rm -it --mount type=bind,source="$(pwd)"/inventory/sample,dst=/inventory \
Note: When Ansible is already installed via system packages on the control machine, other python packages installed via `sudo pip install -r requirements.txt` will go to a different directory tree (e.g. `/usr/local/lib/python2.7/dist-packages` on Ubuntu) from Ansible's (e.g. `/usr/lib/python2.7/dist-packages/ansible` still on Ubuntu).
As a consequence, `ansible-playbook` command will fail with:
```raw
ERROR! no action detected in task. This often indicates a misspelled module name, or incorrect module path.
```
probably pointing on a task depending on a module present in requirements.txt (i.e. "unseal vault").
One way of solving this would be to uninstall the Ansible package and then, to install it via pip but it is not always possible.
A workaround consists of setting `ANSIBLE_LIBRARY` and `ANSIBLE_MODULE_UTILS` environment variables respectively to the `ansible/modules` and `ansible/module_utils` subdirectories of pip packages installation location, which can be found in the Location field of the output of `pip show [package]` before executing `ansible-playbook`.
See [here](docs/ansible/ansible_collection.md) if you wish to use this repository as an Ansible collection
### Vagrant
For Vagrant we need to install python dependencies for provisioning tasks.
Check if Python and pip are installed:
For Vagrant we need to install Python dependencies for provisioning tasks.
Check that ``Python`` and ``pip`` are installed:
```ShellSession
python -V && pip -V
```
If this returns the version of the software, you're good to go. If not, download and install Python from here <https://www.python.org/downloads/source/>
Install the necessary requirements
Install Ansible according to [Ansible installation guide](/docs/ansible/ansible.md#installing-ansible)
then run the following step:
```ShellSession
sudo pip install -r requirements.txt
vagrant up
```
## Documents
- [Requirements](#requirements)
- [Kubespray vs ...](docs/comparisons.md)
- [Getting started](docs/getting-started.md)
- [Ansible inventory and tags](docs/ansible.md)
- [Integration with existing ansible repo](docs/integration.md)
- [Deployment data variables](docs/vars.md)
- [DNS stack](docs/dns-stack.md)
- [HA mode](docs/ha-mode.md)
- [Kubespray vs ...](docs/getting_started/comparisons.md)
Note: The list of validated [docker versions](https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1.16.md) was updated to 1.13.1, 17.03, 17.06, 17.09, 18.06, 18.09. kubeadm now properly recognizes Docker 18.09.0 and newer, but still treats 18.06 as the default supported version. The kubelet might break on docker's non-standard version numbering (it no longer uses semantic versioning). To ensure auto-updates don't break your cluster look into e.g. yum versionlock plugin or apt pin).
- The cri-o version should be aligned with the respective kubernetes version (i.e. kube_version=1.20.x, crio_version=1.20)
## Requirements
- **Minimum required version of Kubernetes is v1.15**
- **Ansible v2.9+, Jinja 2.11+ and python-netaddr is installed on the machine that will run Ansible commands**
- The target servers must have **access to the Internet** in order to pull docker images. Otherwise, additional configuration is required (See [Offline Environment](https://github.com/kubernetes-sigs/kubespray/blob/master/docs/downloads.md#offline-environment))
- **Minimum required version of Kubernetes is v1.30**
- **Ansible v2.14+, Jinja 2.11+ and python-netaddr is installed on the machine that will run Ansible commands**
- The target servers must have **access to the Internet** in order to pull docker images. Otherwise, additional configuration is required (See [Offline Environment](docs/operations/offline-environment.md))
- The target servers are configured to allow **IPv4 forwarding**.
- **Your ssh key must be copied** to all the servers part of your inventory.
- If using IPv6 for pods and services, the target servers are configured to allow **IPv6 forwarding**.
- The **firewalls are not managed**, you'll need to implement your own rules the way you used to.
in order to avoid any issue during deployment you should disable your firewall.
- If kubespray is ran from non-root user account, correct privilege escalation method
- If kubespray is run from non-root user account, correct privilege escalation method
should be configured in the target servers. Then the `ansible_become` flag
or command parameters `--become or -b` should be specified.
Hardware:
These limits are safeguarded by Kubespray. Actual requirements for your workload can differ. For a sizing guide go to the [Building Large Clusters](https://kubernetes.io/docs/setup/cluster-large/#size-of-master-and-master-components) guide.
These limits are safeguarded by Kubespray. Actual requirements for your workload can differ. For a sizing guide go to the [Building Large Clusters](https://kubernetes.io/docs/setup/cluster-large/#size-of-master-and-master-components) guide.
- Master
- Memory: 1500 MB
- Node
- Memory: 1024 MB
- Control Plane
- Memory: 2 GB
- Worker Node
- Memory: 1 GB
## Network Plugins
You can choose between 10 network plugins. (default: `calico`, except Vagrant uses `flannel`)
You can choose among ten network plugins. (default: `calico`, except Vagrant uses `flannel`)
- [Calico](https://docs.projectcalico.org/latest/introduction/) is a networking and network policy provider. Calico supports a flexible set of networking options
- [Calico](https://docs.tigera.io/calico/latest/about/) is a networking and network policy provider. Calico supports a flexible set of networking options
designed to give you the most efficient networking across a range of situations, including non-overlay
and overlay networks, with or without BGP. Calico uses the same engine to enforce network policy for hosts,
pods, and (if using Istio and Envoy) applications at the service mesh layer.
- [canal](https://github.com/projectcalico/canal): a composition of calico and flannel plugins.
- [cilium](http://docs.cilium.io/en/latest/): layer 3/4 networking (as well as layer 7 to protect and secure application protocols), supports dynamic insertion of BPF bytecode into the Linux kernel to implement security services, networking and visibility logic.
- [contiv](docs/contiv.md): supports vlan, vxlan, bgp and Cisco SDN networking. This plugin is able to
apply firewall policies, segregate containers in multiple network and bridging pods onto physical networks.
- [kube-ovn](docs/CNI/kube-ovn.md): Kube-OVN integrates the OVN-based Network Virtualization with Kubernetes. It offers an advanced Container Network Fabric for Enterprises.
- [weave](docs/weave.md): Weave is a lightweight container overlay network that doesn't require an external K/V database cluster.
(Please refer to `weave` [troubleshooting documentation](https://www.weave.works/docs/net/latest/troubleshooting/)).
- [kube-ovn](docs/kube-ovn.md): Kube-OVN integrates the OVN-based Network Virtualization with Kubernetes. It offers an advanced Container Network Fabric for Enterprises.
- [kube-router](docs/kube-router.md): Kube-router is a L3 CNI for Kubernetes networking aiming to provide operational
- [kube-router](docs/CNI/kube-router.md): Kube-router is a L3 CNI for Kubernetes networking aiming to provide operational
simplicity and high performance: it uses IPVS to provide Kube Services Proxy (if setup to replace kube-proxy),
iptables for network policies, and BGP for ods L3 networking (with optionally BGP peering with out-of-cluster BGP peers).
It can also optionally advertise routes to Kubernetes cluster Pods CIDRs, ClusterIPs, ExternalIPs and LoadBalancerIPs.
- [macvlan](docs/macvlan.md): Macvlan is a Linux network driver. Pods have their own unique Mac and Ip address, connected directly the physical (layer 2) network.
- [macvlan](docs/CNI/macvlan.md): Macvlan is a Linux network driver. Pods have their own unique Mac and Ip address, connected directly the physical (layer 2) network.
- [multus](docs/multus.md): Multus is a meta CNI plugin that provides multiple network interface support to pods. For each interface Multus delegates CNI calls to secondary CNI plugins such as Calico, macvlan, etc.
- [multus](docs/CNI/multus.md): Multus is a meta CNI plugin that provides multiple network interface support to pods. For each interface Multus delegates CNI calls to secondary CNI plugins such as Calico, macvlan, etc.
The choice is defined with the variable `kube_network_plugin`. There is also an
- [custom_cni](roles/network-plugin/custom_cni/) : You can specify some manifests that will be applied to the clusters to bring you own CNI and use non-supported ones by Kubespray.
See `tests/files/custom_cni/README.md` and `tests/files/custom_cni/values.yaml`for an example with a CNI provided by a Helm Chart.
The network plugin to use is defined by the variable `kube_network_plugin`. There is also an
option to leverage built-in cloud provider networking instead.
See also [Network checker](docs/netcheck.md).
See also [Network checker](docs/advanced/netcheck.md).
## Ingress Plugins
- [nginx](https://kubernetes.github.io/ingress-nginx): the NGINX Ingress Controller.
- [metallb](docs/ingress/metallb.md): the MetalLB bare-metal service LoadBalancer provider.
## Community docs and resources
@@ -206,10 +217,12 @@ See also [Network checker](docs/netcheck.md).
The Kubespray Project is released on an as-needed basis. The process is as follows:
1. An issue is proposing a new release with a changelog since the last release
2. At least one of the [approvers](OWNERS_ALIASES) must approve this release
3. The `kube_version_min_required` variable is set to `n-1`
4. Remove hashes for [EOL versions](https://github.com/kubernetes/sig-release/blob/master/releases/patch-releases.md) of kubernetes from `*_checksums` variables.
5.An approver creates [new release in GitHub](https://github.com/kubernetes-sigs/kubespray/releases/new) using a version and tag name like `vX.Y.Z` and attaching the release notes
6. An approver creates a release branch in the form `release-X.Y`
7.The corresponding version of [quay.io/kubespray/kubespray:vX.Y.Z](https://quay.io/repository/kubespray/kubespray) and [quay.io/kubespray/vagrant:vX.Y.Z](https://quay.io/repository/kubespray/vagrant) docker images are built and tagged
8.The `KUBESPRAY_VERSION` variable is updated in `.gitlab-ci.yml`
9.The release issue is closed
10.An announcement email is sent to `kubernetes-dev@googlegroups.com` with the subject `[ANNOUNCE] Kubespray $VERSION is released`
11. The topic of the #kubespray channel is updated with `vX.Y.Z is released! | ...`
1. An issue is proposing a new release with a changelog since the last release. Please see [a good sample issue](https://github.com/kubernetes-sigs/kubespray/issues/8325)
1. At least one of the [approvers](OWNERS_ALIASES) must approve this release
1. (Only for major releases) The `kube_version_min_required` variable is set to `n-1`
1. (Only for major releases) Remove hashes for [EOL versions](https://github.com/kubernetes/website/blob/main/content/en/releases/patch-releases.md) of kubernetes from `*_checksums` variables.
1.Create the release note with [Kubernetes Release Notes Generator](https://github.com/kubernetes/release/blob/master/cmd/release-notes/README.md). See the following `Release note creation` section for the details.
1. An approver creates [new release in GitHub](https://github.com/kubernetes-sigs/kubespray/releases/new) using a version and tag name like `vX.Y.Z` and attaching the release notes
1.(Only for major releases) An approver creates a release branch in the form `release-X.Y`
1.(For major releases) On the `master` branch: bump the version in `galaxy.yml` to the next expected major release (X.y.0 with y = Y + 1), make a Pull Request.
1.(For minor releases) On the `release-X.Y` branch: bump the version in `galaxy.yml` to the next expected minor release (X.Y.z with z = Z + 1), make a Pull Request.
1.The corresponding version of [quay.io/kubespray/kubespray:vX.Y.Z](https://quay.io/repository/kubespray/kubespray) and [quay.io/kubespray/vagrant:vX.Y.Z](https://quay.io/repository/kubespray/vagrant) container images are built and tagged. See the following `Container image creation` section for the details.
1. The release issue is closed
1. An announcement email is sent to `dev@kubernetes.io` with the subject `[ANNOUNCE] Kubespray $VERSION is released`
1. The topic of the #kubespray channel is updated with `vX.Y.Z is released! | ...`
1. Create/Update Issue for upgradeing kubernetes and [k8s-conformance](https://github.com/cncf/k8s-conformance)
## Major/minor releases and milestones
@@ -42,7 +45,41 @@ The Kubespray Project is released on an as-needed basis. The process is as follo
* Minor releases can change components' versions, but not the major `kube_version`.
Greater `kube_version` requires a new major or minor release. For example, if Kubespray v2.0.0
is bound to `kube_version: 1.4.x`, `calico_version: 0.22.0`, `etcd_version: v3.0.6`,
is bound to `kube_version: 1.4.x`, `calico_version: 0.22.0`, `etcd_version: 3.0.6`,
then Kubespray v2.1.0 may be bound to only minor changes to `kube_version`, like v1.5.1
and *any* changes to other components, like etcd v4, or calico 1.2.3.
And Kubespray v3.x.x shall be bound to `kube_version: 2.x.x` respectively.
If the release note file(/tmp/kubespray-release-note) contains "### Uncategorized" pull requests, those pull requests don't have a valid kind label(`kind/feature`, etc.).
It is necessary to put a valid label on each pull request and run the above release-notes command again to get a better release note
## Container image creation
The container image `quay.io/kubespray/kubespray:vX.Y.Z` can be created from Dockerfile of the kubespray root directory:
# libvirt__ipv6_address does not work as intended, the address is obtained with the desired prefix, but auto-generated(like fd3c:b398:698:756:5054:ff:fe48:c61e/64)
# add default route for detect ansible_default_ipv6
# TODO: fix libvirt__ipv6 or use $subnet in shell
config.vm.provision"shell",inline:"ip -6 r a fd3c:b398:698:756::/64 dev eth1;ip -6 r add default via fd3c:b398:0698:0756::1 dev eth1 || true"
# Disable swap for each vm
node.vm.provision"shell",inline:"swapoff -a"
# ubuntu2004 and ubuntu2204 have IPv6 explicitly disabled. This undoes that.
MetalLB hooks into your Kubernetes cluster, and provides a network load-balancer implementation. In short, it allows you to create Kubernetes services of type “LoadBalancer” in clusters that don’t run on a cloud provider, and thus cannot simply hook into paid products to provide load-balancers.
```
This playbook aims to automate [this](https://metallb.universe.tf/concepts/layer2/). It deploys MetalLB into kubernetes and sets up a layer 2 loadbalancer.
## Install
```
Defaults can be found in contrib/metallb/roles/provision/defaults/main.yml. You can override the defaults by copying the contents of this file to somewhere in inventory/mycluster/group_vars such as inventory/mycluster/groups_vars/k8s-cluster/addons.yml and making any adjustments as required.
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.