* Ensure entries for 1.23 are added for supported_versions vars
* cri-o: add support for kubernetes 1.23 but still use cri-o 1.22
* kubescheduler-config: diferentiate config versions based on kube_version
* registry: service add clusterIP, nodePort, loadBalancer support
* modify camelcase name to underscore
* Add registry service type compatibility check
* containerd: change default resolvconf_mode to host_resolvconf
* Wait for kube-apiserver to come back after pod refresh
* Handle resolv.conf gracefully
* Retain currently configured DNS entries to ensure we don't break the resolvers
* Suse uses wickedd for network management so no dhcp hooks
* Molecule: increase ansible timeout
* CI: Increase ansible timeout to 120s for Packet jobs
* Improve control plane scale flow (#13)
* Added version 1.20.10 of K8s
* Setting first_kube_control_plane to a existing one
* Setting first_kube_control_plane to a existing one
* change first_kube_master for first_kube_control_plane
* Ansible-lint changes
If trying to pull k8scsi/csi-resizer image from gcr.io, we face the error
like:
$ docker pull gcr.io/k8scsi/csi-resizer:v1.0.0
Error response from daemon: Head https://gcr.io/v2/k8scsi/csi-resizer/
manifests/v1.0.0: unknown: Project 'project:k8scsi' not found or deleted.
$
We can pull the image from quay.io instead.
This fixes the issue.
* containerd: add hashes for 1.5.8 and 1.4.12 and make 1.5.8 the new default
* containerd: make nerdctl mandatory for container_manager = containerd
* nerdctl: bump to version 0.14.0
* containerd: use nerdctl for image manipulation
* OpenSuSE: install basic nerdctl dependencies
* set ingress-nginx default terminationGracePeriodSeconds to 5 min for the drain of connection
* Add ingress_nginx_termination_grace_period_seconds at sample inventory
install containerd on centos need to binary download it
but offline.yml has no that value
binary download url default in
roles/download/defaults/main.yml:runc_download_url: "https://github.com/opencontainers/runc/releases/download/{{ runc_version }}/runc.{{ image_arch }}"
roles/download/defaults/main.yml:containerd_download_url: "https://github.com/containerd/containerd/releases/download/v{{ containerd_version }}/containerd-{{ containerd_version }}-linux-{{ image_arch }}.tar.gz"
if i use default offlie.yml, it's error from task download files
because runc,containerd down url is none offline
i want fix this
just add 2 new line
* Docs: update CONTRIBUTING.md
* Docs: clean up outdated roadmap and point to github issues instead
* Docs: update note on kubelet_cgroup_driver
* Docs: update kata containers docs with note about cgroup driver
* Docs: note about CI specific overrides
* Defaults: replace docker with containerd as our default container_manager
* CI: Use docker for download_localhost test
* Defaults: with container_manager=containerd we need etcd_deployment_type=host
* CI: Run weave jobs with docker
* CI: Vagrant don't download_force_cache
* CI: Fix upgrade tests
* should run compatible with old settings, this means docker
* we need to run with a distro that has at least modern containerd,
this means move from debian9 to debian10 to allow `containerd_version`
to match between 2.17 and master
* add metallb auto-assign property for main IP range & update addons.yml for sample inventory
* add new line at the end of file roles\kubernetes-apps\metallb\defaults\main.yml
* set default value for matallb_auto_assign = true
* Fixes various issues in vSphere Terraform code
Provided to address various shortcomings and to fix the following
issue in upstream Kubespray:
https://github.com/kubernetes-sigs/kubespray/issues/8176
* Resolves Terraform formatting issues
* Sets default prefix to human-readable name
* Documents new default prefix in README
* Ansible: separate requirements files for supported ansible versions
* Ansible: allow using ansible 2.11
* CI: Exercise Ansible 2.9 and Ansible 2.11 in a basic AIO CI job
* CI: Allow running a reset test outside of idempotency tests and running it in stage1
* CI: move ubuntu18-calico-aio job to stage2 and relay only on ubuntu20 with the variously supported ansible versions for stage1
* CI: add capability to install collections or roles from ansible-galaxy to mitigate missing behavior in older ansible versions
* Kata-containes: Fix for ubuntu and centos sometimes kata containers fail to start because of access errors to /dev/vhost-vsock and /dev/vhost-net
* Kata-containers: use similar testing strategy as gvisor
* Kata-Containers: adjust values for 2.2.0 defaults
Make CI tests actually pass
* Kata-Containers: bump to 2.2.2 to fix sandbox_cgroup_only issue
* Limit kubectl delete node to k8s nodes
This avoids the use of `kubectl delete node` when removing etcd nodes
which are not part of the cluser (separate etcd)
* Take errors into account when deleting node
There should not be error now that we're limiting the deletion to nodes
actually in the cluster
* Retrying on error
* Ensure addon-resizer 1.8.11 only effective at arch amd64.
k8s.gcr.io/addon-resizer:1.8.11 returns the amd64 image which is not executable at arm64.
Disable addon-resizer when the platform is not amd64.
When metrics-server upgrade and use addon-resizer:2.3, then revert this
commit and `image_arch` will determine the `addon_resizer_image_tag`.
* Add metrics_server_resizer architectures check
* nginx-ingress: bump to 1.0.4
* Disable builtin ssl_session_cache solving the problem with OpenSSL consuming memory.
* Print warning only instead of error if no IngressClass permission is available.
* nginx-ingress: bump to 1.0.4 in the README
* Disable builtin ssl_session_cache solving the problem with OpenSSL consuming memory.
* Print warning only instead of error if no IngressClass permission is available.
* Containerd: download containerd from upstream instead of using distro specific packages
split runc download to separate role
make bootstrap-os role deploy container-selinux and seccomp libraries
clean up package manager provided containerd
move variables to docker role that are no longer common with containerd
* Containerd: make molecule testing more relevant
* replace ubuntu18 with ubuntu20
* add centos8 and debian11 to molecule tests
* run kubernetes/preinstall role to ensure relevancy
of test including dependency packages
* CI: adjust test scenarios for downloaded containerd
kube-bench scan outputs warning related to Calico like:
* text: "Ensure that the Container Network Interface file
permissions are set to 644 or more restrictive (Manual)"
* text: "Ensure that the Container Network Interface file
ownership is set to root:root (Manual)"
This fixes these warnings.
* netchecker: update images to 1.2.2 from Mirantis which is slightly less ancinet than the l23networks images
* Netchecker: use local etcd instead of kubernetes v1beta1 crds which are no longer suported by kube 1.22+
* Add Rocky as a known OS
* Make sure Rocky includes bootstrap-centos.yml
* Update docs with Rocky Linux
* Rocky Linux wireguard and EPEL
* Rocky Linux in the list of supported distributions
If the etcd cluster is separate and the etcd_deployment_type is "host",
there is no need for a container engine on the etcd nodes
Do not rely on a 'default(true)' filter, but define a proper default in
kubespray-defaults depending on etcd deployment method and if internal
or external etcd is used
to remove deprecation warning:
> Flag --feature-gates has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag.
Kubespray deployment failed when using containerd backend on nodes that apparmor was not installed or previously removed. This PR ensure apparmor is installed by adding it into required_pkgs var.
* Fix invalid link to Ansible documentation
* Fix invalid link to mitogen doc page
* Fix invalid link to calico doc page
* Fix all invalid links to doc pages
The addon-resizer container can reduce resource limits of cpu and
memory of metrics-server container in the pod, and that caused
OOMKilled.
In addition, the original metrics-server manifest doesn't contain
the addon-resizer container as [1].
So this adds metrics_server_resizer option to control the addon-resizer
container deployment and the default value is false to make it stable
for most environments.
[1]: 527679e5e8/manifests/base/deployment.yaml
"allowPrivilegeEscalation: false" blocks deploying metrics-server
on CentOS7. In addition, the original metrics-server manifest doesn't
contain it as [1]. This removes it.
[1]: 527679e5e8/manifests/base/deployment.yaml
* Kata-Containers: add 2.2.0 hashes and make default
* Kata-Containers: replace 2.1.0 with bugfix version 2.1.1
* Kata-Containers: move to q35 a more modern VM architecture as 'pc' is removed in 2.2.0
Kubespray deployment failed when using containerd backend on nodes that apparmor was not installed or previously removed. This PR ensure apparmor is installed by adding it into required_pkgs var.
The path of kubeconfig should be configurable, and its default value
is /etc/kubernetes/admin.conf. Most paths of the file are configurable
but some were not. This make those configurable.
* Calico: make calico_min_version check relevant
* Calico: only check currently installed version against the oldest supported version by the previous release
Modify connection_strings_etcd to only return etcd nodes - not master nodes - since this results in duplicate hosts in the generated Ansible inventory and is unnecessary.
On Debian 11, `ipset` just recommend `iptables` so on the system that apt is configured with `APT::Install-Recommends "0";` iptables will not install automatically.
* Fix: adding new ips with inventory builder (#7577)
* moved conflig loading logic
to after checking whether the config
should be loaded, and added check for
whether the config should be loaded
* added check for removing nodes from config
if the user wants to remove a node, we
need to load the config
* Fix tox errors
* Fix missing file mode (risky-file-permissions)
Found this using ansible-lint.
Signed-off-by: Bryan Hundven <bryanhundven@gmail.com>
* Fix another missing file mode (risky-file-permissions)
This one fixes `/etc/crio/config.json`
Signed-off-by: Bryan Hundven <bryanhundven@gmail.com>
* CSI: update CSI snapshot CRDs
* CSI: update snapshot controller tag version with kubernetes specific versions
* CSI: allow enabling csi_snapshot_controller independent of Cinder CSI
* CSI: Align csi-snapshot-controller with upstream and use a Deployment instead of a StatefulSet
When using Calico with:
- `calico_network_backend: vxlan`,
- `calico_ipip_mode: "Never"`,
- `calico_vxlan_mode: "Always"`,
the `FelixConfiguration` object has `ipipEnabled: true`, when it should be false:
This is caused by an error in the `| bool` conversion in the install task:
when `calico_ipip_mode` is `Never`,
`{{ calico_ipip_mode != 'Never' | bool }}` evaluates to `true`:
* Fedora and RHEL use etc_t and the convention is <type_name>_t
* Docs: specify all values for preinstall_selinux_state
* CI: Add Fedora 34 with SELinux in enforcing mode
using --no-cache-dir flag in pip install ,make sure downloaded packages
by pip don't cached on system . This is a best practice which make sure
to fetch from repo instead of using local cached one . Further , in case
of Docker Containers , by restricting caching , we can reduce image size.
In term of stats , it depends upon the number of python packages
multiplied by their respective size . e.g for heavy packages with a lot
of dependencies it reduce a lot by don't caching pip packages.
Further , more detail information can be found at
https://medium.com/sciforce/strategies-of-docker-images-optimization-2ca9cc5719b6
Signed-off-by: Pratik Raj <rajpratik71@gmail.com>
Fix task 'Cert Manager | Wait for Webhook pods become ready' failed due to webhook pods don't exist yet by using `retries..until` trick like kubernetes-sigs/kubespray#7842
This fix should be removed in the future if the kubernetes/kubernetes#83242 is resolved.
Signed-off-by: rtsp <git@rtsp.us>
The main functions are wrapped by a sys.exit function which expects and
argument. The curent implementation isn't returning values in all cases.
This change ensures main functions return a value in all cases.
Fix task 'Cert Manager | Apply ClusterIssuer manifest' failed due to service/endpoints updating delayed even though the wekhook pod status is ready.
Signed-off-by: rtsp <git@rtsp.us>
Changes:
* ClusterRole updated according to the latest manifests from
https://github.com/kubernetes/cloud-provider-vsphere
* vSphere CPI/CSI default versions bumped and
tested successfully on K8S 1.21.1
* vSphere documentation updated
Signed-off-by: Vitaliy D <vi7alya@gmail.com>
non-kubeadm mode has been removed since ddffdb63bf
2.5 years ago. The non-kubeadm makes unnecessary confusion today, then
this updates the documentation.
Previously IDs of container images were gotten from tar files of container
images but that way was wrong. If multiple json files are contained in a
tar file, the script got multiple IDs and tried to pass these IDs on
`docker tag` command. Then the command was failed.
This updates the script to get image IDs from `docker image inspect` command
to fix this issue.
In addition, this adds a check a registry container exists already or not
before deploying registry container to avoid a container conflict failure.
* CRI-O: Install libseccomp2 from backports on Debian 10
libseccomp2 is a required dependency of cri-o-runc package
The one provided in Debian 10 repositories is outdated
* 7816: Remove useless when condition
As this condition is handled by block
* fix(misc): terraform/aws
- handles deployment with a single availability zone
- handles deployment with more than two availability zone
- handles etcd collocation with control-plane nodes (`aws_etcd_num=0`)
- allows to set a bastion instances count (`aws_bastion_num`)
- allows to set bastion/etcd/control-plane/workers rootfs volume size
- removes variables from terraform.tfvars that were not re-used
- adds .terraform.lock.hcl to .gitignore
- changes/updates base image from ubuntu-18.03 to debian-10
tested by a few coworkers of mine, and myself: thanks for the outstanding
work, on both those terraform samples and kubespray playbooks.
I did not test ubuntu deployments, I could still swap from buster to
focal. LMK.
* fix(gitlab-ci)
AFAIU, terraform.tfvars indentation should be fixed for / no diff
returned running `terraform fmt -check -diff`
https://gitlab.com/kargo-ci/kubernetes-sigs-kubespray/-/jobs/1445622114
To download necessary files in advance for offline deployment,
we can see all file URLs with contrib/offline/generate_list.sh
Most URLs are downloadable, but gvisor's one is not because the
URL is a part of full URLs for gvisor.
To download gvisor's files from the URLs directory, this separates
into two URLs for runsc and the shim.
tf-elax_ubuntu18-calico is so flake today. The test job is failed
due to SSH connectivity check error after deploying virtual machines
which are used for Kubernetes nodes.
This allows failure on the job to see the test situation without
pull request merger failures.
When running the script, I faced the following error but it was
difficult to know the root problem due to lack of error handling.
docker tag" requires exactly 2 arguments.
See 'docker tag --help'.
Usage: docker tag SOURCE_IMAGE[:TAG] TARGET_IMAGE[:TAG]
Create a tag TARGET_IMAGE that refers to SOURCE_IMAGE
To investigate such errors easily, this adds an error handling.
* csi-driver: Added possibility to use application credentials for cinder
* external-cloud-controller: Added env vars for openstack application credentials
* set selinux type t_etc if selinux state is enforcing
* workaround with update repo is no longer needed
remove comments about failing playbook
* grubby is not available in distros using ostree
* remove docker support because removed in fcos
update install script example with live rootfs
* do not call grubby on ostree based distro
* update docs enabling containerd on fedora coreos
* Ansible: move to Ansible 3.4.0 which uses ansible-base 2.10.10
* Docs: add a note about ansible upgrade post 2.9.x
* CI: ensure ansible is removed before ansible 3.x is installed to avoid pip failures
* Ansible: use newer ansible-lint
* Fix ansible-lint 5.0.11 found issues
* syntax issues
* risky-file-permissions
* var-naming
* role-name
* molecule tests
* Mitogen: use 0.3.0rc1 which adds support for ansible 2.10+
* Pin ansible-base to 2.10.11 to get package fix on RHEL8
* terraform/openstack: Use path.root for ansible_bastion_template.txt
The path.root variable points to the root module path. Using this
instead of a relative path makes less assumptions about the current
working directory.
* terraform/openstack: Add group_vars_path variable
Previously, the group_vars path was assumed to be in CWD. The
default value for the group_vars_path variable is still relative
to CWD and thus should be backwards compatible if unset.
* Calico: align manifests with upstream
* allow enabling typha prometheus metrics
* Calico: enable eBPF support
* manage the kubernetes-services-endpoint configmap
* Calico: document the use of eBPF dataplane
* Calico: improve checks before deployment
* enforce disabling kube-proxy when using eBPF dataplane
* ensure calico_version is supported
* Kata: add Kata 2.x checksums and adjust download urls for 2.x
* Kata: drop 1.x version which is no longer supported
* Kata: set default version 2.1.0
* Packet->Equinix Metal rename #6901
Updates throughout to reflect #6901 renaming for Packet to Equinix Metal.
* Rename Packet to Equinix Metal throughout the project #6901
Packet is renamed to Equinix Metal in more contexts including
documentation links. The Terraform provider used is still the Packet
provider. The environment variables and configuration options still
refer to the Packet name.
Signed-off-by: Marques Johansson <mjohansson@equinix.com>
Co-authored-by: Edward Vielmetti <ed@packet.net>
* Calico: add v3.19.1 hashes
* enable liveness probe for calico-kube-controllers
3.19.1
* Calico: drop support for v3.16.x
* Calico: promote v3.18.3 as default
* Override the default value of containerd's root, state, and oom_score configurations
* Add tests data for containerd_storage_dir, containerd_state_dir and containerd_oom_score variables
Basically we need to make necessary TCP/UDP ports open.
However the necessary ports are so many, and sometimes it is difficult
to figure out that is due to firewall issues or not if facing deployment
issues.
To distinguish a root problem on such situation, this adds contrib
playbook to disable the service firewall for Kubespray development
and test.
* add support for using ansible 2.10.x for deploying kubespray
* move dns-autoscaler-clusterrole{binding}.yml to files/ folder
* note that ansible 2.10 is now experimentally supported
* coredns: move files to templates like before #4341
* add initial MetalLB docs
* metallb allow disabling the deployment of the metallb speaker
* calico>=3.18 allow using calico to advertise service loadbalancer IPs
* Document the use of MetalLB and Calico
* clean MetalLB docs
Since K8S 1.21, BoundServiceAccountTokenVolume feature gate is in beta stage, thus activated by default (anyone who follows CSI guidelines has enabled AllAlpha and faced the issue before 1.21).
With this feature, SA tokens are regenerated every hour.
As a consequence for Calico CNI, token in /etc/cni/net.d/calico-kubeconfig copied from /var/run/secrets/kubernetes.io/serviceaccount in install-cni initContainer expires after one hour and any pod creation fails due to unauthorization.
Calico pods need to be restarted so that /etc/cni/net.d/calico-kubeconfig is updated with the new SA token.
follow new naming conventions for gcr's coredns image.
starting from 1.21 kubeadm assumes it to be `coredns/coredns`:
this causes the kubeadm deployment being unable to pull image, beacuse `v`
was also added in image tag, until the role `kubernetes-apps` ovverides
it with the old name, which is only compatible with <=1.7.
Backward comptability with kubeadm <=1.20 is mantained checking
kubernetes version and falling back to old names (`coredns:1.xx`) when
the version is less than 1.21
* rename ansible groups to use _ instead of -
k8s-cluster -> k8s_cluster
k8s-node -> k8s_node
calico-rr -> calico_rr
no-floating -> no_floating
Note: kube-node,k8s-cluster groups in upgrade CI
need clean-up after v2.16 is tagged
* ensure old groups are mapped to the new ones
* crio: add supported versions 1.20 and 1.21 and align default with k8s version
* cri-o: drop versions 1.17 and 1.18 from version matrix
* update note on cri-o version alignment
* calico: drop support for version 3.15
* drop check for calico version >= 3.3, we are at 3.16 minimum now
* we moved to calico 3.16+ so we can default to /opt/cni/bin/install
* AlmaLinux: ansible>2.9.19 is needed to know about AlmaLinux
* AlmaLinux: identify as a centos derrivative
* AlmaLinux: add AlmaLinux to checks for CentOS
* Use ansible_os_family to compare family and not distribution
As the official document[1], the parameter keepcache should be
'0' or '1' as string. To avoid the following warning message,
this fixes the parameter value:
[WARNING]: The value False (type bool) in a string field was
converted to u'False' (type string). If this does not look
like what you expect, quote the entire value to ensure it
does not change.
https://docs.ansible.com/ansible/latest/collections/ansible/builtin/yum_repository_module.html
Context: Load-balancing in Exoscale is performed by associating many
workers with the same EIP. This works, however, the workers cannot access
themselves via the EIP, which is needed at least for cert-managers
"self-test".
Problem: The old iptables based workaround felt fragile and disappointed
me at least once.
New solution: Add the EIP to a loopback interface on each worker.
* Add containerd_extra_args
This is useful for custom containerd config, e.g. auth
Signed-off-by: Zhong Jianxin <azuwis@gmail.com>
* Make containerd config.toml mode 0640
It may contain sensitive information like password
Signed-off-by: Zhong Jianxin <azuwis@gmail.com>
This PR is to move the cilium kvstore options to the configmap
rather than specifying them in the deployment as args. This
is not technically necessary but keeping all the options in
one place is probably not a bad idea.
Tested with cilium 1.9.5.
When attempting a fresh install without cilium_ipsec_enabled I ran
into the following error:
failed: [k8m01] (item={'name': 'cilium', 'file': 'cilium-secret.yml', 'type': 'secret', 'when': 'cilium_ipsec_enabled'}) =>
{"ansible_loop_var": "item", "changed": false, "item": {"file": "cilium-secret.yml", "name": "cilium", "type": "secret",
"when": "cilium_ipsec_enabled"},"msg": "AnsibleUndefinedVariable: 'cilium_ipsec_key' is undefined"}
Moving the when condition from the item level to the task level solved
the issue.
* Add KubeSchedulerConfiguration for k8s 1.19 and up
With release of version 1.19.0 of kubernetes KubeSchedulerConfiguration
was graduated to beta. It allows to extend different stages of
scheduling with profiles. Such effect is achieved by using plugins and
extensions.
This patch adds KubeSchedulerConfiguration for versions 1.19 and later.
Configuration is set to k8s defaults or to kubespray vars. Moving those
defaults to new vars will be done in following patch.
Signed-off-by: Maciej Wereski <m.wereski@partner.samsung.com>
* KubeSchedulerConfiguration: add defaults
Signed-off-by: Maciej Wereski <m.wereski@partner.samsung.com>
* Add documentation for audit webhook variables
* Enclose the value of audit_webhook_server_url in a codeblock
* Add default value for audit_webhook_batch_max_wait
Starting with Cilium v1.9 the default ipam mode has changed to "Cluster
Scope". See:
https://docs.cilium.io/en/v1.9/concepts/networking/ipam/
With this ipam mode Cilium handles assigning subnets to nodes to use
for pod ip addresses. The default Kubespray deploy uses the Kube
Controller Manager for this (the --allocate-node-cidrs
kube-controller-manager flag is set). This makes the proper ipam mode
for kubespray using cilium v1.9+ "kubernetes".
Tested with Cilium 1.9.5.
This PR also mounts the cilium-config ConfigMap for this variable
to be read properly.
In the future we can probably remove the kvstore and kvstore-opt
Cilium Operator args since they can be in the ConfigMap. I will tackle
that after this merges.
When upgrading cilium from 1.8.8 to 1.9.5 I ran into the following
error:
level=error msg="Unable to update CRD" error="customresourcedefinitions.apiextensions.k8s.io
\"ciliumnodes.cilium.io\" is forbidden: User \"system:serviceaccount:kube-system:cilium-operator\"
cannot update resource \"customresourcedefinitions\" in API group \"apiextensions.k8s.io\" at the
cluster scope" name=CiliumNode/v2 subsys=k8s
The fix was to add the update verb to the clusterrole. I also added
create to match the clusterrole created by the cilium helm chart.
DNSSEC is off by default on ubuntu/bionic64 (18.04) as per resolved.conf(5).
These tasks are artefacts of obsolete infra configuration, and no longer needed.
Further removing these tasks resolves the issue that the tasks always reports
'changed' and bounces systemd-resolved unneccesarily, even if there was no
actual modification of /etc/systemd/resolved.conf.
* Remove contrib/vault
This is marked as broken since 2018 / 3dcb914607
This still reference apiserver.pem, not used since ddffdb63bf
Signed-off-by: Etienne Champetier <e.champetier@ateme.com>
* Finish nuking vault from the codebase
Signed-off-by: Etienne Champetier <e.champetier@ateme.com>
This replaces kube-master with kube_control_plane because of [1]:
The Kubernetes project is moving away from wording that is
considered offensive. A new working group WG Naming was created
to track this work, and the word "master" was declared as offensive.
A proposal was formalized for replacing the word "master" with
"control plane". This means it should be removed from source code,
documentation, and user-facing configuration from Kubernetes and
its sub-projects.
NOTE: The reason why this changes it to kube_control_plane not
kube-control-plane is for valid group names on ansible.
[1]: https://github.com/kubernetes/enhancements/blob/master/keps/sig-cluster-lifecycle/kubeadm/2067-rename-master-label-taint/README.md#motivation
While at it remove force_certificate_regeneration
This boolean only forced the renewal of the apiserver certs
Either manually use k8s-certs-renew.sh or set auto_renew_certificates
Signed-off-by: Etienne Champetier <e.champetier@ateme.com>
* Add crun download_url and checksum
* Change versioning format to crun native versioning
* Download crun using download_file.yml
* Get crun version from download defaults
* Delegate crun binary copy task to crun role
* Download Calico KDD CRDs
* Replace kustomize with lineinfile and use ansible assemble module
* Replace find+lineinfile by sed in shell module to avoid nested loop
* add condition on sed
* use block for kdd tasks + remove supernumerary kdd manifest apply in start "Start Calico resources"
"The error was: 'proxy_disable_env' is undefined\n\nThe error appears to
be in '<censored>scale.yml': line 72, column 7"
Fixes 067db686f6
Signed-off-by: Etienne Champetier <e.champetier@ateme.com>
upgrades.md explains how to do upgrade from v1.4.3 to v1.4.6 as an
example. The versions are a little old, and the doc readers would
have a concern the upgrade works fine or not.
This updates versions after verifying the way works fine by hands.
* terraform support for UpCloud
* terraform support for UpCloud
* terraform support for UpCloud
* terraform support for UpCloud
* terraform support for UpCloud
* terraform support for UpCloud
* terraform support for UpCloud
* Updates to README.md and main.tf files
* formatting and updating readme
* added a .terraform_validate CI job
* fixed format issue
* added sample inventory
* added symbolic link to group_vars
* added missing tf variables and minor fixes
* added text formatting
* minor formatting fixes
* add nodeselector and tolerations for metallb
* remove unnecessary commented lines in metallb template
* set default speaker toleration to match original manifest
When privileged is enabled for a container, all the `/dev/*` block
devices from the host are mounted into the guest. The
`privileged_without_host_devices` flag prevents host devices from
being passed to privileged containers.
More information:
* https://github.com/containerd/cri/pull/1225
* 1d0f68156b
The important action in kubeadm-version.yml is the templating of the configuration,
not finding / setting the version
Signed-off-by: Etienne Champetier <e.champetier@ateme.com>
kubeadm is the default for a long time now,
and admin.conf is created by it, so let kubeadm handle it
Signed-off-by: Etienne Champetier <e.champetier@ateme.com>
Using `kubeadm init phase kubeconfig all` breaks kubelet client certificate rotation
as we are missing `kubeadm init phase kubelet-finalize all` to point to `kubelet-client-current.pem`
kubeconfig format is stable so let's just use lineinfile,
this will avoid other future breakage
This revert to the logic before 6fe2248314
Signed-off-by: Etienne Champetier <e.champetier@ateme.com>
On CentOS 8 they seem to be ignored by default, but better be extra safe
This also make it easy to exclude other network plugin interfaces
Signed-off-by: Etienne Champetier <e.champetier@ateme.com>
* Add terraform scripts for vSphere
* Fixup: Add terraform scripts for vSphere
* Add inventory generation
* Use machines var to provide IPs
* Add README file
* Add default.tfvars file
* Fix newlines at the end of files
* Remove master.count and worker.count variables
* Fixup cloud-init formatting
* Fixes after initial review
* Add warning about disabled DHCP
* Fixes after second review
* Add sample-inventory
* use external_openstack_lbaas_use_octavia for template openstack-cloud-config
* Delete external_openstack_lbaas_use_octavia from default values. Added description and default values of variables to docs
* markdown fix
* make this simple
* set external_openstack_lbaas_use_octavia in default values
* duplicated variable in doc
This replaces KUBE_MASTERS with KUBE_CONTROL_HOSTS because of [1]:
```
The Kubernetes project is moving away from wording that is
considered offensive. A new working group WG Naming was created
to track this work, and the word "master" was declared as offensive.
A proposal was formalized for replacing the word "master" with
"control plane". This means it should be removed from source code,
documentation, and user-facing configuration from Kubernetes and
its sub-projects.
```
[1]: https://github.com/kubernetes/enhancements/blob/master/keps/sig-cluster-lifecycle/kubeadm/2067-rename-master-label-taint/README.md#motivation
Since a790935d02 all proxy users
should be properly configured
Now when you have *_PROXY vars in your environment it can leads to failure
if NO_PROXY is not correct, or to persistent configuration changes
as seen with kubeadm in 1c5391dda7
Instead of playing constant whack-a-bug, inject empty *_PROXY vars everywhere
at the play level, and override at the task level when needed
Signed-off-by: Etienne Champetier <e.champetier@ateme.com>
Before this commit, we were gathering:
1 !all
7 network
7 hardware
After we are gathering:
1 !all
1 network
1 hardware
ansible_distribution_major_version is gathered by '!all'
Signed-off-by: Etienne Champetier <e.champetier@ateme.com>
* Move proxy_env to kubespray-defaults/defaults
There is no reasons to use set_facts here
Signed-off-by: Etienne Champetier <e.champetier@ateme.com>
* Ensure kubeadm doesn't use proxy
*_proxy variables might be present in the environment (/etc/environment, bash profile, ...)
When this is the case we end up with those proxy configuration in /etc/kubernetes/manifests/kube-*.yaml manifests
We cannot unset env variables, but kubeadm is nice enough to ignore empty vars
93d288e2a4/cmd/kubeadm/app/util/env.go (L27)
Signed-off-by: Etienne Champetier <e.champetier@ateme.com>
Ubuntu 18.04 crio package ships with 'mountopt = "nodev,metacopy=on"'
even if GA kernel is 4.15 (HWE Kernel can be more recent)
Fedora package ships without metacopy=on
Signed-off-by: Etienne Champetier <e.champetier@ateme.com>
By default Ansible stat module compute checksum, list extended attributes and find mime type
To find all stat invocations that really use one of those:
git grep -F stat. | grep -vE 'stat.(islnk|exists|lnk_source|writeable)'
Signed-off-by: Etienne Champetier <e.champetier@ateme.com>
`containerd.io` is the companion package of `docker-ce` and is the
proper package name. This is needed to avoid apt upgrade/dist-upgrade
from breaking kubernetes.
Running remove-node.yml tasks for clean up cluster on Fedora CoreOS.
The task failed to restart network daemon (task name: "reset | Restart network").
Fedora CoreOS is essentially using NetworkManager, but this task returns network.
Signed-off-by: Takashi IIGUNI <iiguni.tks@gmail.com>
* Add unique annotation on coredns deployment and only remove existing deployment if annotation is missing.
* Ignore errors when gathering coredns deployment details to handle case where it doesn't exist yet
* Remove run_once, deletegate_to and add to when statement
* Added force_etcd_cert_refresh var to maintain existing functionality. Broke out etcd node cert syncing from member and admin cert sync logic. Now first etcd will sync node certs to other etcd members on every run to keep all etcds up to date after adding additional worker nodes to the cluster
* Updated etcd cert check tasks to better detect when new certificates need to be generated
* Move usage of force_etcd_cert_refresh var to gen_certs fact set
* Force etcd cert generation per server if force_etcd_cert_refresh is set to true
* Include gathering of node certs even if k8s-cluster member and in etcd group.
* Removed run_once due to when statement
Helm v3.5.2 is a security (patch) release. Users are strongly
recommended to update to this release. It fixes two security issues in
upstream dependencies and one security issue in the Helm codebase.
See https://github.com/helm/helm/releases/tag/v3.5.2
This makes the docker role work the same as the containerd role.
Being able to override this is needed when you have your own debian
repository. E.g. when performing an airgapped installation
* contrib/terraform/exoscale: Rework SSH public keys
Exoscale has a few limitations with `exoscale_ssh_keypair` resources.
Creating several clusters with these scripts may lead to an error like:
```
Error: API error ParamError 431 (InvalidParameterValueException 4350): The key pair "lj-sc-ssh-key" already has this fingerprint
```
This patch reworks handling of SSH public keys. Specifically, we rely on
the more cloud-agnostic way of configuring SSH public keys via
`cloud-init`.
* contrib/terraform/exoscale: terraform fmt
* contrib/terraform/exoscale: Add terraform validate
* contrib/terraform/exoscale: Inline public SSH keys
The Terraform scripts need to install some SSH key, so that Kubespray
(i.e., the "Ansible part") can take over. Initially, we pointed the
Terraform scripts to `~/.ssh/id_rsa.pub`. This proved to be suboptimal:
Operators sharing responbility for a cluster risk unnecessarily replacing resources.
Therefore, it has been determined that it's best to inline the public
SSH keys. The chosen variable `ssh_public_keys` provides some uniformity
with `contrib/azurerm`.
* Fix Terraform Exoscale test
* Fix Terraform 0.14 test
* update local-path-storage config template to version v0.0.19
* changes local_path_provisioner image tag to v0.0.19
* removes copy paste example from rancher local-path-provisioner repo
According to the following recommendation, this moves the directory
to control-plane:
The Kubernetes project is moving away from wording that is considered
offensive. A new working group WG Naming was created to track this work,
and the word "master" was declared as offensive. A proposal was formalized
for replacing the word "master" with "control plane".
Fixes the following error when using Bastion Node with the sample config.
```
fatal: [bastion]: FAILED! => {"msg": "The task includes an option with an undefined variable. The error was: 'dict object' has no attribute 'bastion'\n\nThe error appears to be in '/home/felix/inovex/kubespray/roles/bastion-ssh-config/tasks/main.yml': line 2, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n---\n- name: set bastion host IP\n ^ here\n"}
```
Previous check for presence of NM assumed "systemctl show
NetworkManager" would exit with a nonzero status code, which seems not
the case anymore with recent Flatcar Container Linux.
This new check also checks the activeness of network manager, as
`is-active` implies presence.
Signed-off-by Jorik Jonker <jorik@kippendief.biz>
This was introduced in 143e2272ff
Extra repo is enabled by default in CentOS, and is not the right repo for EL8
Instead of adding a CentOS repo to RHEL, enable the needed RHEL repos with rhsm_repository
For RHEL 7, we need the "extras" repo for container-selinux
For RHEL 8, we need the "appstream" repo for container-selinux, ipvsadm and socat
Signed-off-by: Etienne Champetier <champetier.etienne@gmail.com>
Only checking the kubernetes api on the first master when upgrading is not enough.
Each master needs to be checked before it's upgrade.
Signed-off-by: Rick Haan <rickhaan94@gmail.com>
yum_repository expect really different params, so nothing to factor here
Ubuntu is not an ansible_os_family, the OS family for Ubuntu is Debian
Check for ansible_pkg_mgr == apt
Signed-off-by: Etienne Champetier <champetier.etienne@gmail.com>
we don't need rpm_key, so nothing to factor here
Ubuntu is not an ansible_os_family, the OS family for Ubuntu is Debian
Check for ansible_pkg_mgr == apt
Signed-off-by: Etienne Champetier <champetier.etienne@gmail.com>
Here the desciption from Ansible docs
Corresponds to the --force-yes to apt-get and implies allow_unauthenticated: yes
This option will disable checking both the packages' signatures and the certificates of the web servers they are downloaded from.
This option *is not* the equivalent of passing the -f flag to apt-get on the command line
**This is a destructive operation with the potential to destroy your system, and it should almost never be used.** Please also see man apt-get for more information.
Signed-off-by: Etienne Champetier <champetier.etienne@gmail.com>
This variable was added as KUBE_MASTERS_MASTERS. That's probably a typo.
Remove the redundant `_MASTERS` suffix. Also, document the variable in the
help message.
TASK [Generate a list of information about the images on a node]
registers list of container images to docker_images.
Then the next TASK [Set pull_required if the desired image is not
yet loaded] does based on expecting images are registered.
However sometimes the first TASK was failed as [1] but the failure
is ignored due to failed_when:false and it makes another issue.
This removes this unnecessary failed_when to detect the failure
at the point.
In addition, this removes no_log:true also because the output doesn't
contain any sensitive data and now it just makes debugging difficult.
[1]: https://gitlab.com/kargo-ci/kubernetes-sigs-kubespray/-/jobs/934714534#L2953
no_proxy is a pain to get right, and having proxy variables present causes issues
(k8s components get proxy configuration after upgrade, see #7100)
It's better to only configure what require proxy:
- the runtime (containerd/docker/crio)
- the package manager + apt_key
- the download tasks
Tested with the following clusters
- 4 CentOS 8 nodes
- 1 Ubuntu 20.04 node
Signed-off-by: Etienne Champetier <champetier.etienne@gmail.com>
In some environments, it might not be possible to ping the IP address
of the nodes, e.g., because ICMP echo is blocked.
This commit allows kubespray to be configured to disable the ping
check, while performing all other checks.
TASK [network_plugin/calico : Calico | Configure calico network pool] **********
task path: /builds/kargo-ci/kubernetes-sigs-kubespray/roles/network_plugin/calico/tasks/install.yml:138
Friday 08 January 2021 17:10:12 +0000 (0:00:01.521) 0:11:36.885 ********
[WARNING]: The value {'kind': 'IPPool', 'apiVersion': 'projectcalico.org/v3',
'metadata': {'name': 'default-pool'}, 'spec': {'blockSize': 24, 'cidr':
'10.233.64.0/18', 'ipipMode': 'Always', 'vxlanMode': 'Never', 'natOutgoing':
True}} (type dict) in a string field was converted to "{'kind': 'IPPool',
'apiVersion': 'projectcalico.org/v3', 'metadata': {'name': 'default-pool'},
'spec': {'blockSize': 24, 'cidr': '10.233.64.0/18', 'ipipMode': 'Always',
'vxlanMode': 'Never', 'natOutgoing': True}}" (type string). If this does not
look like what you expect, quote the entire value to ensure it does not change.
Signed-off-by: Etienne Champetier <champetier.etienne@gmail.com>
This fixes the following failures:
./contrib/offline/README.md:14:1 MD014/commands-show-output Dollar signs used before commands without showing output [Context: "$ ./manage-offline-container-i..."]
./contrib/offline/README.md:20:1 MD014/commands-show-output Dollar signs used before commands without showing output [Context: "$ ./manage-offline-container-i..."]
* Remove because of empty need_http_proxy.rc if http/https_proxy and skip_http_proxy_on_os_packages=true is set
* Modify sample for debian and centos skip_http_proxy
* Modify sample for debian and centos skip_http_proxy
If some settings were changed from the default but not commited into an inventory repo,
we risk breaking the cluster / cause downtime, so add some extra checks
Signed-off-by: Etienne Champetier <champetier.etienne@gmail.com>
Upgrading docker / containerd without adapting the configuration might break the node,
so disable docker-ce repo by default.
We are already using dpkg hold for Debian.
All containerd.io packages provide /usr/bin/runc, so no need to check
yum_conf was never used for containerd
module_hotfixes should not be needed with the EL8 repo
Signed-off-by: Etienne Champetier <champetier.etienne@gmail.com>
* [terraform/aws] Fix Terraform >=0.13 warnings
Terraform >=0.13 gives the following warning:
```
Warning: Interpolation-only expressions are deprecated
```
The fix was tested as follows:
```
rm -rf .terraform && terraform0.12.26 init && terraform0.12.26 validate
rm -rf .terraform && terraform0.13.5 init && terraform0.13.5 validate
rm -rf .terraform && terraform0.14.3 init && terraform0.14.3 validate
```
which gave no errors nor warnings.
* [terraform/openstack] Fixes for Terraform >=0.13
Terraform >=0.13 gives the following error:
```
Error: Failed to install providers
Could not find required providers, but found possible alternatives:
hashicorp/openstack -> terraform-provider-openstack/openstack
```
This patch fixes these errors.
This fix was tested as follows:
```
rm -rf .terraform && terraform0.12.26 init && terraform0.12.26 validate
rm -rf .terraform && terraform0.13.5 init && terraform0.13.5 validate
rm -rf .terraform && terraform0.14.3 init && terraform0.14.3 validate
```
which gave no errors nor warnings for Terraform 0.13.5 and Terraform
0.14.3. Unfortunately, 0.12.x gives a harmless warning, but
with 0.14.3 out the door, I guess we need to move on.
* [terraform/packet] Fixes for Terraform >=0.13
This fix was tested as follows:
```
export PACKET_AUTH_TOKEN=blah-blah
rm -rf .terraform && terraform0.12.26 init && terraform0.12.26 validate
rm -rf .terraform && terraform0.13.5 init && terraform0.13.5 validate
rm -rf .terraform && terraform0.14.3 init && terraform0.14.3 validate
```
Errors are gone, but warnings still remain. It is impossible to please
all three versions of Terraform.
* Add tests for Terraform >=0.13
* Fedora CoreOS: Fix for ethtool pre-installed
Fix error in rpm-ostree when ethtool is already insatlled (FCOS >= 32.20201104.3.0)
* Fedora CoreOS: Fix connection lost
Fedora CoreOS: Ignore connection lost due to reboot and continues the playbook
Now markdownlint covers ./README.md and md files under ./docs only.
However we have a lot of md files under different directories also.
This enables markdownlint for other md files also.
We are currently setting the IP variable to hostIP,
Before https://github.com/projectcalico/node/pull/593 (not yet released)
Calico interpret that as hostIP/32
Using 'can-reach' we get the future behavior
This fixes vxlan and IPIP CrossSubnet modes
Signed-off-by: Etienne Champetier <champetier.etienne@gmail.com>
Just after creating a namespace, the corresponding token could not be
created and sometimes the pod creation might be failed.
This adds check of the token in the new namespace to make this test
case stable.
* update files to handle multi-asn bgp peering conditions.
* put back in the serviceClusterIPs. Bad merge.
* remove extraneous environment var.
* update files as discussed with mirwan
* update titles.
* add not in.
* add a conditional for using bgp to advertise cluster ips.
Co-authored-by: marlow-h <mweston@habana.ai>
If cluster-name is not set, the default value "kubernetes" is used.
The loadbalancees created by Kubernetes follow the format:
kube_service_clusterName_serviceNamespace_serviceName
If 2 clusters create a loadbalancer for the same service in the same
namespace, they will share the same non-working loadbalancer.
Signed-off-by: Cedric Hnyda <cedric.hnyda@itera.io>
tests/scripts/ansible-lint.sh was written on the doc, but there was
not such file actually. We can use ansible-lint command to check
ansible yml files without any options.
This updates to use the command.
If a branch name contains '.sh', current shellcheck checks the branch
file under .git/ and outputs error because the format is not shell
script one.
This makes shellcheck exclude files under .git/ to avoid this issue.
* Update hashes and set default version to 1.19.5
Signed-off-by: anthr76 <hello@anthonyrabbito.com>
* Reorder hashes
1.19.5 hashes should be near 1.19.x
* Added back blank line
This fixes the following warning:
[kubernetes/client : Generate admin kubeconfig with external api endpoint]
[WARNING]: Consider using the file module with state=directory rather than
running 'mkdir'. If you need to use command because file is insufficient
you can
RedHat 8.3 merged nf_conntrack_ipv4 in nf_conntrack but still advertise 4.18
so just try to modprobe and decide depending on the success
Also nf_conntrack is a dependency of ip_vs, so no need to care about it
Signed-off-by: Etienne Champetier <champetier.etienne@gmail.com>
* Ensure libseccomp is installed before starting containerd on CentOS 8
* Simplify libseccomp install on CentOS 8
- Uses `package` module
- Replaces complex version check with 'state: latest'. The version must
be > 2.3 when using with cri-o.
- Removes unnecessary `not is_ostree` condition as CentOS 8 does not use
ostree
* copying ssh key no longer required, works with password auth
* use copy module instead of synchronize (which requires sshpass)
* less tasks and always changed tasks
* containerd docker hub registry mirror support
* add docs
* fix typo
* fix yamllint
* fix indent in sample
and ansible-playbook param in testcases_run
* fix md
* mv common vars to tests/common/_docker_hub_registry_mirror.yml
* checkout vars to upgrade tests
If crictl (and docker) binaries are deployed to the directories
that are not in standard PATH (e.g. /usr/local/bin), it is required
to specify full path to the binaries.
The task outputs the following warning:
TASK [kubernetes/preinstall : Enable ip forwarding]
[WARNING]: The value 1 (type int) in a string field was converted
to u'1' (type string). If this does not look like what you expect,
quote the entire value to ensure it does not change.
This new version uses the same base image as kube-proxy
(k8s.gcr.io/build-image/debian-iptables)
This allow to automatically pick iptables-legacy or iptables-nft,
and be compatible with RHEL/CentOS 8
https://github.com/kubernetes/dns/pull/367
Signed-off-by: Etienne Champetier <champetier.etienne@gmail.com>
* fix flake8 errors in Kubespray CI - tox-inventory-builder
* Invalidate CRI-O kubic repo's cache
Signed-off-by: Victor Morales <v.morales@samsung.com>
* add support to configure pkg install retries
and use in CI job tf-ovh_ubuntu18-calico (due to it failing often)
* Switch Calico, Cilium and MetalLB image repos to Quay.io
Co-authored-by: Victor Morales <v.morales@samsung.com>
Co-authored-by: Barry Melbourne <9964974+bmelbourne@users.noreply.github.com>
calico PODs are first started and then in a handler killed and
restarted for no reason, nothing has changed.
By using the existing variable 'calico_cni_config' (only defined when
calico has already started) the restart can be skipped.
* create a wrapper script with pki options
* supports all kubespray managed container engines
Co-authored-by: Hans Feldt <hafe@users.noreply.github.com>
* Allow the eventRecordQPS setting to be set.
The eventRecordQPS parameter controls rate limiting for event recording. When zero, unlimited events can cause denial-of-service situations. For my situation, I don't need more than a setting of "5". This change allows me to configure the setting before creating the cluster.
* Allow the eventRecordQPS setting to be set.
The default settings (see types.go) is five. So, this change does not affect the cluster provisioning. However, it does allow for the setting to be changed.
Fedora 31 uses Cgroups v2 by default. This change by passes the kernel
parameter systemd.unified_cgroup_hierarchy=0.
Signed-off-by: Victor Morales <v.morales@samsung.com>
When https://kubespray.io/#/docs/comparisons is generated, having the link in the heading creates the following HTML. When displayed there is no space between "vs" and the link. I simply moved the link into the following paragraph.
```
<h2 id="kubespray-vs-kops"><a href="#/docs/comparisons?id=kubespray-vs-kops" data-id="kubespray-vs-kops" class="anchor"><span>Kubespray vs </span></a><a href="https://github.com/kubernetes/kops" target="_blank" rel="noopener">Kops</a></h2>
```
* Add note about changing private IP in admin.conf.
When I run kubespray, a load balancer is created which should be used instead of the ip of the controller node.
* Procedure to find load balancer and update admin.conf
When I run kubespray, a load balancer is used instead of the private ip of the controller.
* update version of ingress-nginx controller.
Change tag from controller-v0.34.0 to controller-v0.40.2 to use newest tag.
* Update docs about aws deploy templates.
In the yaml templates, there is no mention of idle timeouts. This is why I removed the documentation about it. This might be a mistake. Please verify this. I don't know enough to verify it myself.
* Change label when checking version.
When checking for `app.kubernetes.io/name=ingress-nginx`, a completed pod was selected which is not helpful when trying to `exec`. Changing the label selects the running controller pod.
* put back the information about ELB Idle Timeouts.
When I removed the information, I had overlooked that it was mentioned in the L7 yaml file. Thanks.
and thereby support upgrade from e.g. 1.18.x to 1.19.y
Included OSes:
- Centos7/8
- Ubuntu18/20
New variables for overriding by default installed packages:
- centos_crio_packages
- ubuntu_crio_packages
* Enable Kata Containers for CRI-O runtime
Kata Containers is an OCI runtime where containers are run inside
lightweight VMs. This runtime has been enabled for containerd runtime
thru the kata_containers_enabled variable. This change enables Kata
Containers to CRI-O container runtime.
Signed-off-by: Victor Morales <v.morales@samsung.com>
* Set appropiate conmon_cgroup when crio_cgroup_manager is 'cgroupfs'
* Set manage_ns_lifecycle=true when KataContainers is enabed
* Add preinstall check for katacontainers
Signed-off-by: Victor Morales <v.morales@samsung.com>
Co-authored-by: Pasquale Toscano <pasqualetoscano90@gmail.com>
Command line flags aren't added to kube-proxy which results in missing
feature gates set in this component. Add appropriate setting to
ConfigMap instead.
Signed-off-by: Maciej Wereski <m.wereski@partner.samsung.com>
'ansible.vars.hostvars.HostVarsVars object' has no attribute 'kubeadm_upload_cert'
kubeadm_upload_cert will never be found as a hostvar for the first
master since the task is executed for a worker.
Fix by executing the upload task for the first master and register
the needed key. After that, workers can read hostvars for the master
Var kubeadm_etcd_refresh_cert_key removed since it no longer has
any use.
* Adding option to disable gloablly applying a proxy to etc/yum.conf
* Change made to proxy_yum_globaly basedon reviewer feedback
* fix trailing spaces in ymllint
crio refuses to delete pods when cni is unavailable which is the
case e.g. using calico with kdd datastore. See:
https://github.com/cri-o/cri-o/issues/4084
Fix by deleting storage associated with containers. Stop and disable
crio service so switching container runtime can be done.
* Added option to force apiserver and respective client certificate to be regenerated without necessarily needing to bump the K8S cluster version
* Removed extra blank line
Handlers with the same name (Kubeadm | restart kubelet) leads to incorrect playbook execution. As a result, after completing the tasks, kubelet does not restart. This PR fix this behavior
After upgrading to newer Kubernetes(v1.17 at least), kubectl command
shows the following warning message:
WARNING: Kubernetes configuration file is group-readable.
This is insecure. Location: /home/foo/.kube/config
The kubeconfig was copied from {{ artifacts_dir }}/admin.conf with
kubeconfig_localhost feature. It is better to set valid file mode
at getting it on Kubespray.
The label triage/support has been reclassified as kind/support. The
kind/* family of labels makes more logical sense, as they describe the
"kind" of thing an issue or PR is.
For more information, see the announcement email:
https://groups.google.com/g/kubernetes-dev/c/YcaJpsjjLKw/m/i15cLLx5CAAJ
If the `mitogen.yml` playbook is run, it installs Mitogen in this path, causing Git to believe there to 500+ changes. This simply excludes that external module from git
The 0d0cc8cf9c change creates several
DaemonSets to cover the Flannel CNI installation for different CPU
architectures. This change removes the unnecessary architecture value
from the docker tag value.
Signed-off-by: Victor Morales <v.morales@samsung.com>
In case multiple nodeselectors are specified in ingress_nginx_nodeselector, the generated daemonset yaml template for nginx is invalid due to missing indentation starting with the second nodeselector
When stopping at the check of "Stop if ip var does not match local ips"
the error message is like:
fatal: [single-k8s]: FAILED! => {
"assertion": "ip in ansible_all_ipv4_addresses",
"changed": false,
"evaluated_to": false,
"msg": "Assertion failed"
}
That doesn't contain actual IP addresses and it is difficult to understand
what was wrong. This adds the error message which contain actual IP addresses
to investigate the issue if happens.
* calico: add constant calico_min_version_required
and verify current deployed version against it.
* calico: remove upgrade support with data migration
The tool was used pre v3.0.0 and is no longer needed.
* calico: remove old version support from tasks
* calico: remove old ver support from policy ctrl
* calico: remove old ver support from node
* canal: remove old ver support
* remove unused calicoctl download checksums
calico_min_version_required is the oldest version that can be installed
Older versions can be removed.
* Add retries to update calico-rr data in etcd through calicoctl
* Update update-node yaml syntax
* Add comment to clarify ansible block loop
* Remove trailing space
* Fix reserved memory unit in kubelet configuration
Signed-off-by: Wang Zhen <lazybetrayer@gmail.com>
* Move systemReserved default values from template
Signed-off-by: Wang Zhen <lazybetrayer@gmail.com>
* Added ability to set calico vxlan vni and port. defaults to calico's documented defaults.
* Check if calico_network_backend is defined prior to checking value
* Removed calico hidden defaults for vxlan port and vni
* Fixed FELIX_VXLANVNI typo
I kept seeing `TLS handshake error from 10.250.250.158:63770: EOF` from two IP addresses that correlate to my ELB. Changing the health check from TCP to HTTPS stopped the errors from being generated.
* Added support for setting tiller_service_account and tiller_replicas
* Specify helm 2 version to ensure we have a test path that still hits helm 2 code
* Moved tiller_service_account to defaults.yml. Fixed is tiller_replicas defined check.
It was documented as if it were an Ansible variable, but it is a Terraform variable.
This also means the colon syntax was incorrect. TF variables are assigned with an equals sign.
Co-authored-by: rptaylor <rptaylor@uvic.ca>
* Make metallb image repos configurable
* Moved metallb image repo definitions to download role defaults
* Removed comment. These are set in download defaults
After host reboot kubelet and crio goes into a loop and no container is started.
storage_driver in crio.conf overrides system defaults in etc/containers/storage.conf
/etc/containers/storage.conf is installed by package containers-common dependency
installed from cri-o (centos7) and contains "overlay".
Hosts already configured with overlay2 should be reconfigured and the
/var/lib/containers content removed.
* Add comment from roles/kubespray-defaults/defaults/main.yaml clarifying network allocation and sizes
Signed-off-by: Mikael Johansson <mik.json@gmail.com>
* Rewrite of the comment and added new examples
Signed-off-by: Mikael Johansson <mik.json@gmail.com>
It is recommended to use filter to manage the GitHub email notification, see [examples for setting filters to Kubernetes Github notifications](https://github.com/kubernetes/community/blob/master/communication/best-practices.md#examples-for-setting-filters-to-kubernetes-github-notifications)
To install development dependencies you can use`pip install -r tests/requirements.txt`
To install development dependencies you can set up a python virtual env with the necessary dependencies:
```ShellSession
virtualenv venv
source venv/bin/activate
pip install -r tests/requirements.txt
```
#### Linting
Kubespray uses `yamllint` and `ansible-lint`. To run them locally use `yamllint .` and `./tests/scripts/ansible-lint.sh`
Kubespray uses `yamllint` and `ansible-lint`. To run them locally use `yamllint .` and `ansible-lint`. It is a good idea to add call these tools as part of your pre-commit hook and avoid a lot of back end forth on fixing linting issues (<https://support.gitkraken.com/working-with-repositories/githooksexample/>).
#### Molecule
@@ -29,3 +35,5 @@ Vagrant with VirtualBox or libvirt driver helps you to quickly spin test cluster
3. Fork the desired repo, develop and test your code changes.
4. Sign the CNCF CLA (<https://git.k8s.io/community/CLA.md#the-contributor-license-agreement>)
5. Submit a pull request.
6. Work with the reviewers on their suggestions.
7. Ensure to rebase to the HEAD of your target branch and squash un-necessary commits (<https://blog.carbonfive.com/always-squash-and-rebase-your-git-commits/>) before final merger of your contribution.
If you have questions, check the documentation at [kubespray.io](https://kubespray.io) and join us on the [kubernetes slack](https://kubernetes.slack.com), channel **\#kubespray**.
You can get your invite [here](http://slack.k8s.io/)
- Can be deployed on **[AWS](docs/aws.md), GCE, [Azure](docs/azure.md), [OpenStack](docs/openstack.md), [vSphere](docs/vsphere.md), [Packet](docs/packet.md) (bare metal), Oracle Cloud Infrastructure (Experimental), or Baremetal**
- Can be deployed on **[AWS](docs/aws.md), GCE, [Azure](docs/azure.md), [OpenStack](docs/openstack.md), [vSphere](docs/vsphere.md), [Equinix Metal](docs/equinix-metal.md) (bare metal), Oracle Cloud Infrastructure (Experimental), or Baremetal**
- **Highly available** cluster
- **Composable** (Choice of the network plugin for instance)
# Deploy Kubespray with Ansible Playbook - run the playbook as root
# The option `--become` is required, as for example writing SSL keys in /etc/,
@@ -48,11 +48,23 @@ As a consequence, `ansible-playbook` command will fail with:
ERROR! no action detected in task. This often indicates a misspelled module name, or incorrect module path.
```
probably pointing on a task depending on a module present in requirements.txt (i.e. "unseal vault").
probably pointing on a task depending on a module present in requirements.txt.
One way of solving this would be to uninstall the Ansible package and then, to install it via pip but it is not always possible.
A workaround consists of setting `ANSIBLE_LIBRARY` and `ANSIBLE_MODULE_UTILS` environment variables respectively to the `ansible/modules` and `ansible/module_utils` subdirectories of pip packages installation location, which can be found in the Location field of the output of `pip show [package]` before executing `ansible-playbook`.
A simple way to ensure you get all the correct version of Ansible is to use the [pre-built docker image from Quay](https://quay.io/repository/kubespray/kubespray?tab=tags).
You will then need to use [bind mounts](https://docs.docker.com/storage/bind-mounts/) to get the inventory and ssh key into the container, like this:
```ShellSession
docker pull quay.io/kubespray/kubespray:v2.17.1
docker run --rm -it --mount type=bind,source="$(pwd)"/inventory/sample,dst=/inventory \
Note: The list of validated [docker versions](https://kubernetes.io/docs/setup/production-environment/container-runtimes/#docker) is 1.13.1, 17.03, 17.06, 17.09, 18.06, 18.09 and 19.03. The recommended docker version is 19.03. The kubelet might break on docker's non-standard version numbering (it no longer uses semantic versioning). To ensure auto-updates don't break your cluster look into e.g. yum versionlock plugin or apt pin).
## Container Runtime Notes
- The list of available docker version is 18.09, 19.03 and 20.10. The recommended docker version is 20.10. The kubelet might break on docker's non-standard version numbering (it no longer uses semantic versioning). To ensure auto-updates don't break your cluster look into e.g. yum versionlock plugin or apt pin).
- The cri-o version should be aligned with the respective kubernetes version (i.e. kube_version=1.20.x, crio_version=1.20)
## Requirements
- **Minimum required version of Kubernetes is v1.17**
- **Ansible v2.9+, Jinja 2.11+ and python-netaddr is installed on the machine that will run Ansible commands**
- **Minimum required version of Kubernetes is v1.20**
- **Ansible v2.9.x, Jinja 2.11+ and python-netaddr is installed on the machine that will run Ansible commands, Ansible 2.10.x is experimentally supported for now**
- The target servers must have **access to the Internet** in order to pull docker images. Otherwise, additional configuration is required (See [Offline Environment](docs/offline-environment.md))
- The target servers are configured to allow **IPv4 forwarding**.
-**Your ssh key must be copied** to all the servers part of your inventory.
-If using IPv6 for pods and services, the target servers are configured to allow **IPv6 forwarding**.
- The **firewalls are not managed**, you'll need to implement your own rules the way you used to.
in order to avoid any issue during deployment you should disable your firewall.
- If kubespray is ran from non-root user account, correct privilege escalation method
@@ -179,11 +194,6 @@ You can choose between 10 network plugins. (default: `calico`, except Vagrant us
- [cilium](http://docs.cilium.io/en/latest/): layer 3/4 networking (as well as layer 7 to protect and secure application protocols), supports dynamic insertion of BPF bytecode into the Linux kernel to implement security services, networking and visibility logic.
- [contiv](docs/contiv.md): supports vlan, vxlan, bgp and Cisco SDN networking. This plugin is able to
apply firewall policies, segregate containers in multiple network and bridging pods onto physical networks.
- [ovn4nfv](docs/ovn4nfv.md): [ovn4nfv-k8s-plugins](https://github.com/opnfv/ovn4nfv-k8s-plugin) is the network controller, OVS agent and CNI server to offer basic SFC and OVN overlay networking.
- [weave](docs/weave.md): Weave is a lightweight container overlay network that doesn't require an external K/V database cluster.
(Please refer to `weave` [troubleshooting documentation](https://www.weave.works/docs/net/latest/troubleshooting/)).
@@ -204,10 +214,10 @@ See also [Network checker](docs/netcheck.md).
## Ingress Plugins
- [ambassador](docs/ambassador.md): the Ambassador Ingress Controller and API gateway.
- [nginx](https://kubernetes.github.io/ingress-nginx): the NGINX Ingress Controller.
- [metallb](docs/metallb.md): the MetalLB bare-metal service LoadBalancer provider.
@@ -8,19 +8,19 @@ In the same directory of this ReadMe file you should find a file named `inventor
Change that file to reflect your local setup (adding more machines or removing them and setting the adequate ip numbers), and save it to `inventory/sample/k8s_gfs_inventory`. Make sure that the settings on `inventory/sample/group_vars/all.yml` make sense with your deployment. Then execute change to the kubespray root folder, and execute (supposing that the machines are all using ubuntu):
If your machines are not using Ubuntu, you need to change the `--user=ubuntu` to the correct user. Alternatively, if your Kubernetes machines are using one OS and your GlusterFS a different one, you can instead specify the `ansible_ssh_user=<correct-user>` variable in the inventory file that you just created, for each machine/VM:
First step is to fill in a `my-kubespray-gluster-cluster.tfvars` file with the specification desired for your cluster. An example with all required variables would look like:
```
```ini
cluster_name="cluster1"
number_of_k8s_masters="1"
number_of_k8s_masters_no_floating_ip="2"
@@ -54,7 +54,7 @@ ssh_user_gfs = "ubuntu"
As explained in the general terraform/openstack guide, you need to source your OpenStack credentials file, add your ssh-key to the ssh-agent and setup environment variables for terraform:
@@ -8,7 +8,7 @@ Installs and configures GlusterFS on Linux.
For GlusterFS to connect between servers, TCP ports `24007`, `24008`, and `24009`/`49152`+ (that port, plus an additional incremented port for each additional server in the cluster; the latter if GlusterFS is version 3.4+), and TCP/UDP port `111` must be open. You can open these using whatever firewall you wish (this can easily be configured using the `geerlingguy.firewall` role).
This role performs basic installation and setup of Gluster, but it does not configure or mount bricks (volumes), since that step is easier to do in a series of plays in your own playbook. Ansible 1.9+ includes the [`gluster_volume`](https://docs.ansible.com/gluster_volume_module.html) module to ease the management of Gluster volumes.
This role performs basic installation and setup of Gluster, but it does not configure or mount bricks (volumes), since that step is easier to do in a series of plays in your own playbook. Ansible 1.9+ includes the [`gluster_volume`](https://docs.ansible.com/ansible/latest/collections/gluster/gluster/gluster_volume_module.html) module to ease the management of Gluster volumes.
when:inventory_hostname == groups['kube-master'][0] and groups['gfs-cluster'] is defined and hostvars[groups['gfs-cluster'][0]].gluster_disk_size_gb is defined
when:inventory_hostname == groups['kube_control_plane'][0] and groups['gfs-cluster'] is defined and hostvars[groups['gfs-cluster'][0]].gluster_disk_size_gb is defined
- name:Kubernetes Apps | Set GlusterFS endpoint and PV
# Deploy Heketi/Glusterfs into Kubespray/Kubernetes
This playbook aims to automate [this](https://github.com/heketi/heketi/blob/master/docs/admin/install-kubernetes.md) tutorial. It deploys heketi/glusterfs into kubernetes and sets up a storageclass.
## Important notice
> Due to resource limits on the current project maintainers and general lack of contributions we are considering placing Heketi into a [near-maintenance mode](https://github.com/heketi/heketi#important-notice)
## Client Setup
Heketi provides a CLI that provides users with a means to administer the deployment and configuration of GlusterFS in Kubernetes. [Download and install the heketi-cli](https://github.com/heketi/heketi/releases) on your client machine.
## Install
Copy the inventory.yml.sample over to inventory/sample/k8s_heketi_inventory.yml and change it according to your setup.
Container image collecting script for offline deployment
This script has two features:
(1) Get container images from an environment which is deployed online.
(2) Deploy local container registry and register the container images to the registry.
Step(1) should be done online site as a preparation, then we bring the gotten images
to the target offline environment.
Then we will run step(2) for registering the images to local registry.
Step(1) can be operated with:
```shell
manage-offline-container-images.sh create
```
Step(2) can be operated with:
```shell
manage-offline-container-images.sh register
```
## generate_list.sh
This script generates the list of downloaded files and the list of container images by `roles/download/defaults/main.yml` file.
Run this script will generates three files, all downloaded files url in files.list, all container images in images.list, all component version in generate.sh.
```shell
bash generate_list.sh
tree temp
temp
├── files.list
├── generate.sh
└── images.list
0 directories, 3 files
```
In some cases you may want to update some component version, you can edit `generate.sh` file, then run `bash generate.sh | grep 'https' > files.list` to update file.list or run `bash generate.sh | grep -v 'https'> images.list` to update images.list.
* VPC with Public and Private Subnets in # Availability Zones
* Bastion Hosts and NAT Gateways in the Public Subnet
* A dynamic number of masters, etcd, and worker nodes in the Private Subnet
* even distributed over the # of Availability Zones
* AWS ELB in the Public Subnet for accessing the Kubernetes API from the internet
**Requirements**
- VPC with Public and Private Subnets in # Availability Zones
- Bastion Hosts and NAT Gateways in the Public Subnet
- A dynamic number of masters, etcd, and worker nodes in the Private Subnet
- even distributed over the # of Availability Zones
- AWS ELB in the Public Subnet for accessing the Kubernetes API from the internet
## Requirements
- Terraform 0.12.0 or newer
**How to Use:**
## How to Use
- Export the variables for your AWS credentials or edit `credentials.tfvars`:
```
```commandline
export TF_VAR_AWS_ACCESS_KEY_ID="www"
export TF_VAR_AWS_SECRET_ACCESS_KEY ="xxx"
export TF_VAR_AWS_SSH_KEY_NAME="yyy"
export TF_VAR_AWS_DEFAULT_REGION="zzz"
```
- Update `contrib/terraform/aws/terraform.tfvars` with your data. By default, the Terraform scripts use Ubuntu 18.04 LTS (Bionic) as base image. If you want to change this behaviour, see note "Using other distrib than Ubuntu" below.
- Create an AWS EC2 SSH Key
- Run with `terraform apply --var-file="credentials.tfvars"` or `terraform apply` depending if you exported your AWS credentials
Example:
```commandline
terraform apply -var-file=credentials.tfvars
```
- Terraform automatically creates an Ansible Inventory file called `hosts` with the created infrastructure in the directory `inventory`
- Ansible will automatically generate an ssh config file for your bastion hosts. To connect to hosts with ssh using bastion host use generated ssh-bastion.conf.
Ansible automatically detects bastion and changes ssh_args
If you want to use another distribution than Ubuntu 18.04 (Bionic) LTS, you can modify the search filters of the 'data "aws_ami" "distro"' in variables.tf.
For example, to use:
- Debian Jessie, replace 'data "aws_ami" "distro"' in variables.tf with
```ini
data "aws_ami" "distro" {
most_recent=true
@@ -65,8 +74,11 @@ data "aws_ami" "distro" {
owners=["379101102735"]
}
```
- Ubuntu 16.04, replace 'data "aws_ami" "distro"' in variables.tf with
```ini
data "aws_ami" "distro" {
most_recent=true
@@ -82,8 +94,11 @@ data "aws_ami" "distro" {
owners=["099720109477"]
}
```
- Centos 7, replace 'data "aws_ami" "distro"' in variables.tf with
```ini
data "aws_ami" "distro" {
most_recent=true
@@ -99,23 +114,49 @@ data "aws_ami" "distro" {
owners=["688023202711"]
}
```
**Troubleshooting**
## Connecting to Kubernetes
***Remaining AWS IAM Instance Profile***:
You can use the following set of commands to get the kubeconfig file from your newly created cluster. Before running the commands, make sure you are in the project's root folder.
sed -i "s^server:.*^server: https://$LB_HOST:6443^" ~/.kube/config
kubectl get nodes
```
## Troubleshooting
### Remaining AWS IAM Instance Profile
If the cluster was destroyed without using Terraform it is possible that
the AWS IAM Instance Profiles still remain. To delete them you can use
the `AWS CLI` with the following command:
```
```commandline
aws iam delete-instance-profile --region <region_name> --instance-profile-name <profile_name>
```
***Ansible Inventory doesn't get created:***
### Ansible Inventory doesn't get created
It could happen that Terraform doesn't create an Ansible Inventory file automatically. If this is the case copy the output after `inventory=` and create a file named `hosts`in the directory `inventory` and paste the inventory into the file.
**Architecture**
## Architecture
Pictured is an AWS Infrastructure created with this Terraform project distributed over two Availability Zones.
*`ssh_public_keys`: List of public SSH keys to install on all machines
*`zone`: The zone where to run the cluster
*`machines`: Machines to provision. Key of this object will be used as the name of the machine
*`node_type`: The role of this node *(master|worker)*
*`size`: The size to use
*`boot_disk`: The boot disk to use
*`image_name`: Name of the image
*`root_partition_size`: Size *(in GB)* for the root partition
*`ceph_partition_size`: Size *(in GB)* for the partition for rook to use as ceph storage. *(Set to 0 to disable)*
*`node_local_partition_size`: Size *(in GB)* for the partition for node-local-storage. *(Set to 0 to disable)*
*`ssh_whitelist`: List of IP ranges (CIDR) that will be allowed to ssh to the nodes
*`api_server_whitelist`: List of IP ranges (CIDR) that will be allowed to connect to the API server
*`nodeport_whitelist`: List of IP ranges (CIDR) that will be allowed to connect to the kubernetes nodes on port 30000-32767 (kubernetes nodeports)
### Optional
*`prefix`: Prefix to use for all resources, required to be unique for all clusters in the same project *(Defaults to `default`)*
An example variables file can be found `default.tfvars`
## Known limitations
### Only single disk
Since Exoscale doesn't support additional disks to be mounted onto an instance, this script has the ability to create partitions for [Rook](https://rook.io/) and [node-local-storage](https://kubernetes.io/docs/concepts/storage/volumes/#local).
### No Kubernetes API
The current solution doesn't use the [Exoscale Kubernetes cloud controller](https://github.com/exoscale/exoscale-cloud-controller-manager).
This means that we need to set up a HTTP(S) loadbalancer in front of all workers and set the Ingress controller to DaemonSet mode.
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.