modify doc structure and update existing doc-links as preparation for new doc generation script

2026-02-04 08:48:42 +03:00 · 2024-05-15 19:32:51 +02:00
parent 0b464b5239
commit 4dbfd42f1d
82 changed files with 70 additions and 70 deletions
--- a/docs/operations/cgroups.md
+++ b/docs/operations/cgroups.md
@@ -0,0 +1,72 @@
+# cgroups
+
+To avoid the rivals for resources between containers or the impact on the host in Kubernetes, the kubelet components will rely on cgroups to limit the container’s resources usage.
+
+## Enforcing Node Allocatable
+
+You can use `kubelet_enforce_node_allocatable` to set node allocatable enforcement.
+
+```yaml
+# A comma separated list of levels of node allocatable enforcement to be enforced by kubelet.
+kubelet_enforce_node_allocatable: "pods"
+# kubelet_enforce_node_allocatable: "pods,kube-reserved"
+# kubelet_enforce_node_allocatable: "pods,kube-reserved,system-reserved"
+```
+
+Note that to enforce kube-reserved or system-reserved, `kube_reserved_cgroups` or `system_reserved_cgroups` needs to be specified respectively.
+
+Here is an example:
+
+```yaml
+kubelet_enforce_node_allocatable: "pods,kube-reserved,system-reserved"
+
+# Reserve this space for kube resources
+# Set to true to reserve resources for kube daemons
+kube_reserved: true
+kube_reserved_cgroups_for_service_slice: kube.slice
+kube_reserved_cgroups: "/{{ kube_reserved_cgroups_for_service_slice }}"
+kube_memory_reserved: 256Mi
+kube_cpu_reserved: 100m
+# kube_ephemeral_storage_reserved: 2Gi
+# kube_pid_reserved: "1000"
+# Reservation for master hosts
+kube_master_memory_reserved: 512Mi
+kube_master_cpu_reserved: 200m
+# kube_master_ephemeral_storage_reserved: 2Gi
+# kube_master_pid_reserved: "1000"
+
+# Set to true to reserve resources for system daemons
+system_reserved: true
+system_reserved_cgroups_for_service_slice: system.slice
+system_reserved_cgroups: "/{{ system_reserved_cgroups_for_service_slice }}"
+system_memory_reserved: 512Mi
+system_cpu_reserved: 500m
+# system_ephemeral_storage_reserved: 2Gi
+# system_pid_reserved: "1000"
+# Reservation for master hosts
+system_master_memory_reserved: 256Mi
+system_master_cpu_reserved: 250m
+# system_master_ephemeral_storage_reserved: 2Gi
+# system_master_pid_reserved: "1000"
+```
+
+After the setup, the cgroups hierarchy is as follows:
+
+```bash
+/ (Cgroups Root)
+├── kubepods.slice
+│   ├── ...
+│   ├── kubepods-besteffort.slice
+│   ├── kubepods-burstable.slice
+│   └── ...
+├── kube.slice
+│   ├── ...
+│   ├── {{container_manager}}.service
+│   ├── kubelet.service
+│   └── ...
+├── system.slice
+│   └── ...
+└── ...
+```
+
+You can learn more in the [official kubernetes documentation](https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/).
--- a/docs/operations/encrypting-secret-data-at-rest.md
+++ b/docs/operations/encrypting-secret-data-at-rest.md
@@ -0,0 +1,22 @@
+# Encrypting Secret Data at Rest
+
+Before enabling Encrypting Secret Data at Rest, please read the following documentation carefully.
+
+<https://kubernetes.io/docs/tasks/administer-cluster/encrypt-data/>
+
+As you can see from the documentation above, 5 encryption providers are supported as of today (22.02.2022).
+
+As default value for the provider we have chosen `secretbox`.
+
+Alternatively you can use the values `identity`, `aesgcm`, `aescbc` or `kms`.
+
+| Provider | Why we have decided against the value as default                                                                                                                                         |
+|----------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| identity | no encryption                                                                                                                                                                            |
+| aesgcm   | Must be rotated every 200k writes                                                                                                                                                        |
+| aescbc   | Not recommended due to CBC's vulnerability to padding oracle attacks.                                                                                                                    |
+| kms      | Is the official recommended way, but assumes that a key management service independent of Kubernetes exists, we cannot assume this in all environments, so not a suitable default value. |
+
+## Details about Secretbox
+
+Secretbox uses [Poly1305](https://cr.yp.to/mac.html) as message-authentication code and [XSalsa20](https://www.xsalsa20.com/) as secret-key authenticated encryption and secret-key encryption.
--- a/docs/operations/etcd.md
+++ b/docs/operations/etcd.md
@@ -0,0 +1,52 @@
+# etcd
+
+## Deployment Types
+
+It is possible to deploy etcd with three methods. To change the default deployment method (host), use the `etcd_deployment_type` variable. Possible values are `host`, `kubeadm`, and `docker`.
+
+### Host
+
+Host deployment is the default method. Using this method will result in etcd installed as a systemd service.
+
+### Docker
+
+Installs docker in etcd group members and runs etcd on docker containers. Only usable when `container_manager` is set to `docker`.
+
+### Kubeadm
+
+This deployment method is experimental and is only available for new deployments. This deploys etcd as a static pod in master hosts.
+
+## Metrics
+
+To expose metrics on a separate HTTP port, define it in the inventory with:
+
+```yaml
+etcd_metrics_port: 2381
+```
+
+To create a service `etcd-metrics` and associated endpoints in the `kube-system` namespace,
+define its labels in the inventory with:
+
+```yaml
+etcd_metrics_service_labels:
+  k8s-app: etcd
+  app.kubernetes.io/managed-by: Kubespray
+  app: kube-prometheus-stack-kube-etcd
+  release: prometheus-stack
+```
+
+The last two labels in the above example allows to scrape the metrics from the
+[kube-prometheus-stack](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack)
+chart with the following Helm `values.yaml` :
+
+```yaml
+kubeEtcd:
+  service:
+    enabled: false
+```
+
+To fully override metrics exposition urls, define it in the inventory with:
+
+```yaml
+etcd_listen_metrics_urls: "http://0.0.0.0:2381"
+```
--- a/docs/operations/ha-mode.md
+++ b/docs/operations/ha-mode.md
@@ -0,0 +1,158 @@
+# HA endpoints for K8s
+
+The following components require a highly available endpoints:
+
+* etcd cluster,
+* kube-apiserver service instances.
+
+The latter relies on a 3rd side reverse proxy, like Nginx or HAProxy, to
+achieve the same goal.
+
+## Etcd
+
+The etcd clients (kube-api-masters) are configured with the list of all etcd peers. If the etcd-cluster has multiple instances, it's configured in HA already.
+
+## Kube-apiserver
+
+K8s components require a loadbalancer to access the apiservers via a reverse
+proxy. Kubespray includes support for an nginx-based proxy that resides on each
+non-master Kubernetes node. This is referred to as localhost loadbalancing. It
+is less efficient than a dedicated load balancer because it creates extra
+health checks on the Kubernetes apiserver, but is more practical for scenarios
+where an external LB or virtual IP management is inconvenient.  This option is
+configured by the variable `loadbalancer_apiserver_localhost` (defaults to
+`True`. Or `False`, if there is an external `loadbalancer_apiserver` defined).
+You may also define the port the local internal loadbalancer uses by changing,
+`loadbalancer_apiserver_port`.  This defaults to the value of
+`kube_apiserver_port`.  It is also important to note that Kubespray will only
+configure kubelet and kube-proxy on non-master nodes to use the local internal
+loadbalancer.  If you wish to control the name of the loadbalancer container,
+you can set the variable `loadbalancer_apiserver_pod_name`.
+
+If you choose to NOT use the local internal loadbalancer, you will need to
+use the [kube-vip](kube-vip.md) ansible role or configure your own loadbalancer to achieve HA. By default, it only configures a non-HA endpoint, which points to the
+`access_ip` or IP address of the first server node in the `kube_control_plane` group.
+It can also configure clients to use endpoints for a given loadbalancer type.
+The following diagram shows how traffic to the apiserver is directed.
+
+![Image](figures/loadbalancer_localhost.png?raw=true)
+
+A user may opt to use an external loadbalancer (LB) instead. An external LB
+provides access for external clients, while the internal LB accepts client
+connections only to the localhost.
+Given a frontend `VIP` address and `IP1, IP2` addresses of backends, here is
+an example configuration for a HAProxy service acting as an external LB:
+
+```raw
+listen kubernetes-apiserver-https
+  bind <VIP>:8383
+  mode tcp
+  option log-health-checks
+  timeout client 3h
+  timeout server 3h
+  server master1 <IP1>:6443 check check-ssl verify none inter 10000
+  server master2 <IP2>:6443 check check-ssl verify none inter 10000
+  balance roundrobin
+```
+
+  Note: That's an example config managed elsewhere outside Kubespray.
+
+And the corresponding example global vars for such a "cluster-aware"
+external LB with the cluster API access modes configured in Kubespray:
+
+```yml
+apiserver_loadbalancer_domain_name: "my-apiserver-lb.example.com"
+loadbalancer_apiserver:
+  address: <VIP>
+  port: 8383
+```
+
+  Note: The default kubernetes apiserver configuration binds to all interfaces,
+  so you will need to use a different port for the vip from that the API is
+  listening on, or set the `kube_apiserver_bind_address` so that the API only
+  listens on a specific interface (to avoid conflict with haproxy binding the
+  port on the VIP address)
+
+This domain name, or default "lb-apiserver.kubernetes.local", will be inserted
+into the `/etc/hosts` file of all servers in the `k8s_cluster` group and wired
+into the generated self-signed TLS/SSL certificates as well. Note that
+the HAProxy service should as well be HA and requires a VIP management, which
+is out of scope of this doc.
+
+There is a special case for an internal and an externally configured (not with
+Kubespray) LB used simultaneously. Keep in mind that the cluster is not aware
+of such an external LB and you need no to specify any configuration variables
+for it.
+
+  Note: TLS/SSL termination for externally accessed API endpoints' will **not**
+  be covered by Kubespray for that case. Make sure your external LB provides it.
+  Alternatively you may specify an external load balanced VIPs in the
+  `supplementary_addresses_in_ssl_keys` list. Then, kubespray will add them into
+  the generated cluster certificates as well.
+
+Aside of that specific case, the `loadbalancer_apiserver` considered mutually
+exclusive to `loadbalancer_apiserver_localhost`.
+
+Access API endpoints are evaluated automatically, as the following:
+
+| Endpoint type                | kube_control_plane                       | non-master              | external              |
+|------------------------------|------------------------------------------|-------------------------|-----------------------|
+| Local LB (default)           | `https://dbip:sp`                        | `https://lc:nsp`        | `https://m[0].aip:sp` |
+| Local LB (default) + cbip    | `https://cbip:sp` and `https://lc:nsp`   | `https://lc:nsp`        | `https://m[0].aip:sp` |
+| Local LB + Unmanaged here LB | `https://dbip:sp`                        | `https://lc:nsp`        | `https://ext`         |
+| External LB, no internal     | `https://dbip:sp`                        | `<https://lb:lp>`       | `https://lb:lp`       |
+| No ext/int LB                | `https://dbip:sp`                        | `<https://m[0].aip:sp>` | `https://m[0].aip:sp` |
+
+Where:
+
+* `m[0]` - the first node in the `kube_control_plane` group;
+* `lb` - LB FQDN, `apiserver_loadbalancer_domain_name`;
+* `ext` - Externally load balanced VIP:port and FQDN, not managed by Kubespray;
+* `lc` - localhost;
+* `cbip` - a custom bind IP, `kube_apiserver_bind_address`;
+* `dbip` - localhost for the default bind IP '0.0.0.0';
+* `nsp` - nginx secure port, `loadbalancer_apiserver_port`, defers to `sp`;
+* `sp` - secure port, `kube_apiserver_port`;
+* `lp` - LB port, `loadbalancer_apiserver.port`, defers to the secure port;
+* `ip` - the node IP, defers to the ansible IP;
+* `aip` - `access_ip`, defers to the ip.
+
+A second and a third column represent internal cluster access modes. The last
+column illustrates an example URI to access the cluster APIs externally.
+Kubespray has nothing to do with it, this is informational only.
+
+As you can see, the masters' internal API endpoints are always
+contacted via the local bind IP, which is `https://bip:sp`.
+
+## Optional configurations
+
+### ETCD with a LB
+
+In order to use an external loadbalancing (L4/TCP or L7 w/ SSL Passthrough VIP), the following variables need to be overridden in group_vars
+
+* `etcd_access_addresses`
+* `etcd_client_url`
+* `etcd_cert_alt_names`
+* `etcd_cert_alt_ips`
+
+#### Example of a VIP w/ FQDN
+
+```yaml
+etcd_access_addresses: https://etcd.example.com:2379
+etcd_client_url: https://etcd.example.com:2379
+etcd_cert_alt_names:
+  - "etcd.kube-system.svc.{{ dns_domain }}"
+  - "etcd.kube-system.svc"
+  - "etcd.kube-system"
+  - "etcd"
+  - "etcd.example.com" # This one needs to be added to the default etcd_cert_alt_names
+```
+
+#### Example of a VIP w/o FQDN (IP only)
+
+```yaml
+etcd_access_addresses: https://2.3.7.9:2379
+etcd_client_url: https://2.3.7.9:2379
+etcd_cert_alt_ips:
+  - "2.3.7.9"
+```
--- a/docs/operations/hardening.md
+++ b/docs/operations/hardening.md
@@ -0,0 +1,144 @@
+# Cluster Hardening
+
+If you want to improve the security on your cluster and make it compliant with the [CIS Benchmarks](https://learn.cisecurity.org/benchmarks), here you can find a configuration to harden your **kubernetes** installation.
+
+To apply the hardening configuration, create a file (eg. `hardening.yaml`) and paste the content of the following code snippet into that.
+
+## Minimum Requirements
+
+The **kubernetes** version should be at least `v1.23.6` to have all the most recent security features (eg. the new `PodSecurity` admission plugin, etc).
+
+**N.B.** Some of these configurations have just been added to **kubespray**, so ensure that you have the latest version to make it works properly. Also, ensure that other configurations doesn't override these.
+
+`hardening.yaml`:
+
+```yaml
+# Hardening
+---
+
+## kube-apiserver
+authorization_modes: ['Node', 'RBAC']
+# AppArmor-based OS
+# kube_apiserver_feature_gates: ['AppArmor=true']
+kube_apiserver_request_timeout: 120s
+kube_apiserver_service_account_lookup: true
+
+# enable kubernetes audit
+kubernetes_audit: true
+audit_log_path: "/var/log/kube-apiserver-log.json"
+audit_log_maxage: 30
+audit_log_maxbackups: 10
+audit_log_maxsize: 100
+
+tls_min_version: VersionTLS12
+tls_cipher_suites:
+  - TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256
+  - TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
+  - TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305
+
+# enable encryption at rest
+kube_encrypt_secret_data: true
+kube_encryption_resources: [secrets]
+kube_encryption_algorithm: "secretbox"
+
+kube_apiserver_enable_admission_plugins:
+  - EventRateLimit
+  - AlwaysPullImages
+  - ServiceAccount
+  - NamespaceLifecycle
+  - NodeRestriction
+  - LimitRanger
+  - ResourceQuota
+  - MutatingAdmissionWebhook
+  - ValidatingAdmissionWebhook
+  - PodNodeSelector
+  - PodSecurity
+kube_apiserver_admission_control_config_file: true
+# Creates config file for PodNodeSelector
+# kube_apiserver_admission_plugins_needs_configuration: [PodNodeSelector]
+# Define the default node selector, by default all the workloads will be scheduled on nodes
+# with label network=srv1
+# kube_apiserver_admission_plugins_podnodeselector_default_node_selector: "network=srv1"
+# EventRateLimit plugin configuration
+kube_apiserver_admission_event_rate_limits:
+  limit_1:
+    type: Namespace
+    qps: 50
+    burst: 100
+    cache_size: 2000
+  limit_2:
+    type: User
+    qps: 50
+    burst: 100
+kube_profiling: false
+# Remove anonymous access to cluster
+remove_anonymous_access: true
+
+## kube-controller-manager
+kube_controller_manager_bind_address: 127.0.0.1
+kube_controller_terminated_pod_gc_threshold: 50
+# AppArmor-based OS
+# kube_controller_feature_gates: ["RotateKubeletServerCertificate=true", "AppArmor=true"]
+kube_controller_feature_gates: ["RotateKubeletServerCertificate=true"]
+
+## kube-scheduler
+kube_scheduler_bind_address: 127.0.0.1
+# AppArmor-based OS
+# kube_scheduler_feature_gates: ["AppArmor=true"]
+
+## etcd
+etcd_deployment_type: kubeadm
+
+## kubelet
+kubelet_authorization_mode_webhook: true
+kubelet_authentication_token_webhook: true
+kube_read_only_port: 0
+kubelet_rotate_server_certificates: true
+kubelet_protect_kernel_defaults: true
+kubelet_event_record_qps: 1
+kubelet_rotate_certificates: true
+kubelet_streaming_connection_idle_timeout: "5m"
+kubelet_make_iptables_util_chains: true
+kubelet_feature_gates: ["RotateKubeletServerCertificate=true"]
+kubelet_seccomp_default: true
+kubelet_systemd_hardening: true
+# In case you have multiple interfaces in your
+# control plane nodes and you want to specify the right
+# IP addresses, kubelet_secure_addresses allows you
+# to specify the IP from which the kubelet
+# will receive the packets.
+kubelet_secure_addresses: "localhost link-local {{ kube_pods_subnet }} 192.168.10.110 192.168.10.111 192.168.10.112"
+
+# additional configurations
+kube_owner: root
+kube_cert_group: root
+
+# create a default Pod Security Configuration and deny running of insecure pods
+# kube_system namespace is exempted by default
+kube_pod_security_use_default: true
+kube_pod_security_default_enforce: restricted
+```
+
+Let's take a deep look to the resultant **kubernetes** configuration:
+
+* The `anonymous-auth` (on `kube-apiserver`) is set to `true` by default. This is fine, because it is considered safe if you enable `RBAC` for the `authorization-mode`.
+* The `enable-admission-plugins` includes `PodSecurity` (for more details, please take a look here: <https://kubernetes.io/docs/concepts/security/pod-security-admission/>). Then, we set the `EventRateLimit` plugin, providing additional configuration files (that are automatically created under the hood and mounted inside the `kube-apiserver` container) to make it work.
+* The `encryption-provider-config` provide encryption at rest. This means that the `kube-apiserver` encrypt data that is going to be stored before they reach `etcd`. So the data is completely unreadable from `etcd` (in case an attacker is able to exploit this).
+* The `rotateCertificates` in `KubeletConfiguration` is set to `true` along with `serverTLSBootstrap`. This could be used in alternative to `tlsCertFile` and `tlsPrivateKeyFile` parameters. Additionally it automatically generates certificates by itself. By default the CSRs are approved automatically via [kubelet-csr-approver](https://github.com/postfinance/kubelet-csr-approver). You can customize approval configuration by modifying Helm values via `kubelet_csr_approver_values`.
+  See <https://kubernetes.io/docs/reference/access-authn-authz/kubelet-tls-bootstrapping/> for more information on the subject.
+* If you are installing **kubernetes** in an AppArmor-based OS (eg. Debian/Ubuntu) you can enable the `AppArmor` feature gate uncommenting the lines with the comment `# AppArmor-based OS` on top.
+* The `kubelet_systemd_hardening`, both with `kubelet_secure_addresses` setup a minimal firewall on the system. To better understand how these variables work, here's an explanatory image:
+  ![kubelet hardening](img/kubelet-hardening.png)
+
+Once you have the file properly filled, you can run the **Ansible** command to start the installation:
+
+```bash
+ansible-playbook -v cluster.yml \
+        -i inventory.ini \
+        -b --become-user=root \
+        --private-key ~/.ssh/id_ecdsa \
+        -e "@vars.yaml" \
+        -e "@hardening.yaml"
+```
+
+**N.B.** The `vars.yaml` contains our general cluster information (SANs, load balancer, dns, etc..) and `hardening.yaml` is the file described above.
--- a/docs/operations/integration.md
+++ b/docs/operations/integration.md
@@ -0,0 +1,188 @@
+# Kubespray (kubespray) in own ansible playbooks repo
+
+1. Fork [kubespray repo](https://github.com/kubernetes-sigs/kubespray) to your personal/organisation account on github.
+   Note:
+     * All forked public repos at github will be also public, so **never commit sensitive data to your public forks**.
+     * List of all forked repos could be retrieved from github page of original project.
+
+2. Add **forked repo** as submodule to desired folder in your existent ansible repo (for example 3d/kubespray):
+
+   ```ShellSession
+   git submodule add https://github.com/YOUR_GITHUB/kubespray.git kubespray
+   ```
+
+   Git will create `.gitmodules` file in your existent ansible repo:
+
+   ```ini
+   [submodule "3d/kubespray"]
+          path = 3d/kubespray
+          url = https://github.com/YOUR_GITHUB/kubespray.git
+   ```
+
+3. Configure git to show submodule status:
+
+   ```ShellSession
+   git config --global status.submoduleSummary true
+   ```
+
+4. Add *original* kubespray repo as upstream:
+
+   ```ShellSession
+   cd kubespray && git remote add upstream https://github.com/kubernetes-sigs/kubespray.git
+   ```
+
+5. Sync your master branch with upstream:
+
+   ```ShellSession
+   git checkout master
+   git fetch upstream
+   git merge upstream/master
+   git push origin master
+   ```
+
+6. Create a new branch which you will use in your working environment:
+
+   ```ShellSession
+   git checkout -b work
+   ```
+
+    ***Never*** use master branch of your repository for your commits.
+
+7. Modify path to library and roles in your ansible.cfg file (role naming should be unique, you may have to rename your existent roles if they have same names as kubespray project),
+   if you had roles in your existing ansible project before, you can add the path to those separated with `:`:
+
+   ```ini
+   ...
+   library       = ./library/:3d/kubespray/library/
+   roles_path    = ./roles/:3d/kubespray/roles/
+   ...
+   ```
+
+8. Copy and modify configs from kubespray `group_vars` folder to corresponding `group_vars` folder in your existent project.
+
+   You could rename *all.yml* config to something else, i.e. *kubespray.yml* and create corresponding group in your inventory file, which will include all hosts groups related to kubernetes setup.
+
+9. Modify your ansible inventory file by adding mapping of your existent groups (if any) to kubespray naming.
+    For example:
+
+   ```ini
+   ...
+   #Kubespray groups:
+   [kube_node:children]
+   kubenode
+
+   [k8s_cluster:children]
+   kubernetes
+
+   [etcd:children]
+   kubemaster
+   kubemaster-ha
+
+   [kube_control_plane:children]
+   kubemaster
+   kubemaster-ha
+
+   [kubespray:children]
+   kubernetes
+   ```
+
+* Last entry here needed to apply kubespray.yml config file, renamed from all.yml of kubespray project.
+
+10. Now you can include kubespray tasks in you existent playbooks by including cluster.yml file:
+
+    ```yml
+    - name: Import kubespray playbook
+      ansible.builtin.import_playbook: 3d/kubespray/cluster.yml
+    ```
+
+    Or you could copy separate tasks from cluster.yml into your ansible repository.
+
+11. Commit changes to your ansible repo. Keep in mind, that submodule folder is just a link to the git commit hash of your forked repo.
+
+    When you update your "work" branch you need to commit changes to ansible repo as well.
+Other members of your team should use ```git submodule sync```, ```git submodule update --init``` to get actual code from submodule.
+
+## Contributing
+
+If you made useful changes or fixed a bug in existent kubespray repo, use this flow for PRs to original kubespray repo.
+
+1. Sign the [CNCF CLA](https://git.k8s.io/community/CLA.md).
+
+2. Change working directory to git submodule directory (3d/kubespray).
+
+3. Setup desired user.name and user.email for submodule.
+
+   If kubespray is only one submodule in your repo you could use something like:
+
+   ```ShellSession
+   git submodule foreach --recursive 'git config user.name "First Last" && git config user.email "your-email-address@used.for.cncf"'
+   ```
+
+4. Sync with upstream master:
+
+   ```ShellSession
+   git fetch upstream
+   git merge upstream/master
+   git push origin master
+   ```
+
+5. Create new branch for the specific fixes that you want to contribute:
+
+   ```ShellSession
+   git checkout -b fixes-name-date-index
+   ```
+
+   Branch name should be self explaining to you, adding date and/or index will help you to track/delete your old PRs.
+
+6. Find git hash of your commit in "work" repo and apply it to newly created "fix" repo:
+
+   ```ShellSession
+   git cherry-pick <COMMIT_HASH>
+   ```
+
+7. If you have several temporary-stage commits - squash them using [git rebase -i](https://eli.thegreenplace.net/2014/02/19/squashing-github-pull-requests-into-a-single-commit)
+
+   Also you could use interactive rebase
+
+   ```ShellSession
+   git rebase -i HEAD~10
+   ```
+
+   to delete commits which you don't want to contribute into original repo.
+
+8. When your changes is in place, you need to check upstream repo one more time because it could be changed during your work.
+
+   Check that you're on correct branch:
+
+   ```ShellSession
+   git status
+   ```
+
+   And pull changes from upstream (if any):
+
+   ```ShellSession
+   git pull --rebase upstream master
+   ```
+
+9. Now push your changes to your **fork** repo with
+
+   ```ShellSession
+   git push
+   ```
+
+   If your branch doesn't exist on github, git will propose you to use something like
+
+   ```ShellSession
+   git push --set-upstream origin fixes-name-date-index
+   ```
+
+10. Open you forked repo in browser, on the main page you will see proposition to create pull request for your newly created branch. Check proposed diff of your PR. If something is wrong you could safely delete "fix" branch on github using
+
+    ```ShellSession
+    git push origin --delete fixes-name-date-index
+    git branch -D fixes-name-date-index
+    ```
+
+    and start whole process from the beginning.
+
+    If everything is fine - add description about your changes (what they do and why they're needed) and confirm pull request creation.
--- a/docs/operations/large-deployments.md
+++ b/docs/operations/large-deployments.md
@@ -0,0 +1,52 @@
+Large deployments of K8s
+========================
+
+For a large scaled deployments, consider the following configuration changes:
+
+* Tune [ansible settings](https://docs.ansible.com/ansible/latest/intro_configuration.html)
+  for `forks` and `timeout` vars to fit large numbers of nodes being deployed.
+
+* Override containers' `foo_image_repo` vars to point to intranet registry.
+
+* Override the ``download_run_once: true`` and/or ``download_localhost: true``.
+  See [Downloading binaries and containers](/docs/advanced/downloads.md) for details.
+
+* Adjust the `retry_stagger` global var as appropriate. It should provide sane
+  load on a delegate (the first K8s control plane node) then retrying failed
+  push or download operations.
+
+* Tune parameters for DNS related applications
+  Those are ``dns_replicas``, ``dns_cpu_limit``,
+  ``dns_cpu_requests``, ``dns_memory_limit``, ``dns_memory_requests``.
+  Please note that limits must always be greater than or equal to requests.
+
+* Tune CPU/memory limits and requests. Those are located in roles' defaults
+  and named like ``foo_memory_limit``, ``foo_memory_requests`` and
+  ``foo_cpu_limit``, ``foo_cpu_requests``. Note that 'Mi' memory units for K8s
+  will be submitted as 'M', if applied for ``docker run``, and cpu K8s units
+  will end up with the 'm' skipped for docker as well. This is required as
+  docker does not understand k8s units well.
+
+* Tune ``kubelet_status_update_frequency`` to increase reliability of kubelet.
+  ``kube_controller_node_monitor_grace_period``,
+  ``kube_controller_node_monitor_period``,
+  ``kube_apiserver_pod_eviction_not_ready_timeout_seconds`` &
+  ``kube_apiserver_pod_eviction_unreachable_timeout_seconds`` for better Kubernetes reliability.
+  Check out [Kubernetes Reliability](/docs/advanced/kubernetes-reliability.md)
+
+* Tune network prefix sizes. Those are ``kube_network_node_prefix``,
+  ``kube_service_addresses`` and ``kube_pods_subnet``.
+
+* Add calico_rr nodes if you are deploying with Calico or Canal. Nodes recover
+  from host/network interruption much quicker with calico_rr.
+
+* Check out the
+  [Inventory](/docs/getting_started/getting-started.md#building-your-own-inventory)
+  section of the Getting started guide for tips on creating a large scale
+  Ansible inventory.
+
+* Override the ``etcd_events_cluster_setup: true`` store events in a separate
+  dedicated etcd instance.
+
+For example, when deploying 200 nodes, you may want to run ansible with
+``--forks=50``, ``--timeout=600`` and define the ``retry_stagger: 60``.
--- a/docs/operations/mirror.md
+++ b/docs/operations/mirror.md
@@ -0,0 +1,66 @@
+# Public Download Mirror
+
+The public mirror is useful to make the public resources download quickly in some areas of the world. (such as China).
+
+## Configuring Kubespray to use a mirror site
+
+You can follow the [offline](offline-environment.md) to config the image/file download configuration to the public mirror site. If you want to download quickly in China, the configuration can be like:
+
+```shell
+gcr_image_repo: "gcr.m.daocloud.io"
+kube_image_repo: "k8s.m.daocloud.io"
+docker_image_repo: "docker.m.daocloud.io"
+quay_image_repo: "quay.m.daocloud.io"
+github_image_repo: "ghcr.m.daocloud.io"
+
+files_repo: "https://files.m.daocloud.io"
+```
+
+Use mirror sites only if you trust the provider. The Kubespray team cannot verify their reliability or security.
+You can replace the `m.daocloud.io` with any site you want.
+
+## Example Usage Full Steps
+
+You can follow the full steps to use the kubesray with mirror. for example:
+
+Install Ansible according to Ansible installation guide then run the following steps:
+
+```shell
+# Copy ``inventory/sample`` as ``inventory/mycluster``
+cp -rfp inventory/sample inventory/mycluster
+
+# Update Ansible inventory file with inventory builder
+declare -a IPS=(10.10.1.3 10.10.1.4 10.10.1.5)
+CONFIG_FILE=inventory/mycluster/hosts.yaml python3 contrib/inventory_builder/inventory.py ${IPS[@]}
+
+# Use the download mirror
+cp inventory/mycluster/group_vars/all/offline.yml inventory/mycluster/group_vars/all/mirror.yml
+sed -i -E '/# .*\{\{ files_repo/s/^# //g' inventory/mycluster/group_vars/all/mirror.yml
+tee -a inventory/mycluster/group_vars/all/mirror.yml <<EOF
+gcr_image_repo: "gcr.m.daocloud.io"
+kube_image_repo: "k8s.m.daocloud.io"
+docker_image_repo: "docker.m.daocloud.io"
+quay_image_repo: "quay.m.daocloud.io"
+github_image_repo: "ghcr.m.daocloud.io"
+files_repo: "https://files.m.daocloud.io"
+EOF
+
+# Review and change parameters under ``inventory/mycluster/group_vars``
+cat inventory/mycluster/group_vars/all/all.yml
+cat inventory/mycluster/group_vars/k8s_cluster/k8s-cluster.yml
+
+# Deploy Kubespray with Ansible Playbook - run the playbook as root
+# The option `--become` is required, as for example writing SSL keys in /etc/,
+# installing packages and interacting with various systemd daemons.
+# Without --become the playbook will fail to run!
+ansible-playbook -i inventory/mycluster/hosts.yaml  --become --become-user=root cluster.yml
+```
+
+The above steps are by adding the "Use the download mirror" step to the [README.md](../README.md) steps.
+
+## Community-run mirror sites
+
+DaoCloud(China)
+
+* [image-mirror](https://github.com/DaoCloud/public-image-mirror)
+* [files-mirror](https://github.com/DaoCloud/public-binary-files-mirror)
--- a/docs/operations/nodes.md
+++ b/docs/operations/nodes.md
@@ -0,0 +1,185 @@
+# Adding/replacing a node
+
+Modified from [comments in #3471](https://github.com/kubernetes-sigs/kubespray/issues/3471#issuecomment-530036084)
+
+## Limitation: Removal of first kube_control_plane and etcd-master
+
+Currently you can't remove the first node in your kube_control_plane and etcd-master list. If you still want to remove this node you have to:
+
+### 1) Change order of current control planes
+
+Modify the order of your control plane list by pushing your first entry to any other position. E.g. if you want to remove `node-1` of the following example:
+
+```yaml
+  children:
+    kube_control_plane:
+      hosts:
+        node-1:
+        node-2:
+        node-3:
+    kube_node:
+      hosts:
+        node-1:
+        node-2:
+        node-3:
+    etcd:
+      hosts:
+        node-1:
+        node-2:
+        node-3:
+```
+
+change your inventory to:
+
+```yaml
+  children:
+    kube_control_plane:
+      hosts:
+        node-2:
+        node-3:
+        node-1:
+    kube_node:
+      hosts:
+        node-2:
+        node-3:
+        node-1:
+    etcd:
+      hosts:
+        node-2:
+        node-3:
+        node-1:
+```
+
+## 2) Upgrade the cluster
+
+run `upgrade-cluster.yml` or `cluster.yml`. Now you are good to go on with the removal.
+
+## Adding/replacing a worker node
+
+This should be the easiest.
+
+### 1) Add new node to the inventory
+
+### 2) Run `scale.yml`
+
+You can use `--limit=NODE_NAME` to limit Kubespray to avoid disturbing other nodes in the cluster.
+
+Before using `--limit` run playbook `facts.yml` without the limit to refresh facts cache for all nodes.
+
+### 3) Remove an old node with remove-node.yml
+
+With the old node still in the inventory, run `remove-node.yml`. You need to pass `-e node=NODE_NAME` to the playbook to limit the execution to the node being removed.
+
+If the node you want to remove is not online, you should add `reset_nodes=false` and `allow_ungraceful_removal=true` to your extra-vars: `-e node=NODE_NAME -e reset_nodes=false -e allow_ungraceful_removal=true`.
+Use this flag even when you remove other types of nodes like a control plane or etcd nodes.
+
+### 4) Remove the node from the inventory
+
+That's it.
+
+## Adding/replacing a control plane node
+
+### 1) Run `cluster.yml`
+
+Append the new host to the inventory and run `cluster.yml`. You can NOT use `scale.yml` for that.
+
+### 2) Restart kube-system/nginx-proxy
+
+In all hosts, restart nginx-proxy pod. This pod is a local proxy for the apiserver. Kubespray will update its static config, but it needs to be restarted in order to reload.
+
+```sh
+# run in every host
+docker ps | grep k8s_nginx-proxy_nginx-proxy | awk '{print $1}' | xargs docker restart
+
+# or with containerd
+crictl ps | grep nginx-proxy | awk '{print $1}' | xargs crictl stop
+```
+
+### 3) Remove old control plane nodes
+
+With the old node still in the inventory, run `remove-node.yml`. You need to pass `-e node=NODE_NAME` to the playbook to limit the execution to the node being removed.
+If the node you want to remove is not online, you should add `reset_nodes=false` and `allow_ungraceful_removal=true` to your extra-vars.
+
+## Replacing a first control plane node
+
+### 1) Change control plane nodes order in inventory
+
+from
+
+```ini
+[kube_control_plane]
+ node-1
+ node-2
+ node-3
+```
+
+to
+
+```ini
+[kube_control_plane]
+ node-2
+ node-3
+ node-1
+```
+
+### 2) Remove old first control plane node from cluster
+
+With the old node still in the inventory, run `remove-node.yml`. You need to pass `-e node=node-1` to the playbook to limit the execution to the node being removed.
+If the node you want to remove is not online, you should add `reset_nodes=false` and `allow_ungraceful_removal=true` to your extra-vars.
+
+### 3) Edit cluster-info configmap in kube-public namespace
+
+`kubectl  edit cm -n kube-public cluster-info`
+
+Change ip of old kube_control_plane node with ip of live kube_control_plane node (`server` field). Also, update `certificate-authority-data` field if you changed certs.
+
+### 4) Add new control plane node
+
+Update inventory (if needed)
+
+Run `cluster.yml` with `--limit=kube_control_plane`
+
+## Adding an etcd node
+
+You need to make sure there are always an odd number of etcd nodes in the cluster. In such a way, this is always a replacement or scale up operation. Either add two new nodes or remove an old one.
+
+### 1) Add the new node running cluster.yml
+
+Update the inventory and run `cluster.yml` passing `--limit=etcd,kube_control_plane -e ignore_assert_errors=yes`.
+If the node you want to add as an etcd node is already a worker or control plane node in your cluster, you have to remove him first using `remove-node.yml`.
+
+Run `upgrade-cluster.yml` also passing `--limit=etcd,kube_control_plane -e ignore_assert_errors=yes`. This is necessary to update all etcd configuration in the cluster.
+
+At this point, you will have an even number of nodes.
+Everything should still be working, and you should only have problems if the cluster decides to elect a new etcd leader before you remove a node.
+Even so, running applications should continue to be available.
+
+If you add multiple etcd nodes with one run, you might want to append `-e etcd_retries=10` to increase the amount of retries between each etcd node join.
+Otherwise the etcd cluster might still be processing the first join and fail on subsequent nodes. `etcd_retries=10` might work to join 3 new nodes.
+
+### 2) Add the new node to apiserver config
+
+In every control plane node, edit `/etc/kubernetes/manifests/kube-apiserver.yaml`. Make sure the new etcd nodes are present in the apiserver command line parameter `--etcd-servers=...`.
+
+## Removing an etcd node
+
+### 1) Remove an old etcd node
+
+With the node still in the inventory, run `remove-node.yml` passing `-e node=NODE_NAME` as the name of the node that should be removed.
+If the node you want to remove is not online, you should add `reset_nodes=false` and `allow_ungraceful_removal=true` to your extra-vars.
+
+### 2) Make sure only remaining nodes are in your inventory
+
+Remove `NODE_NAME` from your inventory file.
+
+### 3) Update kubernetes and network configuration files with the valid list of etcd members
+
+Run `cluster.yml` to regenerate the configuration files on all remaining nodes.
+
+### 4) Remove the old etcd node from apiserver config
+
+In every control plane node, edit `/etc/kubernetes/manifests/kube-apiserver.yaml`. Make sure only active etcd nodes are still present in the apiserver command line parameter `--etcd-servers=...`.
+
+### 5) Shutdown the old instance
+
+That's it.
--- a/docs/operations/offline-environment.md
+++ b/docs/operations/offline-environment.md
@@ -0,0 +1,152 @@
+# Offline environment
+
+In case your servers don't have access to the internet directly (for example
+when deploying on premises with security constraints), you need to get the
+following artifacts in advance from another environment where has access to the internet.
+
+* Some static files (zips and binaries)
+* OS packages (rpm/deb files)
+* Container images used by Kubespray. Exhaustive list depends on your setup
+* [Optional] Python packages used by Kubespray (only required if your OS doesn't provide all python packages/versions
+  listed in `requirements.txt`)
+* [Optional] Helm chart files (only required if `helm_enabled=true`)
+
+Then you need to setup the following services on your offline environment:
+
+* an HTTP reverse proxy/cache/mirror to serve some static files (zips and binaries)
+* an internal Yum/Deb repository for OS packages
+* an internal container image registry that need to be populated with all container images used by Kubespray
+* [Optional] an internal PyPi server for python packages used by Kubespray
+* [Optional] an internal Helm registry for Helm chart files
+
+You can get artifact lists with [generate_list.sh](/contrib/offline/generate_list.sh) script.
+In addition, you can find some tools for offline deployment under [contrib/offline](/contrib/offline/README.md).
+
+## Configure Inventory
+
+Once all artifacts are accessible from your internal network, **adjust** the following variables
+in [your inventory](/inventory/sample/group_vars/all/offline.yml) to match your environment:
+
+```yaml
+# Registry overrides
+kube_image_repo: "{{ registry_host }}"
+gcr_image_repo: "{{ registry_host }}"
+docker_image_repo: "{{ registry_host }}"
+quay_image_repo: "{{ registry_host }}"
+github_image_repo: "{{ registry_host }}"
+
+kubeadm_download_url: "{{ files_repo }}/kubernetes/{{ kube_version }}/kubeadm"
+kubectl_download_url: "{{ files_repo }}/kubernetes/{{ kube_version }}/kubectl"
+kubelet_download_url: "{{ files_repo }}/kubernetes/{{ kube_version }}/kubelet"
+# etcd is optional if you **DON'T** use etcd_deployment=host
+etcd_download_url: "{{ files_repo }}/kubernetes/etcd/etcd-{{ etcd_version }}-linux-{{ image_arch }}.tar.gz"
+cni_download_url: "{{ files_repo }}/kubernetes/cni/cni-plugins-linux-{{ image_arch }}-{{ cni_version }}.tgz"
+crictl_download_url: "{{ files_repo }}/kubernetes/cri-tools/crictl-{{ crictl_version }}-{{ ansible_system | lower }}-{{ image_arch }}.tar.gz"
+# If using Calico
+calicoctl_download_url: "{{ files_repo }}/kubernetes/calico/{{ calico_ctl_version }}/calicoctl-linux-{{ image_arch }}"
+# If using Calico with kdd
+calico_crds_download_url: "{{ files_repo }}/kubernetes/calico/{{ calico_version }}.tar.gz"
+# Containerd
+containerd_download_url: "{{ files_repo }}/containerd-{{ containerd_version }}-linux-{{ image_arch }}.tar.gz"
+runc_download_url: "{{ files_repo }}/runc.{{ image_arch }}"
+nerdctl_download_url: "{{ files_repo }}/nerdctl-{{ nerdctl_version }}-{{ ansible_system | lower }}-{{ image_arch }}.tar.gz"
+# Insecure registries for containerd
+containerd_registries_mirrors:
+  - prefix: "{{ registry_addr }}"
+    mirrors:
+      - host: "{{ registry_host }}"
+        capabilities: ["pull", "resolve"]
+        skip_verify: true
+
+# CentOS/Redhat/AlmaLinux/Rocky Linux
+## Docker / Containerd
+docker_rh_repo_base_url: "{{ yum_repo }}/docker-ce/$releasever/$basearch"
+docker_rh_repo_gpgkey: "{{ yum_repo }}/docker-ce/gpg"
+
+# Fedora
+## Docker
+docker_fedora_repo_base_url: "{{ yum_repo }}/docker-ce/{{ ansible_distribution_major_version }}/{{ ansible_architecture }}"
+docker_fedora_repo_gpgkey: "{{ yum_repo }}/docker-ce/gpg"
+## Containerd
+containerd_fedora_repo_base_url: "{{ yum_repo }}/containerd"
+containerd_fedora_repo_gpgkey: "{{ yum_repo }}/docker-ce/gpg"
+
+# Debian
+## Docker
+docker_debian_repo_base_url: "{{ debian_repo }}/docker-ce"
+docker_debian_repo_gpgkey: "{{ debian_repo }}/docker-ce/gpg"
+## Containerd
+containerd_debian_repo_base_url: "{{ ubuntu_repo }}/containerd"
+containerd_debian_repo_gpgkey: "{{ ubuntu_repo }}/containerd/gpg"
+containerd_debian_repo_repokey: 'YOURREPOKEY'
+
+# Ubuntu
+## Docker
+docker_ubuntu_repo_base_url: "{{ ubuntu_repo }}/docker-ce"
+docker_ubuntu_repo_gpgkey: "{{ ubuntu_repo }}/docker-ce/gpg"
+## Containerd
+containerd_ubuntu_repo_base_url: "{{ ubuntu_repo }}/containerd"
+containerd_ubuntu_repo_gpgkey: "{{ ubuntu_repo }}/containerd/gpg"
+containerd_ubuntu_repo_repokey: 'YOURREPOKEY'
+```
+
+For the OS specific settings, just define the one matching your OS.
+If you use the settings like the one above, you'll need to define in your inventory the following variables:
+
+* `registry_host`: Container image registry. If you _don't_ use the same repository path for the container images that
+  the ones defined
+  in [kubesprays-defaults's role defaults](https://github.com/kubernetes-sigs/kubespray/blob/master/roles/kubespray-defaults/defaults/main/download.yml)
+  , you need to override the `*_image_repo` for these container images. If you want to make your life easier, use the
+  same repository path, you won't have to override anything else.
+* `registry_addr`: Container image registry, but only have [domain or ip]:[port].
+* `files_repo`: HTTP webserver or reverse proxy that is able to serve the files listed above. Path is not important, you
+  can store them anywhere as long as it's accessible by kubespray. It's recommended to use `*_version` in the path so
+  that you don't need to modify this setting everytime kubespray upgrades one of these components.
+* `yum_repo`/`debian_repo`/`ubuntu_repo`: OS package repository depending on your OS, should point to your internal
+  repository. Adjust the path accordingly.
+
+## Install Kubespray Python Packages
+
+### Recommended way: Kubespray Container Image
+
+The easiest way is to use [kubespray container image](https://quay.io/kubespray/kubespray) as all the required packages
+are baked in the image.
+Just copy the container image in your private container image registry and you are all set!
+
+### Manual installation
+
+Look at the `requirements.txt` file and check if your OS provides all packages out-of-the-box (Using the OS package
+manager). For those missing, you need to either use a proxy that has Internet access (typically from a DMZ) or setup a
+PyPi server in your network that will host these packages.
+
+If you're using an HTTP(S) proxy to download your python packages:
+
+```bash
+sudo pip install --proxy=https://[username:password@]proxyserver:port -r requirements.txt
+```
+
+When using an internal PyPi server:
+
+```bash
+# If you host all required packages
+pip install -i https://pypiserver/pypi -r requirements.txt
+
+# If you only need the ones missing from the OS package manager
+pip install -i https://pypiserver/pypi package_you_miss
+```
+
+## Run Kubespray as usual
+
+Once all artifacts are in place and your inventory properly set up, you can run kubespray with the
+regular `cluster.yaml` command:
+
+```bash
+ansible-playbook -i inventory/my_airgap_cluster/hosts.yaml -b cluster.yml
+```
+
+If you use [Kubespray Container Image](#recommended-way:-kubespray-container-image), you can mount your inventory inside
+the container:
+
+```bash
+docker run --rm -it -v path_to_inventory/my_airgap_cluster:inventory/my_airgap_cluster myprivateregisry.com/kubespray/kubespray:v2.14.0 ansible-playbook -i inventory/my_airgap_cluster/hosts.yaml -b cluster.yml
+```
--- a/docs/operations/port-requirements.md
+++ b/docs/operations/port-requirements.md
@@ -0,0 +1,70 @@
+# Port Requirements
+
+To operate properly, Kubespray requires some ports to be opened. If the network is configured with firewall rules, it is needed to ensure infrastructure components can communicate with each other through specific ports.
+
+Ensure the following ports required by Kubespray are open on the network and configured to allow access between hosts. Some ports are optional depending on the configuration and usage.
+
+## Kubernetes
+
+### Control plane
+
+| Protocol | Port   | Description     |
+|----------|--------| ------------    |
+| TCP      | 22     | ssh for ansible |
+| TCP      | 2379   | etcd client port|
+| TCP      | 2380   | etcd peer port  |
+| TCP      | 6443   | kubernetes api  |
+| TCP      | 10250  | kubelet api     |
+| TCP      | 10257  | kube-scheduler  |
+| TCP      | 10259  | kube-controller-manager  |
+
+### Worker node(s)
+
+| Protocol | Port       | Description     |
+|----------|--------    | ------------    |
+| TCP      | 22         | ssh for ansible |
+| TCP      | 10250      | kubelet api     |
+| TCP      | 30000-32767| kube nodePort range |
+
+refers to: [Kubernetes Docs](https://kubernetes.io/docs/reference/networking/ports-and-protocols/)
+
+## Calico
+
+If Calico is used, it requires:
+
+| Protocol | Port       | Description   |
+|----------|--------    | ------------  |
+| TCP      | 179        | Calico networking (BGP) |
+| UDP      | 4789       | Calico CNI with VXLAN enabled |
+| TCP      | 5473       | Calico CNI with Typha enabled  |
+| UDP      | 51820      | Calico with IPv4 Wireguard enabled |
+| UDP      | 51821      | Calico with IPv6 Wireguard enabled |
+| IPENCAP / IPIP | -    | Calico CNI with IPIP enabled  |
+
+refers to: [Calico Docs](https://docs.tigera.io/calico/latest/getting-started/kubernetes/requirements#network-requirements)
+
+## Cilium
+
+If Cilium is used, it requires:
+
+| Protocol | Port     | Description   |
+|----------|--------  | ------------  |
+| TCP      | 4240     | Cilium Health checks (``cilium-health``)  |
+| TCP      | 4244     | Hubble server  |
+| TCP      | 4245     | Hubble Relay  |
+| UDP      | 8472     | VXLAN overlay  |
+| TCP      | 9962     | Cilium-agent Prometheus metrics  |
+| TCP      | 9963     | Cilium-operator Prometheus metrics  |
+| TCP      | 9964     | Cilium-proxy Prometheus metrics  |
+| UDP      | 51871    | WireGuard encryption tunnel endpoint  |
+| ICMP     | -        | health checks  |
+
+refers to: [Cilium Docs](https://docs.cilium.io/en/v1.13/operations/system_requirements/)
+
+## Addons
+
+| Protocol | Port       | Description   |
+|----------|--------    | ------------  |
+| TCP      | 9100       | node exporter |
+| TCP/UDP  | 7472       | metallb metrics ports |
+| TCP/UDP  | 7946       | metallb L2 operating mode |
--- a/docs/operations/recover-control-plane.md
+++ b/docs/operations/recover-control-plane.md
@@ -0,0 +1,41 @@
+
+# Recovering the control plane
+
+To recover from broken nodes in the control plane use the "recover\-control\-plane.yml" playbook.
+
+Examples of what broken means in this context:
+
+* One or more bare metal node(s) suffer from unrecoverable hardware failure
+* One or more node(s) fail during patching or upgrading
+* Etcd database corruption
+  
+* Other node related failures leaving your control plane degraded or nonfunctional
+
+__Note that you need at least one functional node to be able to recover using this method.__
+
+## Runbook
+
+* Backup what you can
+* Provision new nodes to replace the broken ones
+* Copy any broken etcd nodes into the "broken\_etcd" group, make sure the "etcd\_member\_name" variable is set.
+* Copy any broken control plane nodes into the "broken\_kube\_control\_plane" group.
+* Place the surviving nodes of the control plane first in the "etcd" and "kube\_control\_plane" groups
+* Add the new nodes below the surviving control plane nodes in the "etcd" and "kube\_control\_plane" groups
+
+Then run the playbook with ```--limit etcd,kube_control_plane``` and increase the number of ETCD retries by setting ```-e etcd_retries=10``` or something even larger. The amount of retries required is difficult to predict.
+
+When finished you should have a fully working control plane again.
+
+## Recover from lost quorum
+
+The playbook attempts to figure out it the etcd quorum is intact. If quorum is lost it will attempt to take a snapshot from the first node in the "etcd" group and restore from that. If you would like to restore from an alternate snapshot set the path to that snapshot in the "etcd\_snapshot" variable.
+
+```-e etcd_snapshot=/tmp/etcd_snapshot```
+
+## Caveats
+
+* The playbook has only been tested with fairly small etcd databases.
+* There may be disruptions while running the playbook.
+* There are absolutely no guarantees.
+
+If possible try to break a cluster in the same way that your target cluster is broken and test to recover that before trying on the real target cluster.
--- a/docs/operations/upgrades.md
+++ b/docs/operations/upgrades.md
@@ -0,0 +1,441 @@
+# Upgrading Kubernetes in Kubespray
+
+Kubespray handles upgrades the same way it handles initial deployment. That is to
+say that each component is laid down in a fixed order.
+
+You can also individually control versions of components by explicitly defining their
+versions. Here are all version vars for each component:
+
+* docker_version
+* docker_containerd_version (relevant when `container_manager` == `docker`)
+* containerd_version (relevant when `container_manager` == `containerd`)
+* kube_version
+* etcd_version
+* calico_version
+* calico_cni_version
+* weave_version
+* flannel_version
+* kubedns_version
+
+> **Warning**
+> [Attempting to upgrade from an older release straight to the latest release is unsupported and likely to break something](https://github.com/kubernetes-sigs/kubespray/issues/3849#issuecomment-451386515)
+
+See [Multiple Upgrades](#multiple-upgrades) for how to upgrade from older Kubespray release to the latest release
+
+## Unsafe upgrade example
+
+If you wanted to upgrade just kube_version from v1.18.10 to v1.19.7, you could
+deploy the following way:
+
+```ShellSession
+ansible-playbook cluster.yml -i inventory/sample/hosts.ini -e kube_version=v1.18.10 -e upgrade_cluster_setup=true
+```
+
+And then repeat with v1.19.7 as kube_version:
+
+```ShellSession
+ansible-playbook cluster.yml -i inventory/sample/hosts.ini -e kube_version=v1.19.7 -e upgrade_cluster_setup=true
+```
+
+The var ```-e upgrade_cluster_setup=true``` is needed to be set in order to migrate the deploys of e.g kube-apiserver inside the cluster immediately which is usually only done in the graceful upgrade. (Refer to [#4139](https://github.com/kubernetes-sigs/kubespray/issues/4139) and [#4736](https://github.com/kubernetes-sigs/kubespray/issues/4736))
+
+## Graceful upgrade
+
+Kubespray also supports cordon, drain and uncordoning of nodes when performing
+a cluster upgrade. There is a separate playbook used for this purpose. It is
+important to note that upgrade-cluster.yml can only be used for upgrading an
+existing cluster. That means there must be at least 1 kube_control_plane already
+deployed.
+
+```ShellSession
+ansible-playbook upgrade-cluster.yml -b -i inventory/sample/hosts.ini -e kube_version=v1.19.7
+```
+
+After a successful upgrade, the Server Version should be updated:
+
+```ShellSession
+$ kubectl version
+Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.7", GitCommit:"1dd5338295409edcfff11505e7bb246f0d325d15", GitTreeState:"clean", BuildDate:"2021-01-13T13:23:52Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}
+Server Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.7", GitCommit:"1dd5338295409edcfff11505e7bb246f0d325d15", GitTreeState:"clean", BuildDate:"2021-01-13T13:15:20Z", GoVersion:"go1.15.5", Compiler:"gc", Platform:"linux/amd64"}
+```
+
+You can control how many nodes are upgraded at the same time by modifying the ansible variable named `serial`, as explained [here](https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_strategies.html#setting-the-batch-size-with-serial). If you don't set this variable, it will upgrade the cluster nodes in batches of  20% of the available nodes. Setting `serial=1` would mean upgrade one node at a time.
+
+```ShellSession
+ansible-playbook upgrade-cluster.yml -b -i inventory/sample/hosts.ini -e kube_version=v1.20.7 -e "serial=1"
+```
+
+### Pausing the upgrade
+
+If you want to manually control the upgrade procedure, you can set some variables to pause the upgrade playbook. Pausing *before* upgrading each upgrade may be useful for inspecting pods running on that node, or performing manual actions on the node:
+
+* `upgrade_node_confirm: true` - This will pause the playbook execution prior to upgrading each node. The play will resume when manually approved by typing "yes" at the terminal.
+* `upgrade_node_pause_seconds: 60` - This will pause the playbook execution for 60 seconds prior to upgrading each node. The play will resume automatically after 60 seconds.
+
+Pausing *after* upgrading each node may be useful for rebooting the node to apply kernel updates, or testing the still-cordoned node:
+
+* `upgrade_node_post_upgrade_confirm: true` - This will pause the playbook execution after upgrading each node, but before the node is uncordoned. The play will resume when manually approved by typing "yes" at the terminal.
+* `upgrade_node_post_upgrade_pause_seconds: 60` - This will pause the playbook execution for 60 seconds after upgrading each node, but before the node is uncordoned. The play will resume automatically after 60 seconds.
+
+## Node-based upgrade
+
+If you don't want to upgrade all nodes in one run, you can use `--limit` [patterns](https://docs.ansible.com/ansible/latest/user_guide/intro_patterns.html#patterns-and-ansible-playbook-flags).
+
+Before using `--limit` run playbook `facts.yml` without the limit to refresh facts cache for all nodes:
+
+```ShellSession
+ansible-playbook facts.yml -b -i inventory/sample/hosts.ini
+```
+
+After this upgrade control plane and etcd groups [#5147](https://github.com/kubernetes-sigs/kubespray/issues/5147):
+
+```ShellSession
+ansible-playbook upgrade-cluster.yml -b -i inventory/sample/hosts.ini -e kube_version=v1.20.7 --limit "kube_control_plane:etcd"
+```
+
+Now you can upgrade other nodes in any order and quantity:
+
+```ShellSession
+ansible-playbook upgrade-cluster.yml -b -i inventory/sample/hosts.ini -e kube_version=v1.20.7 --limit "node4:node6:node7:node12"
+ansible-playbook upgrade-cluster.yml -b -i inventory/sample/hosts.ini -e kube_version=v1.20.7 --limit "node5*"
+```
+
+## Multiple upgrades
+
+> **Warning**
+> [Do not skip minor releases (patches releases are ok) when upgrading--upgrade by one tag at a
+> time.](https://github.com/kubernetes-sigs/kubespray/issues/3849#issuecomment-451386515)
+
+For instances, given the tag list:
+
+```console
+$ git tag
+v2.20.0
+v2.21.0
+v2.22.0
+v2.22.1
+v2.23.0
+v2.23.1
+v2.23.2
+v2.24.0
+...
+```
+
+v2.22.0 -> v2.23.2 -> v2.24.0 : ✓
+v.22.0 -> v2.24.0 : ✕
+
+Assuming you don't explicitly define a kubernetes version in your k8s_cluster.yml, you simply check out the next tag and run the upgrade-cluster.yml playbook
+
+* If you do define kubernetes version in your inventory (e.g. group_vars/k8s_cluster.yml) then either make sure to update it before running upgrade-cluster, or specify the new version you're upgrading to: `ansible-playbook -i inventory/mycluster/hosts.ini -b upgrade-cluster.yml -e kube_version=v1.11.3`
+
+  Otherwise, the upgrade will leave your cluster at the same k8s version defined in your inventory vars.
+
+The below example shows taking a cluster that was set up for v2.6.0 up to v2.10.0
+
+```ShellSession
+$ kubectl get node
+NAME      STATUS    ROLES         AGE       VERSION
+apollo    Ready     master,node   1h        v1.10.4
+boomer    Ready     master,node   42m       v1.10.4
+caprica   Ready     master,node   42m       v1.10.4
+
+$ git describe --tags
+v2.6.0
+
+$ git tag
+...
+v2.6.0
+v2.7.0
+v2.8.0
+v2.8.1
+v2.8.2
+...
+
+$ git checkout v2.7.0
+Previous HEAD position was 8b3ce6e4 bump upgrade tests to v2.5.0 commit (#3087)
+HEAD is now at 05dabb7e Fix Bionic networking restart error #3430 (#3431)
+
+# NOTE: May need to `pip3 install -r requirements.txt` when upgrading.
+
+ansible-playbook -i inventory/mycluster/hosts.ini -b upgrade-cluster.yml
+
+...
+
+$ kubectl get node
+NAME      STATUS    ROLES         AGE       VERSION
+apollo    Ready     master,node   1h        v1.11.3
+boomer    Ready     master,node   1h        v1.11.3
+caprica   Ready     master,node   1h        v1.11.3
+
+$ git checkout v2.8.0
+Previous HEAD position was 05dabb7e Fix Bionic networking restart error #3430 (#3431)
+HEAD is now at 9051aa52 Fix ubuntu-contiv test failed (#3808)
+```
+
+> **Note**
+> Review changes between the sample inventory and your inventory when upgrading versions.
+
+Some deprecations between versions that mean you can't just upgrade straight from 2.7.0 to 2.8.0 if you started with the sample inventory.
+
+In this case, I set "kubeadm_enabled" to false, knowing that it is deprecated and removed by 2.9.0, to delay converting the cluster to kubeadm as long as I could.
+
+```ShellSession
+$ ansible-playbook -i inventory/mycluster/hosts.ini -b upgrade-cluster.yml
+...
+    "msg": "DEPRECATION: non-kubeadm deployment is deprecated from v2.9. Will be removed in next release."
+...
+Are you sure you want to deploy cluster using the deprecated non-kubeadm mode. (output is hidden):
+yes
+...
+
+$ kubectl get node
+NAME      STATUS   ROLES         AGE    VERSION
+apollo    Ready    master,node   114m   v1.12.3
+boomer    Ready    master,node   114m   v1.12.3
+caprica   Ready    master,node   114m   v1.12.3
+
+$ git checkout v2.8.1
+Previous HEAD position was 9051aa52 Fix ubuntu-contiv test failed (#3808)
+HEAD is now at 2ac1c756 More Feature/2.8 backports for 2.8.1 (#3911)
+
+$ ansible-playbook -i inventory/mycluster/hosts.ini -b upgrade-cluster.yml
+...
+    "msg": "DEPRECATION: non-kubeadm deployment is deprecated from v2.9. Will be removed in next release."
+...
+Are you sure you want to deploy cluster using the deprecated non-kubeadm mode. (output is hidden):
+yes
+...
+
+$ kubectl get node
+NAME      STATUS   ROLES         AGE     VERSION
+apollo    Ready    master,node   2h36m   v1.12.4
+boomer    Ready    master,node   2h36m   v1.12.4
+caprica   Ready    master,node   2h36m   v1.12.4
+
+$ git checkout v2.8.2
+Previous HEAD position was 2ac1c756 More Feature/2.8 backports for 2.8.1 (#3911)
+HEAD is now at 4167807f Upgrade to 1.12.5 (#4066)
+
+$ ansible-playbook -i inventory/mycluster/hosts.ini -b upgrade-cluster.yml
+...
+    "msg": "DEPRECATION: non-kubeadm deployment is deprecated from v2.9. Will be removed in next release."
+...
+Are you sure you want to deploy cluster using the deprecated non-kubeadm mode. (output is hidden):
+yes
+...
+
+$ kubectl get node
+NAME      STATUS   ROLES         AGE    VERSION
+apollo    Ready    master,node   3h3m   v1.12.5
+boomer    Ready    master,node   3h3m   v1.12.5
+caprica   Ready    master,node   3h3m   v1.12.5
+
+$ git checkout v2.8.3
+Previous HEAD position was 4167807f Upgrade to 1.12.5 (#4066)
+HEAD is now at ea41fc5e backport cve-2019-5736 to release-2.8 (#4234)
+
+$ ansible-playbook -i inventory/mycluster/hosts.ini -b upgrade-cluster.yml
+...
+    "msg": "DEPRECATION: non-kubeadm deployment is deprecated from v2.9. Will be removed in next release."
+...
+Are you sure you want to deploy cluster using the deprecated non-kubeadm mode. (output is hidden):
+yes
+...
+
+$ kubectl get node
+NAME      STATUS   ROLES         AGE     VERSION
+apollo    Ready    master,node   5h18m   v1.12.5
+boomer    Ready    master,node   5h18m   v1.12.5
+caprica   Ready    master,node   5h18m   v1.12.5
+
+$ git checkout v2.8.4
+Previous HEAD position was ea41fc5e backport cve-2019-5736 to release-2.8 (#4234)
+HEAD is now at 3901480b go to k8s 1.12.7 (#4400)
+
+$ ansible-playbook -i inventory/mycluster/hosts.ini -b upgrade-cluster.yml
+...
+    "msg": "DEPRECATION: non-kubeadm deployment is deprecated from v2.9. Will be removed in next release."
+...
+Are you sure you want to deploy cluster using the deprecated non-kubeadm mode. (output is hidden):
+yes
+...
+
+$ kubectl get node
+NAME      STATUS   ROLES         AGE     VERSION
+apollo    Ready    master,node   5h37m   v1.12.7
+boomer    Ready    master,node   5h37m   v1.12.7
+caprica   Ready    master,node   5h37m   v1.12.7
+
+$ git checkout v2.8.5
+Previous HEAD position was 3901480b go to k8s 1.12.7 (#4400)
+HEAD is now at 6f97687d Release 2.8 robust san handling (#4478)
+
+$ ansible-playbook -i inventory/mycluster/hosts.ini -b upgrade-cluster.yml
+...
+    "msg": "DEPRECATION: non-kubeadm deployment is deprecated from v2.9. Will be removed in next release."
+...
+Are you sure you want to deploy cluster using the deprecated non-kubeadm mode. (output is hidden):
+yes
+...
+
+$ kubectl get node
+NAME      STATUS   ROLES         AGE     VERSION
+apollo    Ready    master,node   5h45m   v1.12.7
+boomer    Ready    master,node   5h45m   v1.12.7
+caprica   Ready    master,node   5h45m   v1.12.7
+
+$ git checkout v2.9.0
+Previous HEAD position was 6f97687d Release 2.8 robust san handling (#4478)
+HEAD is now at a4e65c7c Upgrade to Ansible >2.7.0 (#4471)
+```
+
+> **Warning**
+> IMPORTANT: Some variable formats changed in the k8s_cluster.yml between 2.8.5 and 2.9.0
+
+If you do not keep your inventory copy up to date, **your upgrade will fail** and your first master will be left non-functional until fixed and re-run.
+
+It is at this point the cluster was upgraded from non-kubeadm to kubeadm as per the deprecation warning.
+
+```ShellSession
+ansible-playbook -i inventory/mycluster/hosts.ini -b upgrade-cluster.yml
+
+...
+
+$ kubectl get node
+NAME      STATUS   ROLES         AGE     VERSION
+apollo    Ready    master,node   6h54m   v1.13.5
+boomer    Ready    master,node   6h55m   v1.13.5
+caprica   Ready    master,node   6h54m   v1.13.5
+
+# Watch out: 2.10.0 is hiding between 2.1.2 and 2.2.0
+
+$ git tag
+...
+v2.1.0
+v2.1.1
+v2.1.2
+v2.10.0
+v2.2.0
+...
+
+$ git checkout v2.10.0
+Previous HEAD position was a4e65c7c Upgrade to Ansible >2.7.0 (#4471)
+HEAD is now at dcd9c950 Add etcd role dependency on kube user to avoid etcd role failure when running scale.yml with a fresh node. (#3240) (#4479)
+
+ansible-playbook -i inventory/mycluster/hosts.ini -b upgrade-cluster.yml
+
+...
+
+$ kubectl get node
+NAME      STATUS   ROLES         AGE     VERSION
+apollo    Ready    master,node   7h40m   v1.14.1
+boomer    Ready    master,node   7h40m   v1.14.1
+caprica   Ready    master,node   7h40m   v1.14.1
+
+
+```
+
+## Upgrading to v2.19
+
+`etcd_kubeadm_enabled` is being deprecated at v2.19. The same functionality is achievable by setting `etcd_deployment_type` to `kubeadm`.
+Deploying etcd using kubeadm is experimental and is only available for either new or deployments where `etcd_kubeadm_enabled` was set to `true` while deploying the cluster.
+
+From 2.19 and onward `etcd_deployment_type` variable will be placed in `group_vars/all/etcd.yml` instead of `group_vars/etcd.yml`, due to scope issues.
+The placement of the variable is only important for `etcd_deployment_type: kubeadm` right now. However, since this might change in future updates, it is recommended to move the variable.
+
+Upgrading is straightforward; no changes are required if `etcd_kubeadm_enabled` was not set to `true` when deploying.
+
+If you have a cluster where `etcd` was deployed using `kubeadm`, you will need to remove `etcd_kubeadm_enabled` the variable. Then move `etcd_deployment_type` variable from `group_vars/etcd.yml` to `group_vars/all/etcd.yml` due to scope issues and set `etcd_deployment_type` to `kubeadm`.
+
+## Upgrade order
+
+As mentioned above, components are upgraded in the order in which they were
+installed in the Ansible playbook. The order of component installation is as
+follows:
+
+* Docker
+* Containerd
+* etcd
+* kubelet and kube-proxy
+* network_plugin (such as Calico or Weave)
+* kube-apiserver, kube-scheduler, and kube-controller-manager
+* Add-ons (such as KubeDNS)
+
+### Component-based upgrades
+
+A deployer may want to upgrade specific components in order to minimize risk
+or save time. This strategy is not covered by CI as of this writing, so it is
+not guaranteed to work.
+
+These commands are useful only for upgrading fully-deployed, healthy, existing
+hosts. This will definitely not work for undeployed or partially deployed
+hosts.
+
+Upgrade docker:
+
+```ShellSession
+ansible-playbook -b -i inventory/sample/hosts.ini cluster.yml --tags=docker
+```
+
+Upgrade etcd:
+
+```ShellSession
+ansible-playbook -b -i inventory/sample/hosts.ini cluster.yml --tags=etcd
+```
+
+Upgrade etcd without rotating etcd certs:
+
+```ShellSession
+ansible-playbook -b -i inventory/sample/hosts.ini cluster.yml --tags=etcd --limit=etcd --skip-tags=etcd-secrets
+```
+
+Upgrade kubelet:
+
+```ShellSession
+ansible-playbook -b -i inventory/sample/hosts.ini cluster.yml --tags=node --skip-tags=k8s-gen-certs,k8s-gen-tokens
+```
+
+Upgrade Kubernetes master components:
+
+```ShellSession
+ansible-playbook -b -i inventory/sample/hosts.ini cluster.yml --tags=master
+```
+
+Upgrade network plugins:
+
+```ShellSession
+ansible-playbook -b -i inventory/sample/hosts.ini cluster.yml --tags=network
+```
+
+Upgrade all add-ons:
+
+```ShellSession
+ansible-playbook -b -i inventory/sample/hosts.ini cluster.yml --tags=apps
+```
+
+Upgrade just helm (assuming `helm_enabled` is true):
+
+```ShellSession
+ansible-playbook -b -i inventory/sample/hosts.ini cluster.yml --tags=helm
+```
+
+## Migrate from Docker to Containerd
+
+Please note that **migrating container engines is not officially supported by Kubespray**. While this procedure can be used to migrate your cluster, it applies to one particular scenario and will likely evolve over time. At the moment, they are intended as an additional resource to provide insight into how these steps can be officially integrated into the Kubespray playbooks.
+
+As of Kubespray 2.18.0, containerd is already the default container engine. If you have the chance, it is advisable and safer to reset and redeploy the entire cluster with a new container engine.
+
+* [Migrating from Docker to Containerd](upgrades/migrate_docker2containerd.md)
+
+## System upgrade
+
+If you want to upgrade the APT or YUM packages while the nodes are cordoned, you can use:
+
+```ShellSession
+ansible-playbook upgrade-cluster.yml -b -i inventory/sample/hosts.ini -e system_upgrade=true
+```
+
+Nodes will be rebooted when there are package upgrades (`system_upgrade_reboot: on-upgrade`).
+This can be changed to `always` or `never`.
+
+Note: Downloads will happen twice unless `system_upgrade_reboot` is `never`.