Kubernetes Reliability Improvements

- Exclude kubelet CPU/RAM (kube-reserved) from cgroup. It decreases a
  chance of overcommitment
- Add a possibility to modify Kubelet node-status-update-frequency
- Add a posibility to configure node-monitor-grace-period,
  node-monitor-period, pod-eviction-timeout for Kubernetes controller
  manager
- Add Kubernetes Relaibility Documentation with recomendations for
  various scenarios.

Signed-off-by: Sergii Golovatiuk <sgolovatiuk@mirantis.com>
This commit is contained in:
Sergii Golovatiuk
2017-02-07 15:01:02 +01:00
parent ef10ce04e2
commit c07d60bc90
6 changed files with 126 additions and 6 deletions

View File

@@ -3,7 +3,8 @@ Large deployments of K8s
For a large scaled deployments, consider the following configuration changes:
* Tune [ansible settings](http://docs.ansible.com/ansible/intro_configuration.html)
* Tune [ansible settings]
(http://docs.ansible.com/ansible/intro_configuration.html)
for `forks` and `timeout` vars to fit large numbers of nodes being deployed.
* Override containers' `foo_image_repo` vars to point to intranet registry.
@@ -23,9 +24,15 @@ For a large scaled deployments, consider the following configuration changes:
* Tune CPU/memory limits and requests. Those are located in roles' defaults
and named like ``foo_memory_limit``, ``foo_memory_requests`` and
``foo_cpu_limit``, ``foo_cpu_requests``. Note that 'Mi' memory units for K8s
will be submitted as 'M', if applied for ``docker run``, and cpu K8s units will
end up with the 'm' skipped for docker as well. This is required as docker does not
understand k8s units well.
will be submitted as 'M', if applied for ``docker run``, and cpu K8s units
will end up with the 'm' skipped for docker as well. This is required as
docker does not understand k8s units well.
* Tune ``kubelet_status_update_frequency`` to increase reliability of kubelet.
``kube_controller_node_monitor_grace_period``,
``kube_controller_node_monitor_period``,
``kube_controller_pod_eviction_timeout`` for better Kubernetes reliability.
Check out [Kubernetes Reliability](kubernetes-reliability.md)
* Add calico-rr nodes if you are deploying with Calico or Canal. Nodes recover
from host/network interruption much quicker with calico-rr. Note that
@@ -33,7 +40,7 @@ For a large scaled deployments, consider the following configuration changes:
etcd role is okay).
* Check out the
[Inventory](https://github.com/kubernetes-incubator/kargo/blob/master/docs/getting-started.md#building-your-own-inventory)
[Inventory](getting-started.md#building-your-own-inventory)
section of the Getting started guide for tips on creating a large scale
Ansible inventory.