kubernetes(1.8.3)系列之calico集成

Posted on 2018-01-04(星期四) 14:50 in Data

集成calico的工作背景

目前工作试错环境为 k8s(1.8.3)管理下的 tensorflow-1.4.1-compile-py36 docker镜像深度学习集群,说是集群,但平时的工作任务依然是依靠单台跑脚本运行的,GPU的利用率比较低,经过研究,最后决定通过 openmpi-2.1.2 + nccl2 + nccl_2.1.2-1+cuda8.0_x86_64 的技术方案实现GPU的并行计算。但将环境部署到K8S集群时碰到一个问题——在同一个节点中启动的pod可以成功执行,但不同节点间的pods无法成功执行,报错如下:

Warning: Permanently added '10.233.22.6' (ECDSA) to the list of known hosts.
--------------------------------------------------------------------------
Open MPI detected an inbound MPI TCP connection request from a peer
that appears to be part of this MPI job (i.e., it identified itself as
part of this Open MPI job), but it is from an IP address that is
unexpected.  This is highly unusual.

The inbound connection has been dropped, and the peer should simply
try again with a different IP interface (i.e., the job should
hopefully be able to continue).

  Local host:          keras-chun-185-1-779b9d7b69-9kd9v
  Local PID:           467
  Peer hostname:       keras-chun-189-6769c56b7-pt9pm ([[33290,1],1])
  Source IP of socket: 10.233.22.0
  Known IPs of peer:   ae9:1606::ae9:1606::
--------------------------------------------------------------------------
[keras-chun-185-1-779b9d7b69-9kd9v:00461] 1 more process has sent help message help-mpi-btl-tcp.txt / dropped inbound connection
[keras-chun-185-1-779b9d7b69-9kd9v:00461] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

最后,查看openmpi的文档,最后得出的结论是需要在同一个网段,但目前k8s集群底层网络使用的是flannel,在每个node上都分有不同的子网,再由docker0网桥分配给各pod具体的IP,所以,各主机和pod可以互通,但实际上不是在同一个网段。更换一个在同一网段的网络组件就成为工作的重点。

PS:在github上有使用flannel跑通 tensorflow-minist 的例子, 可参考

K8s开放了CNI网络模型,只要实现这个接口的组件均可以集成到k8s集群中。目前已经有很多开源组件支持容器网络模型。常见的有Flannel、Open vSwitch、直接路由和Calico。由于Flannel各node间均有各自子网段,Open vSwitch稍过复杂,所以最终的方案选择的是Calico。

常用开源网络组件

Flannel

Flannel 之所以看可以搭建 Kubernetes依赖的底层网络,是因为它能实现以下两点:

(1) 它能协助Kubernetes,给每一个Node上的Docker容器分配互不冲突的IP地址。

(2) 它能在这些IP地址之间建立一个覆盖网络(Overlay Network),通过这个覆盖网络,将数据包原封不动地传递目标容器内。

Open vSwitch

Open vSwitch是一个开源的虚拟交换机软件,有点儿像Linux中的bridge,但是功能要负责的多。Open vSwitch的网桥可以直接建立多种通信通道(隧道),例如Open vSwitch with GRE/VxLAN。这些通道的建立可以很容易地动过OVS的配置命令实现。在Kubernetes、Docker场景下,我们主要是建立L3到L3的隧道。

Calico

Calico是一个基于BGP协议的虚拟网络工具,在数据中心中的虚拟机、容器或者裸金属机器(在这里都称为workloads)只需要一个IP地址就可以使用Calico实现互连。

项目主页:https://www.projectcalico.org/

Workloads间的网络隔离是通过iptables实现的。相比其他基于模拟的二层网络,Calico更加简单,有以下几项有趣的特点:

  • 1.在Calico中的数据包并不需要进行封包和解封。

  • 2.Calico中的数据包,只要被policy允许,就可以在不同租户中的workloads间传递或直接接入互联网或从互联网中进到Calico网络中,并不需要像overlay方案中,数据包必须经过一些特定的节点去修改某些属性。

  • 3.因为是直接基于三层网络进行数据传输,troubleshoot会更加容易,同时用户也可以直接用一般的工具进行操作与管理,比如ping、Whireshark等,无需考虑解包之类的事情。

  • 4.网络安全策略使用ACL定义,基于iptables实现,比起overlay方案中的复杂机制更直观和容易操作。

具体集成 Calico 过程

版本列表

  • centos 7.4
  • kubernete 1.8.3
  • calico 2.6.5

卸载flannel组件

  • 在每台node上:

    #停止并删除flannel服务
    systemctl stop flanneld
    systemctl status flanneld
    systemctl disable flanneld
    

    关闭 flanneld 时会停掉docker,所以要启动一下docker

    systemctl start docker
    
  • 在装有etcd的其中一个node上:

    #删除etcd上flannel的相关配置
    etcdctl --endpoints=https://192.168.1.184:2379 \ 
            --cert-file=/etc/kubernetes/ssl/etcd.pem \ 
            --ca-file=/etc/kubernetes/ssl/ca.pem \
            --key-file=/etc/kubernetes/ssl/etcd-key.pem\
            rm /kubernetes/network/config
    
    #重启etcd服务(可选)
    systemctl restart etcd
    systemctl status etcd
    

安装配置Calico(在装有kubeclt的node上)

1.准备工作

Calico can run on any Kubernetes cluster which meets the following criteria:

  • The kubelet must be configured to use CNI network plugins (e.g --network-plugin=cni).
  • The kube-proxy must be started in iptables proxy mode. This is the default as of Kubernetes v1.2.0.
  • The kube-proxy must be started without the --masquerade-all flag, which conflicts with Calico policy.
  • The Kubernetes NetworkPolicy API requires at least Kubernetes version v1.3.0.

2.下载并修改calico配置

#获取配置
mkdir -p /etc/kubernetes/calico && cd /etc/kubernetes/calico
wget http://docs.projectcalico.org/v2.6/getting-started/kubernetes/installation/hosted/calico.yaml
wget http://docs.projectcalico.org/v2.6/getting-started/kubernetes/installation/rbac.yaml

#修改配置(基于https)

#1.修改集群etcd地址
etcd_endpoints: "https://192.168.1.184:2379"

#2.由于使用TLS认证,需要打开
etcd_ca: "/calico-secrets/etcd-ca"
etcd_cert: "/calico-secrets/etcd-cert"
etcd_key: "/calico-secrets/etcd-key"

#3.在calico-etcd-secrets中的data下增加证书信息(证书需要进行base64)

etcd-key: (cat /etc/kubernetes/ssl/etcd-key.pem | base64 | tr -d '\n')
etcd-cert: (cat /etc/kubernetes/ssl/etcd.pem | base64 | tr -d '\n')
etcd-ca: (cat /etc/kubernetes/ssl/ca.pem | base64 | tr -d '\n')

如果执行时报如下解析json的错误时:

[[email protected] 2.6]# kubectl apply -f calico.yaml 
configmap "calico-config" created
daemonset "calico-node" created
deployment "calico-kube-controllers" created
deployment "calico-policy-controller" created
serviceaccount "calico-kube-controllers" created
serviceaccount "calico-node" created
Error from server (BadRequest): error when creating "calico.yaml": Secret in version "v1" 
cannot be handled as a Secret: v1.Secret: Data: decode base64: illegal base64 data at inp
ut byte 3, parsing 92 ... -d '\\n'"... at {"apiVersion":"v1","data":{"etcd-ca":"cat /etc/
kubernetes/ssl/ca.pem | base64 | tr -d '\\n'","etcd-cert":"cat /etc/kubernetes/ssl/etcd.p
em | base64 | tr -d '\\n'","etcd-key":"cat /etc/kubernetes/ssl/etcd-key.pem | base64 | tr
-d '\\n'"},"kind":"Secret","metadata":{"annotations":{"kubectl.kubernetes.io/last-applied
-configuration":"{\"apiVersion\":\"v1\",\"data\":{\"etcd-ca\":\"cat /etc/kubernetes/ssl/c
a.pem | base64 | tr -d '\\\\n'\",\"etcd-cert\":\"cat /etc/kubernetes/ssl/etcd.pem | base6
4 | tr -d '\\\\n'\",\"etcd-key\":\"cat /etc/kubernetes/ssl/etcd-key.pem | base64 | tr -d 
'\\\\n'\"},\"kind\":\"Secret\",\"metadata\":{\"annotations\":{},\"name\":\"calico-etcd-se
crets\",\"namespace\":\"kube-system\"},\"type\":\"Opaque\"}\n"},"name":"calico-etcd-secre
ts","namespace":"kube-system"},"type":"Opaque"}

那么将 cat /etc/kubernetes/ssl/etcd-key.pem | base64 | tr -d '\n' 执行的结果直接粘贴上即可,如下(PS:换行是因为排版问题,不应该有换行符):

etcd-key: LS0tLS1CRUdJTiBSU0EgUFJJVkFURSBLRVktLS0tLQpNSUlFcEFJQkFBS0NBUUVBMEZjcTZCU253d2lJN2oxc1lobm9lelhpNzFiTzFYM0U1QVRtZ0xCMlQyQTlvSEFTCnZPQWdDMlFQQ2hPd3RCMUs0ZWlXMmJZRmdzS0pjKzJoYStyR293V0xGSkZNeEN0NXJ1VUxDWjRjaEN3YU1Rck8KRnBEdFp0
MFNDMmRJMnBOMkZPVjJONE9zblJ6MW1zenM0dHRpUEpvT213aWVvL1BOVVRkTlMyeFRtK2RzOTRJKwpaeWJlZXNGeGNZWnNreUlQd0lLcHd0blErb3FlaUJhRGZHOWc0SmtpSC9QaUFzVjlsUzVjSXlwRGNaVVo0ZmU1CmZIbTVHU2ZMdWpGVkEyeXN1WW5LaWtWNWtmU01DL3VhdGVObzlUcmtKR25ab1dTbUlGVHYr
b2ZzT3VraEVuNksKYmJsczY4cHVuRTJ4K3B0MDllQmZNWGowaEpsRnJnT3RHR1gxWlFJREFRQUJBb0lCQVFDcVprRC9wTFU1dldkUgpoQWQ3alRrcVhQNkpSdlRRaEpkZTcrc2ZZalRCNHpORVg2WFR1WFE4SE5CNEszYWhPandlM1Q5VVBaM3dQdkJ4ClV2QnQ5WTRWazlrWEwyZ2NJbnJaNHhmTisvWFMzTWRuU1RF
YUg5c3NBTEJiaDFSOUFaTFlzSHlxRnhZOVFveFcKMmpqOXF2V0VIM1RHdEp3YitMSDdOVUlRNkQ3U0NWSWZSelFOSDdUOWNhcGZTajhGNy9pdThuQ3ZLNVU0K3Y0TQptb21EdFdxdWJWWTNVOTAvMXBWUGNMcWRMaEJ4dHp6YkE0KzFwMGNkTWd2TU5oZFVjc0VsRE4xSXJjUXl0QUpICktvZFZQNkNSMjgzMHZsY3or
UzFQSVlLYkMwdDVhODFzRVp3VFdsSlZGc3huSDdCc1ZDMHhSVFR0aUtvQURyRDQKM3EvVlBWcFpBb0dCQVBTRU5uNkJEdmhROGRDU2NwSjVXYmY3MUdidDBzK1doU05yNkFLQ1ErK0JmR2tVL1h4SQpEQ2dZeWRzU1hJWHRMSURvTTdodFlzdmtpanFEUk5lckwyR0tUMk4yRTBpd2ZFQTRLdFZzNFBoeVlJS2RPMjNU
CmNxM2NCUFFnaWNrUzV1QmxqcExZZlU3OFd2WVBOMEQySnBXNlVxRm9RZmlja2Z5QjkrMWY4TDczQW9HQkFOb2cKQkNsVXR3RVZwMGc1c1h3MGIraVc2L3l4ekhxckU2dUtRVGNqMXhHNFFCSStvMFl1VWxjTlk5WFVXTGpVQzAwMQplMldXTWp0VC8xaFoySk4rTUV1blNTTzloR3M5a1hPQUZIRi9ST2x6TWhzNVBJ
N2pGT244UFZTSndlUWhLZDZzCmdLcjdWN0haV0ppNDVHUkxVYTltUWh2Tk9BNVkvRmdReHVMUzlXdURBb0dCQUx1cjRwdVQrT1owVWpWd3djbFUKcEowSEI0NTh5UW9Wa0ZpUWtNR2tNL3BYR3lNWVBqcXVuYzRFd0tHSlpVUlJ0bysyS1VSTGlNSFB4cHlFZGtsRwpGWmE2N3BYN1lXK3dMWjJvdm8rVEF0VU9ESzhU
ZVRLaFVXcko3VzltcmZxTHJITGMzK0lya1hvWFRNV0JCanF3Cjh3cUd1TGN6NnphakRaV09ON21Vb3BZZkFvR0FZUnNQdnphdm9oUDV2UFd5UmhFeUlPSFBmVmZLS0hJdzk1VTkKSTBjWllCSWV0QUNldjRldnNJR05pSXhZVXpCVE43UXZreklpZXJjU1hrcmhXQWc5aC9DWlp3ZmdBNzROR3RaUwpRNVRkSVBEZnhh
N2RmdDhwV0dHckRBK24rZCtwdkRBZnQvN2RNNWdIRVRaK3R3ZXcvZDBRWVVBalRIL2hGM09nCmx5cERoL0VDZ1lCZG1JM2tFOWgzSGV6QWl0V1Nxc1ZGVm1YOVRwdzhCSG9vcUpQZmUwd3A5aUxxV3pYaUtHNHgKeXY5bnQ3N0cvU1NGSUkybWhQbVpIWGhBZHpmQjBkNHF6VFVCcWczTmwxei9lVXdyL1NrVWRSdnRP
VzdLMFI4UApVTGIreWxmR0dZWk1GZ1JFcGdFRjkvR25GVEMzUFNQOU9XUTU2QldybWpxZXpDVVpKd2NQYUE9PQotLS0tLUVORCBSU0EgUFJJVkFURSBLRVktLS0tLQo=
etcd-cert: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUQ2VENDQXRHZ0F3SUJBZ0lUVUg3d0pqc2NwQWxBR1NaYTkraEo4MUxUZVRBTkJna3Foa2lHOXcwQkFRc0YKQURCbE1Rc3dDUVlEVlFRR0V3SkRUakVRTUE0R0ExVUVDQk1IVkdsaGJrcHBiakVRTUE0R0ExVUVCeE1IVkdsaApia3BwYmpFTU1
Bb0dBMVVFQ2hNRGF6aHpNUTh3RFFZRFZRUUxFd1pUZVhOMFpXMHhFekFSQmdOVkJBTVRDbXQxClltVnlibVYwWlhNd0hoY05NVGN4TWpFNU1EWXdNakF3V2hjTk1qY3hNakUzTURZd01qQXdXakJmTVFzd0NRWUQKVlFRR0V3SkRUakVRTUE0R0ExVUVDQk1IVkdsaGJrcHBiakVRTUE0R0ExVUVCeE1IVkdsaGJrcHB
iakVNTUFvRwpBMVVFQ2hNRGF6aHpNUTh3RFFZRFZRUUxFd1pUZVhOMFpXMHhEVEFMQmdOVkJBTVRCR1YwWTJRd2dnRWlNQTBHCkNTcUdTSWIzRFFFQkFRVUFBNElCRHdBd2dnRUtBb0lCQVFEUVZ5cm9GS2ZEQ0lqdVBXeGlHZWg3TmVMdlZzN1YKZmNUa0JPYUFzSFpQWUQyZ2NCSzg0Q0FMWkE4S0U3QzBIVXJoNkp
iWnRnV0N3b2x6N2FGcjZzYWpCWXNVa1V6RQpLM211NVFzSm5oeUVMQm94Q3M0V2tPMW0zUklMWjBqYWszWVU1WFkzZzZ5ZEhQV2F6T3ppMjJJOG1nNmJDSjZqCjg4MVJOMDFMYkZPYjUyejNnajVuSnQ1NndYRnhobXlUSWcvQWdxbkMyZEQ2aXA2SUZvTjhiMkRnbVNJZjgrSUMKeFgyVkxsd2pLa054bFJuaDk3bDh
lYmtaSjh1Nk1WVURiS3k1aWNxS1JYbVI5SXdMKzVxMTQyajFPdVFrYWRtaApaS1lnVk8vNmgrdzY2U0VTZm9wdHVXenJ5bTZjVGJINm0zVDE0Rjh4ZVBTRW1VV3VBNjBZWmZWbEFnTUJBQUdqCmdaY3dnWlF3RGdZRFZSMFBBUUgvQkFRREFnV2dNQjBHQTFVZEpRUVdNQlFHQ0NzR0FRVUZCd01CQmdnckJnRUYKQlF
jREFqQU1CZ05WSFJNQkFmOEVBakFBTUIwR0ExVWREZ1FXQkJSQngxaUtCRWJqN0syTHJ4dklKYWxCMlljWQpaekFmQmdOVkhTTUVHREFXZ0JUWFZZNWpBTndBNzllZFNQRWlwZytuSWNuMENqQVZCZ05WSFJFRURqQU1od1IvCkFBQUJod1RBcUFHNE1BMEdDU3FHU0liM0RRRUJDd1VBQTRJQkFRQ284bjRjYnI1cm1
nRHBTVHRhZTc0RjAyVmcKYWcwTXA2clVsbko5TFV4NTdXOWJnZXhZaHAwNzZFVks0WnpHa1JSZ2U3emNrb2IrRE02ZzNpbjBWTU1ObU0weAp4cXJzRFZuNnZHQUpyWWZHTWNtRnk5UzVrazM4bjhvSkhXc1oraGRmeFk5Mzd2TUZvQXFiUWJWTU1wd2hHbjYvCkVlRXEyQ3J2NzhCNU1IV0tOMkhRZmQ2S0huam9VQUl
EdXlJdGZLZ3ltVHgxazMxdmxCQWN4MW5SOXAyWTN5bUgKQWczZXl1ZkFZOXhQbEx5NHQ4SmlyZUt4OUpGRk5QOVpJc1VNTnpGMkp3Z0FlcTNkS1ZTMWxjdG9nbkFpR2FhOQpuemJzOSsrVFZkcmJUd3kyMVp1dVNjejdaOExacGNlMDVYM3hUSE1maXBmNEE4ZGFaVkFYMUwzYUF5ZWoKLS0tLS1FTkQgQ0VSVElGSUN
BVEUtLS0tLQo=
etcd-ca: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUR2akNDQXFhZ0F3SUJBZ0lVVUM3dkt0Y05Ec0NkY2QxUmxsdVI4aVllQ3FNd0RRWUpLb1pJaHZjTkFRRUwKQlFBd1pURUxNQWtHQTFVRUJoTUNRMDR4RURBT0JnTlZCQWdUQjFScFlXNUthVzR4RURBT0JnTlZCQWNUQjFScApZVzVLYVc0eEREQ
UtCZ05WQkFvVEEyczRjekVQTUEwR0ExVUVDeE1HVTNsemRHVnRNUk13RVFZRFZRUURFd3ByCmRXSmxjbTVsZEdWek1CNFhEVEUzTVRJeE9UQTFOVEV3TUZvWERUSXlNVEl4T0RBMU5URXdNRm93WlRFTE1Ba0cKQTFVRUJoTUNRMDR4RURBT0JnTlZCQWdUQjFScFlXNUthVzR4RURBT0JnTlZCQWNUQjFScFlXNUthV
zR4RERBSwpCZ05WQkFvVEEyczRjekVQTUEwR0ExVUVDeE1HVTNsemRHVnRNUk13RVFZRFZRUURFd3ByZFdKbGNtNWxkR1Z6Ck1JSUJJakFOQmdrcWhraUc5dzBCQVFFRkFBT0NBUThBTUlJQkNnS0NBUUVBeVREMENqR2E0SXpqTTV0NUhUWnQKNWV2Z2N6bXR2TDVCaHhjdGY4VUU1cDVWSFFRblVxM1dLZk9uaVNiN
015Z1d1VEh0KzlCSFE3bW1YZnllaG8vbwpJU1MrZjlsS1pzdUFDejJORU5SeGlvSm5HVnphSWdqYWVZaENlQTMxdXpad1g2ZHJUa2d6VEdVdU9BaEljNmZhCm0zOEdSU3h5NmpySGkwVFVva1R3cXpaaXRGRjZkOWZZUE4xOU9xZ0ZLSXZISWdIbzFmcGdHd1JabmxLNXpMTy8KYUtWMVVUR3BTc0RDUFNwbXhLT2UvS
UpLTHZSV2M4Tmc2clRYMjh0cGE1RG1vWUhKeEg0elNwRmpCZTdnWFd0QQoyVnpXb2dEOCt4YlU4ekJFT3M2WmJlSURpOG8vVGVyNUpydEZsRG1GbzltQ3F6UGY4R2ZaSmFld2F3eFZ0d3dXCjNRSURBUUFCbzJZd1pEQU9CZ05WSFE4QkFmOEVCQU1DQVFZd0VnWURWUjBUQVFIL0JBZ3dCZ0VCL3dJQkFqQWQKQmdOV
khRNEVGZ1FVMTFXT1l3RGNBTy9YblVqeElxWVBweUhKOUFvd0h3WURWUjBqQkJnd0ZvQVUxMVdPWXdEYwpBTy9YblVqeElxWVBweUhKOUFvd0RRWUpLb1pJaHZjTkFRRUxCUUFEZ2dFQkFBdFh5NXNEVHZ4RGlkK0UvZ0dKCk16OWRuM0t1OEFQTnJEaXA4NlIxaTBrNmJJNlV3VzVTVlBRMENabnY1azJwRURDR3B5S
UNPZFBjTXE2UEZSVUEKelN2alJreHo5aTRqVDZ1WnZuTmo4WTFkOXFSZzBqbGJJdnpKVzRET3MrYTFLRklSOERiZGhBVzBxSkE1NU1legpZVjdxNFpFZGhMU2pxNU5zZStqSXpXUEd2WVZVWkZmelJTMEhkb2NISUtnSUJJMkdid3NqWG9wWkp0Ymp6OVFxCnBaakdnbHB2b2hHaXVFOGVDVURzYTlzT3ZodUcwOTliN
nUwSk5tVThQejVubVJjUWE2QXp2elF6UHMyMW5yaGwKeml3QzRQdTRVRkp5WG1aMkhVeCsrREtqMWswVFhXS3lQdk9jYzhuc2NWUkdQSXJXMVQyNlFCL3pjcnMzbUFQZgo3dWc9Ci0tLS0tRU5EIENFUlRJRklDQVRFLS0tLS0K

#4.在 DaemonSet 的 containers calico-node的环境变量中,添加如下变量(在网络环境比较复杂的情况下使之能获取正确的IP地址,具体解释后面会讲)

- name: IP_AUTODETECTION_METHOD
  value: "can-reach=8.8.8.8"

3.拉取镜像

在获取的配置文件calico.yaml中,显示当前calico的版本:

#   calico/node:v2.6.5
#   calico/cni:v1.11.2
#   calico/kube-controllers:v1.0.2

拉取镜像:(国外镜像比较慢,稍等会,或者可以找找国内镜像,本次下载还不算慢,没有整国内镜像)

docker pull calico/node:v2.6.5
docker pull calico/cni:v1.11.2
docker pull docker pull calico/cni:v1.11.2

4.导入 yaml 配置文件

[[email protected]_master184 calicao]# kubectl create -f rbac.yaml 
clusterrole "calico-kube-controllers" created
clusterrolebinding "calico-kube-controllers" created
clusterrole "calico-node" created
clusterrolebinding "calico-node" created
[[email protected]_master184 calicao]# kubectl create -f calico.yaml 
configmap "calico-config" created
secret "calico-etcd-secrets" created
daemonset "calico-node" created
deployment "calico-kube-controllers" created
deployment "calico-policy-controller" created
serviceaccount "calico-kube-controllers" created
serviceaccount "calico-node" created

5.配置 kubelet.service

vi /etc/systemd/system/kubelet.service

# 增加 如下配置

  --network-plugin=cni \

# 重新加载配置
systemctl daemon-reload
systemctl restart kubelet.service
systemctl status kubelet.service

6.验证 Calico

[[email protected]_master184 calico]# kubectl get pods -n kube-system 
NAME                                       READY     STATUS             RESTARTS   AGE
calico-kube-controllers-7485cd7966-968jz   1/1       Running            0          2h
calico-node-2sxn7                          2/2       Running            0          2h
calico-node-6jxhq                          2/2       Running            0          2h
calico-node-7mpwv                          2/2       Running            0          2h
calico-node-hgl7n                          2/2       Running            0          2h
calico-node-tkkkv                          2/2       Running            0          2h

7.安装 calico命令行工具 calicoctl (可选)

cd /usr/bin/
wget -c  https://github.com/projectcalico/calicoctl/releases/download/v1.6.1/calicoctl
chmod +x calicoctl

# 创建 calicoctl 配置文件(在安装了calico的node中)

mkdir /etc/calico
vi /etc/calico/calicoctl.cfg

apiVersion: v1
kind: calicoApiConfig
metadata:
spec:
  datastoreType: "etcdv2"
  etcdEndpoints: "https://192.168.1.184:2379"
  etcdKeyFile: "/etc/kubernetes/ssl/etcd-key.pem"
  etcdCertFile: "/etc/kubernetes/ssl/etcd.pem"
  etcdCACertFile: "/etc/kubernetes/ssl/ca.pem"


# 查看 calico 状态
[[email protected] bin]# calicoctl node status
Calico process is running.

IPv4 BGP status
+---------------+-------------------+-------+----------+-------------+
| PEER ADDRESS  |     PEER TYPE     | STATE |  SINCE   |    INFO     |
+---------------+-------------------+-------+----------+-------------+
| 192.168.1.185 | node-to-node mesh | up    | 04:13:08 | Established |
| 192.168.1.186 | node-to-node mesh | up    | 04:13:08 | Established |
| 192.168.1.187 | node-to-node mesh | up    | 04:13:11 | Established |
| 192.168.1.189 | node-to-node mesh | up    | 04:13:11 | Established |
+---------------+-------------------+-------+----------+-------------+

IPv6 BGP status
No IPv6 peers found.

8.测试集群

# 在三台node上分别一个容器,看是否能ping通
# (node 185)
kubectl run calico-test-1 --rm -ti --image busybox /bin/sh  
# (node 186)
kubectl run calico-test-2 --rm -ti --image busybox /bin/sh
# (node 187)
kubectl run calico-test-3 --rm -ti --image busybox /bin/sh

默认网络是全部连通的,pods之间、pod与node可以随意访问。如果需要增加限制,可以参考 calico的 Profile Resource (profile)Policy Resource (policy)

验证 openmpi

测试方法:准备两个节点,每台有一个GPU,如果任务可以同时在两台机器上运行,说明成功。

配置文件 yaml

node 188:

apiVersion: extensions/v1beta1
kind: Deployment
metadata: 
  labels:  
    app: keras-chun 
  name: keras-chun-188-0  
  namespace: bigdata
spec:  
  replicas: 1  
  selector:  
    matchLabels:  
      app: keras-chun
  template:  
    metadata:  
      labels:  
        app: keras-chun
    spec:  
      containers: 
      - name: keras-chun-188-0 
        image: 192.168.1.184:5000/bigdata/tensorflow-gpu-1.4.1-compile-py36:eth0 
        # securityContext:
        #   privileged: true
        workingDir: /opt/keras_chun
        resources: 
          limits: 
            alpha.kubernetes.io/nvidia-gpu: 1
        command: 
        - /startssh.sh
        ports:  
        - containerPort: 8088  
          protocol: TCP  
        volumeMounts:
        - mountPath: /opt/keras_chun
          name: keras
        - mountPath: /usr/local/openmpi-2.1.2/
          name: openmpi
        - mountPath: /root/test_data/
          name: test-data
      nodeName: 192.168.1.188
      volumes:
      - hostPath:
          path: /opt/keras_chun
        name: keras
      - hostPath:
          path: /usr/local/openmpi-2.1.2/
        name: openmpi
      - hostPath:
          path: /root/test_data/
        name: test-data

---  
kind: Service  
apiVersion: v1  
metadata:  
  labels:  
    app: keras-chun 
  name: keras-chun 
  namespace: bigdata
spec:  
  type: NodePort  
  ports: 
  - port: 8088  
    targetPort: 8088 
    nodePort: 30010
  selector:  
    app: keras-chun

node 189:

apiVersion: extensions/v1beta1
kind: Deployment
metadata: 
  labels:  
    app: keras-chun 
  name: keras-chun-189-0  
  namespace: bigdata
spec:  
  replicas: 1  
  selector:  
    matchLabels:  
      app: keras-chun
  template:  
    metadata:  
      labels:  
        app: keras-chun
    spec:  
      containers: 
      - name: keras-chun-189-0 
        image: 192.168.1.184:5000/bigdata/tensorflow-gpu-1.4.1-compile-py36:eth0 
        # securityContext:
        #   privileged: true
        workingDir: /opt/keras_chun
        resources: 
          limits: 
            alpha.kubernetes.io/nvidia-gpu: 1
        command: 
        - /startssh.sh
        ports:  
        - containerPort: 8088  
          protocol: TCP  
        volumeMounts:
        - mountPath: /opt/keras_chun
          name: keras
        - mountPath: /usr/local/openmpi-2.1.2/
          name: openmpi
        - mountPath: /root/test_data/
          name: test-data
      nodeName: 192.168.1.189
      volumes:
      - hostPath:
          path: /opt/keras_chun
        name: keras
      - hostPath:
          path: /usr/local/openmpi-2.1.2/
        name: openmpi
      - hostPath:
          path: /root/test_data/
        name: test-data

部署pod:

[[email protected]_master184 calico-test]#kubectl create -f node188.yaml
deployment "keras-chun-188-0" created
service "keras-chun" created

[[email protected]_master184 calico-test]#kubectl create -f node189.yaml
deployment "keras-chun-189-0" created

[[email protected]_master184 calico-test]# kubectl get pod -n bigdata -o wide
NAME                                      READY     STATUS    RESTARTS   AGE       IP             NODE
keras-chun-188-0-665668495f-gft7h         1/1       Running   0          22s       10.233.1.68    192.168.1.188
keras-chun-189-0-768c9bf77-67fn4          1/1       Running   0          35m       10.233.1.135   192.168.1.189

验证mpi:

[[email protected] keras_chun]#    mpirun -np 2 -H 10.233.1.68:1,10.233.1.135:1 --mca btl_tcp_if_include eth0 --mca oob_tcp_if_include eth0 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH --allow-run-as-root python3 /opt/keras_chun/train.py
WARNING:tensorflow:From /opt/keras_chun/vlife/bigdata/tfd/horovodhelper.py:24: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.get_or_create_global_step
WARNING:tensorflow:From /opt/keras_chun/vlife/bigdata/tfd/horovodhelper.py:24: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.get_or_create_global_step
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Create CheckpointSaverHook.
2018-01-04 06:28:27.412859: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:892] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-01-04 06:28:27.413167: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.721
pciBusID: 0000:01:00.0
totalMemory: 10.91GiB freeMemory: 10.69GiB
2018-01-04 06:28:27.413183: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
2018-01-04 06:28:27.867927: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:892] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-01-04 06:28:27.868319: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.721
pciBusID: 0000:01:00.0
totalMemory: 10.91GiB freeMemory: 10.75GiB
2018-01-04 06:28:27.868344: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)

keras-chun-188-0-665668495f-gft7h:151:158 [0] misc/ibvwrap.cu:60 WARN Failed to open libibverbs.so[.1]
keras-chun-188-0-665668495f-gft7h:151:158 [0] INFO Using internal Network Socket
keras-chun-188-0-665668495f-gft7h:151:158 [0] INFO Using NCCL Low-latency algorithm for sizes below 16384
keras-chun-188-0-665668495f-gft7h:151:158 [0] INFO NET : Using interface eth0:10.233.1.68<0>
keras-chun-188-0-665668495f-gft7h:151:158 [0] INFO NET/Socket : 1 interfaces found
NCCL version 2.1.2+cuda8.0

keras-chun-189-0-768c9bf77-67fn4:87:94 [0] misc/ibvwrap.cu:60 WARN Failed to open libibverbs.so[.1]
keras-chun-189-0-768c9bf77-67fn4:87:94 [0] INFO Using internal Network Socket
keras-chun-189-0-768c9bf77-67fn4:87:94 [0] INFO Using NCCL Low-latency algorithm for sizes below 16384
keras-chun-189-0-768c9bf77-67fn4:87:94 [0] INFO NET : Using interface eth0:10.233.1.135<0>
keras-chun-189-0-768c9bf77-67fn4:87:94 [0] INFO NET/Socket : 1 interfaces found
keras-chun-188-0-665668495f-gft7h:151:158 [0] INFO Using 256 threads
keras-chun-188-0-665668495f-gft7h:151:158 [0] INFO Min Comp Cap 6
keras-chun-188-0-665668495f-gft7h:151:158 [0] INFO NCCL_SINGLE_RING_THRESHOLD=131072
keras-chun-188-0-665668495f-gft7h:151:158 [0] INFO [0] Ring 0 :    0   1
keras-chun-189-0-768c9bf77-67fn4:87:94 [0] INFO 0 -> 1 via NET/Socket/0
keras-chun-188-0-665668495f-gft7h:151:158 [0] INFO 1 -> 0 via NET/Socket/0
keras-chun-188-0-665668495f-gft7h:151:158 [0] INFO Launch mode Parallel
INFO:tensorflow:Saving checkpoints for 1 into checkpoint/model.ckpt.
INFO:tensorflow:Saving checkpoints for 1 into checkpoint/model.ckpt.
INFO:tensorflow:d_loss = 1.9829, g_loss = 189.575, l1_loss = 99.0771, l12_loss = 90.4983, const_loss = 18.9435, cheat_loss = 0.510973, d_loss_real = 0.730531, d_loss_fake = 1.25237
INFO:tensorflow:d_loss = 1.80019, g_loss = 189.344, l1_loss = 99.0133, l12_loss = 90.3308, const_loss = 19.1294, cheat_loss = 0.533686, d_loss_real = 0.722731, d_loss_fake = 1.07746
INFO:tensorflow:d_loss = 1.68, g_loss = 42.1723, l1_loss = 21.8061, l12_loss = 20.3662, const_loss = 17.9448, cheat_loss = 0.487852, d_loss_real = 0.684348, d_loss_fake = 0.995656 (115.616 sec)
INFO:tensorflow:d_loss = 1.59902, g_loss = 45.8866, l1_loss = 23.6325, l12_loss = 22.2541, const_loss = 18.128, cheat_loss = 0.47778, d_loss_real = 0.612673, d_loss_fake = 0.986349 (76.636 sec)

查看两个node上GPU信息:

[[email protected] bin]# nvidia-smi           
Thu Jan  4 15:44:47 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.69                 Driver Version: 384.69                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:01:00.0 Off |                  N/A |
| 52%   68C    P2   248W / 250W |  10799MiB / 11172MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     27946    C   python                                       10781MiB |
+-----------------------------------------------------------------------------+

[[email protected] bin]# nvidia-smi
Thu Jan  4 15:45:47 2018 
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.69                 Driver Version: 384.69                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:01:00.0  On |                  N/A |
| 55%   70C    P2   206W / 250W |  10794MiB / 11172MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      1362    G   /usr/bin/X                                      45MiB |
|    0      2092    G   /usr/bin/gnome-shell                            13MiB |
|    0     28193    C   python                                       10723MiB |
+-----------------------------------------------------------------------------+

任务运行正常,结束。

问题小记

反思

本来安装calico是个很简单的工作,但硬生生的整了五天才通了,回头想想,必须得反思一下。现在大部分的程序员的工作就是google或baidu,然后实验,得出结论并化作自身的经验。这本无可厚非,需要提一下的是,在海量信息中筛选有效的信息,需要有主有次。每当需要使用一个新的技术的时候,google一下,如果显示有很多中文博客的,代表这个已经有很多很多人在用了,但对于开源软件来说,绝大多数已经过时了,这个需要注意,也就是第一个要说的版本问题,第二个呢,要小心中文博文中提出的奇技淫巧,当然,并不是说中文的质量就一定不高,只是有些时候,一些人只是为了解决问题而解决问题,在不完全掌握该软件的前提下,提出了一些表面上的折中方案,拆东墙补西墙,最后不好定位问题,所以,如果英文不是太好,可以先参考中文博文(相对比较详细),但一定要在官网的指导下。

1. calico-node的systemctl启动方式

根据同事安装kubernetes参考的一篇博文《kubernetes 1.8.3》,见 [小炒肉 kubernetes 1.8.3 安装过程],开始了将flannel替换成calico的工作。过程很简单,其中提到了一种说法——使用官方的 calico.yml 创建相关组件,这样 ConfigMap、Etcd 配置、Calico policy 啥的直接创建好,然后把 DaemonSet 中 calico-node 容器单独搞出来,使用 Systemd 启动。——[见Calico 部署踩坑记录]

截止到发文前,calico已经更新到3.0.0,但集成到kubernetes,官网还是推荐使用2.6.5版本的。从官网下载的v2.6版本的calico.yaml文件,里面对应的是v2.6.5,因博文中使用的v2.6版本是v2.6.0的小版本,为了实验环境保持一致,也使用了这个版本。在创建pod的时候失败了,原因就是版本问题,看了源码才知道,有一个switch分支的case项,在2.6.0没有,到2.6.2版本才加上,所以版本和配置要保持一致,尤其是开源软件,更新比较快。

按文章所述,环境如期望般搭建起来了,各node均生成一个tunl,各node的tunl地址也可以相互ping通,创建一个测试pod,在对应node上也可以对应的生成一个cali开头的网络虚拟设备,但问题同时出现了——pod之间、pod和node、外网均ping不通,只能ping通自己,在node上查看路由信息,并没有发现期望的via cali网卡的条目,所以不通是正常的。 接下来四天半的工作就是围绕着这个问题展开的。思路如下:

  • 首先,考虑到的还是版本问题,所以更换成最新的3.0.0版,稍微旧一些的2.5.1、2.4.1版本,结果都一样,所以版本不是问题所在。

  • 接下来就是回到到v2.6.5, 深入接着分析网络走向的问题。要分析网络,那么 iptables规则 结合 抓包工具 是比较好的方式。见 [docker 容器网络方案:calico 网络模型],文中比较详细的分析了报文的走向,此处参照了其中的分析流程。不同的是,

    • 第一:查看容器内和node的interface时,文中所说node的calif24874aae57的mac是随机生成的字母数字组合,而veth pair另一端的容器内的interface是ee:ee:ee:ee:ee:ee, 而我的测试结果正好相反。不过后来在查看 v2.6.5的Release信息时发现,v2.6.5将node上cali类的mac信息改成ee:ee:ee:ee:ee:ee,所以这个是因版本导致的信息不一致,不是问题关键。

    • 第二:在使用 tcpdump -nn -i calif24874aae57 -e 进行抓包的过程中并没有出现ARP应答信息,也没有接下来的正常应有的ICMP报文信息。

  • 最后,结合calico的policy和profile来分析iptables规则,从一个pod ping一个pod,然后 watch -n 1 iptables -vL 查看是否有数据被丢掉,最后发现并没有被drop掉,判断的依据是drop的数据为0。

2. 官网安装方式的网络问题

由问题一得出,以systemctl的方式启动calico-node,最大的问题还是网络的问题。另外,将calico-node交由systemd管理,失败之后,其恢复的机制不如kubernetes来的自然。所以,接下来的方向就抛弃了之前的那种方式,转而使用官网的安装方式, 见 [ Installing Calico on Kubernetes - Standard Hosted Install],最后采用了博文 [centos7安装kubernetes-v1.7安装配置calico网络组件] 的方式, 实现了pod和node之间可以互通,但node之间的pod却ping不通。

回想起在官网的 Troubleshooting 中有如下问题:

No ping between containers on different hosts

If you have connectivity between containers on the same host, and between containers and the Internet, but not between containers on different hosts, it probably indicates a problem in the BIRD setup.

Look at calicoctl node status on each host. It should include output like this:

Calico process is running.

IPv4 BGP status
+--------------+-------------------+-------+----------+-------------+
| PEER ADDRESS |     PEER TYPE     | STATE |  SINCE   |    INFO     |
+--------------+-------------------+-------+----------+-------------+
| 172.17.8.102 | node-to-node mesh | up    | 23:30:04 | Established |
+--------------+-------------------+-------+----------+-------------+

IPv6 BGP status
No IPv6 peers found.
If you do not see this, please check the following.

Can your hosts ping each other? There must be IP connectivity between the hosts.

Your hosts’ names must be different. Calico uses hostname as a key in the etcd data, and the etcd data is used to autogenerate the correct BIRD config - so a duplicate hostname will prevent correct BIRD setup.

There must not be iptables rules, or any kind of firewall, preventing communication between the hosts on TCP port 179. (179 is the BGP port.)

使用 calicoctl node status 查看node状态,发现 PEER ADDRESS 一栏中并不是想要的 10.233.0.0/16 网段的,而且INFO一栏状态显示的是 CONNET, 所以想到了博文 [Calico 部署踩坑记录] 中提到的 官方文档中直接创建的 calico.yml 文件中,使用 DaemonSet 方式启动 calico-node,同时 calico-node 的 IP 设置和 NODENAME 设置均为空,此时 calico-node 会进行自动获取,网络复杂情况下获取会出现问题。[centos7安装kubernetes-v1.7安装配置calico网络组件] 中提到了一个环境变量 IP_AUTODETECTION_METHOD, 从官网查看后得出结论, IP_AUTODETECTION_METHOD 默认为 first-found 方式,也就是遍历所有interface,并返回第一个有效的ip地址,所以在网络环境复杂的地方,获取的IP不是期望值就正常了,另外一种方法是指定网卡名称(支持通配符), 但各主机Interface名称又不一样,所以使用了最后一种方法—— can-reach=DESTINATION, 通过在calico-node中的env中指定如下参数,就可以获取正确的地址(单有效网卡时,多网卡不适用)。

- name: IP_AUTODETECTION_METHOD
  value: "can-reach=8.8.8.8"

如此这般,nodes之间的pods网络就可以通了。

参考

[官网] Installing Calico on Kubernetes - Standard Hosted Install

[官网]Calico Configuring Felix

[官网]IP Autodetection methods

centos7安装kubernetes-v1.7安装配置calico网络组件

docker 容器网络方案:calico 网络模型

Calico 部署踩坑记录

抓包神器 tcpdump 使用介绍

Kubernetes1.8.3 集群环境搭建(CentOS)

小炒肉 kubernetes 1.8.3 安装过程

Calico网络的原理、组网方式与使用

《Kubernetes权威指南-从Docker到Kubernetes实践全接触纪念版》 page345 - page363