前言
SuperEdge service group 利用 application-grid-wrapper 實現拓撲感知,完成了同一個 nodeunit 內服務的閉環訪問
在深入分析 application-grid-wrapper 之前,這裡先簡單介紹一下社群 Kubernetes 原生支援的拓撲感知特性
Kubernetes service topology awareness 特性於v1.17釋出alpha版本,用於實現路由拓撲以及就近訪問特性。使用者需要在 service 中新增 topologyKeys 欄位標示拓撲key型別,只有具有相同拓撲域的endpoint會被訪問到,目前有三種 topologyKeys可供選擇:
- "kubernetes.io/hostname":訪問本節點內(
kubernetes.io/hostname
label value相同)的 endpoint,如果沒有則 service 訪問失敗 - "topology.kubernetes.io/zone":訪問相同zone域內(
topology.kubernetes.io/zone
label value 相同)的 endpoint,如果沒有則 service 訪問失敗 - "topology.kubernetes.io/region":訪問相同region域內(
topology.kubernetes.io/region
label value相同)的 endpoint,如果沒有則 service 訪問失敗
除了單獨填寫如上某一個拓撲key之外,還可以將這些key構造成列表進行填寫,例如:["kubernetes.io/hostname", "topology.kubernetes.io/zone", "topology.kubernetes.io/region"]
,這表示:優先訪問本節點內的 endpoint;如果不存在,則訪問同一個 zone 內的 endpoint;如果再不存在,則訪問同一個 region 內的 endpoint,如果都不存在則訪問失敗
另外,還可以在列表最後(只能最後一項)新增"*"表示:如果前面拓撲域都失敗,則訪問任何有效的 endpoint,也即沒有限制拓撲了,示例如下:
# A Service that prefers node local, zonal, then regional endpoints but falls back to cluster wide endpoints.
apiVersion: v1
kind: Service
metadata:
name: my-service
spec:
selector:
app: my-app
ports:
- protocol: TCP
port: 80
targetPort: 9376
topologyKeys:
- "kubernetes.io/hostname"
- "topology.kubernetes.io/zone"
- "topology.kubernetes.io/region"
- "*"
而service group實現的拓撲感知和社群對比,有如下區別:
- service group 拓撲key可以自定義,也即為 gridUniqKey,使用起來更加靈活;而社群實現目前只有三種選擇:"kubernetes.io/hostname","topology.kubernetes.io/zone" 以及 "topology.kubernetes.io/region"
- service group 只能填寫一個拓撲 key,也即只能訪問本拓撲域內有效的 endpoint,無法訪問其它拓撲域的 endpoint;而社群可以通過 topologyKey 列表以及"*"實現其它備選拓撲域 endpoint 的訪問
service group 實現的拓撲感知,service 配置如下:
# A Service that only prefers node zone1al endpoints.
apiVersion: v1
kind: Service
metadata:
annotations:
topologyKeys: '["zone1"]'
labels:
superedge.io/grid-selector: servicegrid-demo
name: servicegrid-demo-svc
spec:
ports:
- port: 80
protocol: TCP
targetPort: 8080
selector:
appGrid: echo
在介紹完 service group 實現的拓撲感知後,我們深入到原始碼分析實現細節。同樣的,這裡以一個使用示例開始分析:
# step1: labels edge nodes
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
node0 Ready <none> 16d v1.16.7
node1 Ready <none> 16d v1.16.7
node2 Ready <none> 16d v1.16.7
# nodeunit1(nodegroup and servicegroup zone1)
$ kubectl --kubeconfig config label nodes node0 zone1=nodeunit1
# nodeunit2(nodegroup and servicegroup zone1)
$ kubectl --kubeconfig config label nodes node1 zone1=nodeunit2
$ kubectl --kubeconfig config label nodes node2 zone1=nodeunit2
...
# step3: deploy echo ServiceGrid
$ cat <<EOF | kubectl --kubeconfig config apply -f -
apiVersion: superedge.io/v1
kind: ServiceGrid
metadata:
name: servicegrid-demo
namespace: default
spec:
gridUniqKey: zone1
template:
selector:
appGrid: echo
ports:
- protocol: TCP
port: 80
targetPort: 8080
EOF
servicegrid.superedge.io/servicegrid-demo created
# note that there is only one relevant service generated
$ kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 192.168.0.1 <none> 443/TCP 16d
servicegrid-demo-svc ClusterIP 192.168.6.139 <none> 80/TCP 10m
# step4: access servicegrid-demo-svc(service topology and closed-looped)
# execute on node0
$ curl 192.168.6.139|grep "node name"
node name: node0
# execute on node1 and node2
$ curl 192.168.6.139|grep "node name"
node name: node2
$ curl 192.168.6.139|grep "node name"
node name: node1
在建立完 ServiceGrid CR 後,ServiceGrid Controller 負責根據 ServiceGrid 產生對應的 service (包含由serviceGrid.Spec.GridUniqKey 構成的 topologyKeys annotations);而 application-grid-wrapper 根據 service 實現拓撲感知,下面依次分析。
ServiceGrid Controller 分析
ServiceGrid Controller 邏輯和 DeploymentGrid Controller 整體一致,如下:
- 1、建立並維護 service group 需要的若干CRDs(包括:ServiceGrid)
- 2、監聽 ServiceGrid event,並填充 ServiceGrid 到工作佇列中;迴圈從佇列中取出 ServiceGrid 進行解析,建立並且維護對應的 service
- 3、監聽 service event,並將相關的 ServiceGrid 塞到工作佇列中進行上述處理,協助上述邏輯達到整體 reconcile 邏輯
注意這裡區別於 DeploymentGrid Controller:
- 一個 ServiceGrid 物件只產生一個 service
- 只需額外監聽 service event,無需監聽 node 事件。因為 node的CRUD 與 ServiceGrid 無關
- ServiceGrid 對應產生的 service,命名為:
{ServiceGrid}-svc
func (sgc *ServiceGridController) syncServiceGrid(key string) error {
startTime := time.Now()
klog.V(4).Infof("Started syncing service grid %q (%v)", key, startTime)
defer func() {
klog.V(4).Infof("Finished syncing service grid %q (%v)", key, time.Since(startTime))
}()
namespace, name, err := cache.SplitMetaNamespaceKey(key)
if err != nil {
return err
}
sg, err := sgc.svcGridLister.ServiceGrids(namespace).Get(name)
if errors.IsNotFound(err) {
klog.V(2).Infof("service grid %v has been deleted", key)
return nil
}
if err != nil {
return err
}
if sg.Spec.GridUniqKey == "" {
sgc.eventRecorder.Eventf(sg, corev1.EventTypeWarning, "Empty", "This service grid has an empty grid key")
return nil
}
// get service workload list of this grid
svcList, err := sgc.getServiceForGrid(sg)
if err != nil {
return err
}
if sg.DeletionTimestamp != nil {
return nil
}
// sync service grid relevant services workload
return sgc.reconcile(sg, svcList)
}
func (sgc *ServiceGridController) getServiceForGrid(sg *crdv1.ServiceGrid) ([]*corev1.Service, error) {
svcList, err := sgc.svcLister.Services(sg.Namespace).List(labels.Everything())
if err != nil {
return nil, err
}
labelSelector, err := common.GetDefaultSelector(sg.Name)
if err != nil {
return nil, err
}
canAdoptFunc := controller.RecheckDeletionTimestamp(func() (metav1.Object, error) {
fresh, err := sgc.crdClient.SuperedgeV1().ServiceGrids(sg.Namespace).Get(context.TODO(), sg.Name, metav1.GetOptions{})
if err != nil {
return nil, err
}
if fresh.UID != sg.UID {
return nil, fmt.Errorf("orignal service grid %v/%v is gone: got uid %v, wanted %v", sg.Namespace,
sg.Name, fresh.UID, sg.UID)
}
return fresh, nil
})
cm := controller.NewServiceControllerRefManager(sgc.svcClient, sg, labelSelector, util.ControllerKind, canAdoptFunc)
return cm.ClaimService(svcList)
}
func (sgc *ServiceGridController) reconcile(g *crdv1.ServiceGrid, svcList []*corev1.Service) error {
var (
adds []*corev1.Service
updates []*corev1.Service
deletes []*corev1.Service
)
sgTargetSvcName := util.GetServiceName(g)
isExistingSvc := false
for _, svc := range svcList {
if svc.Name == sgTargetSvcName {
isExistingSvc = true
template := util.KeepConsistence(g, svc)
if !apiequality.Semantic.DeepEqual(template, svc) {
updates = append(updates, template)
}
} else {
deletes = append(deletes, svc)
}
}
if !isExistingSvc {
adds = append(adds, util.CreateService(g))
}
return sgc.syncService(adds, updates, deletes)
}
func CreateService(sg *crdv1.ServiceGrid) *corev1.Service {
svc := &corev1.Service{
ObjectMeta: metav1.ObjectMeta{
Name: GetServiceName(sg),
Namespace: sg.Namespace,
// Append existed ServiceGrid labels to service to be created
Labels: func() map[string]string {
if sg.Labels != nil {
newLabels := sg.Labels
newLabels[common.GridSelectorName] = sg.Name
newLabels[common.GridSelectorUniqKeyName] = sg.Spec.GridUniqKey
return newLabels
} else {
return map[string]string{
common.GridSelectorName: sg.Name,
common.GridSelectorUniqKeyName: sg.Spec.GridUniqKey,
}
}
}(),
Annotations: make(map[string]string),
},
Spec: sg.Spec.Template,
}
keys := make([]string, 1)
keys[0] = sg.Spec.GridUniqKey
keyData, _ := json.Marshal(keys)
svc.Annotations[common.TopologyAnnotationsKey] = string(keyData)
return svc
}
由於邏輯與DeploymentGrid類似,這裡不展開細節,重點關注 application-grid-wrapper 部分
application-grid-wrapper 分析
在 ServiceGrid Controller 建立完 service 之後,application-grid-wrapper 的作用就開始啟動了:
apiVersion: v1
kind: Service
metadata:
annotations:
topologyKeys: '["zone1"]'
creationTimestamp: "2021-03-03T07:33:30Z"
labels:
superedge.io/grid-selector: servicegrid-demo
name: servicegrid-demo-svc
namespace: default
ownerReferences:
- apiVersion: superedge.io/v1
blockOwnerDeletion: true
controller: true
kind: ServiceGrid
name: servicegrid-demo
uid: 78c74d3c-72ac-4e68-8c79-f1396af5a581
resourceVersion: "127987090"
selfLink: /api/v1/namespaces/default/services/servicegrid-demo-svc
uid: 8130ba7b-c27e-4c3a-8ceb-4f6dd0178dfc
spec:
clusterIP: 192.168.161.1
ports:
- port: 80
protocol: TCP
targetPort: 8080
selector:
appGrid: echo
sessionAffinity: None
type: ClusterIP
status:
loadBalancer: {}
為了實現 Kubernetes 零侵入,需要在 kube-proxy與apiserver 通訊之間新增一層 wrapper,架構如下:
呼叫鏈路如下:
kube-proxy -> application-grid-wrapper -> lite-apiserver -> kube-apiserver
因此application-grid-wrapper會起服務,接受來自kube-proxy的請求,如下:
func (s *interceptorServer) Run(debug bool, bindAddress string, insecure bool, caFile, certFile, keyFile string) error {
...
klog.Infof("Start to run interceptor server")
/* filter
*/
server := &http.Server{Addr: bindAddress, Handler: s.buildFilterChains(debug)}
if insecure {
return server.ListenAndServe()
}
...
server.TLSConfig = tlsConfig
return server.ListenAndServeTLS("", "")
}
func (s *interceptorServer) buildFilterChains(debug bool) http.Handler {
handler := http.Handler(http.NewServeMux())
handler = s.interceptEndpointsRequest(handler)
handler = s.interceptServiceRequest(handler)
handler = s.interceptEventRequest(handler)
handler = s.interceptNodeRequest(handler)
handler = s.logger(handler)
if debug {
handler = s.debugger(handler)
}
return handler
}
這裡會首先建立 interceptorServer,然後註冊處理函式,由外到內依次如下:
-
debug:接受 debug 請求,返回 wrapper pprof 執行資訊
-
logger:列印請求日誌
-
node:接受 kube-proxy node GET(/api/v1/nodes/{node}) 請求,並返回node資訊
-
event:接受 kube-proxy events POST (/events)請求,並將請求轉發給 lite-apiserver
func (s *interceptorServer) interceptEventRequest(handler http.Handler) http.Handler { return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { if r.Method != http.MethodPost || !strings.HasSuffix(r.URL.Path, "/events") { handler.ServeHTTP(w, r) return } targetURL, _ := url.Parse(s.restConfig.Host) reverseProxy := httputil.NewSingleHostReverseProxy(targetURL) reverseProxy.Transport, _ = rest.TransportFor(s.restConfig) reverseProxy.ServeHTTP(w, r) }) }
-
service:接受 kube-proxy service List&Watch(/api/v1/services) 請求,並根據 storageCache 內容返回(GetServices)
-
endpoint:接受 kube-proxy endpoint List&Watch(/api/v1/endpoints) 請求,並根據 storageCache內容返回 (GetEndpoints)
下面先重點分析 cache 部分的邏輯,然後再回過頭來分析具體的 http handler List&Watch 處理邏輯
wrapper 為了實現拓撲感知,自己維護了一個 cache,包括:node,service,endpoint。可以看到在 setupInformers 中註冊了這三類資源的處理函式:
type storageCache struct {
// hostName is the nodeName of node which application-grid-wrapper deploys on
hostName string
wrapperInCluster bool
// mu lock protect the following map structure
mu sync.RWMutex
servicesMap map[types.NamespacedName]*serviceContainer
endpointsMap map[types.NamespacedName]*endpointsContainer
nodesMap map[types.NamespacedName]*nodeContainer
// service watch channel
serviceChan chan<- watch.Event
// endpoints watch channel
endpointsChan chan<- watch.Event
}
...
func NewStorageCache(hostName string, wrapperInCluster bool, serviceNotifier, endpointsNotifier chan watch.Event) *storageCache {
msc := &storageCache{
hostName: hostName,
wrapperInCluster: wrapperInCluster,
servicesMap: make(map[types.NamespacedName]*serviceContainer),
endpointsMap: make(map[types.NamespacedName]*endpointsContainer),
nodesMap: make(map[types.NamespacedName]*nodeContainer),
serviceChan: serviceNotifier,
endpointsChan: endpointsNotifier,
}
return msc
}
...
func (s *interceptorServer) Run(debug bool, bindAddress string, insecure bool, caFile, certFile, keyFile string) error {
...
if err := s.setupInformers(ctx.Done()); err != nil {
return err
}
klog.Infof("Start to run interceptor server")
/* filter
*/
server := &http.Server{Addr: bindAddress, Handler: s.buildFilterChains(debug)}
...
return server.ListenAndServeTLS("", "")
}
func (s *interceptorServer) setupInformers(stop <-chan struct{}) error {
klog.Infof("Start to run service and endpoints informers")
noProxyName, err := labels.NewRequirement(apis.LabelServiceProxyName, selection.DoesNotExist, nil)
if err != nil {
klog.Errorf("can't parse proxy label, %v", err)
return err
}
noHeadlessEndpoints, err := labels.NewRequirement(v1.IsHeadlessService, selection.DoesNotExist, nil)
if err != nil {
klog.Errorf("can't parse headless label, %v", err)
return err
}
labelSelector := labels.NewSelector()
labelSelector = labelSelector.Add(*noProxyName, *noHeadlessEndpoints)
resyncPeriod := time.Minute * 5
client := kubernetes.NewForConfigOrDie(s.restConfig)
nodeInformerFactory := informers.NewSharedInformerFactory(client, resyncPeriod)
informerFactory := informers.NewSharedInformerFactoryWithOptions(client, resyncPeriod,
informers.WithTweakListOptions(func(options *metav1.ListOptions) {
options.LabelSelector = labelSelector.String()
}))
nodeInformer := nodeInformerFactory.Core().V1().Nodes().Informer()
serviceInformer := informerFactory.Core().V1().Services().Informer()
endpointsInformer := informerFactory.Core().V1().Endpoints().Informer()
/*
*/
nodeInformer.AddEventHandlerWithResyncPeriod(s.cache.NodeEventHandler(), resyncPeriod)
serviceInformer.AddEventHandlerWithResyncPeriod(s.cache.ServiceEventHandler(), resyncPeriod)
endpointsInformer.AddEventHandlerWithResyncPeriod(s.cache.EndpointsEventHandler(), resyncPeriod)
go nodeInformer.Run(stop)
go serviceInformer.Run(stop)
go endpointsInformer.Run(stop)
if !cache.WaitForNamedCacheSync("node", stop,
nodeInformer.HasSynced,
serviceInformer.HasSynced,
endpointsInformer.HasSynced) {
return fmt.Errorf("can't sync informers")
}
return nil
}
func (sc *storageCache) NodeEventHandler() cache.ResourceEventHandler {
return &nodeHandler{cache: sc}
}
func (sc *storageCache) ServiceEventHandler() cache.ResourceEventHandler {
return &serviceHandler{cache: sc}
}
func (sc *storageCache) EndpointsEventHandler() cache.ResourceEventHandler {
return &endpointsHandler{cache: sc}
}
這裡依次分析 NodeEventHandler,ServiceEventHandler 以及 EndpointsEventHandler,如下:
1、NodeEventHandler
NodeEventHandler 負責監聽 node 資源相關 event,並將 node 以及 node Labels 新增到storageCache.nodesMap 中 (key 為 nodeName,value 為 node 以及 node labels)
func (nh *nodeHandler) add(node *v1.Node) {
sc := nh.cache
sc.mu.Lock()
nodeKey := types.NamespacedName{Namespace: node.Namespace, Name: node.Name}
klog.Infof("Adding node %v", nodeKey)
sc.nodesMap[nodeKey] = &nodeContainer{
node: node,
labels: node.Labels,
}
// update endpoints
changedEps := sc.rebuildEndpointsMap()
sc.mu.Unlock()
for _, eps := range changedEps {
sc.endpointsChan <- eps
}
}
func (nh *nodeHandler) update(node *v1.Node) {
sc := nh.cache
sc.mu.Lock()
nodeKey := types.NamespacedName{Namespace: node.Namespace, Name: node.Name}
klog.Infof("Updating node %v", nodeKey)
nodeContainer, found := sc.nodesMap[nodeKey]
if !found {
sc.mu.Unlock()
klog.Errorf("Updating non-existed node %v", nodeKey)
return
}
nodeContainer.node = node
// return directly when labels of node stay unchanged
if reflect.DeepEqual(node.Labels, nodeContainer.labels) {
sc.mu.Unlock()
return
}
nodeContainer.labels = node.Labels
// update endpoints
changedEps := sc.rebuildEndpointsMap()
sc.mu.Unlock()
for _, eps := range changedEps {
sc.endpointsChan <- eps
}
}
...
同時由於 node 的改變會影響 endpoint,因此會呼叫 rebuildEndpointsMap 重新整理 storageCache.endpointsMap
// rebuildEndpointsMap updates all endpoints stored in storageCache.endpointsMap dynamically and constructs relevant modified events
func (sc *storageCache) rebuildEndpointsMap() []watch.Event {
evts := make([]watch.Event, 0)
for name, endpointsContainer := range sc.endpointsMap {
newEps := pruneEndpoints(sc.hostName, sc.nodesMap, sc.servicesMap, endpointsContainer.endpoints, sc.wrapperInCluster)
if apiequality.Semantic.DeepEqual(newEps, endpointsContainer.modified) {
continue
}
sc.endpointsMap[name].modified = newEps
evts = append(evts, watch.Event{
Type: watch.Modified,
Object: newEps,
})
}
return evts
}
rebuildEndpointsMap 是 cache 的核心函式,同時也是拓撲感知的演算法實現:
// pruneEndpoints filters endpoints using serviceTopology rules combined by services topologyKeys and node labels
func pruneEndpoints(hostName string,
nodes map[types.NamespacedName]*nodeContainer,
services map[types.NamespacedName]*serviceContainer,
eps *v1.Endpoints, wrapperInCluster bool) *v1.Endpoints {
epsKey := types.NamespacedName{Namespace: eps.Namespace, Name: eps.Name}
if wrapperInCluster {
eps = genLocalEndpoints(eps)
}
// dangling endpoints
svc, ok := services[epsKey]
if !ok {
klog.V(4).Infof("Dangling endpoints %s, %+#v", eps.Name, eps.Subsets)
return eps
}
// normal service
if len(svc.keys) == 0 {
klog.V(4).Infof("Normal endpoints %s, %+#v", eps.Name, eps.Subsets)
return eps
}
// topology endpoints
newEps := eps.DeepCopy()
for si := range newEps.Subsets {
subnet := &newEps.Subsets[si]
subnet.Addresses = filterConcernedAddresses(svc.keys, hostName, nodes, subnet.Addresses)
subnet.NotReadyAddresses = filterConcernedAddresses(svc.keys, hostName, nodes, subnet.NotReadyAddresses)
}
klog.V(4).Infof("Topology endpoints %s: subnets from %+#v to %+#v", eps.Name, eps.Subsets, newEps.Subsets)
return newEps
}
// filterConcernedAddresses aims to filter out endpoints addresses within the same node unit
func filterConcernedAddresses(topologyKeys []string, hostName string, nodes map[types.NamespacedName]*nodeContainer,
addresses []v1.EndpointAddress) []v1.EndpointAddress {
hostNode, found := nodes[types.NamespacedName{Name: hostName}]
if !found {
return nil
}
filteredEndpointAddresses := make([]v1.EndpointAddress, 0)
for i := range addresses {
addr := addresses[i]
if nodeName := addr.NodeName; nodeName != nil {
epsNode, found := nodes[types.NamespacedName{Name: *nodeName}]
if !found {
continue
}
if hasIntersectionLabel(topologyKeys, hostNode.labels, epsNode.labels) {
filteredEndpointAddresses = append(filteredEndpointAddresses, addr)
}
}
}
return filteredEndpointAddresses
}
func hasIntersectionLabel(keys []string, n1, n2 map[string]string) bool {
if n1 == nil || n2 == nil {
return false
}
for _, key := range keys {
val1, v1found := n1[key]
val2, v2found := n2[key]
if v1found && v2found && val1 == val2 {
return true
}
}
return false
}
演算法邏輯如下:
- 判斷 endpoint 是否為 default kubernetes service,如果是,則將該 endpoint 轉化為 wrapper 所在邊緣節點的 lite-apiserver 地址(127.0.0.1)和埠(51003)
apiVersion: v1
kind: Endpoints
metadata:
annotations:
superedge.io/local-endpoint: 127.0.0.1
superedge.io/local-port: "51003"
name: kubernetes
namespace: default
subsets:
- addresses:
- ip: 172.31.0.60
ports:
- name: https
port: xxx
protocol: TCP
func genLocalEndpoints(eps *v1.Endpoints) *v1.Endpoints {
if eps.Namespace != metav1.NamespaceDefault || eps.Name != MasterEndpointName {
return eps
}
klog.V(4).Infof("begin to gen local ep %v", eps)
ipAddress, e := eps.Annotations[EdgeLocalEndpoint]
if !e {
return eps
}
portStr, e := eps.Annotations[EdgeLocalPort]
if !e {
return eps
}
klog.V(4).Infof("get local endpoint %s:%s", ipAddress, portStr)
port, err := strconv.ParseInt(portStr, 10, 32)
if err != nil {
klog.Errorf("parse int %s err %v", portStr, err)
return eps
}
ip := net.ParseIP(ipAddress)
if ip == nil {
klog.Warningf("parse ip %s nil", ipAddress)
return eps
}
nep := eps.DeepCopy()
nep.Subsets = []v1.EndpointSubset{
{
Addresses: []v1.EndpointAddress{
{
IP: ipAddress,
},
},
Ports: []v1.EndpointPort{
{
Protocol: v1.ProtocolTCP,
Port: int32(port),
Name: "https",
},
},
},
}
klog.V(4).Infof("gen new endpoint complete %v", nep)
return nep
}
這樣做的目的是使邊緣節點上的服務採用叢集內 (InCluster) 方式訪問的 apiserver 為本地的 lite-apiserver,而不是雲端的 apiserver
- 從 storageCache.servicesMap cache 中根據 endpoint 名稱 (namespace/name) 取出對應 service,如果該 service 沒有 topologyKeys 則無需做拓撲轉化 (非service group)
func getTopologyKeys(objectMeta *metav1.ObjectMeta) []string {
if !hasTopologyKey(objectMeta) {
return nil
}
var keys []string
keyData := objectMeta.Annotations[TopologyAnnotationsKey]
if err := json.Unmarshal([]byte(keyData), &keys); err != nil {
klog.Errorf("can't parse topology keys %s, %v", keyData, err)
return nil
}
return keys
}
- 呼叫 filterConcernedAddresses 過濾 endpoint.Subsets Addresses 以及 NotReadyAddresses,只保留同一個 service topologyKeys 中的 endpoint
// filterConcernedAddresses aims to filter out endpoints addresses within the same node unit
func filterConcernedAddresses(topologyKeys []string, hostName string, nodes map[types.NamespacedName]*nodeContainer,
addresses []v1.EndpointAddress) []v1.EndpointAddress {
hostNode, found := nodes[types.NamespacedName{Name: hostName}]
if !found {
return nil
}
filteredEndpointAddresses := make([]v1.EndpointAddress, 0)
for i := range addresses {
addr := addresses[i]
if nodeName := addr.NodeName; nodeName != nil {
epsNode, found := nodes[types.NamespacedName{Name: *nodeName}]
if !found {
continue
}
if hasIntersectionLabel(topologyKeys, hostNode.labels, epsNode.labels) {
filteredEndpointAddresses = append(filteredEndpointAddresses, addr)
}
}
}
return filteredEndpointAddresses
}
func hasIntersectionLabel(keys []string, n1, n2 map[string]string) bool {
if n1 == nil || n2 == nil {
return false
}
for _, key := range keys {
val1, v1found := n1[key]
val2, v2found := n2[key]
if v1found && v2found && val1 == val2 {
return true
}
}
return false
}
注意:如果 wrapper 所在邊緣節點沒有 service topologyKeys 標籤,則也無法訪問該 service
回到 rebuildEndpointsMap,在呼叫 pruneEndpoints 重新整理了同一個拓撲域內的 endpoint 後,會將修改後的 endpoints 賦值給 storageCache.endpointsMap [endpoint]. modified (該欄位記錄了拓撲感知後修改的endpoints)。
func (nh *nodeHandler) add(node *v1.Node) {
sc := nh.cache
sc.mu.Lock()
nodeKey := types.NamespacedName{Namespace: node.Namespace, Name: node.Name}
klog.Infof("Adding node %v", nodeKey)
sc.nodesMap[nodeKey] = &nodeContainer{
node: node,
labels: node.Labels,
}
// update endpoints
changedEps := sc.rebuildEndpointsMap()
sc.mu.Unlock()
for _, eps := range changedEps {
sc.endpointsChan <- eps
}
}
// rebuildEndpointsMap updates all endpoints stored in storageCache.endpointsMap dynamically and constructs relevant modified events
func (sc *storageCache) rebuildEndpointsMap() []watch.Event {
evts := make([]watch.Event, 0)
for name, endpointsContainer := range sc.endpointsMap {
newEps := pruneEndpoints(sc.hostName, sc.nodesMap, sc.servicesMap, endpointsContainer.endpoints, sc.wrapperInCluster)
if apiequality.Semantic.DeepEqual(newEps, endpointsContainer.modified) {
continue
}
sc.endpointsMap[name].modified = newEps
evts = append(evts, watch.Event{
Type: watch.Modified,
Object: newEps,
})
}
return evts
}
另外,如果 endpoints (拓撲感知後修改的 endpoints) 發生改變,會構建 watch event,傳遞給 endpoints handler (interceptEndpointsRequest) 處理
2、ServiceEventHandler
storageCache.servicesMap 結構體 key 為 service 名稱 (namespace/name),value 為 serviceContainer,包含如下資料:
- svc:service物件
- keys:service topologyKeys
對於 service 資源的改動,這裡用 Update event 說明:
func (sh *serviceHandler) update(service *v1.Service) {
sc := sh.cache
sc.mu.Lock()
serviceKey := types.NamespacedName{Namespace: service.Namespace, Name: service.Name}
klog.Infof("Updating service %v", serviceKey)
newTopologyKeys := getTopologyKeys(&service.ObjectMeta)
serviceContainer, found := sc.servicesMap[serviceKey]
if !found {
sc.mu.Unlock()
klog.Errorf("update non-existed service, %v", serviceKey)
return
}
sc.serviceChan <- watch.Event{
Type: watch.Modified,
Object: service,
}
serviceContainer.svc = service
// return directly when topologyKeys of service stay unchanged
if reflect.DeepEqual(serviceContainer.keys, newTopologyKeys) {
sc.mu.Unlock()
return
}
serviceContainer.keys = newTopologyKeys
// update endpoints
changedEps := sc.rebuildEndpointsMap()
sc.mu.Unlock()
for _, eps := range changedEps {
sc.endpointsChan <- eps
}
}
邏輯如下:
- 獲取 service topologyKeys
- 構建 service event.Modified event
- 比較 service topologyKeys 與已經存在的是否有差異
- 如果有差異則更新 topologyKeys,且呼叫 rebuildEndpointsMap重新整理該 service 對應的 endpoints,如果 endpoints 發生變化,則構建 endpoints watch event,傳遞給 endpoints handler (interceptEndpointsRequest) 處理
3、EndpointsEventHandler
storageCache.endpointsMap 結構體 key 為 endpoints 名稱(namespace/name),value 為 endpointsContainer,包含如下資料:
- endpoints:拓撲修改前的 endpoints
- modified:拓撲修改後的 endpoints
對於 endpoints 資源的改動,這裡用 Update event 說明:
func (eh *endpointsHandler) update(endpoints *v1.Endpoints) {
sc := eh.cache
sc.mu.Lock()
endpointsKey := types.NamespacedName{Namespace: endpoints.Namespace, Name: endpoints.Name}
klog.Infof("Updating endpoints %v", endpointsKey)
endpointsContainer, found := sc.endpointsMap[endpointsKey]
if !found {
sc.mu.Unlock()
klog.Errorf("Updating non-existed endpoints %v", endpointsKey)
return
}
endpointsContainer.endpoints = endpoints
newEps := pruneEndpoints(sc.hostName, sc.nodesMap, sc.servicesMap, endpoints, sc.wrapperInCluster)
changed := !apiequality.Semantic.DeepEqual(endpointsContainer.modified, newEps)
if changed {
endpointsContainer.modified = newEps
}
sc.mu.Unlock()
if changed {
sc.endpointsChan <- watch.Event{
Type: watch.Modified,
Object: newEps,
}
}
}
邏輯如下:
- 更新 endpointsContainer.endpoint 為新的 endpoints 物件
- 呼叫 pruneEndpoints 獲取拓撲重新整理後的 endpoints
- 比較 endpointsContainer.modified 與新重新整理後的 endpoints
- 如果有差異則更新 endpointsContainer.modified,則構建 endpoints watch event,傳遞給 endpoints handler (interceptEndpointsRequest) 處理
在分析完NodeEventHandler,ServiceEventHandler以及EndpointsEventHandler之後,我們回到具體的http handler List&Watch處理邏輯上,這裡以endpoints為例:
func (s *interceptorServer) interceptEndpointsRequest(handler http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
if r.Method != http.MethodGet || !strings.HasPrefix(r.URL.Path, "/api/v1/endpoints") {
handler.ServeHTTP(w, r)
return
}
queries := r.URL.Query()
acceptType := r.Header.Get("Accept")
info, found := s.parseAccept(acceptType, s.mediaSerializer)
if !found {
klog.Errorf("can't find %s serializer", acceptType)
w.WriteHeader(http.StatusBadRequest)
return
}
encoder := scheme.Codecs.EncoderForVersion(info.Serializer, v1.SchemeGroupVersion)
// list request
if queries.Get("watch") == "" {
w.Header().Set("Content-Type", info.MediaType)
allEndpoints := s.cache.GetEndpoints()
epsItems := make([]v1.Endpoints, 0, len(allEndpoints))
for _, eps := range allEndpoints {
epsItems = append(epsItems, *eps)
}
epsList := &v1.EndpointsList{
Items: epsItems,
}
err := encoder.Encode(epsList, w)
if err != nil {
klog.Errorf("can't marshal endpoints list, %v", err)
w.WriteHeader(http.StatusInternalServerError)
return
}
return
}
// watch request
timeoutSecondsStr := r.URL.Query().Get("timeoutSeconds")
timeout := time.Minute
if timeoutSecondsStr != "" {
timeout, _ = time.ParseDuration(fmt.Sprintf("%ss", timeoutSecondsStr))
}
timer := time.NewTimer(timeout)
defer timer.Stop()
flusher, ok := w.(http.Flusher)
if !ok {
klog.Errorf("unable to start watch - can't get http.Flusher: %#v", w)
w.WriteHeader(http.StatusMethodNotAllowed)
return
}
e := restclientwatch.NewEncoder(
streaming.NewEncoder(info.StreamSerializer.Framer.NewFrameWriter(w),
scheme.Codecs.EncoderForVersion(info.StreamSerializer, v1.SchemeGroupVersion)),
encoder)
if info.MediaType == runtime.ContentTypeProtobuf {
w.Header().Set("Content-Type", runtime.ContentTypeProtobuf+";stream=watch")
} else {
w.Header().Set("Content-Type", runtime.ContentTypeJSON)
}
w.Header().Set("Transfer-Encoding", "chunked")
w.WriteHeader(http.StatusOK)
flusher.Flush()
for {
select {
case <-r.Context().Done():
return
case <-timer.C:
return
case evt := <-s.endpointsWatchCh:
klog.V(4).Infof("Send endpoint watch event: %+#v", evt)
err := e.Encode(&evt)
if err != nil {
klog.Errorf("can't encode watch event, %v", err)
return
}
if len(s.endpointsWatchCh) == 0 {
flusher.Flush()
}
}
}
})
}
邏輯如下:
- 如果為List請求,則呼叫 GetEndpoints 獲取拓撲修改後的 endpoints 列表,並返回
func (sc *storageCache) GetEndpoints() []*v1.Endpoints {
sc.mu.RLock()
defer sc.mu.RUnlock()
epList := make([]*v1.Endpoints, 0, len(sc.endpointsMap))
for _, v := range sc.endpointsMap {
epList = append(epList, v.modified)
}
return epList
}
- 如果為 Watch 請求,則不斷從 storageCache.endpointsWatchCh 管道中接受 watch event,並返回
interceptServiceRequest 邏輯與 interceptEndpointsRequest 一致,這裡不再贅述。
總結
- SuperEdge service group 利用 application-grid-wrapper 實現拓撲感知,完成了同一個 nodeunit 內服務的閉環訪問
- service group 實現的拓撲感知和 Kubernetes 社群原生實現對比,有如下區別:
- service group 拓撲 key 可以自定義,也即為 gridUniqKey,使用起來更加靈活;而社群實現目前只有三種選擇:"kubernetes.io/hostname","topology.kubernetes.io/zone"以及"topology.kubernetes.io/region"
- service group 只能填寫一個拓撲 key,也即只能訪問本拓撲域內有效的 endpoint,無法訪問其它拓撲域的 endpoint;而社群可以通過 topologyKey 列表以及"*"實現其它備選拓撲域 endpoint 的訪問
- ServiceGrid Controller 負責根據 ServiceGrid 產生對應的 service(包含由serviceGrid.Spec.GridUniqKey構成的 topologyKeys annotations),邏輯和 DeploymentGrid Controller 整體一致,如下:
- 建立並維護 service group 需要的若干 CRDs (包括:ServiceGrid)
- 監聽 ServiceGrid event,並填充 ServiceGrid 到工作佇列中;迴圈從佇列中取出ServiceGrid進行解析,建立並且維護對應的 service
- 監聽 service event,並將相關的 ServiceGrid 塞到工作佇列中進行上述處理,協助上述邏輯達到整體 reconcile 邏輯
- 為了實現Kubernetes零侵入,需要在 kube-proxy 與 apiserver 通訊之間新增一層 wrapper,呼叫鏈路如下:
kube-proxy -> application-grid-wrapper -> lite-apiserver -> kube-apiserver
- application-grid-wrapper 是一個 http server,接受來自 kube-proxy 的請求,同時維護一個資源快取,處理函式由外到內依次如下:
- debug:接受 debug 請求,返回 wrapper pprof 執行資訊
- logger:列印請求日誌
- node:接受kube-proxy node GET (/api/v1/nodes/{node}) 請求,並返回 node 資訊
- event:接受 kube-proxy events POST (/events) 請求,並將請求轉發給 lite-apiserver
- service:接受 kube-proxy service List&Watch (/api/v1/services) 請求,並根據 storageCache 內容返回 (GetServices)。
- endpoint:接受 kube-proxy endpoint List&Watch(/api/v1/endpoints) 請求,並根據storageCache 內容返回 (GetEndpoints)。
- wrapper 為了實現拓撲感知,維護了一個資源cache,包括:node,service,endpoint,同時註冊了相關 event 處理函式。核心拓撲演算法邏輯為:呼叫 filterConcernedAddresses 過濾 endpoint.Subsets Addresses 以及 NotReadyAddresses,只保留同一個 service topologyKeys 中的 endpoint。另外,如果 wrapper 所在邊緣節點沒有 service topologyKeys 標籤,則也無法訪問該 service。
- wrapper 接受來自 kube-proxy 對 endpoints 以及 service 的 List&Watch 請求,以 endpoints 為例:如果為List請求,則呼叫 GetEndpoints 獲取拓撲修改後的 endpoints 列表,並返回;如果為 Watch 請求,則不斷從 storageCache.endpointsWatchCh 管道中接受 watch event,並返回。service邏輯與 endpoints 一致。
展望
目前 SuperEdge service group 實現的拓撲演算法功能更加靈活方便,如何處理與 Kubernetes 社群 service topology awareness 之間的關係值得探索,建議將 SuperEdge 拓撲演算法推到社群