atom.xml

<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
    <id>https://cnbailian.github.io</id>
    <title>白联</title>
    <updated>2022-01-13T01:36:45.540Z</updated>
    <generator>https://github.com/jpmonette/feed</generator>
    <link rel="alternate" href="https://cnbailian.github.io"/>
    <link rel="self" href="https://cnbailian.github.io/atom.xml"/>
    <subtitle>努力前行</subtitle>
    <logo>https://cnbailian.github.io/images/avatar.png</logo>
    <icon>https://cnbailian.github.io/favicon.ico</icon>
    <rights>All rights reserved 2022, 白联</rights>
    <entry>
        <title type="html"><![CDATA[从 SampleController 项目看 kubernetes controller 的设计——笔记]]></title>
        <id>https://cnbailian.github.io/post/kubernetes-samplecontroller/</id>
        <link href="https://cnbailian.github.io/post/kubernetes-samplecontroller/">
        </link>
        <updated>2021-05-08T03:10:53.000Z</updated>
        <summary type="html"><![CDATA[<h2 id="总结">总结</h2>
<figure data-type="image" tabindex="1"><img src="https://tva1.sinaimg.cn/large/0081Kckwly1glrrw7x90hj31va0ton44.jpg" alt="Kubernetes Informer" loading="lazy"></figure>
]]></summary>
        <content type="html"><![CDATA[<h2 id="总结">总结</h2>
<figure data-type="image" tabindex="1"><img src="https://tva1.sinaimg.cn/large/0081Kckwly1glrrw7x90hj31va0ton44.jpg" alt="Kubernetes Informer" loading="lazy"></figure>
<!--more-->
<h2 id="设计理念">设计理念</h2>
<p>client-go informer</p>
<h3 id="kubernetes-resource-type">Kubernetes Resource Type</h3>
<p><strong>Scheme</strong></p>
<p>Scheme 提供了 Go type 与对应 GVK 的映射。即给定 Go type 就知道对应 GVK，给定 GVK 就知道对应 Go type</p>
<h3 id="informer">Informer</h3>
<p><strong>list/watch 机制</strong></p>
<p>在 kubernetes 的设计中，使用 etcd 存储数据，apiserver 作为统一入口，任何对资源的操作都必须经过 apiserver。</p>
<p>apiserver 对资源提供了 list watch 两个接口。list 基于 HTTP 短链接实现，用于获取资源列表。watch 基于 HTTP 长链接实现，用于获取资源的变更。</p>
<p>watch 基于 HTTP chunked 实现持久链接。服务端每次传输资源的事件信息。</p>
<p>设计理念</p>
<p>通过 list watch 的组合，保证了消息的可靠性，避免因为消息丢失而造成状态不一致场景。</p>
<p>消息必须是实时的，每当 apiserver 产生资源变更事件，都会将事件实时的推送给客户端，保证了消息的实时性。</p>
<p>kubernetes 在每个资源的事件都有一个 resourceVersion 属性，这个属性是递增的数字，所以当客户端并发处理同一资源的事件时，它可以通过对比 resourceVersion 来保证消息的顺序性。</p>
<p>通过 list 获取资源，写入 cache，然后通过 watch 维护缓存，避免了频繁获取资源的性能损耗。通过 resyncPeriod 维护 list，避免发生不一致现象。</p>
<p><strong>informer 工作流程</strong></p>
<ol>
<li>Informer 使用 Reflector 包建立与 apiserver 的连接。Reflector 使用 ListAndWatch 方法监听该分类下所有资源对象，list 首先会将 resourceVersion 设为 0，然后通过 watch 监听该 resourceVersion 之后的所有变化，若中途出现异常，Reflector 会从断开处尝试重现所有变化。当 Reflector watch 到资源对象的事件通知时，会将该事件与它对应的资源对象这个组合（被称为增量 Delta），放入 DeltaFIFO 队列中。</li>
<li>Informer 会 pop 这个 DeltaFIFO 队列中的 Deltas，通过 Indexer 根据事件类型更新缓存。</li>
<li>同时也会去调用事先注册的 ResourceEventHandler 回调函数进行处理。</li>
</ol>
<p><strong>Custom Controller 工作流程</strong></p>
<ol>
<li>在 ResourceEventhandler 回调函数中，其实只是做了一些很简单的过滤，然后将关心变更的 Object 放在 workqueue 里面</li>
<li>Controller 从 workqueue 里面取出 Object，启动一个 worker 来执行自己的业务逻辑</li>
<li>在 worker 中就可以使用 lister 来获取 resource，而不用频繁的访问 apiserver，因为 apiserver 中的 resource 的变更都会反映到本地的 cache 中</li>
</ol>
<h2 id="源码">源码</h2>
<p>结合《Kubernetes 源码剖析》和 sampleController 的实际使用来学习 Informer</p>
<h3 id="使用">使用</h3>
<p>通过 <code>k8s.io/client-go/informers</code> 或生成的 <code>informers</code> 调用 <code>NewSharedInformerFactory</code> 创建 InformerFactory。</p>
<pre><code class="language-go">kubeInformerFactory := kubeinformers.NewSharedInformerFactory(kubeClient, time.Second*30)
exampleInformerFactory := informers.NewSharedInformerFactory(exampleClient, time.Second*30)
</code></pre>
<p>对具体的 Resources 添加 Events，也就是事件回调函数，正常情况下，在回调中需要添加到 workQueue 中</p>
<p>事件分为三种：Added、Updated、Deleted。那么 kubebuilder 的 generic 是什么...</p>
<pre><code class="language-go">kubeInformerFactory.Apps().V1().Deployments().Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
		AddFunc: controller.handleObject,
		UpdateFunc: func(old, new interface{}) {
			newDepl := new.(*appsv1.Deployment)
			oldDepl := old.(*appsv1.Deployment)
			if newDepl.ResourceVersion == oldDepl.ResourceVersion {
				// Periodic resync will send update events for all known Deployments.
				// Two different versions of the same Deployment will always have different RVs.
				return
			}
			controller.handleObject(new)
		},
		DeleteFunc: controller.handleObject,
	})

exampleInformerFactory.Samplecontroller().V1alpha1().Foos().Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
		AddFunc: controller.enqueueFoo,
		UpdateFunc: func(old, new interface{}) {
			controller.enqueueFoo(new)
		},
	})


func (c *Controller) enqueueFoo(obj interface{}) {
	var key string
	var err error
	if key, err = cache.MetaNamespaceKeyFunc(obj); err != nil {
		utilruntime.HandleError(err)
		return
	}
	c.workqueue.Add(key)
}
</code></pre>
<p>最后 start，因为 informer 是持久运行的，所以需要通过 channel 来发送结束信号</p>
<pre><code class="language-go">kubeInformerFactory.Start(stopCh)
exampleInformerFactory.Start(stopCh)
</code></pre>
<h3 id="sharedinformerfactory">SharedInformerFactory</h3>
<p><code>informers.NewSharedInformerFactory</code> 函数实例化了 <code>SharedInformerFactory</code> 对象，它接收两个参数：第1个参数 <code>clientset</code> 是用于与Kubernetes API Server交互的客户端，第2个参数 <code>time.Minute</code> 用于设置多久进行一次 resync（重新同步），resync 会周期性地执行 List 操作，将所有的资源存放在 <code>Informer Store</code> 中，如果该参数为0，则禁用 resync 功能。</p>
<pre><code class="language-go">func NewSharedInformerFactory(client kubernetes.Interface, defaultResync time.Duration) SharedInformerFactory {
	return NewSharedInformerFactoryWithOptions(client, defaultResync)
}
func NewSharedInformerFactoryWithOptions(client kubernetes.Interface, defaultResync time.Duration, options ...SharedInformerOption) SharedInformerFactory {
	factory := &amp;sharedInformerFactory{
		client:           client,
		namespace:        v1.NamespaceAll,
		defaultResync:    defaultResync,
		informers:        make(map[reflect.Type]cache.SharedIndexInformer),
		startedInformers: make(map[reflect.Type]bool),
		customResync:     make(map[reflect.Type]time.Duration),
	}

	// Apply all options
  // 如果不熟悉这种参数传递模式，可以参考 Rob Pike 的文章：https://commandcenter.blogspot.com/2014/01/self-referential-functions-and-design.html
  // 相关文章中文版：https://driverzhang.github.io/post/golang友好的设计api参数可选项/
	for _, opt := range options {
		factory = opt(factory)
	}

	return factory
}
</code></pre>
<p><strong>Informer Shared 机制</strong></p>
<p>从上面的代码中可以看出，我们 New 的是一个 SharedInformerFactory，它是可以被共享使用的。</p>
<p>Shared Informer Factory 可以使同一类资源共享一个 Informer，这样可以节约很多资源。</p>
<pre><code class="language-go">// 实际上是调用 factory 的 InformerFor 方法
kubeInformerFactory.Apps().V1().Deployments().Informer()

func (f *deploymentInformer) Informer() cache.SharedIndexInformer {
	return f.factory.InformerFor(&amp;appsv1.Deployment{}, f.defaultInformer)
}

type sharedInformerFactory struct {
  ......
	informers map[reflect.Type]cache.SharedIndexInformer
  ......
}

// InformerFor 通过 map 数据结构存储 Informer，多次添加也会共享一个 informer。
func (f *sharedInformerFactory) InformerFor(obj runtime.Object, newFunc internalinterfaces.NewInformerFunc) cache.SharedIndexInformer {
	f.lock.Lock()
	defer f.lock.Unlock()

	informerType := reflect.TypeOf(obj)
	informer, exists := f.informers[informerType]
	if exists {
		return informer
	}

	resyncPeriod, exists := f.customResync[informerType]
	if !exists {
		resyncPeriod = f.defaultResync
	}

	informer = newFunc(f.client, resyncPeriod)
	f.informers[informerType] = informer

	return informer
}
</code></pre>
<p>上面可以看出，sharedInformerFactory 的 InformerFor 方法会实例化一个 informer 放入 <code>f.informers</code> 中，看下 deployment 传入的 <code>newFunc</code> 是什么：</p>
<pre><code class="language-go">// 可以看到，传入的是 deploymentInformer.defaultInformer 的闭包
func (f *deploymentInformer) Informer() cache.SharedIndexInformer {
	return f.factory.InformerFor(&amp;appsv1.Deployment{}, f.defaultInformer)
}

// defaultInformer 的参数与 NewSharedInformerFactory 的参数一致，用途也一致
func (f *deploymentInformer) defaultInformer(client kubernetes.Interface, resyncPeriod time.Duration) cache.SharedIndexInformer {
	return NewFilteredDeploymentInformer(client, f.namespace, resyncPeriod, cache.Indexers{cache.NamespaceIndex: cache.MetaNamespaceIndexFunc}, f.tweakListOptions)
}

// 实例化 SharedIndexInformer，传入对应资源的 List and Watch
func NewFilteredDeploymentInformer(client kubernetes.Interface, namespace string, resyncPeriod time.Duration, indexers cache.Indexers, tweakListOptions internalinterfaces.TweakListOptionsFunc) cache.SharedIndexInformer {
	return cache.NewSharedIndexInformer(
		&amp;cache.ListWatch{
			ListFunc: func(options metav1.ListOptions) (runtime.Object, error) {
				if tweakListOptions != nil {
					tweakListOptions(&amp;options)
				}
				return client.AppsV1().Deployments(namespace).List(context.TODO(), options)
			},
			WatchFunc: func(options metav1.ListOptions) (watch.Interface, error) {
				if tweakListOptions != nil {
					tweakListOptions(&amp;options)
				}
				return client.AppsV1().Deployments(namespace).Watch(context.TODO(), options)
			},
		},
		&amp;appsv1.Deployment{},
		resyncPeriod,
		indexers,
	)
}
</code></pre>
<h3 id="sharedindexinformer">SharedIndexInformer</h3>
<p>最后，来看下 informer 的 Start，以及 informer 如何利用 List and Watch。</p>
<pre><code class="language-go">func (f *sharedInformerFactory) Start(stopCh &lt;-chan struct{}) {
	f.lock.Lock()
	defer f.lock.Unlock()

	for informerType, informer := range f.informers {
		if !f.startedInformers[informerType] {
			go informer.Run(stopCh)
			f.startedInformers[informerType] = true
		}
	}
}

// 从上面的代码中可以看到 informer 是传入的 SharedIndexInformer
// ShareIndexInformer 有三个主要组件：
// 第一个 indexed local cache；Indexer
// 第二个是 controller，它使用 ListerWatcher 获取资源，并将其推送到 DeltaFIFO 中
// 同时从 FIFO 中取出 Deltas values，并通过 sharedIndexInformer::HandleDeltas 方法处理
// 每个 Deltas，都会更新 local cache，并将相关通知发送给 sharedProcessor
// 第三个组件就是 sharedProcessor，它会负责转发这些通知给 listeners
func (s *sharedIndexInformer) Run(stopCh &lt;-chan struct{}) {
	defer utilruntime.HandleCrash()

  // deltaFIFO 可以分开理解
  // FIFO 是一个先进先出的队列
  // Delta 代表队列中存储的是 Delta 对象，Delta 是一个资源对象存储，它可以保存资源对象的操作类型
  // 例如 Added、Updated、Deleted、Sync 等操作类型
	fifo := NewDeltaFIFOWithOptions(DeltaFIFOOptions{
		KnownObjects:          s.indexer,
		EmitDeltaTypeReplaced: true,
	})

	cfg := &amp;Config{
		Queue:            fifo,
		ListerWatcher:    s.listerWatcher,
		ObjectType:       s.objectType,
		FullResyncPeriod: s.resyncCheckPeriod,
		RetryOnError:     false,
		ShouldResync:     s.processor.shouldResync,

		Process:           s.HandleDeltas,
		WatchErrorHandler: s.watchErrorHandler,
	}

	func() {
		s.startedLock.Lock()
		defer s.startedLock.Unlock()
		// 实例化 controller 组件
		s.controller = New(cfg)
		s.controller.(*controller).clock = s.clock
		s.started = true
	}()

  // 启动 processor 组件
	// Separate stop channel because Processor should be stopped strictly after controller
	processorStopCh := make(chan struct{})
	var wg wait.Group
	defer wg.Wait()              // Wait for Processor to stop
	defer close(processorStopCh) // Tell Processor to stop
	wg.StartWithChannel(processorStopCh, s.cacheMutationDetector.Run)
	wg.StartWithChannel(processorStopCh, s.processor.run)

	defer func() {
		s.startedLock.Lock()
		defer s.startedLock.Unlock()
		s.stopped = true // Don't want any new listeners
	}()
	s.controller.Run(stopCh)
}
</code></pre>
<h4 id="reflector">Reflector</h4>
<p>controller 组件的主要功能就是通过 reflector 完成。</p>
<p>Reflector 用于监控指定资源，当监控的资源发生变化时，触发相应的变更事件，例如 Added、Updated、Deleted，并将其资源对象存放到 DeltaFIFO 中。</p>
<pre><code class="language-go">func (c *controller) Run(stopCh &lt;-chan struct{}) {
	defer utilruntime.HandleCrash()
	go func() {
		&lt;-stopCh
		c.config.Queue.Close()
	}()
  // NewReflector 实例化过程中需要传入 ListerWatcher
  // 这是对应资源对象在实例化 NewSharedIndexInformer 时传入的，实现了对应资源的 List and Watch 接口
  // Queue 是上面实例化的 DealtaFIFO
	r := NewReflector(
		c.config.ListerWatcher,
		c.config.ObjectType,
		c.config.Queue,
		c.config.FullResyncPeriod,
	)
	r.ShouldResync = c.config.ShouldResync
	r.WatchListPageSize = c.config.WatchListPageSize
	r.clock = c.clock
	if c.config.WatchErrorHandler != nil {
		r.watchErrorHandler = c.config.WatchErrorHandler
	}

	c.reflectorMutex.Lock()
	c.reflector = r
	c.reflectorMutex.Unlock()

	var wg wait.Group
  // 启动 reflector
	wg.StartWithChannel(stopCh, r.Run)
  // 启动 processor loop
	wait.Until(c.processLoop, time.Second, stopCh)
	wg.Wait()
}
</code></pre>
<p><strong>Reflector run</strong></p>
<pre><code class="language-go">func (r *Reflector) Run(stopCh &lt;-chan struct{}) {
	klog.V(2).Infof(&quot;Starting reflector %s (%s) from %s&quot;, r.expectedTypeName, r.resyncPeriod, r.name)
	wait.BackoffUntil(func() {
		if err := r.ListAndWatch(stopCh); err != nil {
			r.watchErrorHandler(r, err)
		}
	}, r.backoffManager, true, stopCh)
	klog.V(2).Infof(&quot;Stopping reflector %s (%s) from %s&quot;, r.expectedTypeName, r.resyncPeriod, r.name)
}

// ListAndWatch 函数实现可分为两部分：第一部分获取列表数据，第二部分监控资源对象
func (r *Reflector) ListAndWatch(stopCh &lt;-chan struct{}) error {
	klog.V(3).Infof(&quot;Listing and watching %v from %s&quot;, r.expectedTypeName, r.name)
	var resourceVersion string

	options := metav1.ListOptions{ResourceVersion: r.relistResourceVersion()}
  // 第一部分：获取列表数据
	if err := func() error {
		initTrace := trace.New(&quot;Reflector ListAndWatch&quot;, trace.Field{&quot;name&quot;, r.name})
		defer initTrace.LogIfLong(10 * time.Second)
		var list runtime.Object
		var paginatedResult bool
		var err error
		listCh := make(chan struct{}, 1)
		panicCh := make(chan interface{}, 1)
		go func() {
			defer func() {
				if r := recover(); r != nil {
					panicCh &lt;- r
				}
			}()
			// Attempt to gather list in chunks, if supported by listerWatcher, if not, the first
			// list request will return the full response.
			pager := pager.New(pager.SimplePageFunc(func(opts metav1.ListOptions) (runtime.Object, error) {
				return r.listerWatcher.List(opts)
			}))
			switch {
			case r.WatchListPageSize != 0:
				pager.PageSize = r.WatchListPageSize
			case r.paginatedResult:
				// We got a paginated result initially. Assume this resource and server honor
				// paging requests (i.e. watch cache is probably disabled) and leave the default
				// pager size set.
			case options.ResourceVersion != &quot;&quot; &amp;&amp; options.ResourceVersion != &quot;0&quot;:
				// User didn't explicitly request pagination.
				//
				// With ResourceVersion != &quot;&quot;, we have a possibility to list from watch cache,
				// but we do that (for ResourceVersion != &quot;0&quot;) only if Limit is unset.
				// To avoid thundering herd on etcd (e.g. on master upgrades), we explicitly
				// switch off pagination to force listing from watch cache (if enabled).
				// With the existing semantic of RV (result is at least as fresh as provided RV),
				// this is correct and doesn't lead to going back in time.
				//
				// We also don't turn off pagination for ResourceVersion=&quot;0&quot;, since watch cache
				// is ignoring Limit in that case anyway, and if watch cache is not enabled
				// we don't introduce regression.
				pager.PageSize = 0
			}
			// 获取 list 数据，获取资源数据是由 options.ResourcesVersion 参数控制的
      // 如果 ResourceVersion 为 0，则表示获取所有资源数据；如果 ResourceVersion 非0，则表示根据资源版本号继续获取
      // 功能类似于断点续传，当传输过程中遇到网络故障导致中断，下次再连接时，会根据资源版本号继续传输未完成的部分
			list, paginatedResult, err = pager.List(context.Background(), options)
			if isExpiredError(err) || isTooLargeResourceVersionError(err) {
				r.setIsLastSyncResourceVersionUnavailable(true)
				// Retry immediately if the resource version used to list is unavailable.
				// The pager already falls back to full list if paginated list calls fail due to an &quot;Expired&quot; error on
				// continuation pages, but the pager might not be enabled, the full list might fail because the
				// resource version it is listing at is expired or the cache may not yet be synced to the provided
				// resource version. So we need to fallback to resourceVersion=&quot;&quot; in all to recover and ensure
				// the reflector makes forward progress.
				list, paginatedResult, err = pager.List(context.Background(), metav1.ListOptions{ResourceVersion: r.relistResourceVersion()})
			}
			close(listCh)
		}()
		select {
		case &lt;-stopCh:
			return nil
		case r := &lt;-panicCh:
			panic(r)
		case &lt;-listCh:
		}
		if err != nil {
			return fmt.Errorf(&quot;failed to list %v: %v&quot;, r.expectedTypeName, err)
		}

		// We check if the list was paginated and if so set the paginatedResult based on that.
		// However, we want to do that only for the initial list (which is the only case
		// when we set ResourceVersion=&quot;0&quot;). The reasoning behind it is that later, in some
		// situations we may force listing directly from etcd (by setting ResourceVersion=&quot;&quot;)
		// which will return paginated result, even if watch cache is enabled. However, in
		// that case, we still want to prefer sending requests to watch cache if possible.
		//
		// Paginated result returned for request with ResourceVersion=&quot;0&quot; mean that watch
		// cache is disabled and there are a lot of objects of a given type. In such case,
		// there is no need to prefer listing from watch cache.
		if options.ResourceVersion == &quot;0&quot; &amp;&amp; paginatedResult {
			r.paginatedResult = true
		}

		r.setIsLastSyncResourceVersionUnavailable(false) // list was successful
		initTrace.Step(&quot;Objects listed&quot;)
		listMetaInterface, err := meta.ListAccessor(list)
		if err != nil {
			return fmt.Errorf(&quot;unable to understand list result %#v: %v&quot;, list, err)
		}
    // 获取 ResourceVersion，ResourceVersion 非常重要，Kubernetes 中所有的资源都拥有该字段
    // 它标识当前资源对象的版本号。每次修改资源对象时，apiserver 都会更改 ResourceVersion
    // 使得 client-go  执行 watch 时可以根据 ResourceVersion 来确定当前对象资源是否发生变化
		resourceVersion = listMetaInterface.GetResourceVersion()
		initTrace.Step(&quot;Resource version extracted&quot;)
    // 将获取到的资源对象转为列表
		items, err := meta.ExtractList(list)
		if err != nil {
			return fmt.Errorf(&quot;unable to understand list result %#v (%v)&quot;, list, err)
		}
		initTrace.Step(&quot;Objects extracted&quot;)
    // 将资源对象存储至 DeltaFIFO，并会替换已存在的对象
    // 实现是调用传入的 DeltaFIFO 的 Replace 方法
		if err := r.syncWith(items, resourceVersion); err != nil {
			return fmt.Errorf(&quot;unable to sync list result: %v&quot;, err)
		}
		initTrace.Step(&quot;SyncWith done&quot;)
		r.setLastSyncResourceVersion(resourceVersion)
		initTrace.Step(&quot;Resource version updated&quot;)
		return nil
	}(); err != nil {
		return err
	}

  // 额外部分，resync 机制，如果实例化 ShareIndexInformer 时指定了 resyncPeriod
  // 此处就会启动一个 gorutine 来定期强制同步资源，也会同步给 DeltaFIFO
	resyncerrc := make(chan error, 1)
	cancelCh := make(chan struct{})
	defer close(cancelCh)
	go func() {
		resyncCh, cleanup := r.resyncChan()
		defer func() {
			cleanup() // Call the last one written into cleanup
		}()
		for {
			select {
			case &lt;-resyncCh:
			case &lt;-stopCh:
				return
			case &lt;-cancelCh:
				return
			}
			if r.ShouldResync == nil || r.ShouldResync() {
				klog.V(4).Infof(&quot;%s: forcing resync&quot;, r.name)
				if err := r.store.Resync(); err != nil {
					resyncerrc &lt;- err
					return
				}
			}
			cleanup()
			resyncCh, cleanup = r.resyncChan()
		}
	}()

  // 第二部分：监控资源对象
	for {
		// give the stopCh a chance to stop the loop, even in case of continue statements further down on errors
		select {
		case &lt;-stopCh:
			return nil
		default:
		}

		timeoutSeconds := int64(minWatchTimeout.Seconds() * (rand.Float64() + 1.0))
		options = metav1.ListOptions{
			ResourceVersion: resourceVersion,
      // typo issue: wachers =&gt; watchers
			// We want to avoid situations of hanging watchers. Stop any wachers that do not
			// receive any events within the timeout window.
			TimeoutSeconds: &amp;timeoutSeconds,
			// To reduce load on kube-apiserver on watch restarts, you may enable watch bookmarks.
			// Reflector doesn't assume bookmarks are returned at all (if the server do not support
			// watch bookmarks, it will ignore this field).
			AllowWatchBookmarks: true,
		}

		// start the clock before sending the request, since some proxies won't flush headers until after the first watch event is sent
		start := r.clock.Now()
    // Watch 实际上调用了对应资源 Client 的 Watch 函数，通过 HTTP 协议与 kube-apiserver 建立长连接
    // Watch 的实现机制是使用 HTTP 的分块传输协议（Chunked Transfer Encoding）
    // Client watch 方法会返回 watcher 的接口实现，交给下文，通过通道读取数据
    // 此处如果是 apiserver 未响应的错误，则会重试
		w, err := r.listerWatcher.Watch(options)
		if err != nil {
			// If this is &quot;connection refused&quot; error, it means that most likely apiserver is not responsive.
			// It doesn't make sense to re-list all objects because most likely we will be able to restart
			// watch where we ended.
			// If that's the case begin exponentially backing off and resend watch request.
			if utilnet.IsConnectionRefused(err) {
				&lt;-r.initConnBackoffManager.Backoff().C()
				continue
			}
			return err
		}
    // watchHandler 负责处理资源的变更事件。当触发 Added、Updated、Deleted 事件时，将对应的资源对象
    // 更新到本地缓存 DeltaFIFO 中并更新 ResourceVersion 资源版本号
		if err := r.watchHandler(start, w, &amp;resourceVersion, resyncerrc, stopCh); err != nil {
			if err != errorStopRequested {
				switch {
				case isExpiredError(err):
					// Don't set LastSyncResourceVersionUnavailable - LIST call with ResourceVersion=RV already
					// has a semantic that it returns data at least as fresh as provided RV.
					// So first try to LIST with setting RV to resource version of last observed object.
					klog.V(4).Infof(&quot;%s: watch of %v closed with: %v&quot;, r.name, r.expectedTypeName, err)
				default:
					klog.Warningf(&quot;%s: watch of %v ended with: %v&quot;, r.name, r.expectedTypeName, err)
				}
			}
			return nil
		}
	}
}

// 通过 watcher.ResultChan 得到分段传输的数据，根据资源对象类型 DeltaFIFO 执行相应动作。
func (r *Reflector) watchHandler(start time.Time, w watch.Interface, resourceVersion *string, errc chan error, stopCh &lt;-chan struct{}) error {
	eventCount := 0

	// Stopping the watcher should be idempotent and if we return from this function there's no way
	// we're coming back in with the same watch interface.
	defer w.Stop()

loop:
	for {
		select {
		case &lt;-stopCh:
			return errorStopRequested
		case err := &lt;-errc:
			return err
		case event, ok := &lt;-w.ResultChan():
			if !ok {
				break loop
			}
      ......
			newResourceVersion := meta.GetResourceVersion()
			switch event.Type {
			case watch.Added:
        // DeltaFIFO::Add
				err := r.store.Add(event.Object)
				if err != nil {
					utilruntime.HandleError(fmt.Errorf(&quot;%s: unable to add watch event object (%#v) to store: %v&quot;, r.name, event.Object, err))
				}
			case watch.Modified:
        // DeltaFIFO::Update
				err := r.store.Update(event.Object)
				if err != nil {
					utilruntime.HandleError(fmt.Errorf(&quot;%s: unable to update watch event object (%#v) to store: %v&quot;, r.name, event.Object, err))
				}
			case watch.Deleted:
				// TODO: Will any consumers need access to the &quot;last known
				// state&quot;, which is passed in event.Object? If so, may need
				// to change this.
        // 
        // DeltaFIFO::Delete
				err := r.store.Delete(event.Object)
				if err != nil {
					utilruntime.HandleError(fmt.Errorf(&quot;%s: unable to delete watch event object (%#v) from store: %v&quot;, r.name, event.Object, err))
				}
			case watch.Bookmark:
				// A `Bookmark` means watch has synced here, just update the resourceVersion
			default:
				utilruntime.HandleError(fmt.Errorf(&quot;%s: unable to understand watch event %#v&quot;, r.name, event))
			}
      ......
		}
	}
  ......
	return nil
}
</code></pre>
<h4 id="deltafifo">DeltaFIFO</h4>
<p>接下来看看 DeltaFIFO 的详细操作都有哪些，上面源码中的 Resync、Add、Update 等操作都做了什么。</p>
<p>现在将代码回到 sharedIndexInformer::Run 方法中:</p>
<pre><code class="language-go">func (s *sharedIndexInformer) Run(stopCh &lt;-chan struct{}) {
	......
	fifo := NewDeltaFIFOWithOptions(DeltaFIFOOptions{
    // Indexer 是由对应资源 Informer 实例化 SharedIndexInformer 时传入，这个下面会详细再看
		KnownObjects:          s.indexer,
		EmitDeltaTypeReplaced: true,
	})
  ......
}

// DeltaFIFO 是一个生产者-消费者队列，其中 Reflector 是生产者，消费者是调用 Pop 方法的任何人
// 通过 DeltaFIFO 可以一次处理一个资源对象的所有操作，这主要取决与 DeltaFIFO 的存储结构
// 它通过 queue 字段存储资源对象的 key，该 key 通过 KeyOf 函数计算得到。items 字段使用 map 数据结构
// 的方式存储，key 与 queue 对应，value 存储的是对象的 Deltas 数组
type DeltaFIFO struct {
  ......
	items map[string]Deltas
	queue []string
  ......
}

// Add、Update、Delete 都是生产者方法，产生的都是增量更新，都会调用 queueActionLocked 方法
// 只是传入的 DeltaType 不同
func (f *DeltaFIFO) Add(obj interface{}) error {
	f.lock.Lock()
	defer f.lock.Unlock()
	f.populated = true
	return f.queueActionLocked(Added, obj)
}

func (f *DeltaFIFO) queueActionLocked(actionType DeltaType, obj interface{}) error {
  // 取得 key
	id, err := f.KeyOf(obj)
	if err != nil {
		return KeyError{obj, err}
	}
	oldDeltas := f.items[id]
	newDeltas := append(oldDeltas, Delta{actionType, obj})
  // 不只 watch，resync 机制也会改变 items 中的值，所以新的事件进来要进行去重
	newDeltas = dedupDeltas(newDeltas)

	if len(newDeltas) &gt; 0 {
    // 更新 queue 字段
		if _, exists := f.items[id]; !exists {
			f.queue = append(f.queue, id)
		}
    // 更新 items
		f.items[id] = newDeltas
    // 唤醒被阻塞的 goroutine
		f.cond.Broadcast()
	} else {
		// This never happens, because dedupDeltas never returns an empty list
		// when given a non-empty list (as it is here).
		// If somehow it happens anyway, deal with it but complain.
		if oldDeltas == nil {
			klog.Errorf(&quot;Impossible dedupDeltas for id=%q: oldDeltas=%#+v, obj=%#+v; ignoring&quot;, id, oldDeltas, obj)
			return nil
		}
		klog.Errorf(&quot;Impossible dedupDeltas for id=%q: oldDeltas=%#+v, obj=%#+v; breaking invariant by storing empty Deltas&quot;, id, oldDeltas, obj)
		f.items[id] = newDeltas
		return fmt.Errorf(&quot;Impossible dedupDeltas for id=%q: oldDeltas=%#+v, obj=%#+v; broke DeltaFIFO invariant by storing empty Deltas&quot;, id, oldDeltas, obj)
	}
	return nil
}

// 消费者方法
func (f *DeltaFIFO) Pop(process PopProcessFunc) (interface{}, error) {
	f.lock.Lock()
	defer f.lock.Unlock()
	for {
    // 如果队列为空则阻塞
		for len(f.queue) == 0 {
			// When the queue is empty, invocation of Pop() is blocked until new item is enqueued.
			// When Close() is called, the f.closed is set and the condition is broadcasted.
			// Which causes this loop to continue and return from the Pop().
			if f.closed {
				return nil, ErrFIFOClosed
			}
			// 阻塞，可被 f.cond.Broadcast 唤醒
			f.cond.Wait()
		}
    // 取出头部资源对象 key
		id := f.queue[0]
    // 已加锁，可以直接更新队列
		f.queue = f.queue[1:]
		if f.initialPopulationCount &gt; 0 {
			f.initialPopulationCount--
		}
    // 根据 key 取出 deltas
		item, ok := f.items[id]
		if !ok {
			// This should never happen
			klog.Errorf(&quot;Inconceivable! %q was in f.queue but not f.items; ignoring.&quot;, id)
			continue
		}
		delete(f.items, id)
    // process 是传入的回调方法，由上层消费者（controller）传入
    // 这正是第三个组件，sharedProcessor，DeltaFIFO 会以此通知 listeners
		err := process(item)
    // 如果回调函数出错，就将资源重新存入队列
		if e, ok := err.(ErrRequeue); ok {
			f.addIfNotPresent(id, item)
			err = e.Err
		}
		// Don't need to copyDeltas here, because we're transferring
		// ownership to the caller.
		return item, err
	}
}
</code></pre>
<p><strong>DeltaFIFO Resync</strong></p>
<p>Resync 与 kubebuilder  的 retry 是一样的功能吗？</p>
<p>Deleted object 如何给出完整资源呢？</p>
<p>上文在 <code>Reflector::ListAndWatch</code> 中可以看到启动了一个 goroutine 用于定时同步，调用的就是 <code>DeltaFIFO::Resync</code> 方法</p>
<p>Resync 的作用是将 indexer 中的资源对象同步至 DeltaFIFO 中，以便于让处理失败的事件再次处理</p>
<pre><code class="language-go">// Resync 实际上是添加了一个 Sync 类型的 delta
func (f *DeltaFIFO) Resync() error {
	f.lock.Lock()
	defer f.lock.Unlock()

  // knownObject 接口用于列出所有已知资源对象，实际传入的就是 indexer
  // 在 sharedIndexInformer::Run 方法中实例化 DeltaFIFO 时传入
	if f.knownObjects == nil {
		return nil
	}

	keys := f.knownObjects.ListKeys()
	for _, k := range keys {
		if err := f.syncKeyLocked(k); err != nil {
			return err
		}
	}
	return nil
}

// Resync 的作用是将 indexer 中的资源对象同步至 DeltaFIFO 中，并将同步过去的资源对象设为 Sync 类型
func (f *DeltaFIFO) syncKeyLocked(key string) error {
	obj, exists, err := f.knownObjects.GetByKey(key)
  ......

	// If we are doing Resync() and there is already an event queued for that object,
	// we ignore the Resync for it. This is to avoid the race, in which the resync
	// comes with the previous value of object (since queueing an event for the object
	// doesn't trigger changing the underlying store &lt;knownObjects&gt;.
	id, err := f.KeyOf(obj)
	if err != nil {
		return KeyError{obj, err}
	}
	if len(f.items[id]) &gt; 0 {
		return nil
	}

	if err := f.queueActionLocked(Sync, obj); err != nil {
		return fmt.Errorf(&quot;couldn't queue object: %v&quot;, err)
	}
	return nil
}
</code></pre>
<p><strong>Replace</strong></p>
<p>在 <code>sharedIndexInformer::ListAndWatch</code> 中，List 部分会在通过 <code>DeltaFIFO::Replace</code> 方法替换 DeltaFIFO 中的资源，此方法用于首次 List 的数据处理或连接中断后的数据同步</p>
<pre><code class="language-go">func (f *DeltaFIFO) Replace(list []interface{}, resourceVersion string) error {
  ......
	for _, item := range list {
		key, err := f.KeyOf(item)
		if err != nil {
			return KeyError{item, err}
		}
		keys.Insert(key)
		if err := f.queueActionLocked(action, item); err != nil {
			return fmt.Errorf(&quot;couldn't enqueue object: %v&quot;, err)
		}
	}
	......
	// Detect deletions not already in the queue.
	knownKeys := f.knownObjects.ListKeys()
	queuedDeletions := 0
	for _, k := range knownKeys {
		if keys.Has(k) {
			continue
		}
    ......
    // 根据 Indexer 检测已删除的资源
		if err := f.queueActionLocked(Deleted, DeletedFinalStateUnknown{k, deletedObj}); err != nil {
			return err
		}
	}
  ......
}
</code></pre>
<h4 id="indexer">Indexer</h4>
<p>Indexer 上面也介绍了，是负责 local cache 的组件，它用来存储资源对象并自带索引功能。Indexer 中的数据会与 Etcd 集群中的数据保持一致，这主要通过 Reflector 实现。</p>
<p>在实例化 <code>sharedIndexInformer</code> 时需要在参数中传入参数，跟着一起实例化，下面是 deployment Informer 的代码：</p>
<pre><code class="language-go">func (f *deploymentInformer) defaultInformer(client kubernetes.Interface, resyncPeriod time.Duration) cache.SharedIndexInformer {
	return NewFilteredDeploymentInformer(client, f.namespace, resyncPeriod, cache.Indexers{cache.NamespaceIndex: cache.MetaNamespaceIndexFunc}, f.tweakListOptions)
}

func NewSharedIndexInformer(lw ListerWatcher, exampleObject runtime.Object, defaultEventHandlerResyncPeriod time.Duration, indexers Indexers) SharedIndexInformer {
	realClock := &amp;clock.RealClock{}
	sharedIndexInformer := &amp;sharedIndexInformer{
    ......
		indexer:                         NewIndexer(DeletionHandlingMetaNamespaceKeyFunc, indexers),
    ......
	}
	return sharedIndexInformer
}

func NewIndexer(keyFunc KeyFunc, indexers Indexers) Indexer {
  // cache 使用 cacheStorage 进行存储，自身在其基础上封装了用于索引的方法，便于使用
	return &amp;cache{
    // cacheStorage 使用的 ThreadSafeMap 是线程安全的存储
		cacheStorage: NewThreadSafeStore(indexers, Indices{}),
    // 用于获取 key 的闭包
		keyFunc:      keyFunc,
	}
}

func NewThreadSafeStore(indexers Indexers, indices Indices) ThreadSafeStore {
	return &amp;threadSafeMap{
    // 存储
		items:    map[string]interface{}{},
    // 索引器，map 类型，key 是索引器的名称，value 是对应索引器函数，由对应资源的 informer 传入
    // 索引器函数被定义为接收一个资源对象，返回检索结果列表
		indexers: indexers,
    // 索引存储器，map 类型，将名称与 index 对应
    // index 被定义为存储的缓存数据，通过 set 结构存储，go 语言没有 set 结构，所以是通过 map 实现 set 的去重
		indices:  indices,
	}
}

func (c *threadSafeMap) Add(key string, obj interface{}) {
  // 通过锁保证数据一致性
	c.lock.Lock()
	defer c.lock.Unlock()
	oldObject := c.items[key]
	c.items[key] = obj
  // 更新索引
	c.updateIndices(oldObject, obj, key)
}
</code></pre>
<p>想要理解 indexer 还是需要通过 example：</p>
<figure data-type="image" tabindex="2"><img src="https://tva1.sinaimg.cn/large/0081Kckwly1glqkusf9nvj30su10odib.jpg" alt="exalpme" loading="lazy"></figure>
<h4 id="processor">Processor</h4>
<p>最后一个组件 sharedProcessor，processor 作为回调函数在 DeltaFIFO 的中被调用，现在将代码回到 <code>Controller::Run</code> 中：</p>
<pre><code class="language-go">func (c *controller) Run(stopCh &lt;-chan struct{}) {
  ......
	var wg wait.Group
  // 上面讲了 Reflector:Run:
	wg.StartWithChannel(stopCh, r.Run)
  // 消费 DeltaFIFO 数据
	wait.Until(c.processLoop, time.Second, stopCh)
	wg.Wait()
}

func (c *controller) processLoop() {
	for {
    // 传入确保类型正确的回调函数，c.config.Process 由 sharedIndexInformer 传入
		obj, err := c.config.Queue.Pop(PopProcessFunc(c.config.Process))
		if err != nil {
			if err == ErrFIFOClosed {
				return
			}
      // DeltaFIFO::Pop 只会对 ErrRequeue 类型错误进行重试，此处会处理所有类型错误
			if c.config.RetryOnError {
				// 这是个安全的方法，如果队列中已存在，则不会重复添加
				c.config.Queue.AddIfNotPresent(obj)
			}
		}
	}
}

// 寻找 config.Process
func (s *sharedIndexInformer) Run(stopCh &lt;-chan struct{}) {
  ......
	cfg := &amp;Config{
		Queue:            fifo,
		ListerWatcher:    s.listerWatcher,
		ObjectType:       s.objectType,
		FullResyncPeriod: s.resyncCheckPeriod,
		RetryOnError:     false,
		ShouldResync:     s.processor.shouldResync,
    // here
		Process:           s.HandleDeltas,
		WatchErrorHandler: s.watchErrorHandler,
	}
  ......
}

// Informer DeltaFIFO 回调函数
func (s *sharedIndexInformer) HandleDeltas(obj interface{}) error {
	s.blockDeltas.Lock()
	defer s.blockDeltas.Unlock()

	// from oldest to newest
	for _, d := range obj.(Deltas) {
		switch d.Type {
    // 更新 local cache
		case Sync, Replaced, Added, Updated:
			s.cacheMutationDetector.AddObject(d.Object)
			if old, exists, err := s.indexer.Get(d.Object); err == nil &amp;&amp; exists {
				if err := s.indexer.Update(d.Object); err != nil {
					return err
				}

				isSync := false
				switch {
				case d.Type == Sync:
					// Sync events are only propagated to listeners that requested resync
					isSync = true
				case d.Type == Replaced:
					if accessor, err := meta.Accessor(d.Object); err == nil {
						if oldAccessor, err := meta.Accessor(old); err == nil {
							// Replaced events that didn't change resourceVersion are treated as resync events
							// and only propagated to listeners that requested resync
							isSync = accessor.GetResourceVersion() == oldAccessor.GetResourceVersion()
						}
					}
				}
        // 发送给 sharedProcessor 组件
				s.processor.distribute(updateNotification{oldObj: old, newObj: d.Object}, isSync)
			} else {
				if err := s.indexer.Add(d.Object); err != nil {
					return err
				}
				s.processor.distribute(addNotification{newObj: d.Object}, false)
			}
    // 从 local cache 中删除对象资源
		case Deleted:
			if err := s.indexer.Delete(d.Object); err != nil {
				return err
			}
			s.processor.distribute(deleteNotification{oldObj: d.Object}, false)
		}
	}
	return nil
}
</code></pre>
<p><strong>sharedProcessor</strong></p>
<pre><code class="language-go">// sharedProcessor 在实例化 SharedIndexInformer 时一起实例化
func NewSharedIndexInformer(lw ListerWatcher, exampleObject runtime.Object, defaultEventHandlerResyncPeriod time.Duration, indexers Indexers) SharedIndexInformer {
	realClock := &amp;clock.RealClock{}
	sharedIndexInformer := &amp;sharedIndexInformer{
		processor:                       &amp;sharedProcessor{clock: realClock},
		indexer:                         NewIndexer(DeletionHandlingMetaNamespaceKeyFunc, indexers),
		listerWatcher:                   lw,
		objectType:                      exampleObject,
		resyncCheckPeriod:               defaultEventHandlerResyncPeriod,
		defaultEventHandlerResyncPeriod: defaultEventHandlerResyncPeriod,
		cacheMutationDetector:           NewCacheMutationDetector(fmt.Sprintf(&quot;%T&quot;, exampleObject)),
		clock:                           realClock,
	}
	return sharedIndexInformer
}

// 在调用 SharedIndexInformer::AddEventHandler 的方法时，会为 sharedProcessor 添加 listener
func NewController(
	kubeclientset kubernetes.Interface,
	sampleclientset clientset.Interface,
	deploymentInformer appsinformers.DeploymentInformer,
	fooInformer informers.FooInformer) *Controller {
  ......
  deploymentInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
    AddFunc: controller.handleObject,
    UpdateFunc: func(old, new interface{}) {
      newDepl := new.(*appsv1.Deployment)
      oldDepl := old.(*appsv1.Deployment)
      if newDepl.ResourceVersion == oldDepl.ResourceVersion {
        // Periodic resync will send update events for all known Deployments.
        // Two different versions of the same Deployment will always have different RVs.
        return
      }
      controller.handleObject(new)
    },
    DeleteFunc: controller.handleObject,
  })
  ......
}

func (s *sharedIndexInformer) AddEventHandler(handler ResourceEventHandler) {
	s.AddEventHandlerWithResyncPeriod(handler, s.defaultEventHandlerResyncPeriod)
}

// 为 processor 添加 listener
func (s *sharedIndexInformer) AddEventHandlerWithResyncPeriod(handler ResourceEventHandler, resyncPeriod time.Duration) {
	......
	listener := newProcessListener(handler, resyncPeriod, determineResyncPeriod(resyncPeriod, s.resyncCheckPeriod), s.clock.Now(), initialBufferSize)
	.......
	s.processor.addListener(listener)
	for _, item := range s.indexer.List() {
		listener.add(addNotification{newObj: item})
	}
}

// listener 添加后会直接启动
func (p *sharedProcessor) addListener(listener *processorListener) {
	p.listenersLock.Lock()
	defer p.listenersLock.Unlock()

	p.addListenerLocked(listener)
	if p.listenersStarted {
		p.wg.Start(listener.run)
		p.wg.Start(listener.pop)
	}
}

// processorListener::run 方法会等待 pop 通知，并根据通知的类型，调用相应回调函数
func (p *processorListener) run() {
	stopCh := make(chan struct{})
	wait.Until(func() {
		for next := range p.nextCh {
			switch notification := next.(type) {
			case updateNotification:
				p.handler.OnUpdate(notification.oldObj, notification.newObj)
			case addNotification:
				p.handler.OnAdd(notification.newObj)
			case deleteNotification:
				p.handler.OnDelete(notification.oldObj)
			default:
				utilruntime.HandleError(fmt.Errorf(&quot;unrecognized notification: %T&quot;, next))
			}
		}
		// the only way to get here is if the p.nextCh is empty and closed
		close(stopCh)
	}, 1*time.Second, stopCh)
}

// processorListener::pop 会接收通知，并发送给 run 等待的 channel 中
func (p *processorListener) pop() {
	defer utilruntime.HandleCrash()
	defer close(p.nextCh) // Tell .run() to stop

	var nextCh chan&lt;- interface{}
	var notification interface{}
	for {
		select {
		case nextCh &lt;- notification:
			// Notification dispatched
			var ok bool
			notification, ok = p.pendingNotifications.ReadOne()
			if !ok { // Nothing to pop
				nextCh = nil // Disable this select case
			}
		case notificationToAdd, ok := &lt;-p.addCh:
			if !ok {
				return
			}
			if notification == nil { // No notification to pop (and pendingNotifications is empty)
				// Optimize the case - skip adding to pendingNotifications
				notification = notificationToAdd
				nextCh = p.nextCh
			} else { // There is already a notification waiting to be dispatched
				p.pendingNotifications.WriteOne(notificationToAdd)
			}
		}
	}
}
</code></pre>
<h2 id="workqueue">WorkQueue</h2>
<p>WorkQueue 称为工作队列，Kubernetes 的 WorkQueue 队列与普通的 FIFO 队列相比，实现略显复杂，它的主要功能在于标记和去重，并支持如下特性：</p>
<ul>
<li><strong>有序</strong>：按照添加顺序处理元素</li>
<li><strong>去重</strong>：相同元素在同一时间不会被重复处理，例如一个元素在处理之前被添加了多次，它只会被处理一次</li>
<li><strong>并发性</strong>：多生产者和多消费者</li>
<li><strong>标记机制</strong>：支持标记功能，标记一个元素是否被处理，也允许元素在处理时重新排队</li>
<li><strong>通知机制</strong>：ShutDown 方法通过信号量通知队列不再接收新的元素，并通知 metric goroutine 退出</li>
<li><strong>延迟</strong>：支持延迟队列，延迟一段时间后再将元素存入队列</li>
<li><strong>限速</strong>：支持限速队列，元素存入队列时进行速率限制，限制一个元素被重新排队的次数</li>
<li><strong>Metric</strong>：支持 metric 监控指标</li>
<li><strong>Interface</strong>：FIFO 队列接口，先进先出，并支持去重机制</li>
<li><strong>DelayingInterface</strong>：延迟队列接口，基于 Interface 接口封装，延迟一段时间后再将元素存入队列</li>
<li><strong>RateLimitingInterface</strong>：限速队列接口，基于 DelayingInterface 接口封装，支持元素存入队列时进行速率限制</li>
</ul>
<h3 id="源码-2">源码</h3>
<p>SampleController 在实例化 Controller 时实例化了 <code>NamedRateLimitingQueue</code>，也就是命名的限速队列。</p>
<pre><code class="language-go">func NewController(
	kubeclientset kubernetes.Interface,
	sampleclientset clientset.Interface,
	deploymentInformer appsinformers.DeploymentInformer,
	fooInformer informers.FooInformer) *Controller {
  ......
	controller := &amp;Controller{
    ......
    // DefaultControllerRateLimiter 是默认的、基于令牌桶机制的速率限制器
		workqueue:         workqueue.NewNamedRateLimitingQueue(workqueue.DefaultControllerRateLimiter(), &quot;Foos&quot;),
    ......
	}
  fooInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
    // 在回调函数中将 delta 加入 workQueue
		AddFunc: controller.enqueueFoo,
		UpdateFunc: func(old, new interface{}) {
			controller.enqueueFoo(new)
		},
	})
  ......
}

func NewNamedRateLimitingQueue(rateLimiter RateLimiter, name string) RateLimitingInterface {
	return &amp;rateLimitingType{
    // 同时实例化延迟队列
		DelayingInterface: NewNamedDelayingQueue(name),
		rateLimiter:       rateLimiter,
	}
}

func NewNamedDelayingQueue(name string) DelayingInterface {
	return NewDelayingQueueWithCustomClock(clock.RealClock{}, name)
}

func NewDelayingQueueWithCustomClock(clock clock.Clock, name string) DelayingInterface {
  // NewNamed 会实例化一个基础的 FIFOWorkQueue
	return newDelayingQueue(clock, NewNamed(name), name)
}

func newDelayingQueue(clock clock.Clock, q Interface, name string) *delayingType {
	ret := &amp;delayingType{
    // 基础队列
		Interface:       q,
		clock:           clock,
    // 心跳确保等待时间不超过 maxWait，const maxWait = 10 * time.Second
		heartbeat:       clock.NewTicker(maxWait),
		stopCh:          make(chan struct{}),
		waitingForAddCh: make(chan *waitFor, 1000),
		metrics:         newRetryMetrics(name),
	}
  // 处理延迟元素
	go ret.waitingLoop()
	return ret
}

// 以上是实例化的过程，下面看添加
func (c *Controller) enqueueFoo(obj interface{}) {
	var key string
	var err error
  // obj 为传入的资源，在此处转为 key 存入 WorkQueue
  // 为什么转为 key，而不是将 object 整个存入 workQueue？
  // 暂时的思考： object 不利于去重，而且数据过大，还有如果存入 object
  // Reconciler 还是需要 Get 来判断资源是否已删除
	if key, err = cache.MetaNamespaceKeyFunc(obj); err != nil {
		utilruntime.HandleError(err)
		return
	}
  // RateLimiting 和 Delaying 都是基于基础的 workQueue 封装
  // 所以此处调用的是 Interface::Add()
	c.workqueue.Add(key)
}

// 现在回来看看 NewNamed 实例化的是什么
func NewNamed(name string) *Type {
	rc := clock.RealClock{}
	return newQueue(
		rc,
		globalMetricsFactory.newQueueMetrics(name, rc),
		defaultUnfinishedWorkUpdatePeriod,
	)
}
func newQueue(c clock.Clock, metrics queueMetrics, updatePeriod time.Duration) *Type {
	t := &amp;Type{
		clock:                      c,
		dirty:                      set{},
		processing:                 set{},
		cond:                       sync.NewCond(&amp;sync.Mutex{}),
		metrics:                    metrics,
		unfinishedWorkUpdatePeriod: updatePeriod,
	}
	go t.updateUnfinishedWorkLoop()
	return t
}

func (q *Type) Add(item interface{}) {
	q.cond.L.Lock()
	defer q.cond.L.Unlock()
	if q.shuttingDown {
		return
	}
  // 去重，这里也不利于 object 整个传入
	if q.dirty.has(item) {
		return
	}

	q.metrics.add(item)

	q.dirty.insert(item)
  // 集中处理正在处理的元素
	if q.processing.has(item) {
		return
	}
  
  // 将需要处理的元素加入到队列中，队列中应该只包含已处理完成的元素，不应该有 processing 的元素
	q.queue = append(q.queue, item)
	q.cond.Signal()
}

// 现在来看下如何消费 workQueue 中的内容
// 在 Controller::Run() 中，会根据给定线程数启动 x 个 Controller::runWorkker()
func (c *Controller) Run(threadiness int, stopCh &lt;-chan struct{}) error {
	......
	for i := 0; i &lt; threadiness; i++ {
		go wait.Until(c.runWorker, time.Second, stopCh)
	}
  ......
}

// 这是一个持久运行的方法，它会不断的消费 workQueue 中的元素
func (c *Controller) runWorker() {
	for c.processNextWorkItem() {
	}
}

// 这里的处理过于简单，可以看看 kubebuilder 的源码了
func (c *Controller) processNextWorkItem() bool {
  // 取出元素
	obj, shutdown := c.workqueue.Get()
  ......
	err := func(obj interface{}) error {
		......
    // Controller::syncHandler 等同于 Reconciler
		if err := c.syncHandler(key); err != nil {
			// 处理失败则放回等待再次处理
			c.workqueue.AddRateLimited(key)
			return fmt.Errorf(&quot;error syncing '%s': %s, requeuing&quot;, key, err.Error())
		}
		// 如果处理成功则删除元素
		c.workqueue.Forget(obj)
    ......
	}(obj)
  ......
}
</code></pre>
]]></content>
    </entry>
    <entry>
        <title type="html"><![CDATA[我们为什么要使用 Kubernetes 自定义资源]]></title>
        <id>https://cnbailian.github.io/post/why-do-use-kubernetes-crd/</id>
        <link href="https://cnbailian.github.io/post/why-do-use-kubernetes-crd/">
        </link>
        <updated>2021-05-08T03:09:27.000Z</updated>
        <summary type="html"><![CDATA[<p>Kubernetes 提供了 CRD(CustomResourceDefinitions) 这种扩展方式满足了用户增强 Kubernetes 功能的需求，我们熟悉的 Kubernetes  Operator 也是基于这一机制而实现。</p>
<p>本文想讨论的是我们要在什么时候使用 CRD 以及为什么要使用 CRD。</p>
]]></summary>
        <content type="html"><![CDATA[<p>Kubernetes 提供了 CRD(CustomResourceDefinitions) 这种扩展方式满足了用户增强 Kubernetes 功能的需求，我们熟悉的 Kubernetes  Operator 也是基于这一机制而实现。</p>
<p>本文想讨论的是我们要在什么时候使用 CRD 以及为什么要使用 CRD。</p>
<!--more-->
<h2 id="我是否应该向我的-kubernetes-集群添加定制资源">我是否应该向我的 Kubernetes 集群添加定制资源？</h2>
<p>表格是 <a href="https://kubernetes.io/zh/docs/concepts/extend-kubernetes/api-extension/custom-resources/#%E6%88%91%E6%98%AF%E5%90%A6%E5%BA%94%E8%AF%A5%E5%90%91%E6%88%91%E7%9A%84-kubernetes-%E9%9B%86%E7%BE%A4%E6%B7%BB%E5%8A%A0%E5%AE%9A%E5%88%B6%E8%B5%84%E6%BA%90">Kubernetes 官网</a>列出的选择 CRD 的场景，其中最重要的，也是难以理解的应该是<strong>声明式 API</strong>这一概念。</p>
<table>
<thead>
<tr>
<th>考虑 API 聚合的情况</th>
<th>优选独立 API 的情况</th>
</tr>
</thead>
<tbody>
<tr>
<td>你的 API 是<a href="https://kubernetes.io/zh/docs/concepts/extend-kubernetes/api-extension/custom-resources/#declarative-apis">声明式的</a>。</td>
<td>你的 API 不符合<a href="https://kubernetes.io/zh/docs/concepts/extend-kubernetes/api-extension/custom-resources/#declarative-apis">声明式</a>模型。</td>
</tr>
<tr>
<td>你希望可以是使用 <code>kubectl</code> 来读写你的新资源类别。</td>
<td>不要求 <code>kubectl</code> 支持。</td>
</tr>
<tr>
<td>你希望在 Kubernetes UI （如仪表板）中和其他内置类别一起查看你的新资源类别。</td>
<td>不需要 Kubernetes UI 支持。</td>
</tr>
<tr>
<td>你在开发新的 API。</td>
<td>你已经有一个提供 API 服务的程序并且工作良好。</td>
</tr>
<tr>
<td>你有意愿取接受 Kubernetes 对 REST 资源路径所作的格式限制，例如 API 组和名字空间。（参阅 <a href="https://kubernetes.io/zh/docs/concepts/overview/kubernetes-api/">API 概述</a>）</td>
<td>你需要使用一些特殊的 REST 路径以便与已经定义的 REST API 保持兼容。</td>
</tr>
<tr>
<td>你的资源可以自然地界定为集群作用域或集群中某个名字空间作用域。</td>
<td>集群作用域或名字空间作用域这种二分法很不合适；你需要对资源路径的细节进行控制。</td>
</tr>
<tr>
<td>你希望复用 <a href="https://kubernetes.io/zh/docs/concepts/extend-kubernetes/api-extension/custom-resources/#common-features">Kubernetes API 支持特性</a>。</td>
<td>你不需要这类特性。</td>
</tr>
</tbody>
</table>
<h2 id="声明式">声明式</h2>
<p>声明式指的是这么一种软件设计理念和做法：<strong>让我们的动作更偏向于描述，而不是命令</strong>。</p>
<p>声明式（Declarative）通常是与命令式（Imperative）作对比，两者的侧重点不同。命令式编程会详细的命令工具怎么（How）去处理一件事情以达到你想要的结果（What）；声明式编程则是只告诉工具想要的结果（What），由工具自行决定怎么做（How）。</p>
<img src="https://tva1.sinaimg.cn/large/008i3skNly1gq8n3f5etmj30f008374t.jpg" alt="img"  />
<p>以生活中打车作为例子，我们在大多数时候并不会指挥司机师傅：走哪条街，前行多少米，在哪个路口转向；而是直接告诉师傅，我要去 XXX 地点。上述例子能看出命令式与声明式在生活中的体现，在编程中，我们大多数人首先接触到的都是命令式的编程语言，这就导致我们对声明式会有一些不理解。下面就用声明式在编程领域中的两个比较重要的成果来说明声明式的意义。</p>
<h3 id="dsl">DSL</h3>
<p>DSL 是 Domain Specific Language 的缩写，中文翻译为<strong>领域特定语言</strong>。与 DSL 相对的是 GPPL（General Purpose Programming Language，通用目的编程语言），也就是我们非常熟悉的 Java、C、Go 等编程语言。</p>
<p>DSL 的定义并不是很明确，我们可以简单的理解为“为了解决某一类任务而专门设计的计算机语言”。最常见的 DSL 包括 SQL、HTML 和 CSS 等。</p>
<p>所有的 DSL 都是声明式的，你写出一条 SQL 语句，只是告诉数据库想要的结果是什么，数据库会帮我们设计获取这个结果集的执行路径，并返回结果集。众所周知，使用 SQL 语言获取数据，要比自行编写处理过程去获取数据容易的多。</p>
<pre><code class="language-sql">SELECT * from user WHERE user_name = Ben
</code></pre>
<p>Go 伪代码：</p>
<pre><code class="language-go">users := get_users()
for row, value := range users {
  if value.user_name = &quot;Ben&quot; {
    print(&quot;find&quot;)
    break
  }
}
</code></pre>
<h4 id="内部-dsl">内部 DSL</h4>
<p>上面提到的 SQL、HTML 和 CSS 等，属于外部 DSL。外部 DSL 是自我包含的语言，他有自己特定语法、解析器和词法分析器等等。与之相对的是内部 DSL，它使用的是宿主语言的抽象能力，更像是一种别称，代表着一类特别的 API 及使用模式。</p>
<p>比如说 LINQ（C#）、 Ruby on Rails（Ruby）、jQuery（JavaScript）。它们共同的特点是，它们其实只是一系列 API，但是你可以“假装”它们是一种 DSL。不过，这种 DSL 模糊了框架和 DSL 的边界，因为两者看起来没有什么区别，我们也没有必要争论哪些是框架，哪些是 DSL，因为这些争论并没有什么意义。</p>
<p><em>就我个人体验而言，如果脱离框架转而使用宿主语言实现同样功能时会感觉到不适应，那么可能就证明了这个框架拥有内部 DSL 的性质。</em></p>
<h3 id="函数式编程">函数式编程</h3>
<p>函数式编程就是声明式的另一个重要成果，它的编程形式更倾向于描述而不是执行命令，下面这个例子是 React 的声明式构建 UI：</p>
<pre><code class="language-javascript">// 普通的 DOM API 构建 UI
const div = document.createElement('div')
const p = document.createElement('p')
p.textContent = 'hello world'
const UI = div.append(p)

// React 构建 UI
const h = React.craeteElement
const UI = h('div', null, h('p', null, 'hello world'))
</code></pre>
<p>React 依托于 JavaScript，并不是完全的函数式编程语言，不过 Haskell 等函数式语言我也没有接触，所以并不能很好的理解。分享两篇文章，希望能一起学习。</p>
<p><a href="http://blog.zhaojie.me/2010/05/trends-and-future-directions-in-programming-languages-by-anders-3-functional-programming-and-fsharp.html">编程语言的发展趋势及未来方向（3）：函数式编程</a></p>
<p><a href="https://lutaonan.com/blog/declarative-programming-is-the-future/">未来属于声明式编程</a></p>
<h2 id="kubernetes-声明式-api">Kubernetes 声明式 API</h2>
<p>通过上述的例子，我们已经明白声明式的理念。Kubernetes 的声明式 API 正是使用了这种方法，<strong>我们向其描述我们想要让一个事物达到的期望状态，由 Kubernetes 内部去自行实现，令这个事物达成实际状态</strong>。</p>
<p>声明式 API 基于 RESTful 的设计风格，将想要描述的事物抽象为资源，通过 CRUD 风格的操作方法修改资源对象的状态。这也正是 REST 的本质：<strong>资源表述性状态转移</strong>，通俗的讲就是：资源以某种表现形式进行状态转移。在 Kubernetes 中，自定义资源的表现形式是由 CRD 来定义。</p>
<p><em>表现形式包含表示的格式，也包含表示的属性。格式在 Kubernetes 中有着统一的定义，所以我们在 CRD 中主要配置的是表示的属性，也就是对象的配置信息，我们想要对象达成的期望状态的相关属性。</em></p>
<h3 id="关于-replace-和-apply">关于 replace 和 apply</h3>
<p>通过上面对声明式和声明式 API 的理解，我们也就能更好的理解极客时间中张磊老师课程里所说的 replace 和 apply 的区别。replace 的语义主要体现在删除重建的命令上，而 apply 是对资源对象期望状态的更新。</p>
<p>根据课程中的例子来更好的理解：</p>
<pre><code class="language-yaml"># nginx.yaml 将 Nginx 容器镜像改为1.7.9
    spec:
      containers:
      - name: nginx
        image: nginx:1.7.9
</code></pre>
<pre><code class="language-shell"># 这个命令所表达的语义，是要将 nginx 资源强制替换为修改后的资源
# 明确表示了执行过程：先删除，然后重建
$ kubectl replace -f nginx.yaml

# 而 apply 则只表明更新 nginx 资源的期望状态，具体的实现过程，由其自行处理
$ kubectl apply -f nginx.yaml
</code></pre>
<p>在实际的使用过程中，我们也要尽量避免使用 <code>replace -f</code> 命令，同时避免更新有上层抽象控制的底层资源对象。</p>
<h3 id="声明式-api-特点">声明式 API 特点</h3>
<p>现在我们也能更好的理解 <a href="https://kubernetes.io/zh/docs/concepts/extend-kubernetes/api-extension/custom-resources/#declarative-apis">Kubernetes 官网</a>中对于声明式 API 的一些说明，附带上一些我的理解：</p>
<ul>
<li>你的 API 包含相对而言为数不多的、尺寸较小的对象（资源）。
<ul>
<li><em>声明式重要的点在于描述，描述可以详细，但不应用于存储具体数据，应该描述其元数据。</em></li>
</ul>
</li>
<li>对象定义了应用或者基础设施的配置信息。</li>
<li>对象更新操作频率较低。</li>
<li>通常需要人来读取或写入对象。</li>
<li>对象的主要操作是 CRUD 风格的（创建、读取、更新和删除）。</li>
<li>不需要跨对象的事务支持：API 对象代表的是期望状态而非确切实际状态。
<ul>
<li><em>也就是说我们在设计抽象资源时，如果该资源的创建需要依赖其他资源的实际状态，那么应该考虑将其归属于所依赖的资源。</em></li>
</ul>
</li>
</ul>
<p>也能更好的理解什么不是声明式 API：</p>
<ul>
<li>客户端发出“做这个操作”的指令，之后在该操作结束时获得同步响应。
<ul>
<li><em>声明式 API 的一个特点，声明的永远是期望状态，不能即时得到处理成功的响应。对实时性要求很高的场景是不合适的。</em></li>
</ul>
</li>
<li>客户端发出“做这个操作”的指令，并获得一个操作 ID，之后需要检查一个 Operation（操作） 对象来判断请求是否成功完成。
<ul>
<li><em>我们要相信我们期望的状态是能达到的，并且不能在状态达成后才需要处理一些其他逻辑，如果是这样，应该考虑将这些逻辑放入声明式 API，或是放弃使用。</em></li>
</ul>
</li>
<li>你会将你的 API 类比为远程过程调用（Remote Procedure Call，RPCs）。
<ul>
<li><em>这很明显，过程调用强调的是过程，如果你的 API 非常注重过程的处理，那就不适合声明式 API</em></li>
</ul>
</li>
<li>直接存储大量数据；例如每个对象几 kB，或者存储上千个对象。</li>
<li>需要较高的访问带宽（长期保持每秒数十个请求）。</li>
<li>存储有应用来处理的最终用户数据（如图片、个人标识信息（PII）等）或者其他大规模数据。</li>
<li>在对象上执行的常规操作并非 CRUD 风格。
<ul>
<li><em>对于声明式 API 而言，我们对资源对象的操作是有限的，仅能对其进行状态转移，这也就局限为 CRUD 操作。如果一项操作不能抽象为状态的改变，那么就证明不适合声明式 API。</em></li>
</ul>
</li>
<li>API 不太容易用对象来建模。</li>
<li>你决定使用操作 ID 或者操作对象来表现悬决的操作。
<ul>
<li><em>这里“悬决的操作”英文原文为&quot;pending operations&quot;，表达的应该是悬而未决的意思。然而需要挂起，就表示你知道这个操作在可控的范围内需要依赖于其他操作的完成，这是不符合声明式 API 要求的。</em></li>
</ul>
</li>
</ul>
<h2 id="控制器模式">控制器模式</h2>
<p>从上面可以了解到，声明式 API 让我们可以描述资源对象的期望状态，那么 Kubernetes 内部是如何将期望状态转为实际状态的呢？答案就是 Kubernetes 的控制器模式。这是 kubernetes 的核心机制，也叫 Control Loop 或是 Reconcile Loop。</p>
<p>以下是 <a href="https://kubernetes.io/zh/docs/concepts/architecture/controller/">Kubernetes 官网</a>对于 Control Loop 的解释，很详细：</p>
<blockquote>
<p>在机器人技术和自动化领域，控制回路（Control Loop）是一个非终止回路，用于调节系统状态。</p>
<p>这是一个控制环的例子：房间里的温度自动调节器。</p>
<p>当你设置了温度，告诉了温度自动调节器你的<em>期望状态（Desired State）</em>。 房间的实际温度是<em>当前状态（Current State）</em>。 通过对设备的开关控制，温度自动调节器让其当前状态接近期望状态。</p>
</blockquote>
<p>控制器模式指的就是这样一个控制循环，Kubernetes 中的控制器通过 “List&amp;Watch 机制” 实现对于 Kubernetes 中相关资源变化的关注，从而触发控制器逻辑的处理，完成最终用户的期望，并且实时更新资源的状态来告知用户。Kubernetes 自身的固有资源也都是通过这种形式来实现的。</p>
<p>这个控制循环确保了实际状态与期望状态的一致性，而实际状态向期望状态逐渐转换的这个过程，叫做 Reconcile，所以控制循环也叫做调谐循环（Reconcile Loop）。正是由于 Reconcile 的存在，它不断的执行“检查 -&gt; Diff -&gt; 更新实际状态”这样一个过程，才使得这个系统能够始终对系统当前状态与期望状态对比差异并采取必要的行动。</p>
<h4 id="期望状态与实际状态">期望状态与实际状态</h4>
<blockquote>
<p>Kubernetes 采用了系统的云原生视图，并且可以处理持续的变化。</p>
<p>在任务执行时，集群随时都可能被修改，并且控制回路会自动修复故障。 这意味着很可能集群永远不会达到稳定状态。</p>
<p>只要集群中的控制器在运行并且进行有效的修改，整体状态的稳定与否是无关紧要的。</p>
</blockquote>
<h3 id="关于控制器的实现原理">关于控制器的实现原理</h3>
<p>限于篇幅，不讲了。感兴趣的可以看我的另一篇关于 Controller 原理和源码的笔记：<a href="/post/kubernetes-samplecontroller">《从 SampleController 项目看 kubernetes controller 的设计》</a>。</p>
<h2 id="声明式的优点">声明式的优点</h2>
<h3 id="可读性">可读性</h3>
<p>声明式的描述通常比一连串的命令更具有可读性。</p>
<h4 id="dsl-2">DSL</h4>
<p>对于在 DSL 上的体现来说，DSL 通常比伪代码更接近自然语言，并且非程序员更容易学习。包括内部 DSL，通常也会比宿主语言实现同样功能的命令更加易读。</p>
<h4 id="函数式编程-2">函数式编程</h4>
<p>函数式编程也同样具有更高的可读性，因为所有的状态都是不可变的。你声明一个状态，但是不能改变这个状态。由于你无法改变它，所以在函数式编程中不需要变量。对函数式编程的讨论也更像是数学、公式，而不像是程序语句。</p>
<pre><code class="language-c">x = x + 1
</code></pre>
<p>如果你把这行代码交给一个数学家去看，他会认为这是一个不成立的等式。如果用函数式编程的形式：</p>
<pre><code class="language-c">y = x + 1
</code></pre>
<p>这个数学家就会明白 y 的值是 x + 1 的计算结果。并且它不会被改变，被声明之后，y 就永远代表的 x + 1。</p>
<h4 id="声明式-api">声明式 API</h4>
<p>面向终态的声明式 API 的可读性是毋庸置疑的，我们关注的就是对象最终的运行状态，现在可以通过对象的描述直接了解，而不用根据过程进行推算。</p>
<h3 id="简单">简单</h3>
<p>一段代码越简单，就越容易看懂并发现错误，也就越容易对系统进行修改。所以我们鼓励采用有意义的变量名，清晰的代码结构，整洁的系统架构等等。基于同样的理由，DSL 的本质就是<strong>通过简单来换取在某一领域内的高效</strong>。DSL 的简单体现在其有限的表达性上，它不需要做到万能，只相反，DSL 只需要解决系统某一领域内的问题。只有在这个领域内，DSL 才有用，也更推荐使用。</p>
<h3 id="幂等性">幂等性</h3>
<p>由于我们面向的最终状态，对状态修改的操作一定是幂等性的。因为没有副作用，所以对于重复操作的效果是稳定的，也就能更好的处理分布式环境和并发等问题。</p>
<h3 id="可交换性">可交换性</h3>
<p>上面也提到了，声明式 API 不需要跨对象的事务支持。换句话说，<strong>声明式 API 不需要事务中固定的执行顺序</strong>。因为我们描述的总是期望状态，所以在多个对象协作的场景中，对每个对象的创建或状态转移都是不需要保证执行顺序的。</p>
<h3 id="关于控制器模式的优点">关于控制器模式的优点</h3>
<p>当我们自己设计的 API 也经过良好的抽象，对外的表现形式与声明式 API 的表现形式一致时，我们为什么还要用 CRD 呢？</p>
<p>这就需要我们对控制器模式的一些思考，控制器模式对比命令式的执行模型有哪些优点。</p>
<p>在一次性的命令执行过程中，指令的执行失败是很难处理的，通常是响应错误后需要记录日志、报警及回滚等一系列操作。调用方在接收到响应错误时，也很难把握对象当前的状态，后续的处理也会很困难。</p>
<p>而控制器模式是一个永不终止的循环，在这个循环中，控制器会通过观察对象状态，不断尝试调谐（Reconcile）以达成实际状态和期望状态的一致。这个过程是包含错误处理的流程，不需要调用方费心。调用方也可以通过对象的 status 字段实时查看对象的当前状态，以便于辅助处理。</p>
<p><em>所以我认为当你的 API 足够声明式的时候，CRD 永远是首选项。</em></p>
<h2 id="相关链接">相关链接</h2>
<p><a href="https://skyao.io/learning-cloudnative/declarative/">声明式设计</a></p>
<p><a href="http://blog.zhaojie.me/2010/04/trends-and-future-directions-in-programming-languages-by-anders-2-declarative-programming-and-dsl.html">编程语言的发展趋势及未来方向（2）：声明式编程与DSL</a></p>
<p><a href="https://www.toptal.com/software/declarative-programming">Declarative Programming: Is It A Real Thing?</a></p>
<p><a href="https://www.cnblogs.com/lisperl/archive/2011/11/21/2257360.html">浅析函数式编程与命令式编程的区别（一）计算模型的区别</a></p>
<p><a href="https://draveness.me/dsl/">谈谈 DSL 以及 DSL 的应用（以 CocoaPods 为例）</a></p>
<p><a href="https://i.cloudnative.to/oam/event">【网络研讨会】GitOps 及 OAM 的落地实践</a></p>
]]></content>
    </entry>
    <entry>
        <title type="html"><![CDATA[Go 栈笔记]]></title>
        <id>https://cnbailian.github.io/post/go-stack-notes/</id>
        <link href="https://cnbailian.github.io/post/go-stack-notes/">
        </link>
        <updated>2021-02-08T06:58:40.000Z</updated>
        <summary type="html"><![CDATA[<p>本文用于持续记录 Go 栈相关源码学习笔记。</p>
]]></summary>
        <content type="html"><![CDATA[<p>本文用于持续记录 Go 栈相关源码学习笔记。</p>
<!--more-->
<h2 id="goroutine-执行栈结构">Goroutine 执行栈结构</h2>
<p>Goroutine 是一个 <code>g</code> 对象，<code>g</code> 对象的前三个字段描述了它的执行栈：</p>
<pre><code class="language-go">// stack 描述了 Goroutine 的执行栈，栈的区间为 [lo, hi)，在栈两边没有任何隐式数据结构
// 因此 Go 的执行栈由运行时管理，本质上分配在堆中，比 ulimit -s 大
type stack struct {
	lo uintptr
	hi uintptr
}
// gobuf 描述了 Goroutine 的执行现场
type gobuf struct {
	sp   uintptr
	pc   uintptr
	g    guintptr
	ctxt unsafe.Pointer
	ret  sys.Uintreg
	lr   uintptr
	bp   uintptr
}

type g struct {
  // stack 字段描述了实际的栈内存：[stack.lo, stack.hi)
	stack       stack   // offset known to runtime/cgo
  // stackhuard0 是对比 Go 栈增长的 prologue 的栈指针
  // 如果 sp 寄存器比 stackguard 小（由于栈忘低地址方向增长），会触发栈拷贝和调度
  // 通常情况下：stackguard0 = stack.lo + StackGuard，但被抢占时会变成 StackPreempt
	stackguard0 uintptr // offset known to liblink
  // stackguard1 时对比 C 栈增长的 prologue 的栈指针
  // 当位于 g0 的 gsignal 栈上时，值为 stack.lo + StackGuard
  // 在其他栈上值为 ~0 用于触发 morestackc(并 crash)调用
	stackguard1 uintptr // offset known to liblink
  ...
  // sched 描述了执行现场
	sched       gobuf
}
</code></pre>
<h2 id="go-调用栈帧内存布局">Go 调用栈帧内存布局</h2>
<h3 id="栈帧布局">栈帧布局</h3>
<p><code>runtime/stack.go</code> 中有 x86 架构下的栈帧布局示意图</p>
<pre><code class="language-go">// (x86)
// +------------------+
// | args from caller |
// +------------------+ &lt;- frame-&gt;argp
// |  return address  |
// +------------------+
// |  caller's BP (*) | (*) if framepointer_enabled &amp;&amp; varp &lt; sp
// +------------------+ &lt;- frame-&gt;varp
// |     locals       |
// +------------------+
// |  args to callee  |
// +------------------+ &lt;- frame-&gt;sp
</code></pre>
<p>在 <strong>x86架构下，golang栈帧布局从上（高地址）到下（低地址）依次为：这个函数帧的调用者传入的参数， 这个函数帧的返回地址，调用者调用时的BP快照（见上文<code>FP</code>用法原理），该帧本地变量，该帧调用其它函数需要传递的参数。</strong></p>
<h4 id="完整的栈结构图">完整的栈结构图</h4>
<pre><code>                       -----------------                                           
                       current func arg0                                           
                       ----------------- &lt;----------- FP(pseudo FP)                
                        caller ret addr                                            
                       +---------------+ &lt;----------- 这里是 _g_.sched.hi 吗？
                       | caller BP(*)  |                                           
                       ----------------- &lt;----------- SP(pseudo SP，实际上是当前栈帧的 BP 位置)
                       |   Local Var0  |                                           
                       -----------------                                           
                       |   Local Var1  |                                           
                       -----------------                                           
                       |   Local Var2  |                                           
                       -----------------                -                          
                       |   ........    |                                           
                       -----------------                                           
                       |   Local VarN  |                                           
                       -----------------                                           
                       |               |                                           
                       |               |                                           
                       |  temporarily  |                                           
                       |  unused space |                                           
                       |               |                                           
                       |               |                                           
                       -----------------                                           
                       |  call retn    |                                           
                       -----------------                                           
                       |  call ret(n-1)|                                           
                       -----------------                                           
                       |  ..........   |                                           
                       -----------------                                           
                       |  call ret1    |                                           
                       -----------------                                           
                       |  call argn    |                                           
                       -----------------                                           
                       |   .....       |                                           
                       -----------------                                           
                       |  call arg3    |                                           
                       -----------------                                           
                       |  call arg2    |                                           
                       |---------------|                                           
                       |  call arg1    |                                           
                       -----------------   &lt;------------  hardware SP 位置           
                       | return addr   |                                           
                       +---------------+   &lt;----------- 这里是 _g_.sched.lo 吗？
                                                                                   
</code></pre>
<h3 id="go-栈管理机制">Go 栈管理机制</h3>
<p>Go 使用<strong>连续栈</strong>机制进行管理栈空间。</p>
<p>在 Go1.3 以前使用<strong>分段栈</strong>：</p>
<p>在栈空间用完后，分配一块新的内存地址，在这个新栈中包含旧栈的地址。</p>
<p>问题：这种设计的缺陷很容易破坏缓存的局部性原理，从而降低程序的运行时性能。收缩栈的操作太过昂贵，在循环中重复分裂，收缩，释放的操作会付出很大的开销。这就是<strong>热点分裂问题（hot split problem）</strong></p>
<p>从 Go1.4 之后的版本中，使用了<strong>连续栈</strong>机制，也叫栈拷贝。</p>
<p>栈拷贝的方式是创建一个新的栈，它的大小是旧栈的两倍，并把旧栈完全拷贝进去。收缩操作不做处理，再次增长时使用刚才的空间。</p>
<h4 id="栈是如何拷贝的">栈是如何拷贝的</h4>
<p>由于栈中的变量在 Go 中能够获得其地址，因此最终会出现指向栈的指针，如果直接拷贝，任何指向旧栈的指针都会失效。</p>
<p>所以 Go 的内存安全机制规定，任何能够指向栈的指针都必须存在于栈中。</p>
<p>在编译器的逃逸分析中，所有有可能逃逸的变量，都会被分配在堆上。剩下栈中的指针，指向的都是栈里的数据。</p>
<h5 id="go-没有采用-x86-64-架构函数传参优化">Go 没有采用 x86-64 架构函数传参优化</h5>
<p>在 x86-64 架构下，增加了许多通用寄存器，C 系语言为了优化，会将参数部分（最多6个）使用寄存器直接传递，但是在 Go 中，编译器强制规定<strong>函数的传参全部使用栈传递，不使用寄存器传参</strong>。</p>
<h2 id="执行栈分配过程">执行栈分配过程</h2>
<h3 id="从创建-goroutine-开始">从创建 goroutine 开始</h3>
<pre><code class="language-go">// go 函数会被编译为 runtime.newproc 的调用
// 用 siz 字节的参数创建运行 fn 调用的 goroutine
// 这个调用的堆栈布局是特殊的，它假设传递给 fn 的参数是进阶在 &amp;fn 之上的堆栈上的。
// 因此它们在逻辑上是 newproc 的参数框架的一部分
func newproc(siz int32, fn *funcval) {
  // 从 fn 的地址增加一个指针的长度，从而获取第一参数地址
	argp := add(unsafe.Pointer(&amp;fn), sys.PtrSize)
  // 获取 g 指针，编译器会编译为从 TLS 或其他专用寄存器中获取
  // 获取到的是 caller g 吗？
	gp := getg()
  // 获取调用方 PC
	pc := getcallerpc()
  // 使用 g0 系统栈创建新的 goroutine
	systemstack(func() {
    // 创建 g 的函数，传入了 fn 函数的入口地址，argp 调用函数参数的起始位置，siz 参数长度，
    // gp caller g，caller pc（创建 goroutine 语句的地址）
		newg := newproc1(fn, argp, siz, gp, pc)

		_p_ := getg().m.p.ptr()
		runqput(_p_, newg, true)

		if mainStarted {
			wakep()
		}
	})
}
</code></pre>
<h4 id="解析-newproc-调用前的过程">解析 newproc 调用前的过程</h4>
<p>也就是 fn 在哪，fn 之上的参数是如何分配的。</p>
<p>有参数的情况</p>
<pre><code class="language-go">package main

func hello(msg string) {
	println(msg)
}

func main() { // 7 行
	go hello(&quot;hello world&quot;) // 8 行
}
</code></pre>
<pre><code class="language-assembly">&quot;&quot;.main STEXT size=91 args=0x0 locals=0x28
	......
	0x000f 00015 (hello.go:7)	SUBQ	$40, SP // 栈扩大 40 字节
	0x0013 00019 (hello.go:7)	MOVQ	BP, 32(SP) // caller BP，由 callee 存储
	0x0018 00024 (hello.go:7)	LEAQ	32(SP), BP // callee BP 的栈从 32(SP) 开始
	......
	0x001d 00029 (hello.go:8)	MOVL	$16, (SP) // 将 16 放到 SP 的位置，16 是第一个参数 siz，因为是 int32，所以是 MOVL。数字是 16 是因为有 string.data 和 string.len 两个参数加一起占 16 个字节
	0x0024 00036 (hello.go:8)	LEAQ	&quot;&quot;.hello·f(SB), AX // 将 hello 的调用地址传给 AX
	0x002b 00043 (hello.go:8)	MOVQ	AX, 8(SP) // 将 hello 的调用地址放入 8(SP) 的位置
	0x0030 00048 (hello.go:8)	LEAQ	go.string.&quot;hello world&quot;(SB), AX // 将“hello world”的地址放入 AX
	0x0037 00055 (hello.go:8)	MOVQ	AX, 16(SP) // 将字符串地址放在 16(SP) 的位置
	0x003c 00060 (hello.go:8)	MOVQ	$11, 24(SP) // 将 $11 放在 24(SP) 的位置，11 是 string 的长度，string 是结构体，结构体在传参中会扁平化为多个参数
	0x0045 00069 (hello.go:8)	CALL	runtime.newproc(SB) // call 指令 = push+jmp，所以会将 newproc 地址入栈
	0x004a 00074 (hello.go:9)	MOVQ	32(SP), BP // 复原 caller BP
	0x004f 00079 (hello.go:9)	ADDQ	$40, SP // 缩小栈
</code></pre>
<p>特殊的栈布局</p>
<pre><code>             栈布局
40(SP)+-----------------+      高地址
      |    caller BP    |       
32(SP)+-----------------+ &lt;-- main.BP
      |  11 string.len  |
24(SP)+-----------------+ 
      | &amp;&quot;hello world&quot;  |
16(SP)+-----------------+ &lt;-- fn + sys.PtrSize
      |      hello      |
8(SP) +-----------------+ &lt;-- fn
      |       siz       |
(SP)  +-----------------+ &lt;-- SP
      |    newproc PC   |  
      +-----------------+ callerpc: 要运行的 Goroutine 的 PC
      |                 |
      |                 |       低地址
</code></pre>
<h4 id="newproc1-调用">newproc1 调用</h4>
<pre><code class="language-go">func newproc1(fn *funcval, argp unsafe.Pointer, narg int32, callergp *g, callerpc uintptr) *g {
  // 在系统栈中得到的是 g0
	_g_ := getg()
  ......
	siz := narg
	siz = (siz + 7) &amp;^ 7 // 内存对齐

	// 参数大小不能超过 2048-4*8-8（64位），可以分配更大的栈，但没必要
	if siz &gt;= _StackMin-4*sys.RegSize-sys.RegSize {
		throw(&quot;newproc: function arguments too large for new goroutine&quot;)
	}

	_p_ := _g_.m.p.ptr()
  // 尝试复用运行结束的 G
	newg := gfget(_p_)
	if newg == nil {
    // 分配一个新的 g 结构, 包含一个 stacksize 字节的的栈
    // 总是 2KB？
		newg = malg(_StackMin)
		casgstatus(newg, _Gidle, _Gdead)
		allgadd(newg) // 将 _Gdead 状态的 newg 添加到 allg，防止被 GC 扫描到
	}
  ......
}

func malg(stacksize int32) *g {
	newg := new(g)
	if stacksize &gt;= 0 {
    // 有些系统需要额外的栈空间
    // 将 stacksize 舍入为 2 的指数，目的是为了消除 _StackSystem 对栈的影响
		stacksize = round2(_StackSystem + stacksize)
		systemstack(func() {
			newg.stack = stackalloc(uint32(stacksize))
		})
    // 设置堆栈保护位置
		newg.stackguard0 = newg.stack.lo + _StackGuard
		newg.stackguard1 = ^uintptr(0)
    ......
	}
	return newg
}
</code></pre>
<h3 id="执行栈的分配">执行栈的分配</h3>
<p>前置部分知识：[[Go 内存分配器]]</p>
<p>栈可能从两个不同的位置被分配：小栈和大栈。小栈指大小为 2K/4K/8K/16K 的栈，大栈则是更大的栈。 <code>stackalloc</code> 基本上也就是在权衡应该从哪里分配出一个执行栈，返回所在栈的低位和高位。</p>
<pre><code class="language-go">func stackalloc(n uint32) stack {
  // g0
	thisg := getg()
  ......

	// 小栈由固定大小的空闲链表分配器进行分配
	// 大栈由专用的 span 分配
	var v unsafe.Pointer
	if n &lt; _FixedStack&lt;&lt;_NumStackOrders &amp;&amp; n &lt; _StackCacheSize {
    // 小栈分配
	} else {
    // 大栈分配
	}
  ......
	return stack{uintptr(v), uintptr(v) + uintptr(n)}
}
</code></pre>
<h4 id="小栈">小栈</h4>
<p>对于较小的栈可以从 <code>stackpool</code> 或者 <code>stackcache</code> 中分配，这取决于当产生栈分配时，goroutine 是否正处于 <code>exitsyscall</code> 或 <code>procresize</code>，或是正在发生抢占 <code>thisg.m.preemptoff != &quot;&quot;</code>。</p>
<pre><code class="language-go">order := uint8(0)
n2 := n
for n2 &gt; _FixedStack {
  order++
  n2 &gt;&gt;= 1
}
var x gclinkptr
// 检查是否需要从全局池（stackpool）中分配栈
if stackNoCache != 0 || thisg.m.p == 0 || thisg.m.preemptoff != &quot;&quot; {
  lock(&amp;stackpool[order].item.mu)
  x = stackpoolalloc(order)
  unlock(&amp;stackpool[order].item.mu)
} else {
  // 如果不需要就从 mcache.stackcache 中分配
  c := thisg.m.p.ptr().mcache
  x = c.stackcache[order].list
  if x.ptr() == nil { // 提取失败，扩容再重试
    stackcacherefill(c, order)
    x = c.stackcache[order].list
  }
  c.stackcache[order].list = x.ptr().next
  c.stackcache[order].size -= uintptr(n)
}
v = unsafe.Pointer(x)

// mcache.stackcache 扩容
func stackcacherefill(c *mcache, order uint8) {
  ......
	var list gclinkptr
	var size uintptr
	lock(&amp;stackpool[order].item.mu)
  // 从全局池（stackpool）中获取一些 stack
	// 获取所允许的容量的一半来防止 thrashing
	for size &lt; _StackCacheSize/2 {
		x := stackpoolalloc(order)
		x.ptr().next = list
		list = x
		size += _FixedStack &lt;&lt; order
	}
	unlock(&amp;stackpool[order].item.mu)
	c.stackcache[order].list = list
	c.stackcache[order].size = size
}

// 从空闲池中分配一个栈，必须在持有 stackpool[order].item.mu 下调用
func stackpoolalloc(order uint8) gclinkptr {
	list := &amp;stackpool[order].item.span // mSpanList 存储了 mspan 的头部和尾部
	s := list.first // 链表头
	lockWithRankMayAcquire(&amp;mheap_.lock, lockRankMheap)
  // 证明没有缓存已空
	if s == nil {
    // 从 mheap 上申请，一次申请 32KB 内存即 4 页（(32*1024) &gt;&gt; 13）
		s = mheap_.allocManual(_StackCacheSize&gt;&gt;_PageShift, &amp;memstats.stacks_inuse)
    ......
    // OpenBSD 6.4+ 对栈内存有特殊的需求，所以只要我们从堆上申请栈内存，需要在申请后做一些额外处理
		osStackAlloc(s)
		s.elemsize = _FixedStack &lt;&lt; order
		for i := uintptr(0); i &lt; _StackCacheSize; i += s.elemsize {
      // gclinkptr 也是一个指针类型
      // 作用是屏蔽gc扫描
			x := gclinkptr(s.base() + i)
      // 链表头插法
			x.ptr().next = s.manualFreeList
			s.manualFreeList = x
		}
		list.insert(s)
	}
	x := s.manualFreeList
	if x.ptr() == nil {
		throw(&quot;span has no free stacks&quot;)
	}
	s.manualFreeList = x.ptr().next
	s.allocCount++
	if s.manualFreeList.ptr() == nil {
		// 所有内存已经分配完毕，删除节点 s
		list.remove(s)
	}
	return x
}
</code></pre>
<h4 id="大栈">大栈</h4>
<p>大空间从 <code>stackLarge</code> 进行分配</p>
<pre><code class="language-go">var s *mspan
npage := uintptr(n) &gt;&gt; _PageShift
log2npage := stacklog2(npage)

// 尝试从 stackLarge 缓存中获取堆栈。
lock(&amp;stackLarge.lock)
if !stackLarge.free[log2npage].isEmpty() {
  s = stackLarge.free[log2npage].first
  stackLarge.free[log2npage].remove(s)
}
unlock(&amp;stackLarge.lock)

lockWithRankMayAcquire(&amp;mheap_.lock, lockRankMheap)

if s == nil {
  // 如果无法从缓存中获取，则从堆中分配一个新的栈
  s = mheap_.allocManual(npage, &amp;memstats.stacks_inuse)
  if s == nil {
    throw(&quot;out of memory&quot;)
  }
  osStackAlloc(s)
  s.elemsize = uintptr(n)
}
v = unsafe.Pointer(s.base())
</code></pre>
<h4 id="堆上分配">堆上分配</h4>
<p>无论是大栈还是小栈的分配，都是使用从 <code>mheap</code> 上申请的缓存，通过 <code>allocManual</code> 方法：</p>
<pre><code class="language-go">func (h *mheap) allocManual(npages uintptr, stat *uint64) *mspan {
	return h.allocSpan(npages, true, 0, stat)
}
</code></pre>
<p>[[Go 内存分配器]]</p>
<p><strong>总结</strong></p>
<figure data-type="image" tabindex="1"><img src="https://tva1.sinaimg.cn/large/008eGmZEly1gmv4h8v5k2j30v10l4wgo.jpg" alt="summary" loading="lazy"></figure>
<h2 id="栈管理">栈管理</h2>
<p>早期几个版本中发生过一些变化：</p>
<ul>
<li>v1.0 ~ v1.1 — 最小栈内存空间为 4KB；</li>
<li>v1.2 — 将最小栈内存提升到了 8KB；</li>
<li>v1.3 — 使用<strong>连续栈</strong>替换之前版本的分段栈；</li>
<li>v1.4 — 将最小栈内存降低到了 2KB；</li>
</ul>
<p>Goroutine 的初始栈内存在最初的几个版本中多次修改，从 4KB 提升到 8KB 是临时的解决方案，其目的是为了减轻分段栈中的栈分裂对程序的性能影响；在 v1.3 版本引入连续栈之后，Goroutine 的初始栈大小降低到了 2KB，进一步减少了 Goroutine 占用的内存空间。</p>
<h3 id="分段栈">分段栈</h3>
<p>Go 会在编译时在每个 go 函数入口处增加一个栈空间检查代码，如果栈用光了，就会去调用 <code>morestack</code> 函数。<code>morestack</code> 函数会分配一段新内存用作栈空间，接下来它会将有关栈的各种数据信息写入栈底的一个 struct 中，包括上一段的堆栈地址。然后重启 goroutine 来重试导致栈用光的函数。这就是“栈分裂”。</p>
<pre><code>  +---------------+
  |               |  &lt;---+ 新栈
  |   unused      |
  |   stack       |
  |   space       |
  +---------------+
  |    Foobar     |
  |               |
  +---------------+
  |               |
  |  lessstack    |
  +---------------+
  | Stack info    |
  |               |-----+
  +---------------+     |
                        |
                        |
  +---------------+     |
  |    Foobar     |     |
  |               | &lt;---+
  +---------------+
  | rest of stack | &lt;---+ 旧栈
  |               |
</code></pre>
<p>分段栈回溯机理：在新栈的底部，我们插入了 <code>lessstack</code> 函数。当我们从导致栈分裂的函数返回时，我们会回到 <code>lessstack</code>， <code>lessstack</code> 会查找 stack 底部的那个 struct，并调整栈指针（rsp），使得返回前一段的栈空间。这样，我们就将新的栈释放掉了。</p>
<p>分段栈也有瑕疵。这两个栈彼此没有连续。 这种设计的缺陷很容易破坏缓存的<strong>局部性原理</strong>，从而降低程序的运行时性能。</p>
<p>同时<strong>收缩栈是一个相对昂贵的操作</strong>。如果是在一个循环中分裂栈情况更明显。函数会增长栈，分裂栈，返回栈，并且释放栈分段。如果是在循环里面做这些操作，那么将会付出很大的开销。例如循环一次经历了这些过程，当下一次循环时栈又被耗尽，又得重新分配栈分段，然后又被释放掉，周而复始，循环往复，开销就会巨大。</p>
<p>这就是熟知的 <strong><code>hot split problem</code></strong> （热点分裂问题）。这是Golang开发组切换到新的栈管理方式的主要原因，新方式称为<strong>栈拷贝</strong>。</p>
<h3 id="连续栈栈拷贝">连续栈（栈拷贝）</h3>
<p>栈拷贝开始很像分段栈。协程运行，使用栈空间，当栈将要耗尽时，触发相同的栈溢出检测。但是，不像分段栈里有一个回溯链接，<strong>栈拷贝的方式则是创建了一个新的分段，它是旧栈的两倍大小，并且把旧栈完全拷贝进来。</strong></p>
<p>但栈拷贝也没有想象中的那么简单。由于栈中的变量在 Golang 中能够获取其地址，因此最终会出现指向栈的指针。而如果轻易拷贝移动栈，任何指向旧栈的指针都会失效。</p>
<p>所以 Golang 的内存安全机制规定，任何能够指向栈的指针都必须存在于栈中。这就可以通过垃圾收集器协助栈拷贝，因为垃圾收集器需要知道哪些指针可以进行回收，所以可以查到栈上的哪些部分是指针，当进行栈拷贝时，会更新指针信息只相信目标，以及它相关的所有指针。</p>
<p>特殊的是 <code>runtime</code> 的大量核心调度函数和 GC 核心都是用 C 语言写的，这些函数都获取不到指针信息，那么它们就无法复制。这种都会在一个特殊的栈中执行（g0），并且由 <code>runtime</code> 开发者定义栈尺寸。</p>
<h4 id="汇编中的连续栈">汇编中的连续栈</h4>
<p>在机器架构层面，很多关于函数的公用操作都会被提取为固定代码，在函数运行时插入到代码片段的前后部分中，其中函数代码前插入汇编，称为<code>prolog</code>，一般只会有一个<code>prolog</code>。在函数代码后插入汇编，称为<code>epilog</code>，一般可以有多个<code>epilog</code>。这就是“序言”和“后记”。</p>
<p><strong>golang就是用<code>prolog + epilog</code>的方式来实现连续栈的检测和复制的。</strong></p>
<pre><code class="language-assembly">&quot;&quot;.main STEXT size=105 args=0x0 locals=0x20
	0x0000 00000 (main.go:23)	TEXT	&quot;&quot;.main(SB), ABIInternal, $32-0
	0x0000 00000 (main.go:23)	MOVQ	(TLS), CX
	0x0009 00009 (main.go:23)	CMPQ	SP, 16(CX)
	0x000d 00013 (main.go:23)	JLS	98
	// main func body
	0x0062 00098 (main.go:26)	NOP
	0x0062 00098 (main.go:23)	PCDATA	$1, $-1
	0x0062 00098 (main.go:23)	PCDATA	$0, $-1
	0x0062 00098 (main.go:23)	CALL	runtime.morestack_noctxt(SB)
	0x0067 00103 (main.go:23)	JMP	0
</code></pre>
<h4 id="栈溢出检测实现">栈溢出检测实现</h4>
<p>TLS(thred-local storage) 是伪寄存器，它表示 <code>g</code> 结构体的位置。并且只能被载入到另一个寄存器中（因为本质上不是寄存器，是内存位置？）。16(TLS) 指向的是 <code>g.stackguard0</code>。<code>g.stackguard0</code> 在上面的源码中显示被设置为 <code>g.stack.lo + _StackGuard</code> 的位置，也就是保留栈顶的一段（<code>_StackGuard</code>）位置。所以每一个不是 <code>nosplit</code> 的函数都会在编译后的函数中加入检查，比较 SP 和 <code>g.stackguard0</code> 的值。</p>
<p>这就表示：<strong>栈溢出发生在整个函数执行前就能被侦测到，而不是函数内某条语句执行时。</strong></p>
<figure data-type="image" tabindex="2"><img src="https://tva1.sinaimg.cn/large/008eGmZEly1gmwi88j939j30cu0c0dg4.jpg" alt="StackGuard" loading="lazy"></figure>
<h2 id="执行栈的伸缩">执行栈的伸缩</h2>
<h3 id="栈的扩张">栈的扩张</h3>
<p>经过溢出检测后，会跳转到汇编实现的函数上进行栈扩容，如果函数不需要 <code>g.sched.ctxt</code> 字段，则会调用 <code>runtime.nirestack_noctxt</code>，否则会被编译为直接调用 <code>runtime.morestack</code>。</p>
<pre><code class="language-assembly">TEXT runtime·morestack_noctxt(SB),NOSPLIT,$0
	MOVL	$0, DX // DX 中存储着 g.sched.ctxt 字段，置为0意为不需要保存。
	JMP	runtime·morestack(SB)

TEXT runtime·morestack(SB),NOSPLIT,$0-0
	// 检查要增加的是否为 g0 栈，不能扩容 g0 栈
	get_tls(CX)
	MOVQ	g(CX), BX
	MOVQ	g_m(BX), BX
	MOVQ	m_g0(BX), SI
	CMPQ	g(CX), SI
	JNE	3(PC)
	CALL	runtime·badmorestackg0(SB)
	CALL	runtime·abort(SB)

	// 不能扩容信号栈（gsignal stack）
	MOVQ	m_gsignal(BX), SI
	CMPQ	g(CX), SI
	JNE	3(PC)
	CALL	runtime·badmorestackgsignal(SB)
	CALL	runtime·abort(SB)

	// 从 f 调用
	// 将 m-&gt;morebuf 设置为 f 的调用方
	NOP	SP	// tell vet SP changed - stop checking offsets
	MOVQ	8(SP), AX	// f's caller's PC
	MOVQ	AX, (m_morebuf+gobuf_pc)(BX)
	LEAQ	16(SP), AX	// f's caller's SP
	MOVQ	AX, (m_morebuf+gobuf_sp)(BX)
	get_tls(CX)
	MOVQ	g(CX), SI
	MOVQ	SI, (m_morebuf+gobuf_g)(BX)

	// 设置当前的执行栈（g.sched）为 f
	MOVQ	0(SP), AX // f's PC
	MOVQ	AX, (g_sched+gobuf_pc)(SI)
	MOVQ	SI, (g_sched+gobuf_g)(SI)
	LEAQ	8(SP), AX // f's SP
	MOVQ	AX, (g_sched+gobuf_sp)(SI)
	MOVQ	BP, (g_sched+gobuf_bp)(SI)
	MOVQ	DX, (g_sched+gobuf_ctxt)(SI)

	// 切换到 g0 栈上调用 newstack
	MOVQ	m_g0(BX), BX
	MOVQ	BX, g(CX)
	MOVQ	(g_sched+gobuf_sp)(BX), SP
	CALL	runtime·newstack(SB)
	CALL	runtime·abort(SB)	// crash if newstack returns
	RET
</code></pre>
<p><code>newstack</code> 在前半部分承担了对 Goroutine 进行抢占的任务[[Go 协作与抢占]]，在后半部分则是真正的扩张。</p>
<pre><code class="language-go">func newstack() {
  // g0
	thisg := getg()
  ......

  // 寻找要执行的 g
	gp := thisg.m.curg

  ......

	morebuf := thisg.m.morebuf
	thisg.m.morebuf.pc = 0
	thisg.m.morebuf.lr = 0
	thisg.m.morebuf.sp = 0
	thisg.m.morebuf.g = 0

	......
	sp := gp.sched.sp
	if sys.ArchFamily == sys.AMD64 || sys.ArchFamily == sys.I386 || sys.ArchFamily == sys.WASM {
		// 对 morestack 的调用花费了一个字，是因为 call 指令吗？
		sp -= sys.PtrSize
	}

	// 分配一个更大（2倍）的栈并移动
	oldsize := gp.stack.hi - gp.stack.lo
	newsize := oldsize * 2

  // 需要的栈太大， 直接溢出
	if newsize &gt; maxstacksize {
		print(&quot;runtime: goroutine stack exceeds &quot;, maxstacksize, &quot;-byte limit\n&quot;)
		print(&quot;runtime: sp=&quot;, hex(sp), &quot; stack=[&quot;, hex(gp.stack.lo), &quot;, &quot;, hex(gp.stack.hi), &quot;]\n&quot;)
		throw(&quot;stack overflow&quot;)
	}

  // goroutine 必须是正在执行中才会来调用 newstack，所以状态一定是 Grunning(or Gscanrunning)
  // 转为 Gcopystack
	casgstatus(gp, _Grunning, _Gcopystack)

	// 因为 goroutine 处于 Gcopystack 状态，所以我们在复制栈时不会被并发的 gc 影响。
	copystack(gp, newsize)
  ......
  // 继续执行
	casgstatus(gp, _Gcopystack, _Grunning)
	gogo(&amp;gp.sched)
}
</code></pre>
<h3 id="栈的拷贝">栈的拷贝</h3>
<p>栈拷贝的难点在于 Go 栈上的变量会包含自己的地址，当我们拷贝了一个指向原栈的指针时，拷贝后的指针就会变为无效指针。所以 Go 的策略是<strong>只有栈上分配的指针才能指向栈上的地址，否则这个指针指向的对象会重新在堆中进行分配（逃逸）。</strong></p>
<pre><code class="language-go">func copystack(gp *g, newsize uintptr) {
  // 旧栈
	old := gp.stack
	used := old.hi - gp.sched.sp

	// 获取新栈
	new := stackalloc(uint32(newsize))

	// 计算调节幅度
	var adjinfo adjustinfo
	adjinfo.old = old
	adjinfo.delta = new.hi - old.hi

	// 调整 sudog，必要时与 channel 操作同步
	ncopy := used
	if !gp.activeStackChans {
		adjustsudogs(gp, &amp;adjinfo)
	} else {
		adjinfo.sghi = findsghi(gp, old)
		ncopy -= syncadjustsudogs(gp, used, &amp;adjinfo)
	}

	// 复制栈
	memmove(unsafe.Pointer(new.hi-ncopy), unsafe.Pointer(old.hi-ncopy), ncopy)

	// 新栈替换旧栈
	gp.stack = new
	gp.stackguard0 = new.lo + _StackGuard // 注意：可能会破坏一个抢占请求
	gp.sched.sp = new.hi - used
	gp.stktopsp += adjinfo.delta

	// 释放旧栈
	if stackPoisonCopy != 0 {
		fillstack(old, 0xfc)
	}
	stackfree(old)
}
</code></pre>
<h3 id="栈的收缩">栈的收缩</h3>
<p>栈的收缩发生在 GC 时</p>
<pre><code class="language-go">func scanstack(gp *g, gcw *gcWork) {
  ......
	switch readgstatus(gp) &amp;^ _Gscan {
	default:
		print(&quot;runtime: gp=&quot;, gp, &quot;, goid=&quot;, gp.goid, &quot;, gp-&gt;atomicstatus=&quot;, readgstatus(gp), &quot;\n&quot;)
		throw(&quot;mark - bad status&quot;)
	case _Gdead:
		return
	case _Grunning:
		print(&quot;runtime: gp=&quot;, gp, &quot;, goid=&quot;, gp.goid, &quot;, gp-&gt;atomicstatus=&quot;, readgstatus(gp), &quot;\n&quot;)
		throw(&quot;scanstack: goroutine not stopped&quot;)
	case _Grunnable, _Gsyscall, _Gwaiting:
		// 只在这三种状态下才能收缩
	}
  ......
  // 检查是否能够安全的收缩栈，比如系统调用时不可以，因为可能有指向栈的指针。
	if isShrinkStackSafe(gp) {
		// Shrink the stack if not much of it is being used.
		shrinkstack(gp)
	} else {
		// Otherwise, shrink the stack at the next sync safe point.
		gp.preemptShrink = true
	}
  ......
}

func shrinkstack(gp *g) {
	oldsize := gp.stack.hi - gp.stack.lo
	newsize := oldsize / 2 // 收缩幅度为减半
	// 但不能小于最小栈大小
	if newsize &lt; _FixedStack {
		return
	}
	// 仅当栈使用量小于四分之一时才会对栈进行收缩
	avail := gp.stack.hi - gp.stack.lo
	if used := gp.stack.hi - gp.sched.sp + _StackLimit; used &gt;= avail/4 {
		return
	}

	copystack(gp, newsize)
}
</code></pre>
<h2 id="goroutine-执行现场">Goroutine 执行现场</h2>
<p>在上面的 <code>morestack</code> 中可以看到一些对 <code>g.sched(gobuf)</code> 字段的处理：</p>
<pre><code class="language-assembly">	// 设置当前的执行栈（g.sched）为 f
	MOVQ	0(SP), AX // f's PC
	MOVQ	AX, (g_sched+gobuf_pc)(SI)
	MOVQ	SI, (g_sched+gobuf_g)(SI)
	LEAQ	8(SP), AX // f's SP
	MOVQ	AX, (g_sched+gobuf_sp)(SI)
	MOVQ	BP, (g_sched+gobuf_bp)(SI)
	MOVQ	DX, (g_sched+gobuf_ctxt)(SI)
</code></pre>
<p>以及调用 <code>gogo</code> 函数时需要传入 <code>gobuf</code>：</p>
<pre><code class="language-go">func newstack() {
	gogo(&amp;gp.sched)
}
</code></pre>
<p>剩余内容在 [[Go runtime]] 中也有体现。</p>
]]></content>
    </entry>
    <entry>
        <title type="html"><![CDATA[Kubernetes 网络笔记]]></title>
        <id>https://cnbailian.github.io/post/kubernetes-network-notes/</id>
        <link href="https://cnbailian.github.io/post/kubernetes-network-notes/">
        </link>
        <updated>2021-02-07T11:00:00.000Z</updated>
        <summary type="html"><![CDATA[<p>集群网络系统是 Kubernetes 的核心部分，其中 Pod 之间的通信的部分 Kubernetes 没有自己实现，而是交给了外部组件进行处理。Kubernetes 对这部分网络模型的要求是：节点上的 Pod 可以不通过 NAT 和其他任何节点上的 Pod 通信。这就需要一个跨主机的容器网络。</p>
<p>本篇笔记前半部分记录了 VXLAN 技术。VXLAN 全称是 <code>Virtual eXtensible Local Area Network</code>，虚拟可扩展的局域网。它是一种 Overlay 技术，通过三层网络来搭建的二层网络。在笔记的后半部分，通过学习 <a href="https://github.com/coreos/flannel">Flannel</a> 的源码手动搭建跨主机容器网络示例。</p>
<p>笔记中 vxlan 内容学习自 <a href="https://cizixs.com/about/">cizixs</a> 的两篇博客，一篇<a href="https://cizixs.com/2017/09/25/vxlan-protocol-introduction/">介绍协议原理</a>，一篇<a href="https://cizixs.com/2017/09/28/linux-vxlan/">结合实践</a>。文章写的很详细，而且深入浅出适合学习，建议读者在 vxlan 部分直接看原文。</p>
]]></summary>
        <content type="html"><![CDATA[<p>集群网络系统是 Kubernetes 的核心部分，其中 Pod 之间的通信的部分 Kubernetes 没有自己实现，而是交给了外部组件进行处理。Kubernetes 对这部分网络模型的要求是：节点上的 Pod 可以不通过 NAT 和其他任何节点上的 Pod 通信。这就需要一个跨主机的容器网络。</p>
<p>本篇笔记前半部分记录了 VXLAN 技术。VXLAN 全称是 <code>Virtual eXtensible Local Area Network</code>，虚拟可扩展的局域网。它是一种 Overlay 技术，通过三层网络来搭建的二层网络。在笔记的后半部分，通过学习 <a href="https://github.com/coreos/flannel">Flannel</a> 的源码手动搭建跨主机容器网络示例。</p>
<p>笔记中 vxlan 内容学习自 <a href="https://cizixs.com/about/">cizixs</a> 的两篇博客，一篇<a href="https://cizixs.com/2017/09/25/vxlan-protocol-introduction/">介绍协议原理</a>，一篇<a href="https://cizixs.com/2017/09/28/linux-vxlan/">结合实践</a>。文章写的很详细，而且深入浅出适合学习，建议读者在 vxlan 部分直接看原文。</p>
<!--more-->
<h2 id="vxlan-协议原理">VXLAN 协议原理</h2>
<p>上面提到 vxlan 是 overlay 技术，overlay 网络是建立在已有物理网络（underlay）上的虚拟网络，具有独立的控制和转发平面，对于连接到 overlay 的设备来说，物理网络是透明的。</p>
<p>那么 vxlan 这类的 Overlay 网络解决了那么些问题？</p>
<ul>
<li>传统的 VLAN 技术满足不了虚拟化场景下的数据中心规模，VLAN 最多只支持 4096 个网络上限。</li>
<li>数据中心需要提供多租户功能，不同用户之间需要独立的分配 IP 和 MAC 地址</li>
<li>云计算业务需要高灵活性，虚拟机可能会大规模迁移，并保证网络一直可用。</li>
</ul>
<p>vxlan 实现原理就是使用 VTEP 设备对服务器发出和收到的数据包进行二次封装和解封。所以 vxlan 这类隧道网络对原有的网络架构影响小，原来的网络不需要做任何改动，在原有网络上架设一层新的网络。</p>
<h3 id="vxlan-模型">VXLAN 模型</h3>
<figure data-type="image" tabindex="1"><img src="https://tva1.sinaimg.cn/large/007S8ZIlly1gi3vmfk4nmj30g808ft9m.jpg" alt="vxlan" loading="lazy"></figure>
<p>物理网络上可以创建多个 vxlan 网络，这些 vxlan 网络可以认为是一个隧道，不同节点的虚拟机能够通过隧道直连。在每个端点上都有一个 vtep 负责 vxlan 协议的封包和解包，也就是在虚拟报文上封装 vtep 通信的报文头部。每个 vxlan 网络由唯一的 VNI 标识，不同的 vxlan 可以不互相影响。</p>
<ul>
<li>VTEP（VXLAN Tunnel Endpoints）：vxlan 网络的边缘设备，用来进行 vxlan 报文的处理（封包和解包）。vtep 可以是网络设备（比如交换机），也可以是一台机器（比如虚拟化集群中的宿主机）。</li>
<li>VNI（VXLAN Network Identifier）：VNI 是每个 vxlan 的标识，是个 24 位整数，一共有 2^24 = 16,777,216（一千多万），一般每个 VNI 对应一个租户，也就是说使用 vxlan 搭建的公有云可以理论上可以支撑千万级别的租户。</li>
<li>Tunnel：隧道是一个逻辑上的概念，在 vxlan 模型中并没有具体的物理实体想对应。隧道可以看做是一种虚拟通道，vxlan 通信双方（图中的虚拟机）认为自己是在直接通信，并不知道底层网络的存在。从整体来说，每个 vxlan 网络像是为通信的虚拟机搭建了一个单独的通信通道，也就是隧道。</li>
</ul>
<h3 id="vxlan-报文解析">VXLAN 报文解析</h3>
<figure data-type="image" tabindex="2"><img src="https://tva1.sinaimg.cn/large/007S8ZIlly1gi3w2klximj30vn0f040c.jpg" alt="post-img" loading="lazy"></figure>
<p>白色部分是虚拟机发送的原始报文（二层帧，包含了 MAC 头部、IP 头部和传输层头部的报文），前面加上了 vxlan 头部用于保存 vxlan 相关内容，再前面是标准的 UDP 协议头部（UDP 头部、IP 头部和 MAC 头部）用来在底层网络上传输报文。</p>
<p>最外层的 UDP 协议用来在底层网络上传输，也就是 vtep 之间互相通信的基础。中间是 VXLAN 头部，vetp 接到报文后，根据这部分内容处理 vxlan 逻辑，主要是根据 VNI 发送到最终的虚拟机。最里面是原始报文，也就是虚拟机看到的报文内容。</p>
<p>报文各部分意义如下：</p>
<ul>
<li>VXLAN header：8 字节
<ul>
<li>VXLAN flags：标志位</li>
<li>Reserved：保留位</li>
<li>VNID：24 位的 VNID 标识</li>
<li>Reserved：保留位</li>
</ul>
</li>
<li>UDP 头部：8 字节
<ul>
<li>UDP：UDP 通信双方是 vtep 应用，IANA 分配的 vxlan 端口是 4789</li>
</ul>
</li>
<li>IP 头部：20 字节
<ul>
<li>目的地址：是由虚拟机所在地址宿主机的 vtep 的 IP 地址</li>
</ul>
</li>
<li>MAC 头部：14 字节
<ul>
<li>MAC 地址：主机之间通信的 MAC 地址</li>
</ul>
</li>
</ul>
<p>可以看出 vxlan 协议比原始报文多出 50 字节的内容，这会降低网络链路传输有效数据的比例。</p>
<h2 id="实现-vxlan">实现 VXLAN</h2>
<p>Linux 在 3.7.0 版本才开始支持 vxlan，请尽量使用比较新版本的 kernel，以免因为内核版本太低导致功能或性能出现问题。</p>
<p>我的实验环境是 2 台 AWS Debian 系统实例：</p>
<pre><code class="language-shell">$ uname -r
4.19.0-14-cloud-amd64
$ echo ${HOST1_IP}
172.16.3.142
$ echo ${HOST2_IP}
172.16.2.21
</code></pre>
<p>同时为了实验容器网络，会保证每台主机上都有 network namespace（net0）与 bridge（br0） 的连接关系。创建过程在上一篇笔记。</p>
<pre><code class="language-shell">$ ip netns
net0 (id: 0)
$ ip link
veth1
br0
$ ip netns exec net0 ip addr # host1
veth0
  link/ether 4e:3d:fd:29:55:38
  inet 192.168.2.11/24 scope global veth0
$ ip netns exec net0 ip addr # host2
veth0
  link/ether 46:63:12:3e:fa:da
  inet 192.168.2.12/24 scope global veth0
</code></pre>
<h3 id="点对点-vxlan">点对点 VXLAN</h3>
<p>首先创建 host1 的点对点的 VXLAN 设备，点对点设备是指创建 vxlan 时指定了 <code>remote</code> 参数的设备：</p>
<pre><code class="language-shell">$ ip link add type vxlan id 1 dstport 4789 dev eth0 remote 172.16.2.21
</code></pre>
<p><code>id 1</code> 表示 VNI，在点对点的设备中需要双方保持一致。</p>
<p><code>dstport 4789</code> 是IANA 分配的 vxlan 端口是 4789，Linux 默认使用 8472，所以这里显式分配。</p>
<p><code>dev eth0</code> 表示当前节点用于通信的网络设备，用于获取 IP，与 <code>local 172.16.3.142</code> 参数等效。</p>
<p><code>remote 172.16.2.21</code> 显示指定了 vxlan 的对口 IP，所以只会发往这个地址，类似点对点协议。</p>
<p>host2 主机同样需要创建，注意 <code>id &amp; dspport</code> 参数要保持一致，<code>remote</code> 参数要指定 host1 IP：</p>
<pre><code class="language-shell">$ ip link add type vxlan id 1 dstport 4789 dev eth0 remote 172.16.3.142
</code></pre>
<p>在两台主机上将 vxlan 设备挂载至 bridge，并启动：</p>
<pre><code class="language-shell">$ ip link set vxlan0 master br0
$ ip link set up vxlan0
</code></pre>
<p>尝试 ping:</p>
<pre><code class="language-shell">$ ip netns exec net0 ping -c1 192.168.2.12
PING 192.168.2.12 (192.168.2.12) 56(84) bytes of data.
64 bytes from 192.168.2.12: icmp_seq=1 ttl=64 time=1.31 ms

--- 192.168.2.12 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
</code></pre>
<h3 id="vxlan-网络">VXLAN 网络</h3>
<p>点对点设备只能两两通信，实际用处不大。我们需要组成 vxlan 网络，在 vxlan 网络中有着一个问题：vtep 如何感知彼此的存在并选择正确路径传输报文？从上面的封装的报文中来看，有两个地址在发送时是不确定的：</p>
<ol>
<li>对方 vtep 的 IP 地址
<ul>
<li>在 IP 头部，需要的是双方 vtep 的 IP 地址，源地址可以很简单确定，目的地址是要发往的<strong>虚拟机所在地址的宿主机的 vtep 的 IP 地址</strong>，而我们在发送时只知道对方虚拟机 IP 的地址。</li>
</ul>
</li>
<li>对方虚拟机 MAC 地址
<ul>
<li>在内部报文中，通信双方是知道对方 IP 地址的，但如果是同一网段的通信，还需要知道对方<strong>虚拟机的 MAC 地址</strong>。</li>
</ul>
</li>
</ol>
<p>那么在点对点的 vxlan 设备上为什么没有这个问题呢？</p>
<p>在点对点的设备中，对方 vtep IP 地址在创建 vxlan 设备由 <code>remote</code> 参数指定。由于是同网段，vxlan 设备将 ARP 请求也发送到了对点的 vtep 上，所以能够直接获得对方的 ARP 响应。</p>
<p><a href="https://cizixs.com/2017/09/25/vxlan-protocol-introduction/">《vxlan 协议原理简介》</a>中提出了两个解决方案：多播和分布式控制中心；多播需要底层网络设备的配合，有一定局限性，而且多播方式会带来报文的浪费，在实际生产中很少用到。而分布式控制的 vxlan 是一种典型的 <a href="https://baike.baidu.com/item/%E8%BD%AF%E4%BB%B6%E5%AE%9A%E4%B9%89%E7%BD%91%E7%BB%9C/9117977">SDN</a> 架构，也是目前使用最广泛的方式。</p>
<h3 id="分布式控制中心">分布式控制中心</h3>
<p>多播的解决方案是在发送报文前以广播的方式自动学习地址，可是这太浪费了。所以分布式控制中心的解决方案就是提前知道地址信息，直接告诉 vtep，这就不需要多播了。</p>
<p>一般情况下，在每个 vtep 所在的节点都会有一个 agent，它会和控制中心通信，获取 vtep 需要的信息以某种方式告知 vtep。不止告知的方式不同，告知的时间也有区别。一般有两种方式：常见的是一旦知道信息就立刻告知 vtep，即使它可能用不上，一般这时候第一次通信还没有发生；另一种方式是在第一次通信时 vtep 以某种方式通知 agent，然后 agent 才告诉 vtep 这些信息。</p>
<h4 id="arp-和-fdb">ARP 和 FDB</h4>
<p>先解释一下 ARP 表和 FDB（二层转发表）表。</p>
<p>ARP 表是由三层设备（路由器，三层交换机，服务器，电脑）用来存储 ip 地址和 mac 地址对应关系的一张表。</p>
<p>FDB 是二层转发表，它是由2层设备（二层交换机）用来存储mac地址和交换机接口地址对应关系的一张表，用于帮助交换机指明 MAC 帧应从哪个端口发出去。Linux vxlan 设备的 FDB 表与上面说的交换机的 FDB 表略有不同，vxlan 设备的 FBD 表保存的是 mac 地址与其他 vxlan 设备的 vtep 地址。</p>
<h4 id="手动维护-fdb-表">手动维护 FDB 表</h4>
<p>在多播中以广播的形式获取宿主机的 IP 地址。如果提前知道目的虚拟机的 MAC 地址和它所在的主机的 IP 地址，可以通过更新 FDB 表项来减少广播报文的数量。这就能解决第一个问题。</p>
<pre><code class="language-shell">$ ip link add type vxlan id 1 dstport 4789 dev eth0 nolearning
</code></pre>
<p>添加 <code>nolearning</code> 参数告诉 vtep 不要通过收到的报文来学习 FDB 表项的内容，因为我们会手动维护这个列表。</p>
<pre><code class="language-shell">$ bridge fdb append 4e:3d:fd:29:55:38 dev vxlan0 dst 172.16.3.142 # host1 netns 与宿主机 IP 映射
$ bridge fdb append 46:63:12:3e:fa:da dev vxlan0 dst 172.16.2.21 # host2 netns 与宿主机 IP 映射
</code></pre>
<p>通过这个映射表，在发送报文时，vtep 搜索 FDB 表项就知道应该发送到哪个对应的 vtep 上了。需要注意的是，还需要一个默认的表项，以便 vtep 在不知道对应关系时可以通过默认方式发送 ARP 报文去查询对方的 MAC 地址。</p>
<pre><code class="language-shell">$ bridge fdb append 00:00:00:00:00:00 dev vxlan0 dst 172.16.3.142
$ bridge fdb append 00:00:00:00:00:00 dev vxlan0 dst 172.16.2.21
</code></pre>
<h4 id="手动维护-arp-表">手动维护 ARP 表</h4>
<p>单独维护 FDB 表并没有作用，因为在不知道对方虚拟机 MAC 地址的情况下还是会广播大量的 ARP 报文。所以 ARP 表也需要手动维护。这能解决第二个问题。</p>
<p>但 ARP 表的维护不同于 FDB 表，因为最终通信的双方是容器。到每个容器里面去更新对应的 ARP 表，是件工作量很大的事情，而且容器的创建和删除还是动态的。Linux 提供了一个解决方案，vtep 可以作为 ARP 代理，回复 ARP 请求，也就是说只要 vtep interface 知道对应的 <code>IP - MAC</code> 关系，在接收到容器发来的 ARP 请求时可以直接做出应答。我们只需要更新 vtep interface 上的 ARP 表项就行了。</p>
<pre><code class="language-shell">$ ip link add type vxlan id 1 dstport 4789 dev eth0 nolearning proxy
</code></pre>
<p>添加 <code>proxy</code> 参数告知 vtep 承担 ARP 代理的功能。如果收到 ARP 请求，并且自己知道结果就直接作出应答。</p>
<pre><code class="language-shell">$ ip neigh add 192.168.2.11 lladdr 4e:3d:fd:29:55:38 dev vxlan0
$ ip neigh add 192.168.2.12 lladdr 46:63:12:3e:fa:da dev vxlan0
</code></pre>
<p>在要通信的所有节点配置完之后，容器就能相互 ping 通。当容器要访问彼此，并且第一次发送 ARP 请求时，这个请求并不会发送给所有的 vtep，而是由当前的 vtep 作出应答，大大减少了网络上的报文。</p>
<pre><code class="language-shell">$ ip netns exec net0 ping -c1 192.168.2.12
PING 192.168.2.12 (192.168.2.12) 56(84) bytes of data.
64 bytes from 192.168.2.12: icmp_seq=1 ttl=64 time=1.15 ms

--- 192.168.2.12 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
</code></pre>
<p>这里要注意的是前面的示例中 netns 都是在同一网段 <code>192.168.2.0/24</code>，实际的项目会需要更大的网段，而跨网段就需要走网关。</p>
<h4 id="动态维护-arp-和-fdb-表">动态维护 ARP 和 FDB 表</h4>
<p>尽管通过手动维护 FDB 表和 ARP 表可以避免多余的网络报文，但是还有一个问题：为了能让所有的容器正常工作，所有可能会通信的容器都必须提前添加到 ARP 和 FDB 表项中。但并不是网络上所有的容器都会相互通信，所以添加的有些表项是用不到的。</p>
<p>Linux 提供了一种方法，内核能够动态地通知节点要和哪个容器通信，应用程序可以订阅这些事件，如果内核发现需要的 ARP 或者 FDB 表项不存在，会发送事件给订阅的应用程序，这样应用程序可以从控制中心拿到这些信息来更新表项，做到更精确的控制。</p>
<pre><code class="language-shell">$ ip link add vxlan0 type vxlan id 1 dstport 4789 dev eth0 nolearning proxy l2miss l3miss
</code></pre>
<p>这次多了两个参数 <code>l2miss</code> 和 <code>l3miss</code>：</p>
<ul>
<li><code>l2miss</code>：如果设备找不到 MAC 地址需要的 vtep 地址，就会发送通知事件</li>
<li><code>l3miss</code>：如果设备找不到 IP 地址需要的 MAC 地址，就会发送通知事件</li>
</ul>
<p><code>ip monitor</code> 命令可以监听某个 interface 的事件：</p>
<pre><code class="language-shell">$ ip monitor all dev vxlan0
</code></pre>
<p>如果从本节点容器 ping 另外一个节点的容器，就先发生 l3 miss：</p>
<pre><code class="language-shell">$ ipmonitor all dev vxlan0
[nsid current]miss 10.20.1.3  STALE
</code></pre>
<p><code>l3miss</code> 是说这个 IP 地址，vtep 不知道它对应的 MAC 地址，因此要手动添加 ARP 记录：</p>
<pre><code class="language-shell">$ ip neigh add 192.168.2.12 lladdr 46:63:12:3e:fa:da dev vxlan0 nud reachable
</code></pre>
<p><code>nud reachable</code> 参数代表系统发现其无效一段时间后会自动删除。</p>
<p>添加 ARP 表项后还是不能正常通信，接着会出现 l2miss 的通知事件：</p>
<pre><code class="language-shell">$ ip monitor all dev vxlan0
[nsid current]miss lladdr 46:63:12:3e:fa:da STALE
</code></pre>
<p>这个事件是说不知道这个容器的 MAC 地址在哪个节点上，所以要手动添加 FDB 记录：</p>
<pre><code class="language-shell">$ bridge fdb append 46:63:12:3e:fa:da dev vxlan0 dst 172.16.2.21
</code></pre>
<h2 id="flannel">Flannel</h2>
<p>Flannel 是 CoreOS 为 Kubernetes 设计的网络插件，实现简单且容易配置，但社区不怎么活跃，不过用来学习还是很好的。</p>
<h3 id="some-design-notes-and-history">Some design notes and history</h3>
<p>Flannel 对于网络的实现有不同的 <code>backend</code>，vxlan 的实现在 <code>backend/vxlan</code> 中， 源码文件 <code>vxlan.go</code> 的注释中记载了一些修改历史：</p>
<ol>
<li>
<p>Flannel 的第一个版本，l3miss 学习，通过查找 ARP 表 MAC 完成的。 l2miss 学习，通过获取 VTEP 上的 public ip 完成的。</p>
</li>
<li>
<p>Flannel 的第二个版本，移除了 l3miss 学习的需求，当远端主机上线，只是直接添加对应的 ARP 表项即可，不用查找学习了。</p>
</li>
<li>
<p>Flannel的最新版本，移除了 l2miss 学习的需求，不再监听 netlink 消息。</p>
<p>它的工作模式：</p>
<ol>
<li>创建 vxlan 设备，不再监听任何 l2miss 和 l3miss 事件消息</li>
<li>为远端的子网创建路由</li>
<li>为远端主机创建静态 ARP 表项</li>
<li>创建 FDB 转发表项，包含 VTEP MAC 和远端 Flannel 的 public IP</li>
<li>同一个 VNI 下每一台 Host 主机仅包含 1 route，1 arp entry and 1 FDB entry。</li>
<li>还有一个选项是跳过对位于同一子网的主机使用vxlan，这被称为“directRouting”</li>
</ol>
</li>
</ol>
<p>l2miss 和 l3miss 方案缺陷</p>
<ol>
<li>每一台 Host 需要配置所有需要互通 Guest 路由，路由记录会膨胀，不适合大型组网</li>
<li>通过 netlink 通知学习路由的效率不高</li>
<li>Flannel Daemon 异常后无法持续维护 ARP 和 FDB 表，从而导致网络不通</li>
</ol>
<p>在最新的方案中，有选项可以跳过对同一子网上的主机使用vxlan，称为“directRouting（直达路由）”。</p>
<h3 id="源码分析">源码分析</h3>
<pre><code>func main() {
	// 创建 SubnetManager 用于管理子网。sm 有两种模式，通过 kube-subnet-mgr 划分。kubeSubnetMgr 使用 Kubernetes 管理子网；etcdSubnetMgr 使用 etcd 管理子网。
	sm, err := newSubnetManager()
	if err != nil {
		log.Error(&quot;Failed to create SubnetManager: &quot;, err)
		os.Exit(1)
	}
	log.Infof(&quot;Created subnet manager: %s&quot;, sm.Name())

	// 创建 BackendManager，随后根据类型获取 BackendNetwork，用于在 Node 上创建网络。backends 通过 init 函数在 backend.Register 中注册，BackendManager 通过 GetBackend 获得对应类型的 backend，类型通过 Flannel config 文件 BackendType 字段获取。
	// Create a backend manager then use it to create the backend and register the network with it.
	bm := backend.NewManager(ctx, sm, extIface)
	be, err := bm.GetBackend(config.BackendType)
	if err != nil {
		log.Errorf(&quot;Error fetching backend: %s&quot;, err)
		cancel()
		wg.Wait()
		os.Exit(1)
	}

	// 获得 backend 后，使用 RegisterNetwork 方法创建主机网络。
	bn, err := be.RegisterNetwork(ctx, &amp;wg, config)
	if err != nil {
		log.Errorf(&quot;Error registering network: %s&quot;, err)
		cancel()
		wg.Wait()
		os.Exit(1)
	}

	// Start &quot;Running&quot; the backend network. This will block until the context is done so run in another goroutine.
	log.Info(&quot;Running backend.&quot;)
	wg.Add(1)
	go func() {
		// 监听子网事件，通过 handleSubnetEvents 为主机创建静态路由、ARP表项、FDB表项。
		// kubeSubnetManager 在 newKubeSubnetManager 时通过 informer 监听 Node 事件，发送给 events 不同的 object，然后进行处理
		bn.Run(ctx)
		wg.Done()
	}()

	daemon.SdNotify(false, &quot;READY=1&quot;)

	// Kube subnet mgr doesn't lease the subnet for this node - it just uses the podCidr that's already assigned.
	if !opts.kubeSubnetMgr {
		err = MonitorLease(ctx, sm, bn, &amp;wg)
		if err == errInterrupted {
			// The lease was &quot;revoked&quot; - shut everything down
			cancel()
		}
	}
}
</code></pre>
<p><code>vxlan.go RegisterNetwork()</code></p>
<pre><code>func (be *VXLANBackend) RegisterNetwork(ctx context.Context, wg *sync.WaitGroup, config *subnet.Config) (backend.Network, error) {
  // 通过 config 文件中 Backend 字段获取配置。设置 vxlanDeviceAttrs，使用 VNI 作为 name。
  devAttrs := vxlanDeviceAttrs{
		vni:       uint32(cfg.VNI),
		name:      fmt.Sprintf(&quot;flannel.%v&quot;, cfg.VNI),
		vtepIndex: be.extIface.Iface.Index,
		vtepAddr:  be.extIface.IfaceAddr,
		vtepPort:  cfg.Port,
		gbp:       cfg.GBP,
		learning:  cfg.Learning,
	}
  // 使用 vxlanDeviceAttrs 设置 vxlanDevice
  // newVXLANDevice 函数通过 github.com/vishvananda/netlink 包创建 vxlan 设备，然后设置 net/ipv6/conf/${device_name}/accept_ra 的配置。
	dev, err := newVXLANDevice(&amp;devAttrs)
	if err != nil {
		return nil, err
	}
	dev.directRouting = cfg.DirectRouting

  // 通过 newSubnetAttrs 函数获取配置，使用 subnetMgr 设置子网并得到 Lease。
	subnetAttrs, err := newSubnetAttrs(be.extIface.ExtAddr, dev.MACAddr())
	if err != nil {
		return nil, err
	}

	lease, err := be.subnetMgr.AcquireLease(ctx, subnetAttrs)
	switch err {
	case nil:
	case context.Canceled, context.DeadlineExceeded:
		return nil, err
	default:
		return nil, fmt.Errorf(&quot;failed to acquire lease: %v&quot;, err)
	}

	// Ensure that the device has a /32 address so that no broadcast routes are created.
	// This IP is just used as a source address for host to workload traffic (so
	// the return path for the traffic has an address on the flannel network to use as the destination)
  // 配置 vxlan 设备 addr，然后启动设备。设置 vxlan 设备为子网中的 /32 地址
	if err := dev.Configure(ip.IP4Net{IP: lease.Subnet.IP, PrefixLen: 32}); err != nil {
		return nil, fmt.Errorf(&quot;failed to configure interface %s: %s&quot;, dev.link.Attrs().Name, err)
	}

	return newNetwork(be.subnetMgr, be.extIface, dev, ip.IP4Net{}, lease)
}
</code></pre>
<h3 id="linux-实现-flannel-网络">Linux 实现 flannel 网络</h3>
<p>Flannel 网络配置不需要维护过多的表项，在同一个 VNI 下的每台主机仅需要配置一个路由、一个 ARP 表项、一个 FDB 表项。配置的表项变少，解决了手动维护 FDB 表和 ARP 表所带来的过多的无用表项问题，但相应的也会增加报文的发送，这也是 flannel 在实现上的取舍问题。</p>
<h4 id="环境">环境</h4>
<p>Flannel 配置：</p>
<pre><code class="language-json">{
  &quot;Network&quot;: &quot;10.244.0.0/16&quot;,
  &quot;Backend&quot;: {
    &quot;Type&quot;: &quot;vxlan&quot;
  }
}
</code></pre>
<p>Flannel 使用 <code>/16</code> CIDR，为每个节点分配一个 <code>/24</code> 的子网，所以此时的 network namespace 变为：</p>
<pre><code class="language-shell">$ ip netns exec net0 ip addr # host1
veth0:
  inet 10.244.0.2/24 scope global veth0
$ ip netns exec net0 ip addr # host2
veth0:
  inet 10.244.1.2/24 scope global veth0
</code></pre>
<p>因为跨网段，所以为 br0 设置 IP 地址，并修改路由表做为网关：</p>
<pre><code class="language-shell">$ ip netns exec net0 ip route add 10.244.0.0/16 via 10.244.0.1 dev veth0 onlink # host1
$ ip addr
br0:
  inet 10.244.0.1/24 scope global br0
$ ip netns exec net0 ip route add 10.244.0.0/16 via 10.244.1.1 dev veth0 onlink # host2
$ ip addr
br0:
  inet 10.244.1.1/24 scope global br0
</code></pre>
<h4 id="示例">示例</h4>
<p>配置 vxlan 设备：</p>
<pre><code class="language-shell">$ ip link add vxlan0 type vxlan id 1 dstport 4789 dev eth0 nolearning
$ ip link set up vxlan0
$ ip link # host1
vxlan0:
  link/ether c2:cb:69:f5:a6:e4
$ ip link # host2
vxlan0:
  link/ether 66:8e:33:ac:7a:22
</code></pre>
<p>设置路由表：</p>
<pre><code class="language-shell">$ ip addr add 10.244.0.0 dev vxlan0 # 本机 vxlan IP
$ ip route add 10.244.1.0/24 via 10.244.1.0 dev vxlan0 onlink # 在 host1 设置 host2 路由表
# $ ip addr add 10.244.1.0 dev vxlan0
# $ ip route add 10.244.0.0/24 via 10.244.0.0 dev vxlan0 onlink
</code></pre>
<p>设置 FDB 表：</p>
<pre><code class="language-shell">$ bridge fdb append 66:8e:33:ac:7a:22 dev vxlan0 dst 172.16.2.21 # host2 主机的 vxlan MAC 地址与主机 IP
# bridge fdb append c2:cb:69:f5:a6:e4 dev vxlan0 dst 172.16.3.142
</code></pre>
<p>设置 ARP 表：</p>
<pre><code class="language-shell">$ ip neigh add 10.244.1.0 dev vxlan0 lladdr 66:8e:33:ac:7a:22 # host2 vxlan MAC 与 vxlan IP
# $ ip neigh add 10.244.0.0 dev vxlan0 lladdr c2:cb:69:f5:a6:e4
</code></pre>
<p>测试 ping：</p>
<pre><code class="language-shell">$ ip netns exec net0 ping -c1 10.244.1.2
PING 10.244.1.2 (10.244.1.2) 56(84) bytes of data.
64 bytes from 10.244.1.2: icmp_seq=1 ttl=62 time=1.11 ms

--- 10.244.1.2 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
</code></pre>
<p>如果需要，别忘记设置内核 ip forward 参数：</p>
<pre><code class="language-shell">$ sysctl -w net.ipv4.ip_forward=1
</code></pre>
<h4 id="总结">总结</h4>
<p>Flannel 基于每台节点一个 <code>/24</code> 的网段，大大减少了维护 ARP 和 FDB 表项的工作，所增加的只是数据包达到目的地主机后的少量 ARP 请求，每次容器的增减也不需要触发维护。对比完全手动维护的方案来说，要好得多。</p>
<h2 id="参考文章">参考文章</h2>
<p><a href="https://cizixs.com/2017/09/25/vxlan-protocol-introduction/">vxlan 协议原理简介</a></p>
<p><a href="https://cizixs.com/2017/09/28/linux-vxlan/">linux 上实现 vxlan 网络</a></p>
<p><a href="https://zdyxry.github.io/2020/01/03/%e4%b8%ba%e4%bb%80%e4%b9%88-flannel-1-%e4%b8%a2%e5%a4%b1%e5%90%8e%e4%b8%8d%e4%bc%9a%e8%87%aa%e5%8a%a8%e9%87%8d%e5%bb%ba/">为什么 flannel.1 丢失后不会自动重建</a></p>
]]></content>
    </entry>
    <entry>
        <title type="html"><![CDATA[容器网络学习笔记]]></title>
        <id>https://cnbailian.github.io/post/container-netwrok-notes/</id>
        <link href="https://cnbailian.github.io/post/container-netwrok-notes/">
        </link>
        <updated>2021-02-04T11:00:00.000Z</updated>
        <summary type="html"><![CDATA[<p>容器技术涉及的知识点很多，包括进程隔离、容器网络、分层存储等等，我对其中容器网络部分很感兴趣，并有较为深入的学习。此篇文章用于记录我的学习笔记。</p>
]]></summary>
        <content type="html"><![CDATA[<p>容器技术涉及的知识点很多，包括进程隔离、容器网络、分层存储等等，我对其中容器网络部分很感兴趣，并有较为深入的学习。此篇文章用于记录我的学习笔记。</p>
<!--more-->
<h2 id="概述">概述</h2>
<p>提到容器技术，大家可能知道容器通过 Linux Namespace 技术实现资源隔离。Namespace 是 kernel 对全局系统资源的一种封装隔离，比如 PID、User、Network 等等，改变 namespace 中被隔离的系统资源，只会影响当前 namespace 中的进程，对其他 namespace 中的进程没有影响。</p>
<p>Network namespace 就是本文主要涉及的一个 namespace，它被用来隔离网络设备、IP 地址端口等，每个 namespace 都有自己独立的网络协议栈、IP 路由表、防火墙规则、sockets等。</p>
<p>有了不同的 network namespace 之后，也就有了网络隔离，但一个完全被隔离的网络环境没有实际用处，这就需要通过 Linux 的虚拟网络设备为其插上“网卡”，以连通更多的网络。Linux 虚拟网络设备很多，这里主要介绍的是构建容器网络要用到的 Veth 与 Bridge，前者可以连接两个被隔离的 network namespace，后者则可以让更多的 network namespace 加入进来。</p>
<h2 id="linux-veth">Linux Veth</h2>
<h3 id="linux-网络设备">Linux 网络设备</h3>
<p>Linux 的网络设备就像一个双向的管道，数据从一端进，就会从另一端出，关键要看这两端是什么。用常见的 eth0 举例，eth0 设备的一端连接网络协议栈，另一端连接网卡。用户通过 socket api 调用，经过 Linux 网络协议栈，进入 eth0 网络设备，最后发送到网卡。</p>
<pre><code>+-------------------------------------------+
|                                           |
|        +-------------------+              |
|        | User Application  |              |
|        +-------------------+              |   
|                 |                         |     
|.................|.........................|
|                 ↓                         |     
|           +----------+                    |     
|           | socket   |                    |     
|           +----------+                    |     
|                 |                         |     
|.................|.........................|
|                 ↓                         |     
|      +------------------------+           |     
|      | Newwork Protocol Stack |           |     
|      +------------------------+           |     
|                 |                         |     
|.................|.........................|
|                 ↓                         |     
|        +----------------+                 |     
|        |      eth0      |                 |     
|        +----------------+                 |     
|                 |                         |
|                 |                         |
|                 |                         |
+-----------------|-------------------------+
                  ↓
          Physical Network
</code></pre>
<h3 id="veth-pair">Veth Pair</h3>
<p>Veth 作为 Linux 的虚拟网络设备，它总是成对（pair）出现，它的一端连接着网络协议栈，另一端两个设备彼此相连。这个特性使得一个设备收到协议栈的数据请求后，会将数据发送到另一个设备上去。</p>
<pre><code>+----------------------------------------------------------------+
|                                                                |
|       +------------------------------------------------+       |
|       |             Newwork Protocol Stack             |       |
|       +------------------------------------------------+       |
|              ↑               ↑               ↑                 |
|..............|...............|...............|.................|
|              ↓               ↓               ↓                 |
|        +----------+    +-----------+   +-----------+           |
|        |   eth0   |    |   veth0   |   |   veth1   |           |
|        +----------+    +-----------+   +-----------+           |
|              ↑               ↑               ↑                 |
|              |               +---------------+                 |
|              |         192.168.2.11     192.168.2.1            |
+--------------|-------------------------------------------------+
               ↓
         Physical Network
</code></pre>
<p>可以通过这个特性，实现两个 network namespace 网络的互通。</p>
<h4 id="示例">示例</h4>
<p>通过示例创建 network namespace 与 veth pair，并实现网络互通。</p>
<pre><code class="language-shell"># 创建 network namespace
root@ubuntu:~$ ip netns add net0
root@ubuntu:~$ ip netns add net1
# 创建 veth pair
# 因为未指定名称，会默认生成 veth0 和 veth1，如果有其他 veth 设备序号会顺延
# 如果想指定名字：ip link add vethfoo type veth peer name vethbar
root@ubuntu:~$ ip link add type veth
# 将 veth0 设备转给 net0 namespace
root@ubuntu:~$ ip link set dev veth0 netns net0
# 将 veth1 设备转给 net1 namespace
root@ubuntu:~$ ip link set dev veth1 netns net1
# 分别设置设备 IP
# ip netns exec 命令是进入 network namespace 内执行指令
root@ubuntu:~$ ip netns exec net0 ip addr add 192.168.2.11/24 dev veth0
root@ubuntu:~$ ip netns exec net1 ip addr add 192.168.2.1/24 dev veth1
# 启动 veth pair
root@ubuntu:~$ ip netns exec net0 ip link set dev veth0 up
root@ubuntu:~$ ip netns exec net1 ip link set dev veth1 up
# 尝试 ping
root@ubuntu:~$ ip netns exec net0 ping -c1 192.168.2.1
PING 192.168.2.1 (192.168.2.1) from 192.168.2.11 veth0: 56(84) bytes of data.
64 bytes from 192.168.2.1: icmp_seq=1 ttl=64 time=0.032 ms

--- 192.168.2.1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
</code></pre>
<h3 id="network-namespace">Network namespace</h3>
<p>多记一些 network namespace 相关的知识点。</p>
<p>每个新的 network namespace 创建之后默认会有一个 lo 设备，除此之外的其他网络设备就需要创建或移动过来。注意 lo 设备默认是关闭的，需要自己手动启动。</p>
<pre><code class="language-shell">root@ubuntu:~$ ip netns add net0
root@ubuntu:~$ ip netns exec net0 ip link
lo: &lt;LOOPBACK&gt; mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
</code></pre>
<p>上面的示例中将 veth pair 设备分别给了两个 namespace，但被标记为“local device”的设备不能被移动，比如 loopback、bridge、ppp 等。可以通过 <code>ethtool -k</code> 命令查看设备的 <code>netns-local</code> 属性：</p>
<pre><code class="language-shell">root@ubuntu:~$ ethtool -k lo|grep netns-local
netns-local: on [fixed]
root@ubuntu:~$ ethtool -k veth0|grep netns-local
netns-local: off [fixed]
</code></pre>
<h2 id="linux-bridge">Linux Bridge</h2>
<p>虽然 veth pair 可以实现两个 network namespace 之间的通信，但是当多个 namespace 需要通信的时候，就需要 bridge 了。bridge 同样是 Linux 虚拟网络设备，具有网络设备的特征，可以配置 IP、MAC 地址等，但 bridge 同时也是一个虚拟交换机，和物理交换机有类似的功能。</p>
<p>对于普通的网络设备来说，只有两个端口，从一端进来的数据会从另一端出去。而 bridge 不同，bridge 有多个端口，数据可以从任何端口进来，进来之后从哪个端口出去和物理交换机的原理差不多，要看 MAC 地址。</p>
<p>所以，要想实现多 network namespace 的网络通信，就需要 bridge 这个虚拟交换机。</p>
<h3 id="使用-bridge-连接不同的-namespace">使用 bridge 连接不同的 namespace</h3>
<p>首先创建并启动 bridge，将其取名为 br0：</p>
<pre><code class="language-shell">root@ubuntu:~$ ip link add name br0 type bridge
root@ubuntu:~$ ip link set br0 up
root@ubuntu:~$ ip link
br0: &lt;BROADCAST,MULTICAST,UP,LOWER_UP&gt; mtu 1500 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether 9a:a8:84:37:d4:56 brd ff:ff:ff:ff:ff:ff
</code></pre>
<p>同样，network namespace 也要准备好：</p>
<pre><code class="language-shell">root@ubuntu:~$ ip netns add net0
root@ubuntu:~$ ip netns add net1
</code></pre>
<h4 id="示例-2">示例</h4>
<p>现在两个网络环境与虚拟交换机都已准备好，接下来将使用 veth pair 进行连接互通：</p>
<pre><code class="language-shell"># 创建 net0 使用的 veth pair
root@ubuntu:~$ ip link add type veth
# 将 veth0 移至 net0
root@ubuntu:~$ ip link set dev veth0 netns net0
# 设置 IP 并启动
root@ubuntu:~$ ip netns exec net0 ip addr add 192.168.2.11/24 dev veth0
root@ubuntu:~$ ip netns exec net0 ip link set dev veth0 up
# 将其对应的另一个设备 attach 到 bridge 上并启动
root@ubuntu:~$ ip link set dev veth1 master br0
root@ubuntu:~$ ip link set dev veth1 up
# net1 同理
root@ubuntu:~$ ip link add type veth
root@ubuntu:~$ ip link set dev veth0 netns net1
root@ubuntu:~$ ip netns exec net1 ip addr add 192.168.2.1/24 dev veth0
root@ubuntu:~$ ip netns exec net1 ip link set dev veth0 up
root@ubuntu:~$ ip link set dev veth2 master br0
root@ubuntu:~$ ip link set dev veth2 up
</code></pre>
<p>测试 ping：</p>
<pre><code class="language-shell">root@ubuntu:~$ ip netns exec net0 ping -c1 192.168.2.1
PING 192.168.2.1 (192.168.2.1) 56(84) bytes of data.
64 bytes from 192.168.2.1: icmp_seq=1 ttl=64 time=0.045 ms

--- 192.168.2.1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
</code></pre>
<p>Veth pair 在此时的作用就相当于网线，一头（veth0）连着容器（network namespace），另一头（veth1）连着交换机（bridge）。bridge 作为交换机，当有设备 attach 到 bridge，就相当于交换机上插了一个新网线。当有请求到达 bridge 设备时，就可以通过报文中的 MAC 地址进行广播、转发、丢弃处理。</p>
<h3 id="给-bridge-配上-ip">给 bridge 配上 IP</h3>
<p>Bridge 与现实世界的二层交换机有一个区别：数据可以直接被发到 bridge 上，而不是从一个端口接受。这种情况可以看做 bridge 自己有一个 MAC 可以主动发送报文，或者说 bridge 自带了一个隐藏端口和寄主 Linux 系统自动连接，Linux 上的程序可以直接从这个端口向 bridge 上的其他端口发数据。</p>
<p>由此带来一个有意思的事情是，bridge 可以设置 IP 地址。通常来讲 IP 地址是三层协议的内容，不应该出现在二层设备 bridge 上，但 bridge 是虚拟交换机，属于通用网络设备的抽象的一种，只要是网络设备就能够设定 IP 地址。</p>
<p>当一个 bridge 拥有 IP 后，Linux 便可以通过路由表或者 IP 表规则在三层定位 bridge，此时相当于 Linux 拥有了另外一个隐藏的虚拟网卡和 bridge 的隐藏端口相连，这个网卡就是名为 br0 的通用网络设备，IP 可以看成是这个网卡的。当有符合此 IP 的数据到达 br0 时，内核协议栈认为收到了一包目标为本机的数据，此时应用程序可以通过 socket 接收到它。</p>
<h4 id="示例-3">示例</h4>
<p>接上文环境，为 bridge 配置 IP：</p>
<pre><code class="language-shell">root@ubuntu:~$ ip addr add 192.168.2.12/24 dev br0
</code></pre>
<p>在主机上尝试 ping net0：</p>
<pre><code class="language-shell">root@ubuntu:~$ ping -I br0 -c1 192.168.2.11
PING 192.168.2.11 (192.168.2.11) from 192.168.2.12 br0: 56(84) bytes of data.
64 bytes from 192.168.2.11: icmp_seq=1 ttl=64 time=0.057 ms

--- 192.168.2.11 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
</code></pre>
<p>在 net1 中尝试 ping br0:</p>
<pre><code class="language-shell">root@ubuntu:~$ ip netns exec net1 ping -c1 192.168.2.12
PING 192.168.2.12 (192.168.2.12) 56(84) bytes of data.
64 bytes from 192.168.2.12: icmp_seq=1 ttl=64 time=0.061 ms

--- 192.168.2.12 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
</code></pre>
<h2 id="与外部网络通信">与外部网络通信</h2>
<p>上文给 bridge 配置 IP 后，network namespace 已经可以通过 br0 与宿主机的网络协议栈通信，但我们还需要与外部的网络通信。</p>
<p>其中的一种方法是将物理网卡设备 eth0 也 attach 到 br0 上。br0 根本不区分 attach 的是物理设备还是虚拟设备，对它来说都一样，都是网络设备，这就相当于 br0 拥有了一条连接外部物理设备的网线。此时连接到 br0 的 network namespace 都可以通过 br0 访问外部网络。但由于我是使用的云主机，通过 ssh 连接，无法很方便的调试，所以没有试过这种方法。</p>
<p>上一种方法不需要经过宿主机网络协议栈，直接就可以通过 eth0 设备发送数据。而第二种方法，可以不接入 eth0 设备，而是通过 IP forward 将数据转发。同时由于 network namespace 是分配的内网 IP，所以一般在发出去之前还需要经过 NAT 转换。</p>
<h3 id="ip-forward">IP forward</h3>
<p>“IP forwarding” 和 “routing” 是同义词，因为属于 Linux 内核的特性，所以也被叫做 “kernel IP forwarding”。所谓转发的概念就是 Linux 内核实现了路由器的功能，根据数据包的 IP 地址将数据从一个网络发送到另一个网络，该网络根据路由表配置继续发送数据包。</p>
<p>出于安全考虑，Linux 默认是禁止数据包转发的。如果想要启用，需要修改内核参数 <code>net.ipv4.ip_forward</code>。这个参数的值指定了是否启用转发功能；为 0 时禁用，为 1 时表示启用。</p>
<pre><code class="language-shell">root@ubuntu:~$ sysctl net.ipv4.ip_forward
net.ipv4.ip_forward = 0
# 也可以通过 /proc 查看
root@ubuntu:~$ cat /proc/sys/net/ipv4/ip_forward
0
</code></pre>
<h4 id="修改内核参数">修改内核参数</h4>
<p><strong>临时生效</strong></p>
<pre><code class="language-shell">root@ubuntu:~$ sysctl -w net.ipv4.ip_forward=1
net.ipv4.ip_forward = 1
# 或直接修改 /proc 文件
root@ubuntu:~$ echo 1 &gt; /proc/sys/net/ipv4/ip_forward
</code></pre>
<p><strong>永久生效</strong></p>
<p>修改 <code>sysctl.conf</code> 文件，找到 <code>net.ipv4.ip_forward</code> 配置项，修改为 1：</p>
<pre><code class="language-shell">root@ubuntu:~$ vi /etc/sysctl.conf
# 需要在当前环境中刷新更改
root@ubuntu:~$ sysctl -p /etc/sysctl.conf
</code></pre>
<h3 id="nat">NAT</h3>
<p><strong>网络地址转换</strong> NAT（Network Address Translation）的作用是将数据包中的 network namespace 内网 IP 转为主机所拥有的公网 IP。</p>
<p>NAT 根据数据流向可以分为两种：SNAT 是源 IP 转换，将发送的数据包中的源 IP 转为公网 IP；DNAT 是目标 IP 转换，将接收到的数据包中的公网 IP 转为 network namespace 的内网 IP。</p>
<h3 id="netfilteriptables">netfilter/iptables</h3>
<p>无论是 IP forward 还是 NAT，在 Linux 系统上都可以通过 netfilter/iptables 配置规则。netfilter 和 iptables 可以拆开来说，netfilter 指的是整个<a href="https://www.netfilter.org">项目</a>，在这个项目中 netfilter 特指内核中的 netfilter 框架，而我们更熟悉的 iptables 则是用户空间的配置工具，用于与 netfilter 框架打交道。</p>
<h4 id="netfilter-框架">netfilter 框架</h4>
<p>netfilter 在内核协议栈的 IP 层添加了几个钩子（hooks），允许内核模块在这些钩子的地方注册回调函数，这样经过钩子的所有数据包都会被注册在相应钩子上的函数所处理，包括修改数据包内容或者丢弃数据包等等。</p>
<p>netfilter 框架负责维护钩子上注册的处理函数或者模块，以及它们的优先级。</p>
<h4 id="iptables">iptables</h4>
<p>iptables 是用户空间的一个程序，与内核的 neifilter 框架打交道，根据规则在钩子上配置回调函数。</p>
<p>iptables 用表（table）来分类管理它的规则（rule），根据 rule 的作用可以分类为几个表，比如用于过滤数据的 filter 表，用于处理 NAT 规则的 nat 表等等。</p>
<h4 id="conntrack">conntrack</h4>
<p>onntrack 是 netfilter 实现 NAT 的基础，当加载内核模块 <code>nf_conntrack</code> 后，connection tracking 机制就开始工作，它工作在 <code>NF_IP_PRE_ROUTING</code> 和 <code>NF_IP_LOCAL_OUT</code> 这两个钩子处。它会追踪每个数据包（被 raw 表中的 rule 标记过的除外），并生成 conntrack 条目用于追踪此连接，对于后续通过的数据包，内核会判断若此数据包属于某个连接，则会更新对应的 conntrack 条目。</p>
<p>所有的 conntrack 条目都存放在一张表里，称为连接跟踪表。可以用 <code>cat /proc/net/nf_conntrack</code> 来查看当前的所有连接。下面是所有的连接状态：</p>
<ul>
<li>NEW：当检测到一个不和任何现有连接关联的新包时，如果该包是一个合法的建立连接的数据包，一个新的连接将会被保存，并且标记为状态 NEW。</li>
<li>ESTABLISHED：对于状态是 NEW 的连接，当检测到一个相反方向的包时，连接的状态将会由 NEW 变成 ESTABLISHED，表示连接成功建立。对于TCP连接，意味着收到了一个 SYN/ACK 包， 对于 UDP 和 ICMP，任何反方向的包都可以。</li>
<li>RELATED：数据包不属于任何现有的连接，但它跟现有的状态为 ESTABLISHED 的连接有关系，对于这种数据包，将会创建一个新的连接，且状态被标记为 RELATED。这种连接一般是辅助连接，比如 FTP 的数据传输连接（FTP 有两个连接，另一个是控制连接），或者和某些连接有关的ICMP报文。</li>
<li>INVALID：数据包不和任何现有连接关联，并且不是一个合法的建立连接的数据包，对于这种连接，将会被标记为 INVALID，一般这种都是垃圾数据包，比如收到一个 TCP 的 RST 包，但实际上没有任何相关的 TCP 连接，或者别的地方误发过来的 ICMP 包。</li>
<li>UNTRACKED：被 raw 表里面的 rule 标记为不需要 tracking 的数据包，这种连接将会标记成 UNTRACKED。</li>
</ul>
<h3 id="示例-4">示例</h3>
<p>创建 bridge，并配置 IP：</p>
<pre><code class="language-shell">root@ubuntu:~$ ip link add br0 type bridge
root@ubuntu:~$ ip link set dev br0 up
root@ubuntu:~$ ip addr add 192.168.2.1/24 dev br0
</code></pre>
<p>创建 network namespace 并与 bridge 相连：</p>
<pre><code class="language-shell">root@ubuntu:~$ ip netns add net0
root@ubuntu:~$ ip link add type veth
root@ubuntu:~$ ip link set veth0 netns net0
root@ubuntu:~$ ip netns exec net0 ip link set dev veth0 up
root@ubuntu:~$ ip link set veth1 up
root@ubuntu:~$ ip link set veth1 master br0
root@ubuntu:~$ ip netns exec net0 ip addr add 192.168.2.11/24 dev veth0
</code></pre>
<p>修改 net0 路由表，默认网关设置为 br0：</p>
<pre><code class="language-shell">root@ubuntu:~$ ip netns exec net0 ip route add 0.0.0.0/0 via 192.168.2.1 dev veth0 onlink
</code></pre>
<p>注意 IP forward 配置：</p>
<pre><code class="language-shell">root@ubuntu:~$ sysctl -w net.ipv4.ip_forward=1
net.ipv4.ip_forward = 1
</code></pre>
<p>屏蔽环境干扰，先默认不允许转发：</p>
<pre><code class="language-shell">root@ubuntu:~$ iptables -P FORWARD DROP
</code></pre>
<p>开始配置 iptables rules，首先设置 bridge 转发规则，此条规则的意思是允许 br0 转发给 eth0：</p>
<pre><code class="language-shell">root@ubuntu:~$ iptables -A FORWARD -i br0 -o eth0 -j ACCEPT
</code></pre>
<p>接下来配置 SNAT 规则：</p>
<pre><code class="language-shell">root@ubuntu:~$ iptables -t nat -A POSTROUTING -s 192.168.2.0/24 -j SNAT --to # to eth0 ip
# 也可以直接配置在 eth0 上
root@ubuntu:~$ # iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
</code></pre>
<p>netfilter 通过 conntrack 来实现 NAT 转换，所以我们要对 <code>RELATED,ESTABLISHED</code> 状态的包予以通行：</p>
<pre><code class="language-shell">root@ubuntu:~$ iptables -A FORWARD -o br0 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
</code></pre>
<p>通过上面的配置，conntrack 状态监测到是回包的数据包，都给予通行，而后回包经过 conntrack 表会变为原始 IP 关系，相当于 DNAT 转换。</p>
<p>在 network namespace 中使用 ping 来测试访问外部网络：</p>
<pre><code class="language-bash">root@ubuntu:~$ ip netns exec net0 ping -c1 110.242.68.4 # 百度的一个 IP
PING 110.242.68.4 (110.242.68.4) 56(84) bytes of data.
64 bytes from 110.242.68.4: icmp_seq=1 ttl=34 time=56.7 ms

--- 110.242.68.4 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
</code></pre>
<h3 id="端口转发">端口转发</h3>
<p>上面的示例是从 network namespace 内部访问外部网络，可以利用 conntrack 来替代 DNAT，如果想让外部请求访问内部服务，就需要配置 DNAT 的映射规则。可映射是一对一的，一个宿主机 IP 对应一个 network namespace 的内网 IP，当我们有多个内部服务想要暴露给公网，就需要配置 NAPT 规则。</p>
<h4 id="napt">NAPT</h4>
<p>网络地址与端口号转换 NAPT (Network Address andPort Translation) 就是使用端口号的 NAT，有端口号的配置，就能实现内网 IP 的多对一映射，只是映射到不同的端口上。</p>
<table>
<thead>
<tr>
<th style="text-align:left">内网 IP</th>
<th style="text-align:left">公网 IP</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align:left">192.168.2.11:80</td>
<td style="text-align:left">x.x.x.x:8080</td>
</tr>
<tr>
<td style="text-align:left">192.168.2.1:80</td>
<td style="text-align:left">x.x.x.x:8081</td>
</tr>
</tbody>
</table>
<h4 id="示例-5">示例</h4>
<p>除 iptables rules 外规则不变，首先是在 network namespace 中启动一个 http server：</p>
<pre><code class="language-shell"># 注意：这会暴露当前目录下的文件
root@ubuntu:~$ ip netns exec net0 python -m SimpleHTTPServer 80
</code></pre>
<p>添加 DNAT 规则，设置主机端口为 8080，映射 net0 的 80：</p>
<pre><code class="language-shell">root@ubuntu:~$ iptables -t nat -A PREROUTING -p tcp --dport 8080 -j DNAT --to 192.168.2.11:80
</code></pre>
<p>添加 ip forward：</p>
<pre><code class="language-shell">root@ubuntu:~$ iptables -A FORWARD -i eth0 -d 192.168.2.0/24 -o br0 -p tcp --dport 80 -j ACCEPT
</code></pre>
<p>现在就可以通过宿主机的 IP 访问了。</p>
<h2 id="写在最后">写在最后</h2>
<p>上述例子用于学习需要，与真实的容器配置不同，但所用的基础技术都是一样的。笔记内容主要学习和参考自 Segmentfault 用户 <a href="https://segmentfault.com/u/public0821">public0821</a> 的 Linux 专栏文章，还有网络上的一些相关文章。</p>
<p>接下来会继续学习跨主机的容器网络搭建，这次会结合实际项目 Flannel。</p>
<h3 id="参考文章">参考文章</h3>
<ol>
<li><a href="https://segmentfault.com/a/1190000009251098">Linux虚拟网络设备之veth</a></li>
<li><a href="https://segmentfault.com/a/1190000009491002">Linux虚拟网络设备之bridge(桥)</a></li>
<li><a href="https://segmentfault.com/a/1190000009043962">netfilter/iptables简介</a></li>
<li><a href="http://xstarcd.github.io/wiki/Linux/iptables_forward_internetshare.html">通过iptables实现端口转发和内网共享上网</a></li>
</ol>
]]></content>
    </entry>
    <entry>
        <title type="html"><![CDATA[Go Runtime 笔记]]></title>
        <id>https://cnbailian.github.io/post/go-runtime-notes/</id>
        <link href="https://cnbailian.github.io/post/go-runtime-notes/">
        </link>
        <updated>2021-01-25T02:46:20.000Z</updated>
        <summary type="html"><![CDATA[<p>本文用于记录 Go 语言运行时及调度器方面源码的学习笔记。</p>
]]></summary>
        <content type="html"><![CDATA[<p>本文用于记录 Go 语言运行时及调度器方面源码的学习笔记。</p>
<!--more-->
<h2 id="启动过程">启动过程</h2>
<p>使用 gdb 调试程序，在 macOS 下注意 build 时使用增加 <code>-ldflags=-compressdwarf=false</code> 参数，并且<a href="https://segmentfault.com/q/1010000004136334">自建证书给 gdb</a>。</p>
<h4 id="寻找入口">寻找入口</h4>
<p>使用 <code>info files</code> 查看执行文件，使用 <code>breakpoint</code> 定位 entry point 所在的文件位置，确定入口文件。</p>
<pre><code class="language-shell">(gdb) info files
Symbols from &quot;/Users/bailian/GoProject/go-build/go-build&quot;.
Local exec file:
	`/Users/bailian/GoProject/go-build/go-build', file type mach-o-x86-64.
	Entry point: 0x1063f40
	0x0000000001001000 - 0x00000000010a6f0a is .text
	......
(gdb) b *0x1063f40
Breakpoint 2 at 0x1063f40: file /Users/bailian/GoProject/go/src/runtime/rt0_darwin_amd64.s, line 8.
</code></pre>
<p><em>Go 使用的 plan9 汇编语言......</em></p>
<p>可以在汇编文件中看到执行程序的初始化流程：</p>
<pre><code class="language-asm">// rt0 其实是 runtime0 的缩写，意为运行时的创生，随后所有创建的都是 1 为后缀。
// 操作系统通过入口参数的约定与应用程序进行沟通，为了支持从系统给运行时传递参数，Go 程序 在进行引导时将对这部分参数进行处理。
// 程序刚刚启动时，栈指针 SP 的前两个值分别对应 argc 和 argv，分别存储参数的数量和具体的参数的值
TEXT _rt0_amd64(SB),NOSPLIT,$-8
	MOVQ	0(SP), DI	// argc
	LEAQ	8(SP), SI	// argv
	JMP	runtime·rt0_go(SB)
TEXT runtime·rt0_go(SB),NOSPLIT,$0
	// 在偶数堆栈上向前复制参数
	MOVQ	DI, AX		// argc
	MOVQ	SI, BX		// argv
	SUBQ	$(4*8+7), SP		// 2args 2auto
	ANDQ	$~15, SP
	MOVQ	AX, 16(SP)
	MOVQ	BX, 24(SP)

	// 初始化 g0 执行栈
	MOVQ	$runtime·g0(SB), DI
	LEAQ	(-64*1024+104)(SP), BX
	MOVQ	BX, g_stackguard0(DI)
	MOVQ	BX, g_stackguard1(DI)
	MOVQ	BX, (g_stack+stack_lo)(DI)
	MOVQ	SP, (g_stack+stack_hi)(DI)

	// 确定 CPU 处理器的信息
	MOVL	$0, AX
	CPUID
	MOVL	AX, SI
	CMPL	AX, $0
	JE	nocpuinfo
	......
needtls:
#ifdef GOOS_darwin
	// Darwin 系统跳过 TLS 设置
	JMP ok
#endif
	// 设置 TLS 伪寄存器
	LEAQ	runtime·m0+m_tls(SB), DI // DI = m0.tls
	CALL	runtime·settls(SB) // 将 TLS 地址设置到 DI
	// 使用它进行存储，确保能正常运行
	get_tls(BX)
	MOVQ	$0x123, g(BX)
	MOVQ	runtime·m0+m_tls(SB), AX
	CMPQ	AX, $0x123
	JEQ 2(PC) // 跳转到下面的 get_tls 指令
	CALL	runtime·abort(SB)
ok:
	// 程序刚刚启动，此时位于主线程
	// 当前栈与资源保存在 g0
	// 该线程保存在 m0
	get_tls(BX)
	LEAQ	runtime·g0(SB), CX
	MOVQ	CX, g(BX)
	LEAQ	runtime·m0(SB), AX

	// g0 和 m0 是一组全局变量，在程序运行之初就已经存在。 除了程序参数外，会首先将 m0 与 g0 通过指针互相关联。
	// save m-&gt;g0 = g0
	MOVQ	CX, m_g0(AX)
	// save m0 to g0-&gt;m
	MOVQ	AX, g_m(CX)

	// 在正式初始化运行时组件之前，还需要做一些校验和系统级的初始化工作，这包括：运行时类型检查， 系统参数的获取以及影响内存管理和程序调度的相关常量的初始化。
	CLD				// convention is D is always left cleared
	CALL	runtime·check(SB) // 运行时类型检查。 其本质上基本上属于对编译器翻译工作的一个校验，显然如果编译器的编译工作 不正确，运行时的运行过程便不是一个有效的过程。
	
	MOVL	16(SP), AX		// copy argc
	MOVL	AX, 0(SP)
	MOVQ	24(SP), AX		// copy argv
	MOVQ	AX, 8(SP)
	// argc, argv 作为来自操作系统的参数传递给 args 处理程序参数的相关事宜。
	CALL	runtime·args(SB)
	// 系统初始化
	CALL	runtime·osinit(SB)
	// 进行各种运行时组件初始化工作，这包括我们的调度器与内存分配器、回收器的初始化
	CALL	runtime·schedinit(SB)

	// create a new goroutine to start program
	// 将入口函数作为参数，准备传递给第一个 G
	MOVQ	$runtime·mainPC(SB), AX		// entry
	PUSHQ	AX
	PUSHQ	$0			// 参数大小
	// 新建 goroutine，将参数传入
	CALL	runtime·newproc(SB)
	POPQ	AX
	POPQ	AX

	// 启动 M
	CALL	runtime·mstart(SB)

	CALL	runtime·abort(SB)	// mstart should never return
	......

// 全局变量 声明 runtime.mainPC 地址为 runtime.main 函数地址，RODATA read only data
DATA	runtime·mainPC+0(SB)/8,$runtime·main(SB)
GLOBL	runtime·mainPC(SB),RODATA,$8
</code></pre>
<h4 id="初始化">初始化</h4>
<p><strong>args</strong></p>
<img src="https://tva1.sinaimg.cn/large/008eGmZEly1gmeujrewwkj30u010yn4z.jpg" alt="img" style="zoom: 33%;" />
<p><code>args</code> 函数将参数指针保存到了 <code>argc</code> 和 <code>argv</code> 这两个全局变量中， 供其他初始化函数使用，而后调用了平台特定的 <code>sysargs</code>。 对于 Darwin 系统而言，只负责获取程序的 <code>executable_path</code>。这个参数用于设置 <code>os</code> 包中的 <code>executablePath</code> 变量。</p>
<pre><code class="language-go">func sysargs(argc int32, argv **byte) {
	// skip over argv, envv and the first string will be the path
	n := argc + 1
	for argv_index(argv, n) != nil {
		n++
	}
	executablePath = gostringnocopy(argv_index(argv, n+1))

	// strip &quot;executable_path=&quot; prefix if available, it's added after OS X 10.11.
	const prefix = &quot;executable_path=&quot;
	if len(executablePath) &gt; len(prefix) &amp;&amp; executablePath[:len(prefix)] == prefix {
		executablePath = executablePath[len(prefix):]
	}
}
</code></pre>
<p>而在 Linux 平台中，这个过程就变得复杂起来了。 与 Darwin 使用 <code>mach-o</code> 不同，Linux 使用 ELF 格式 [Matz et al. 2014]。 ELF 除了 argc, argv, envp 之外，会携带辅助向量（auxiliary vector） 将某些内核级的信息传递给用户进程，例如<strong>内存物理页大小</strong>。因此对于 Linux 而言，物理页大小在 <code>sysargs</code> 中便能直接完成初始化。</p>
<h5 id="osinit">osinit</h5>
<p><code>osinit</code> 完成对 CPU 核心数的获取，因为这与调度器有关。 而 Darwin 上由于使用的是 <code>mach-o</code> 格式，在此前的 <code>sysargs</code> 上 还没有确定内存页的大小，因而在这个函数中，还会额外使用 <code>sysctl</code> 完成物理页大小的查询。</p>
<pre><code class="language-go">var ncpu int32

// Linux
func osinit() {
	ncpu = getproccount()
}

// Darwin
func osinit() {
	ncpu = getncpu()
	physPageSize = getPageSize() // 内部使用 sysctl 来获取物理页大小.
}
</code></pre>
<blockquote>
<p><code>Darwin</code> 从操作系统发展来看，是从 NeXTSTEP 和 FreeBSD 2.x 发展而来的后代， macOS 系统调用的特殊之处在于它提供了两套调用接口，一个是 Mach 调用，另一个则是 POSIX 调用。 Mach 是 NeXTSTEP 遗留下来的产物，其 BSD 层本质上是对 Mach 内核的一层封装。 尽管用户态进程可以直接访问 Mach 调用，但出于通用性的考虑， 物理页大小获取的方式是通过 POSIX <code>sysctl</code> 这个系统调用进行获取 [Bacon, 2007]。</p>
<p>事实上 <code>Linux</code> 与 <code>Darwin</code> 下的系统调用如何参与到 Go 程序中去稍有不同，我们暂时不做深入讨论，留到以后再统一分析。</p>
</blockquote>
<p>可以看出，对运行时最为重要的两个系统级参数：CPU 核心数与内存物理页大小。</p>
<h5 id="schedinit">schedinit</h5>
<p><code>schedinit</code> 函数名表面上是调度器的初始化，但实际上它包含了所有核心组件的初始化工作。</p>
<p>关于执行栈：[[Go 栈笔记]]</p>
<p>关于内存分配器组件：[[Go 内存分配器]]</p>
<pre><code class="language-go">func schedinit() {
  _g_ := getg()
	......
  // 设置最大系统线程数量(M)
	sched.maxmcount = 10000

  // 初始化 skipPC，用于 traceback。
	tracebackinit()
  // 验证链接器(linker)的模块数据正确性
	moduledataverify()
  // 执行栈的初始化，将 stackpool 与 stackLarge 的双向链表置为 nil
	stackinit()
  // 内存分配器的初始化：初始化堆、分配 mcache
	mallocinit()
  // 初始化当前系统线程 M：通过 schedt.mnext 获得 id 及 m.gsignal 的初始化（一个栈大小为 32KB 的 G）
	mcommoninit(_g_.m)
  // cpu 相关初始化
	cpuinit()       // must run before alginit
	alginit()       // maps must not be used before this call
  // 模块加载相关初始化
	modulesinit()   // provides activeModules
	typelinksinit() // uses maps, activeModules
	itabsinit()     // uses activeModules

	msigsave(_g_.m)
	initSigmask = _g_.m.sigmask

  // 处理用户参数及环境变量
	goargs()
	goenvs()
  // 处理调试相关环境变量
	parsedebugvars()
  // 垃圾回收器初始化
	gcinit()
  
  // 初始化网络轮询时间
	sched.lastpoll = uint64(nanotime())
  // 设置 processor 数量，处理用户 GOMAXPROCS 环境变量
	procs := ncpu
	if n, ok := atoi32(gogetenv(&quot;GOMAXPROCS&quot;)); ok &amp;&amp; n &gt; 0 {
		procs = n
	}
  // 调整 P 的数量，初始化 P，会导致 STW，在运行时调用 runtime.GOMAXPROCS() 也是最终执行这个方法
	if procresize(procs) != nil {
		throw(&quot;unknown runnable goroutine during bootstrap&quot;)
	}
  ......
}
</code></pre>
<p>我们最感兴趣的三大运行时组件在如下函数签名中进行大量初始化工作：</p>
<ul>
<li><code>stackinit()</code> goroutine 执行栈初始化</li>
<li><code>mallocinit()</code> 内存分配器初始化</li>
<li><code>mcommoninit()</code> 系统线程的部分初始化工作</li>
<li><code>gcinit()</code> 垃圾回收器初始化</li>
<li><code>procresize()</code> 根据 CPU 核心数，初始化系统线程的本地缓存</li>
</ul>
<h5 id="main-goroutine">main goroutine</h5>
<p><code>runtime.main</code> 已经在 <code>newproc</code> 时作为一个 G 被放入 P 中，会在 <code>mstart</code> 启动 <code>schedule</code> 后被调度执行</p>
<pre><code class="language-go">// 关于 go linkname
//go:linkname localname [importpath.name] 简单来说通过这种机制，可以实现调用其他包不能导出的内容。

//go:linkname runtime_inittask runtime..inittask
var runtime_inittask initTask

// 可以看到这里链接的是 main..inittask 变量，但我们自己写的 main 包中并没有这个变量，它是编译器生成的。
// cmd/compile/internal/gc.fninit 函数中有实现过程
//go:linkname main_inittask main..inittask
var main_inittask initTask

//go:linkname main_main main.main
func main_main()

func main() {
  ......
  // 规定栈最大限制，64 位系统最大 1GB，32 位系统最大 250 MB
	if sys.PtrSize == 8 {
		maxstacksize = 1000000000
	} else {
		maxstacksize = 250000000
	}

	// 允许新建 G 时可以启动新 M
	mainStarted = true

  // 非 wasm 程序启动系统监控(定期垃圾回收、并发任务调度)
	if GOARCH != &quot;wasm&quot; { // no threads on wasm yet, so no sysmon
		systemstack(func() {
			newm(sysmon, nil)
		})
	}

	// 将 main goroutine 锁在主 OS 线程下运行，有些程序需要
	lockOSThread()

	......

  // 执行 runtime init
	doInit(&amp;runtime_inittask) // must be before defer
  
  ......
  
  // 启动 GC
  gcenable()
  
  ......

  // 执行 main 包和 import 包的 init 函数
	doInit(&amp;main_inittask)
  
  ......
  
  // 执行 main.main
	fn := main_main // make an indirect call, as the linker doesn't know the address of the main package when laying down the runtime
	fn()
  
  ......

  // main 执行结束后直接退出
	exit(0)
	for {
		var x *int32
		*x = 0
	}
}
</code></pre>
<figure data-type="image" tabindex="1"><img src="https://tva1.sinaimg.cn/large/008eGmZEly1gmeznu7slnj32y70u0jwt.jpg" alt="img" loading="lazy"></figure>
<h2 id="调度器">调度器</h2>
<h3 id="基本结构">基本结构</h3>
<p>M：Machine，是对于系统线程的抽象。</p>
<p>P：Processor 的抽象，它主要是提供了 G 的本地队列，用于减少全局锁，提高性能。</p>
<p>G：Goroutine，使用 <code>go</code> 关键字创建的执行体。本质上是需要执行的函数体的抽象，将需要执行的函数参数进行拷贝，保存了函数体的入口地址，用于执行。</p>
<p><strong>调度器 sched</strong></p>
<ul>
<li>管理了能够将 G 和 M 绑定的 M 队列</li>
<li>管理了空闲的 P 队列（链表）</li>
<li>管理了 runnable G 全局队列</li>
<li>管理了即将进入 runnable 状态的（dead 状态）G 的队列</li>
<li>管理了发生阻塞的 G 的队列</li>
<li>管理了 defer 调用池</li>
<li>管理了 GC 和系统监控的信号</li>
<li>管理了需要在 safe point 时执行的函数</li>
<li>统计了（极少发生的）动态调整 P 所花的时间</li>
</ul>
<h3 id="初始化-schedinit">初始化 schedinit</h3>
<p>调度器的初始化过程：M(mcommoninit)--&gt;P(procresize)--&gt;G(newproc)，它们分别负责初始化 M 资源池（allm）、P 资源池（allp）、G 的运行现场（g.sched）以及调度队列（p.runq）。</p>
<h5 id="m-的初始化">M 的初始化</h5>
<p>M 只有两个状态：自旋、非自旋。在调度器初始化阶段，只有一个 M，就是主 OS 线程，因此不涉及状态部分，只有对 M 的初步初始化及信号部分处理。</p>
<h5 id="p初始化">P初始化</h5>
<figure data-type="image" tabindex="2"><img src="https://tva1.sinaimg.cn/large/006tNbRwly1g9tnier7d9j30mx0fkdhu.jpg" alt="p-status.png" loading="lazy"></figure>
<p>通常情况下（在程序运行时不调整 P 的个数），P 只会在四种状态下进行切换。当程序刚开始初始化时，所有的 P 都处于 <code>_Pgcstop</code> 状态，随着 P 的初始化 <code>runtime.procresize</code> ，会被置为 <code>_Pidle</code>。如果是非初始化阶段调用 <code>runtime.procresize</code> ，当前 P 状态会被置为 <code>_Prunning</code>。</p>
<p>当 M 需要运行时，会 <code>runtime.acquirep</code> 绑定 P，状态变为 <code>_Prunning</code>。通过 <code>runtime.releasep</code> 来释放，状态变为 <code>_Pidle</code>。</p>
<p><code>runtime.entersyscall</code> 时，P 的状态变为 <code>_Psyscall</code>，<code>runtime.exitsyscall</code> 后，状态变为 <code>_Pidel</code>。</p>
<p>如果发生 GC，会在 <code>stopTheWorld</code> 时，状态变为 <code>_Pgcstop</code>，<code>startTheWorld</code> 后通过 <code>procresize</code> 状态会变为 <code>_Prunning</code> 或 <code>_Pidel</code>（其他 P）。</p>
<p>在运行中调用 <code>runtime.GOMAXPROCS()</code> 后，会调整 <code>gomaxprocs</code> 的值，<code>procresize</code> 中，如果 <code>nprocs</code> 大于 <code>old</code> 则新创建 P，状态为 <code>_Pidel</code>。如果是收缩(小于原有 P 数量)，则会将多出的 P 状态改为 <code>_Pdead</code>，这是中间态，它会在下一次 <code>gomaxprocs</code> 增加时继续复用。</p>
<p>P 初始化的主要流程都在 <code>procresize</code> 中：</p>
<pre><code class="language-go">// 调用之前需要先 STW，并且 sched locked。
func procresize(nprocs int32) *p {
  // 获取当前 P 数量
	old := gomaxprocs
	......

	// 更新统计信息，记录此次修改时间
	now := nanotime()
	if sched.procresizetime != 0 {
		sched.totaltime += int64(old) * (now - sched.procresizetime)
	}
	sched.procresizetime = now

	// 这里只有在用户调用了 runtime.GOMAXPROCS 并且参数大于原有 P 数量才会进入
	if nprocs &gt; int32(len(allp)) {
		// Synchronize with retake, which could be running
		// concurrently since it doesn't run on a P.
		lock(&amp;allpLock)
    // P 不会被释放，始终存在 allp 的底层数组中，cap 代表 P 的最大值
		if nprocs &lt;= int32(cap(allp)) {
      // 如果 nprocs 仍然小于最大的 P 值，就复用一定数量的 P
			allp = allp[:nprocs]
		} else {
      // 如果超过了最大值，就创建更多的 P，定义 cap 的值，为最大 P 数量
			nallp := make([]*p, nprocs)
			// 将原有的 P copy 复用
			copy(nallp, allp[:cap(allp)])
			allp = nallp
		}
		unlock(&amp;allpLock)
	}

	// 初始化新的 P，扩容和程序初运行时都会进入
	for i := old; i &lt; nprocs; i++ {
		pp := allp[i]
    // pp 在复用 _Pdead P 时不等于 nil，所以不用新创建
		if pp == nil {
			pp = new(p)
		}
    // 初始化 pp，将 P.id 与 allp 的索引绑定，当前状态为 _Pgcstop
		pp.init(i)
		atomicstorep(unsafe.Pointer(&amp;allp[i]), unsafe.Pointer(pp))
	}

	_g_ := getg()
	if _g_.m.p != 0 &amp;&amp; _g_.m.p.ptr().id &lt; nprocs {
  	// 如果当前 P 不在收缩范围内，则将当前 P 状态置为 _Prunning
		_g_.m.p.ptr().status = _Prunning
		_g_.m.p.ptr().mcache.prepareForSweep()
	} else {
		// 如果当前 P 在收缩范围中，则解除与当前 M 的绑定，换为与 allp[0] 绑定
		if _g_.m.p != 0 {
			if trace.enabled {
				traceGoSched()
				traceProcStop(_g_.m.p.ptr())
			}
			_g_.m.p.ptr().m = 0
		}
		_g_.m.p = 0
		_g_.m.mcache = nil
		p := allp[0]
		p.m = 0
		p.status = _Pidle
		acquirep(p)
		if trace.enabled {
			traceGoStart()
		}
	}

	// 释放掉多余 P 的相关资源，但保留 P 本身，将状态置为 _Pdead 等待复用
	for i := nprocs; i &lt; old; i++ {
		p := allp[i]
		p.destroy()
		// can't free P itself because it can be referenced by an M in syscall
	}

	// 修剪 allp，保留 cap 与底层数组
	if int32(len(allp)) != nprocs {
		lock(&amp;allpLock)
		allp = allp[:nprocs]
		unlock(&amp;allpLock)
	}
  
	var runnablePs *p
	for i := nprocs - 1; i &gt;= 0; i-- {
		p := allp[i]
    // 当前 P 已经处理
		if _g_.m.p.ptr() == p {
			continue
		}
		p.status = _Pidle
		if runqempty(p) {
      // 将没有本地任务的 P 放入 idel 链表
			pidleput(p)
		} else {
      // 有本地任务的 P，为其绑定一个 M
			p.m.set(mget())
			p.link.set(runnablePs)
      // 放入当前链表
			runnablePs = p
		}
	}
	stealOrder.reset(uint32(nprocs))
  // 将 gomaxprocs 值设置为 nprocs
	var int32p *int32 = &amp;gomaxprocs // make compiler check that gomaxprocs is an int32
	atomic.Store((*uint32)(unsafe.Pointer(int32p)), uint32(nprocs))
  // 返回由本地任务的 P 链表
	return runnablePs
}
</code></pre>
<h5 id="g-初始化">G 初始化</h5>
<p>运行完 <code>runtime.procresize</code> 之后，就是使用 <code>runtime.newproc</code> 来完成 <code>main goroutine</code> 的初始化，并且放入调度器中运行。</p>
<figure data-type="image" tabindex="3"><img src="https://tva1.sinaimg.cn/large/006tNbRwly1g9srqlmo9qj30u90j90uz.jpg" alt="g-status.png" loading="lazy"></figure>
<pre><code class="language-go">// CALL	runtime·newproc(SB)
// 上面汇编代码中将 main goroutine 作为 fn 传入了 newproc
func newproc(siz int32, fn *funcval) {
  // 得到参数的内存地址
	argp := add(unsafe.Pointer(&amp;fn), sys.PtrSize)
	gp := getg()
	pc := getcallerpc()
	systemstack(func() {
		newproc1(fn, (*uint8)(argp), siz, gp, pc)
	})
}
</code></pre>
<pre><code class="language-go">// 创建一个运行 fn 的 G，具有 narg 字节大小的参数，从 argp 开始。
// callerps 是 go 语句的起始地址，也就是 G 的调用地址，新创建的 G 会被放入 G 的队列红等待运行。
func newproc1(fn *funcval, argp *uint8, narg int32, callergp *g, callerpc uintptr) {
  // 获得当前 G，初始化时是 g0
	_g_ := getg()

	......
  
  // 禁止当前 m 被抢占
	acquirem() // disable preemption because it can be holding p in a local var
	siz := narg
	siz = (siz + 7) &amp;^ 7
  // 参数不应该超过 G 的初始栈大小：2KB
	if siz &gt;= _StackMin-4*sys.RegSize-sys.RegSize {
		throw(&quot;newproc: function arguments too large for new goroutine&quot;)
	}

	......

  // 得到当前 P
	_p_ := _g_.m.p.ptr()
  // 尝试得到一个可用的 G(G 状态为 _Gdead 时可复用)，会先寻找当前 P 的 gFree 链表，如果没有去全局的 gFree 链表获取。
	newg := gfget(_p_)
  // 初始化时找不到，运行时可能已被耗尽
	if newg == nil {
    // 创建一个最小栈的 G，当前版本：_StackMin = 2048 2KB
		newg = malg(_StackMin)
    // 将新 G 状态由 _Gidle 置为 _Gdead
		casgstatus(newg, _Gidle, _Gdead)
    // allg 是存放运行时所有的 G 的列表，此时将 _Gdead 状态的 G 添加至 allg ，是防止 GC 扫描打扫未初始化的栈
		allgadd(newg)
	}
	
  ......

  // 计算运行空间大小，对齐
	totalSize := 4*sys.RegSize + uintptr(siz) + sys.MinFrameSize
	totalSize += -totalSize &amp; (sys.SpAlign - 1)
  // 确定 sp 和参数入栈位置
	sp := newg.stack.hi - totalSize
	spArg := sp
  
  ......
  
  // 处理 G 的参数，当有参数时，需要将参数拷贝到 G 的执行栈中
	if narg &gt; 0 {
		// 从 argp 参数开始的位置，复制 narg 个字节到 spArg
    memmove(unsafe.Pointer(spArg), unsafe.Pointer(argp), uintptr(narg))
		// 栈到栈的拷贝，涉及到写屏障，学完 GC 回来再看
		if writeBarrier.needed &amp;&amp; !_g_.m.curg.gcscandone {
			f := findfunc(fn.fn)
			stkmap := (*stackmap)(funcdata(f, _FUNCDATA_ArgsPointerMaps))
			if stkmap.nbit &gt; 0 {
				// We're in the prologue, so it's always stack map index 0.
				bv := stackmapdata(stkmap, 0)
				bulkBarrierBitmap(spArg, spArg, uintptr(bv.n)*sys.PtrSize, 0, bv.bytedata)
			}
		}
	}

  // 清理并初始化 G 的运行现场，因为有可能得到复用的 G
  // g.sched 是 gobuf 结构，用于保存上下文
	memclrNoHeapPointers(unsafe.Pointer(&amp;newg.sched), unsafe.Sizeof(newg.sched))
	newg.sched.sp = sp
	newg.stktopsp = sp
	newg.sched.pc = funcPC(goexit) + sys.PCQuantum // +PCQuantum so that previous instruction is in same function
	newg.sched.g = guintptr(unsafe.Pointer(newg))
  // 看起来像是在这里运行了 fn，其实没有，需要等到调度器执行，后面有详细理解
	gostartcallfn(&amp;newg.sched, fn)
  // 初始化 G 的基本状态
	newg.gopc = callerpc
	newg.ancestors = saveAncestors(callergp)
	newg.startpc = fn.fn
  
	.....
  
  // GC 运行周期，初始化时为 false，不可以被扫。如果 G 自上次扫描后未运行，则为 true，也就是标记可以被 GC 扫描
	newg.gcscanvalid = false
  // 将 G 的状态从 _Gdead 置为 _Grunable
	casgstatus(newg, _Gdead, _Grunnable)

  // P 维护了一个 G id 缓存列表，每次都会获取 _GoidCacheBatch(当前版本16) 个 id，放入自身的列表，性能优化吧。这里是判断是否用完了，用完了就再取一批。
	if _p_.goidcache == _p_.goidcacheend {
		// Sched.goidgen is the last allocated id,
		// this batch must be [sched.goidgen+1, sched.goidgen+GoidCacheBatch].
		// At startup sched.goidgen=0, so main goroutine receives goid=1.
		_p_.goidcache = atomic.Xadd64(&amp;sched.goidgen, _GoidCacheBatch)
		_p_.goidcache -= _GoidCacheBatch - 1
		_p_.goidcacheend = _p_.goidcache + _GoidCacheBatch
	}
  // 设置 id，增加 id 缓存信息
	newg.goid = int64(_p_.goidcache)
	_p_.goidcache++
  
  ......
  
  // 将创建好的 G 放入 P 中，先放本地，满了进全局。
  // true 表示放入执行队列的下一个，false 表示放入队尾
	runqput(_p_, newg, true)

  // 如果有空闲的 P，并且没有自旋中的 M，则直接唤醒 P
  // 初始化时 mainStarted 为 fasle，所以不可以
  // 什么情况会有空闲 P，但没有自旋中的 M？
	if atomic.Load(&amp;sched.npidle) != 0 &amp;&amp; atomic.Load(&amp;sched.nmspinning) == 0 &amp;&amp; mainStarted {
		wakep()
	}
	releasem(_g_.m)
}
</code></pre>
<p>关于 <code>gostartcallfn</code>：</p>
<pre><code class="language-go">// 获取了传入 fv 的入口地址
func gostartcallfn(gobuf *gobuf, fv *funcval) {
	var fn unsafe.Pointer
	if fv != nil {
		fn = unsafe.Pointer(fv.fn)
	} else {
		fn = unsafe.Pointer(funcPC(nilfunc))
	}
	gostartcall(gobuf, fn, unsafe.Pointer(fv))
}
// 将 fn 与 fv 保存至 g.sched buf 中
func gostartcall(buf *gobuf, fn, ctxt unsafe.Pointer) {
	sp := buf.sp
	if sys.RegSize &gt; sys.PtrSize {
		sp -= sys.PtrSize
		*(*uintptr)(unsafe.Pointer(sp)) = 0
	}
	sp -= sys.PtrSize
	*(*uintptr)(unsafe.Pointer(sp)) = buf.pc
	buf.sp = sp
	buf.pc = uintptr(fn)
	buf.ctxt = ctxt
}
</code></pre>
<p>关于 <code>runqput</code>：</p>
<pre><code class="language-go">func runqput(_p_ *p, gp *g, next bool) {
	......

  // 插入下一个
	if next {
	retryNext:
		oldnext := _p_.runnext
    // 通过原子操作将 _p_.runnext 的值替换为 gp
		if !_p_.runnext.cas(oldnext, guintptr(unsafe.Pointer(gp))) {
			goto retryNext
		}
    // 如果原本就没有 oldnext，直接就返回了
		if oldnext == 0 {
			return
		}
		// 将原有的 next G 作为新的 G，继续添加
		gp = oldnext.ptr()
	}

retry:
	h := atomic.LoadAcq(&amp;_p_.runqhead) // load-acquire, synchronize with consumers
	t := _p_.runqtail
  // 本地队列未满则入队
	if t-h &lt; uint32(len(_p_.runq)) {
		_p_.runq[t%uint32(len(_p_.runq))].set(gp)
		atomic.StoreRel(&amp;_p_.runqtail, t+1) // store-release, makes the item available for consumption
		return
	}
  // 满了则放进全局队列，还会带走一半的本地队列，性能优化吧
	if runqputslow(_p_, gp, h, t) {
		return
	}
	// the queue is not full, now the put above must succeed
	goto retry
}
</code></pre>
<h3 id="调度循环">调度循环</h3>
<h4 id="启动前">启动前</h4>
<p>在启动调度器以前，需要确定 G 的栈边界，也就是栈高位指针和低位指针。</p>
<pre><code class="language-go">func mstart() {
  // 在初始化时获取到的是 g0，也就是系统栈，每个 M 都有一个系统栈。系统栈主要用于 runtime 的程序逻辑。系统栈大小固定，是程序设计时算好的。
	_g_ := getg()

  // 验证当前 g0 栈是否已初始化，不同系统的处理方式不一致。
  // m0 的 g0 已经在汇编中初始化，所以不用进入。而后创建的 M，如果属于操作系统分配的栈，则需要在这里确定栈边界
	osStack := _g_.stack.lo == 0
	if osStack {
		size := _g_.stack.hi
		if size == 0 {
			size = 8192 * sys.StackGuardMultiplier
		}
		_g_.stack.hi = uintptr(noescape(unsafe.Pointer(&amp;size)))
    // 为什么要扣除 1KB 的空间？
		_g_.stack.lo = _g_.stack.hi - size + 1024
	}
  // 初始化栈 guard，用于栈溢出检测
  // 进而可以同时调用 Go 或 C 函数
	_g_.stackguard0 = _g_.stack.lo + _StackGuard
	_g_.stackguard1 = _g_.stackguard0
  // 启动 M
	mstart1()

	// 这里应该就是处理 m0.g0 属于操作系统分配栈的逻辑
	if GOOS == &quot;windows&quot; || GOOS == &quot;solaris&quot; || GOOS == &quot;illumos&quot; || GOOS == &quot;plan9&quot; || GOOS == &quot;darwin&quot; || GOOS == &quot;aix&quot; {
    // 由于 windows, solaris, darwin, aix 和 plan9 总是系统分配的栈，在 mstart 之前放进 _g_.stack 的
		// 因此上面的逻辑还没有设置 osStack。
		osStack = true
	}
	mexit(osStack)
}
</code></pre>
<p>启动调度器</p>
<pre><code class="language-go">func mstart1() {
	_g_ := getg()
  ......
  // 为了在 mcall 的栈顶使用调用方来结束当前线程，做记录
	// 当进入 schedule 之后，我们再也不会回到 mstart1，所以其他调用可以复用当前帧。
	save(getcallerpc(), getcallersp())
	asminit()
	minit()
  
  // 设置信号 handler；在 minit 之后，因为 minit 可以准备处理信号的的线程
	if _g_.m == &amp;m0 {
		mstartm0()
	}

  // M 的启动函数，m0 没有 fn
	if fn := _g_.m.mstartfn; fn != nil {
		fn()
	}

  // 如果当前 M 不是 m0，需要绑定 P
	if _g_.m != &amp;m0 {
		acquirep(_g_.m.nextp.ptr())
		_g_.m.nextp = 0
	}
  // m 开始进入调度，永不返回
	schedule()
}
</code></pre>
<p><strong>M 与 P 的绑定</strong></p>
<p>很简单，就是将 m.p 绑定 P 的指针，p.m 绑定 M 的指针，绑定前 P 的状态要求是 <code>_Pidel</code>，绑定后变为 <code>_Prunning</code>。</p>
<p><strong>schedule 永不返回</strong></p>
<p>调度循环 <code>schedule</code> 无法返回，因此最后一个 <code>mexit</code> 目前还不会被执行，因此当下所有的 Go 程序创建的线程都无法被释放 （只有一个特例，当使用 <code>runtime.LockOSThread</code> 锁住的 G 退出时会使用 <code>gogo</code> 退出 M）。</p>
<h4 id="调度逻辑">调度逻辑</h4>
<p><code>schedule</code> 开始就正式进行调度，下面是核心调度逻辑：</p>
<pre><code class="language-go">func schedule() {
  // g0
	_g_ := getg()
  ......

  // m.lockedg 会在 runtime.LockOSThread 下变为非零
	if _g_.m.lockedg != 0 {
		stoplockedm()
		execute(_g_.m.lockedg.ptr(), false) // Never returns.
	}
  ......

top:
	if sched.gcwaiting != 0 {
		gcstopm()
		goto top
	}
	if _g_.m.p.ptr().runSafePointFn != 0 {
		runSafePointFn()
	}

	var gp *g
	var inheritTime bool

	......
  
  // 如果正在 GC，去找 GC 中的 G
	if gp == nil &amp;&amp; gcBlackenEnabled != 0 {
		gp = gcController.findRunnableGCWorker(_g_.m.p.ptr())
		tryWakeP = tryWakeP || gp != nil
	}
	// 每隔 61 次优先取全局队列的 G，防止饿死
	if gp == nil {
		if _g_.m.p.ptr().schedtick%61 == 0 &amp;&amp; sched.runqsize &gt; 0 {
			lock(&amp;sched.lock)
			gp = globrunqget(_g_.m.p.ptr(), 1)
			unlock(&amp;sched.lock)
		}
	}
  // 尝试获取 G，用于验证 M 是否处于自旋状态但取不到 G
	if gp == nil {
		gp, inheritTime = runqget(_g_.m.p.ptr())
		if gp != nil &amp;&amp; _g_.m.spinning {
			throw(&quot;schedule: spinning with local work&quot;)
		}
	}
  // 在此 M 进入自旋，持续寻找可用 G，并阻塞
	if gp == nil {
		gp, inheritTime = findrunnable()
	}

	// 这时一定取到 G 了
	if _g_.m.spinning {
    // 将会把 M 标记为非自旋状态，如果标记后，没有自旋状态中的 M，并且还有 Pidel 链表中还有空闲的 P，需要新启动一个 M。M 有可能死亡
		resetspinning()
	}

	......

	execute(gp, inheritTime)
}
</code></pre>
<p><strong>G 的运行</strong></p>
<pre><code class="language-go">func execute(gp *g, inheritTime bool) {
	_g_ := getg()

  // 将 G 切换为 _Grunning 状态
	casgstatus(gp, _Grunnable, _Grunning)
	gp.waitsince = 0
  // 抢占信号 信号在后面了解
	gp.preempt = false
	gp.stackguard0 = gp.stack.lo + _StackGuard
	if !inheritTime {
		_g_.m.p.ptr().schedtick++
	}
  // 绑定至当前 M
	_g_.m.curg = gp
	gp.m = _g_.m

	......

  // 开始执行 G 中的函数
	gogo(&amp;gp.sched)
}
</code></pre>
<p><code>gogo</code> 的实现</p>
<pre><code class="language-asm">TEXT runtime·gogo(SB), NOSPLIT, $16-8
	MOVQ	buf+0(FP), BX		// 运行现场
	MOVQ	gobuf_g(BX), DX
	MOVQ	0(DX), CX		// 确认 g != nil
	get_tls(CX)
	MOVQ	DX, g(CX)
	MOVQ	gobuf_sp(BX), SP	// 恢复 SP
	MOVQ	gobuf_ret(BX), AX
	MOVQ	gobuf_ctxt(BX), DX
	MOVQ	gobuf_bp(BX), BP
	MOVQ	$0, gobuf_sp(BX)	// 清理，辅助 GC
	MOVQ	$0, gobuf_ret(BX)
	MOVQ	$0, gobuf_ctxt(BX)
	MOVQ	$0, gobuf_bp(BX)
	MOVQ	gobuf_pc(BX), BX // 获取 G 要执行的函数的入口地址
	JMP	BX // 开始执行
</code></pre>
<p>在这里看似是 <code>JMP BX</code> 后就结束执行了，没有后续操作。但其实在前面有对其 <code>PC</code> 进行巧妙的处理。</p>
<pre><code class="language-go">func newproc1(fn *funcval, argp *uint8, narg int32, callergp *g, callerpc uintptr) {
  ......
	siz := narg
	siz = (siz + 7) &amp;^ 7
  ......
  	totalSize := 4*sys.RegSize + uintptr(siz) + sys.MinFrameSize // extra space in case of reads slightly beyond frame
	totalSize += -totalSize &amp; (sys.SpAlign - 1)                  // align to spAlign
	sp := newg.stack.hi - totalSize
	spArg := sp
  ......
  memclrNoHeapPointers(unsafe.Pointer(&amp;newg.sched), unsafe.Sizeof(newg.sched))
	newg.sched.sp = sp
	newg.stktopsp = sp
  // 将 goexit 作为 PC 存入 gobuf
	newg.sched.pc = funcPC(goexit) + sys.PCQuantum // +PCQuantum so that previous instruction is in same function
	newg.sched.g = guintptr(unsafe.Pointer(newg))
  // 在这里对 gobuf 进行处理
	gostartcallfn(&amp;newg.sched, fn)
  ......
}
</code></pre>
<p>看下 <code>gostartcallfn</code> 的处理：</p>
<pre><code class="language-go">func gostartcallfn(gobuf *gobuf, fv *funcval) {
	var fn unsafe.Pointer
	if fv != nil {
		fn = unsafe.Pointer(fv.fn)
	} else {
		fn = unsafe.Pointer(funcPC(nilfunc))
	}
	gostartcall(gobuf, fn, unsafe.Pointer(fv))
}
// x86
func gostartcall(buf *gobuf, fn, ctxt unsafe.Pointer) {
  // 原 sp
	sp := buf.sp
	if sys.RegSize &gt; sys.PtrSize {
		sp -= sys.PtrSize
		*(*uintptr)(unsafe.Pointer(sp)) = 0
	}
  // sp 地址下移以适应新的布局
	sp -= sys.PtrSize
	*(*uintptr)(unsafe.Pointer(sp)) = buf.pc
	buf.sp = sp
  // 还原 pc 为 fn，也就是原函数地址
	buf.pc = uintptr(fn)
	buf.ctxt = ctxt
}
</code></pre>
<p>在不同架构下的 <code>gostartcall</code> 的处理也不一样，这里是 x86 架构下的处理。但是想要的效果都是一样的，也就是拆解 <code>CALL</code> 指令，先手动将 <code>goexit</code> 压入栈，然后 <code>JMP</code> 至 fn，等待 fn 运行完成，执行 <code>RET</code> 指令时，自然会将 <code>goexit</code> 出栈，放入 <code>PC</code> 寄存器。</p>
<p>这也是上面的 <code>gogo</code> 为什么没有使用 <code>CALL</code> 而是使用了 <code>JMP</code>，使用 <code>CALL</code> 命令 cpu 会将 <code>PC(下一条指令)</code> 压入栈中，并 <code>JMP</code>。而直接 <code>JMP</code>，等待 <code>ret</code> 时就会将 <code>goexit</code> 恢复到 <code>PC</code>，从而达到执行 <code>goexit</code> 的目的。</p>
<p>接下来就是去执行 <code>goexit</code> 了：</p>
<pre><code class="language-asm">// The top-most function running on a goroutine
// returns to goexit+PCQuantum.
TEXT runtime·goexit(SB),NOSPLIT,$0-0
	BYTE	$0x90	// NOP
	CALL	runtime·goexit1(SB)	// does not return
	// traceback from goexit1 must hit code range of goexit
	BYTE	$0x90	// NOP
</code></pre>
<p>接下来是 <code>goexit1</code>：</p>
<pre><code class="language-go">func goexit1() {
	......
  // 通过 mcall 调用 goexit0
	mcall(goexit0)
}
</code></pre>
<p><code>mcall</code> 主要是更改执行栈为 <code>m.g0</code> 在系统栈中执行调用，接下来看 <code>goexit0</code>：</p>
<pre><code class="language-go">func goexit0(gp *g) {
  // 此时已经是 g0
	_g_ := getg()

  // 将 G 状态变为 _Gdead
	casgstatus(gp, _Grunning, _Gdead)
	if isSystemGoroutine(gp, false) {
		atomic.Xadd(&amp;sched.ngsys, -1)
	}
  
  // 清理
	gp.m = nil
	locked := gp.lockedm != 0
	gp.lockedm = 0
	_g_.m.lockedg = 0
	gp.paniconfault = false
	gp._defer = nil // 应该已经为 true，但是以防万一
	gp._panic = nil // non-nil for Goexit during panic. points at stack-allocated data.
	gp.writebuf = nil
	gp.waitreason = 0
	gp.param = nil
	gp.labels = nil
	gp.timer = nil

	if gcBlackenEnabled != 0 &amp;&amp; gp.gcAssistBytes &gt; 0 {
		// Flush assist credit to the global pool. This gives
		// better information to pacing if the application is
		// rapidly creating an exiting goroutines.
		scanCredit := int64(gcController.assistWorkPerByte * float64(gp.gcAssistBytes))
		atomic.Xaddint64(&amp;gcController.bgScanCredit, scanCredit)
		gp.gcAssistBytes = 0
	}

	// 现在可以对 G 进行栈扫描，因为它已经没有栈了
	gp.gcscanvalid = true
	dropg()

	if GOARCH == &quot;wasm&quot; { // wasm 目前还没有线程
    // 将 G 放入 gfree 链表中等待复用
		gfput(_g_.m.p.ptr(), gp)
		schedule() // 再次进行调度
	}

  ......
  
  // 将 G 放入 gfree 链表中等待复用
	gfput(_g_.m.p.ptr(), gp)
	if locked {
		// 这个 G 有可能在当前线程上锁住，这个时候需要 kill 线程，而不是将 M 放回线程池
    // 这个操作会返回 mstart，从而释放当前 P 并退出该线程
		if GOOS != &quot;plan9&quot; { // See golang.org/issue/22227.
      // 回到 M 的运行现场，在 mstart1 中有保存 M 的运行现场(g0.sched)，这里将会回到 mstart 中继续执行 mexit
			gogo(&amp;_g_.m.g0.sched)
		} else {
			// Clear lockedExt on plan9 since we may end up re-using
			// this thread.
			_g_.m.lockedExt = 0
		}
	}
  // 再次调度
	schedule()
}
</code></pre>
<p><strong>如何寻找 G</strong></p>
<p>回头看看调度逻辑中如何找到可运行的 G：</p>
<pre><code class="language-go">func findrunnable() (gp *g, inheritTime bool) {
	_g_ := getg()
  
top:
	_p_ := _g_.m.p.ptr()
  // 如果在 GC，则暂停，直到复始后重新开始
	if sched.gcwaiting != 0 {
		gcstopm()
		goto top
	}
  ......

	// 首先从 P 本地队列中寻找
	if gp, inheritTime := runqget(_p_); gp != nil {
		return gp, inheritTime
	}

	// 找不到则去全局队列中寻找
	if sched.runqsize != 0 {
		lock(&amp;sched.lock)
		gp := globrunqget(_p_, 0)
		unlock(&amp;sched.lock)
		if gp != nil {
			return gp, false
		}
	}

	// Poll 网络，优先级比从其他 P 中偷取高
	if netpollinited() &amp;&amp; atomic.Load(&amp;netpollWaiters) &gt; 0 &amp;&amp; atomic.Load64(&amp;sched.lastpoll) != 0 {
		if list := netpoll(false); !list.empty() { // non-blocking
			gp := list.pop()
			injectglist(&amp;list)
			casgstatus(gp, _Gwaiting, _Grunnable)
			if trace.enabled {
				traceGoUnpark(gp, 0)
			}
			return gp, false
		}
	}

	// 准备从其他 P 中偷取
	procs := uint32(gomaxprocs)
	if atomic.Load(&amp;sched.npidle) == procs-1 {
		// 如果没有可偷取的就不偷了
		goto stop
	}
	// 如果自旋中的 M 数量大于正在运行中 P 的数量，则直接阻塞
	if !_g_.m.spinning &amp;&amp; 2*atomic.Load(&amp;sched.nmspinning) &gt;= procs-atomic.Load(&amp;sched.npidle) {
		goto stop
	}
  // M 进入自旋状态
	if !_g_.m.spinning {
		_g_.m.spinning = true
		atomic.Xadd(&amp;sched.nmspinning, 1)
	}
	for i := 0; i &lt; 4; i++ {
    // 随机偷取
		for enum := stealOrder.start(fastrand()); !enum.done(); enum.next() {
      // 再次检查 GC，如果进入 GC，回到顶部，暂停 M
			if sched.gcwaiting != 0 {
				goto top
			}
			stealRunNextG := i &gt; 2 // 如果偷了两次都偷不到，则优先查找 ready 队列
			if gp := runqsteal(_p_, allp[enum.position()], stealRunNextG); gp != nil {
				return gp, false
			}
		}
	}

stop:

	......

	// 放弃当前 P 之前，对 allp 做一个快照
  // 一旦我们不再阻塞在 safe-point 时候，可以立刻在下面进行修改
	allpSnapshot := allp

	// 准备归还 P，调度器加锁
	lock(&amp;sched.lock)
  // 再次检查 GC......
	if sched.gcwaiting != 0 || _p_.runSafePointFn != 0 {
		unlock(&amp;sched.lock)
		goto top
	}
  // 再次检查全局队列
	if sched.runqsize != 0 {
		gp := globrunqget(_p_, 0)
		unlock(&amp;sched.lock)
		return gp, false
	}
  // 归还 P
	if releasep() != _p_ {
		throw(&quot;findrunnable: wrong p&quot;)
	}
  // 将 P 放入 Pidel 链表
	pidleput(_p_)
  // 解锁调度器
	unlock(&amp;sched.lock)

	// Delicate dance: thread transitions from spinning to non-spinning state,
	// potentially concurrently with submission of new goroutines. We must
	// drop nmspinning first and then check all per-P queues again (with
	// #StoreLoad memory barrier in between). If we do it the other way around,
	// another thread can submit a goroutine after we've checked all run queues
	// but before we drop nmspinning; as the result nobody will unpark a thread
	// to run the goroutine.
	// If we discover new work below, we need to restore m.spinning as a signal
	// for resetspinning to unpark a new worker thread (because there can be more
	// than one starving goroutine). However, if after discovering new work
	// we also observe no idle Ps, it is OK to just park the current thread:
	// the system is fully loaded so no spinning threads are required.
	// Also see &quot;Worker thread parking/unparking&quot; comment at the top of the file.
	wasSpinning := _g_.m.spinning
	if _g_.m.spinning {
		_g_.m.spinning = false
		if int32(atomic.Xadd(&amp;sched.nmspinning, -1)) &lt; 0 {
			throw(&quot;findrunnable: negative nmspinning&quot;)
		}
	}

	// 再次检查所有 P 的本地队列
	for _, _p_ := range allpSnapshot {
		if !runqempty(_p_) {
			lock(&amp;sched.lock)
			_p_ = pidleget()
			unlock(&amp;sched.lock)
			if _p_ != nil {
				acquirep(_p_)
				if wasSpinning {
					_g_.m.spinning = true
					atomic.Xadd(&amp;sched.nmspinning, 1)
				}
				goto top
			}
			break
		}
	}

	// 再次检查 idel GC work
	if gcBlackenEnabled != 0 &amp;&amp; gcMarkWorkAvailable(nil) {
		lock(&amp;sched.lock)
		_p_ = pidleget()
		if _p_ != nil &amp;&amp; _p_.gcBgMarkWorker == 0 {
			pidleput(_p_)
			_p_ = nil
		}
		unlock(&amp;sched.lock)
		if _p_ != nil {
			acquirep(_p_)
			if wasSpinning {
				_g_.m.spinning = true
				atomic.Xadd(&amp;sched.nmspinning, 1)
			}
			// Go back to idle GC check.
			goto stop
		}
	}

	// 再次检查 poll 网络
	if netpollinited() &amp;&amp; atomic.Load(&amp;netpollWaiters) &gt; 0 &amp;&amp; atomic.Xchg64(&amp;sched.lastpoll, 0) != 0 {
		if _g_.m.p != 0 {
			throw(&quot;findrunnable: netpoll with p&quot;)
		}
		if _g_.m.spinning {
			throw(&quot;findrunnable: netpoll with spinning&quot;)
		}
		list := netpoll(true) // block until new work is available
		atomic.Store64(&amp;sched.lastpoll, uint64(nanotime()))
		if !list.empty() {
			lock(&amp;sched.lock)
			_p_ = pidleget()
			unlock(&amp;sched.lock)
			if _p_ != nil {
				acquirep(_p_)
				gp := list.pop()
				injectglist(&amp;list)
				casgstatus(gp, _Gwaiting, _Grunnable)
				if trace.enabled {
					traceGoUnpark(gp, 0)
				}
				return gp, false
			}
			injectglist(&amp;list)
		}
	}
  // 真的找不到了，暂止当前 M
	stopm()
	goto top
}
</code></pre>
<p>总结查找 G 顺序：本地 &gt; 全局 &gt; poll 网络 &gt; 偷。</p>
<p>如何偷取 G：</p>
<pre><code class="language-go">// 从 p2 的本地队列中窃取一半的元素，并放入 p 的本地队列中
func runqsteal(_p_, p2 *p, stealRunNextG bool) *g {
	t := _p_.runqtail
	n := runqgrab(p2, &amp;_p_.runq, t, stealRunNextG)
	if n == 0 {
		return nil
	}
	n--
	gp := _p_.runq[(t+n)%uint32(len(_p_.runq))].ptr()
	if n == 0 {
		return gp
	}
	h := atomic.LoadAcq(&amp;_p_.runqhead) // load-acquire, synchronize with consumers
	if t-h+n &gt;= uint32(len(_p_.runq)) {
		throw(&quot;runqsteal: runq overflow&quot;)
	}
	atomic.StoreRel(&amp;_p_.runqtail, t+n) // store-release, makes the item available for consumption
	return gp
}
</code></pre>
<p><strong>M 的自旋</strong></p>
<p>M 的自旋状态就是不断执行 schedule 的过程。</p>
<p>M 会在有 G 可用时，尽量保证有正在运行中 P 数量的自旋 M，而当没有 G 可用时，M 会陷入阻塞，等待唤醒。这样尽量保证在有 G 可用时不需要多次重复唤醒 M，也避免了无 G 可用时的查找 G 的 cpu 浪费。</p>
]]></content>
    </entry>
    <entry>
        <title type="html"><![CDATA[记一次 Traefik 无法代理 MySQL 问题]]></title>
        <id>https://cnbailian.github.io/post/traefik-cannot-proxy-mysql/</id>
        <link href="https://cnbailian.github.io/post/traefik-cannot-proxy-mysql/">
        </link>
        <updated>2020-04-13T11:00:00.000Z</updated>
        <summary type="html"><![CDATA[<p>Traefik 从 2.0 版本开始支持 TCP route，我也使用 Traefik 作为 kubernetes 集群的 Ingress，但是在使用过程中，发现 Traefik 为 MySQL 创建的 TCP route 无法正常工作，经过排查搜索后发现了官方人员关于这个疑惑的<a href="https://community.containo.us/t/v2-tcp-router-with-tls-example/2664">解答</a>，以下截取片段：</p>
<blockquote>
<p>But be careful: not all protocols based on TCP and using TLS supports the SNI routing or the passthrough. It requires the protocol supporting SNI (for instance MySQL doesn't) and doing a TLS handshake (if it is a STARTTLS, then it does not work).</p>
</blockquote>
<p>虽然找到了问题是由于 MySQL 不支持，但也勾起了我的好奇心，什么是 SNI？Traefik 为什么要使用 <code>HostSNI</code> 创建 TCP route 呢？为什么 MySQL 不支持 SNI 呢？于是带着这些问题，我开始寻找答案。</p>
]]></summary>
        <content type="html"><![CDATA[<p>Traefik 从 2.0 版本开始支持 TCP route，我也使用 Traefik 作为 kubernetes 集群的 Ingress，但是在使用过程中，发现 Traefik 为 MySQL 创建的 TCP route 无法正常工作，经过排查搜索后发现了官方人员关于这个疑惑的<a href="https://community.containo.us/t/v2-tcp-router-with-tls-example/2664">解答</a>，以下截取片段：</p>
<blockquote>
<p>But be careful: not all protocols based on TCP and using TLS supports the SNI routing or the passthrough. It requires the protocol supporting SNI (for instance MySQL doesn't) and doing a TLS handshake (if it is a STARTTLS, then it does not work).</p>
</blockquote>
<p>虽然找到了问题是由于 MySQL 不支持，但也勾起了我的好奇心，什么是 SNI？Traefik 为什么要使用 <code>HostSNI</code> 创建 TCP route 呢？为什么 MySQL 不支持 SNI 呢？于是带着这些问题，我开始寻找答案。</p>
<!--more-->  
<h2 id="tls-extensions-sni">TLS Extensions —— SNI</h2>
<p>首先从了解 SNI 开始，SNI 是 TLS 的一个扩展协议。</p>
<h3 id="什么是-tls-extensions">什么是 TLS Extensions？</h3>
<p>TLS 扩展于 2003 年以一个独立的规范（<a href="https://tools.ietf.org/html/rfc3546">RFC 3546</a>）被提出，经过不断的发展：<a href="https://tools.ietf.org/html/rfc4366">RFC 4366</a>、<a href="https://tools.ietf.org/html/rfc6066">RFC 6066</a> 等，先后被加入到 TLS1.1、TLS1.2、TLS1.3 中。它能让 Client 和 Server 在不更新 TLS 的基础上，获得新的功能。</p>
<p>Client 在 ClientHello 中声明多个自己可以支持的 Extensions，Server 收到 ClientHello 以后，依次解析 Extensions，有些如果需要立即回应，就在 ServerHello 中作出回应，有些不需要回应，或者 Server 不支持的 Extensions 就不用响应，忽略不处理。</p>
<p>在 ClientHello 中，Extension 字段位于 Compression Methods 字段之后，通过 Wireshark 工具进行查看：</p>
<figure data-type="image" tabindex="1"><img src="https://tva1.sinaimg.cn/large/007S8ZIlly1ge84vtby39j31nf0u0wt5.jpg" alt="github-wireshark" loading="lazy"></figure>
<h3 id="什么是-sni-扩展">什么是 SNI 扩展？</h3>
<p>我们知道，在 Nginx 中可以通过指定不同的 <code>server_name</code> 来配置多个站点。HTTP/1.1 协议请求头中的 <code>Host</code> 字段可以标识出当前请求属于哪个站点。但是在 TLS 协议中，没有提供一种机制来告诉 Server 它正在建立连接的 Server 的名称，那么对于在同一个地址，并且还使用不同证书的情况下，Server 怎么知道该发送哪个证书？</p>
<p>于是为了解决这个问题，SNI 应运而生。SNI 全称是 Server Name Indication，<a href="https://tools.ietf.org/html/rfc3546#page-8">最初是 2003 年标准化的</a>，在 <a href="https://tools.ietf.org/html/rfc6066#page5">RFC 6066</a> 中有更新。它允许 Server 在同一个网络地址上托管多个启用了 TLS 的服务，要求 Client 在初始 TLS 握手期间指定要连接到哪个服务。</p>
<pre><code class="language-c">struct {
  NameType name_type;
  select (name_type) {
  	case host_name: HostName;
  } name;
} ServerName;

enum {
	host_name(0), (255)
} NameType;

opaque HostName&lt;1..2^16-1&gt;;

struct {
	ServerName server_name_list&lt;1..2^16-1&gt;
} ServerNameList;
</code></pre>
<p>Extension type 是 <code>server_name</code>，点开上图 Wireshark 中 <code>server_name</code> 一行，查看更详细信息：</p>
<figure data-type="image" tabindex="2"><img src="https://tva1.sinaimg.cn/large/007S8ZIlly1ge85t4pu0uj31n80u019m.jpg" alt="server_name" loading="lazy"></figure>
<p><code>ServerNameList</code> 不能包含多个具有相同 <code>ServerNameType</code> 的名称，当前 <code>ServernameType</code> 只有 <code>host_name</code> 一种，在以后可能会添加更多类型，<code>host_name</code> 包含标准的 DNS hostname 且不含结尾点。如果 Server 支持 SNI 扩展，但不能识别 <code>server_name</code>，则应该发送 <code>fatal-level unrecognized_name(112)</code> 来终止握手或继续握手。</p>
<p><em>更多详细的规范内容可以到 <a href="https://tools.ietf.org/html/rfc6066#page5">RFC 6066</a> 中查看。<a href="https://www.iana.org/assignments/tls-extensiontype-values/tls-extensiontype-values.xhtml">这里</a> 有一个扩展协议列表。</em></p>
<h2 id="traefik-的-tcp-路由与-sni">Traefik 的 TCP 路由与 SNI</h2>
<p>Traefik 从 2.0 开始支持 TCP 路由，也支持在相同的 <code>entryPoints</code>（traefik 中的入口端口） 中定义不同的 TCP 路由，但是我们都知道，TCP 是传输层协议，没有任何 SNI 类的机制来保证同一地址入口可以处理不同的服务。那么，Traefik 是怎么做的呢？</p>
<h3 id="部署基于-tls-的-tcp-路由">部署基于 TLS 的 TCP 路由</h3>
<p>答案很简单，Traefik 支持通过 SNI 在每台主机上进行路由，因为这是通过 TCP 进行路由的惟一标准方法，但是 TCP 本身没有 SNI，因此必须使用 TLS。部署配置：</p>
<pre><code class="language-yaml">apiVersion: traefik.containo.us/v1alpha1
kind: IngressRouteTCP
metadata:
  name: example
spec:
  entryPoints:
    - web
  routes:
  - match: HostSNI(`web.example.com`)
    services:
    - name: example-service-name
      port: 80
  tls: 
    secretName: traefik-tls-certs
</code></pre>
<p><code>HostSNI</code> 中的值对应 SNI 扩展中 <code>server_name</code> 的值，Traefik 以此来进行路由，并找到对应证书。还需要注意的是 <code>entryPoints</code> 部分由部署的 Traefik 配置中的 <code>entryPoints</code> 参数决定，此处的 <code>web</code> 是我们指定的一个 <code>entryPoints</code> 名称，端口地址对应为 80 端口：</p>
<pre><code class="language-yaml">......
- image: traefik:2.1.1
  name: traefik
  ports:
  - name: web
    containerPort: 80
    hostPort: 80
  args:
  - --entryPoints.web.address=:80
......
</code></pre>
<p>此处使用 <code>hostPort</code> 的方式暴露入口点，是为了能够通过 Traefik 部署的节点的入口点端口直接访问到 backend service。</p>
<h3 id="部署非-tls-的-tcp-路由">部署非 TLS 的 TCP 路由</h3>
<p>如果有不支持 SNI/TLS 协议的应用客户端，Traefik 也可以部署 “plain TCP”，也就是标准的通过端口进行路由。此时虽然 <code>metch</code> 还是使用 <code>HostSNI</code>，但需要指定为通配符 <code>*</code>：</p>
<pre><code class="language-yaml">apiVersion: traefik.containo.us/v1alpha1
kind: IngressRouteTCP
metadata:
  name: example
spec:
  entryPoints:
    - web
  routes:
  - match: HostSNI(`*`)
    services:
    - name: example-service-name
      port: 80
</code></pre>
<h3 id="其他">其他</h3>
<p>使用 Traefik 代理 TLS 服务时，backend service 可不设置 TLS 相关，由 Traefik 负责全部相关机制。如果 backend service 有需要加密后的数据时，可通过 <code>passthrough</code> 参数配置，Traefik 将发送加密后的数据给 backend service：</p>
<pre><code class="language-yaml">......
  tls: 
    secretName: traefik-tls-certs
    passthrough: true
</code></pre>
<h2 id="为什么不能为-mysql-代理">为什么不能为 MySQL 代理</h2>
<p>当我明白 SNI 协议以及 Traefik 如何使用 SNI/TLS 为 TCP 创建路由时，我开始研究为什么 MySQL 不能使用 SNI 扩展，甚至在 2016 年就有人提出过这个问题，但可惜一直没有人跟进：https://bugs.mysql.com/bug.php?id=82872。这让我有些疑惑，毕竟 MySQL 已经实现了 TLS 功能，为什么在有用户有需求的情况下不加上 SNI 扩展呢？毕竟这又不是过于复杂的功能。</p>
<p>在寻找到答案之前，让我们先简单复习下 TLS 协议的标准流程：首先是 TCP 的三次握手，随后开始 TLS 的握手，如果是 TLS1.2 或之前需要四次握手，如果是 TLS1.3 则需要三次握手，最后开始传输加密数据。</p>
<p>下面来看看 MySQL 的流程，输入命令：<code>mysql -hmysql.example.com -P3306 -uroot -pmysql --ssl-mode=REQUIRED</code>，使用 wireshark 查看：</p>
<p><em>MySQL 对于 TCP 连接已经默认使用 tls，如果不想使用需要修改参数为 <code>--ssl-mode=DISABLED</code>，同时对于 localhost 默认使用 soket 连接，强制使用 TCP 连接需要增加参数: <code>--protocol tcp</code>。</em></p>
<figure data-type="image" tabindex="3"><img src="https://tva1.sinaimg.cn/large/007S8ZIlly1ge97ghpl3ej31ja0u07t9.jpg" alt="mysql" loading="lazy"></figure>
<p>上图中可以看出，在 TCP 握手后，Server 会发送 MySQL 协议 HandShake Paket：<code>Server Greeting proto=10 version=5.7.29</code>，开始 MySQL 协议的握手流程，随后 Client 发送 Auth Paket，图中为开启 TLS 认证的流程，所以并未显示 <code>user</code> 的内容，如果设置 MySQL Client 参数为 <code>--ssl-mode=DISABLED</code>，将显示认证的用户名，并且 Server 会在随后发送 <code>Auth Switch Request</code> 包继续认证流程，此处不再赘述，有兴趣的可以自己抓包看一下。</p>
<p>看到这里其实就已经很清晰了，MySQL 在连接时会将自定义协议握手流程置于 TLS 协议握手之前，以至于 Traefik 无法通过 TLS SNI 找到对应 backend service，也就无法发送 MySQL 的 HandShake Paket。对于 MySQL Client 来说，如果是有超时机制，会响应 <code>waiting for initial communication packet</code> 或类似的错误，如果没有超时机制，就会一直等待。</p>
<p>这点对于 Traefik 来说也很无奈，MySQL 自定义协议中也没有 SNI 的机制，而 TLS 又在 MySQL 协议握手之后发生，导致它完全没办法进行路由，只好期望 MySQL 能尽快修改这部分的流程。<a href="https://github.com/containous/traefik/issues/5155">这里</a>有官方对于这件事的一些回复：https://github.com/containous/traefik/issues/5155</p>
<h2 id="其他常见数据库">其他常见数据库</h2>
<p>了解到了 MySQL 的问题，不禁让我好奇，其他的常见数据库是否也拥有相同问题，于是我又去看了 MongoDB 和 Redis。</p>
<h3 id="mongodb">MongoDB</h3>
<p>使用命令进行连接：<code>mongo --host mongo.example.com --port 27017 --ssl</code></p>
<figure data-type="image" tabindex="4"><img src="https://tva1.sinaimg.cn/large/007S8ZIlly1ge98i03x3ej31pq0u0wsp.jpg" alt="mongodb" loading="lazy"></figure>
<p>非常标准的流程，也支持 SNI 扩展，Traefik 可以顺利的进行路由。</p>
<h3 id="redis">Redis</h3>
<p>Redis 从 6.0 开始支持 SSL/TLS，但 6.0 正在处于 RC（Release　Candidate） 阶段，如果想要测试，可以下载代码后自行编译。TLS 特性是个可选特性，需要在编译时使用参数确认使用：<code>make BUILD_TLS=yes</code>。</p>
<p><em>相关官方文档：https://redis.io/topics/encryption</em></p>
<p>编译后尝试连接 Traefik 代理的地址：<code>./redis-cli --tls -h testtcp.ohuna.cloud -p 6379</code>，却发现 Traefik 响应了 fatal level error： <code>Unknown CA</code>：</p>
<figure data-type="image" tabindex="5"><img src="https://tva1.sinaimg.cn/large/007S8ZIlly1ge99xitc2zj31j50u0qlj.jpg" alt="redis" loading="lazy"></figure>
<p>很明显是因为 redis 没有使用 SNI 扩展，但文档中又没有提及，所以我去 redis 源码中寻找答案。在 <code>tls.h</code> 中了解到 redis 使用了 openssl：</p>
<pre><code class="language-c">......
#ifdef USE_OPENSSL

#include &lt;openssl/ssl.h&gt;
#include &lt;openssl/err.h&gt;
#include &lt;openssl/rand.h&gt;
</code></pre>
<p>于是通过 openssl 设置 SNI 的函数 <code>SSL_set_tlsext_host_name</code> 进行查找：</p>
<pre><code class="language-c">#redis-cli.c
    if (config.sni &amp;&amp; !SSL_set_tlsext_host_name(ssl, config.sni)) {
        *err = &quot;Failed to configure SNI&quot;;
        SSL_free(ssl);
        return REDIS_ERR;
    }
......
  #ifdef USE_OPENSSL
        } else if (!strcmp(argv[i],&quot;--tls&quot;)) {
            config.tls = 1;
        } else if (!strcmp(argv[i],&quot;--sni&quot;) &amp;&amp; !lastarg) {
            config.sni = argv[++i];
......
</code></pre>
<p>发现可以通过 <code>--sni</code> 参数进行指定，通过 <code>redis-cli --help</code> 能查看到相关说明：</p>
<pre><code class="language-bash">redis-cli 5.9.103

Usage: redis-cli [OPTIONS] [cmd [arg [arg ...]]]
......
	--tls              Establish a secure TLS connection.
  --sni &lt;host&gt;       Server name indication for TLS.
</code></pre>
<p>由于粗心大意，导致耽误了时间去寻找 SNI 的设置方法，不过 redis 需要必须手动设置 SNI 的方式也是很奇怪。重新使用带有 <code>--sni</code> 参数的命令进行连接：<code>./redis-cli --tls -h redis.example.com -p 6379 --sni redis.example.com</code>，这次成功连接，查看 TLS ClientHello 中也带有 <code>server_name</code>：</p>
<figure data-type="image" tabindex="6"><img src="https://tva1.sinaimg.cn/large/007S8ZIlly1ge9amao5j7j31ji0u01a4.jpg" alt="redis-success" loading="lazy"></figure>
<h2 id="扩展阅读esni">扩展阅读——ESNI</h2>
<p>虽然关于 Traefik 与 MySQL 的问题告一段落，但 SNI 本身还有其他可学习的内容。</p>
<h4 id="sni-的安全问题">SNI 的安全问题</h4>
<p>由于 SNI 扩展是在 TLS 握手期间通过 ClientHello 进行发送，在此时 Client 和 Server 还未共享加密密钥，因此 ClientHello 消息未被加密发送。这就意味着如果有中间人，是可以拦截明文的 ClientHello 消息，并知道 Client 将要访问的网址。</p>
<h4 id="esni">ESNI</h4>
<p>当前有一项草案正在试图解决这个问题，也就是 <a href="https://tools.ietf.org/html/draft-rescorla-tls-esni-00">ESNI（Encrypted Server Name Indication）</a>。</p>
<p>对于加密 SNI 内容这种先有鸡还是先有蛋的问题，ESNI 通过引入 DNS 来解决。服务器在已知的 DNS 记录上发布一个公钥，客户端可以在连接 Server 之前获得该公钥。然后，客户端将 ClientHello 中的 SNI 扩展替换为 ESNI，也就是使用获得的公钥对 SNI 信息对称加密。</p>
<p>ESNI 必须要基于 TLS1.3 版本，因为 TLS1.3 使用了 Deffie-Hellman 算法进行密钥交换，DH 算法可以使通信的双方能在非安全的信道中安全的交换密钥。否则，就算加密了 SNI，也可以通过明文证书进行验证。</p>
<p>如果仅仅使用 DNS 也不行，因为 DNS 默认是为加密的，所以需要使用的 DNS 支持 DNS over TLS（DoT）或 DNS over HTTPS（DoH）特性。</p>
<p><em>简单的学习下 ESNI，更多详细内容可以通过 Cloudflare 的<a href="https://blog.cloudflare.com/zh/encrypted-sni-zh/">文章</a>或<a href="https://tools.ietf.org/html/draft-rescorla-tls-esni-00">草案</a>进行了解。</em></p>
<h2 id="参考和致谢">参考和致谢</h2>
<p>学习过程中碰到了诸多问题，幸好互联网上有着众多的学习资料，感谢以下文档与博客：</p>
<p><a href="https://www.qikqiak.com/post/traefik-2.1-101/">一文搞懂 Traefik2.1 的使用</a></p>
<p><a href="https://harttle.land/2018/03/25/https-protocols.html">HTTPS 交互过程分析</a></p>
<p><a href="https://imququ.com/post/sth-about-switch-to-https-2.html">关于启用 HTTPS 的一些经验分享（二）</a></p>
<p><a href="https://halfrost.com/https-extensions/">HTTPS 温故知新（六） —— TLS 中的 Extensions</a></p>
<p><a href="https://tools.ietf.org/html/rfc6066">RFC 6066</a></p>
<p><a href="%5Bhttps://www.callmejiagu.com/2018/10/26/WireShark-%E5%88%86%E6%9E%90MySQL%E7%BD%91%E7%BB%9C%E5%8D%8F%E8%AE%AE%E4%B8%AD%E7%9A%84%E6%95%B0%E6%8D%AE%E5%8C%85%EF%BC%88%E4%BA%8C%EF%BC%89/%5D(https://www.callmejiagu.com/2018/10/26/WireShark-%E5%88%86%E6%9E%90MySQL%E7%BD%91%E7%BB%9C%E5%8D%8F%E8%AE%AE%E4%B8%AD%E7%9A%84%E6%95%B0%E6%8D%AE%E5%8C%85%EF%BC%88%E4%BA%8C%EF%BC%89/)">实现自己的数据库驱动——WireShark分析MySQL网络协议中的数据包（二）</a></p>
<p><a href="https://blog.cloudflare.com/zh/encrypted-sni-zh/">不加密，无隐私：加密SNI工作原理</a></p>
]]></content>
    </entry>
    <entry>
        <title type="html"><![CDATA[Kubernetes Cluster Autoscaler]]></title>
        <id>https://cnbailian.github.io/post/kubernetes-cluster-autoscaler/</id>
        <link href="https://cnbailian.github.io/post/kubernetes-cluster-autoscaler/">
        </link>
        <updated>2020-03-31T11:00:00.000Z</updated>
        <summary type="html"><![CDATA[<p>当我们使用 Kubernetes 部署应用后，会发现如果用户增长速度超过预期，以至于计算资源不够时，你会怎么做呢？Kubernetes 给出的解决方案就是：自动伸缩（auto-scaling），通过自动伸缩组件之间的配合，可以 7*24 小时的监控着你的集群，动态变化负载，以适应你的用户需求。</p>
]]></summary>
        <content type="html"><![CDATA[<p>当我们使用 Kubernetes 部署应用后，会发现如果用户增长速度超过预期，以至于计算资源不够时，你会怎么做呢？Kubernetes 给出的解决方案就是：自动伸缩（auto-scaling），通过自动伸缩组件之间的配合，可以 7*24 小时的监控着你的集群，动态变化负载，以适应你的用户需求。</p>
<!--more-->	
<h2 id="自动伸缩组件">自动伸缩组件</h2>
<p><strong>水平自动伸缩（Horizontal Pod Autoscaler，HPA）</strong></p>
<p>HPA 可以基于实时的 CPU 利用率自动伸缩 Replication Controller、Deployment 和 Replica Set 中的 Pod 数量。也可以通过搭配 Metrics Server 基于其他的度量指标。</p>
<p><strong>垂直自动伸缩（Vertical Pod Autoscaler，VPA）</strong></p>
<p>VPA 可以基于 Pod 的使用资源来自动设置 Pod 所需资源并且能够在运行时自动调整资源。</p>
<p><strong>集群自动伸缩（Cluster Autoscaler，CA）</strong></p>
<p>CA 是一个可以自动伸缩集群 Node 的组件。如果集群中有未被调度的 Pod，它将会自动扩展 Node 来使 Pod 可用，或是在发现集群中的 Node 资源使用率过低时，删除 Node 来节约资源。</p>
<p><strong>插件伸缩（Addon Resizer）</strong></p>
<p>这是一个小插件，它以 Sidecar 的形式来垂直伸缩与自己同一个部署中的另一个容器，目前唯一的策略就是根据集群中节点的数量来进行线性扩展。通常与 <a href="https://github.com/kubernetes/kubernetes/blob/master/cluster/addons/metrics-server/metrics-server-deployment.yaml#L66">Metrics Server</a> 配合使用，以保证其可以负担不断扩大的整个集群的 metrics API 服务。</p>
<p>通过 HPA 伸缩无状态应用，VPA 伸缩有状态应用，CA 保证计算资源，它们的配合使用，构成了一个完整的自动伸缩解决方案。</p>
<h2 id="cluster-autoscaler-详细介绍">Cluster Autoscaler 详细介绍</h2>
<p>上面介绍的四个组件中，HPA 是在 kubernetes 代码仓库中的，随着 kubernetes 的版本进行更新发布，不需要部署，可以直接使用。其他的三个组件都在官方社区维护的<a href="https://github.com/kubernetes/autoscaler">仓库</a>中，Cluster Autoscaler 的 v1.0(GA) 版本已经随着 kubernetes 1.8 一起发布，剩下两个则还是 beta 版本。</p>
<h3 id="部署">部署</h3>
<p>Cluster Autoscaler 通常需要搭配云厂商使用，它提供了 <code>Cloud Provider</code> 接口供各个云厂商接入，云厂商通过伸缩组（Scaling Group）或节点池（Node Pool）的功能对 ECS 类产品节点进行增加删除等操作。</p>
<p>目前（v1.18.1）已接入的云厂商：</p>
<p>**Alicloud：**https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/alicloud/README.md</p>
<p>**Aws：**https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/README.md</p>
<p>**Azure：**https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/azure/README.md</p>
<p>**Baiducloud：**https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/baiducloud/README.md</p>
<p>**Digitalocean：**https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/digitalocean/README.md</p>
<p>**GoogleCloud GCE：**https://kubernetes.io/docs/tasks/administer-cluster/cluster-management/#upgrading-google-compute-engine-clusters</p>
<p>**GoogleCloud GKE：**https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-autoscaler</p>
<p>**OpenStack Magnum：**https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/magnum/README.md</p>
<p>**Packet：**https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/packet/README.md</p>
<p>启动参数列表：https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-are-the-parameters-to-ca</p>
<h3 id="工作原理">工作原理</h3>
<p>Cluster Autoscaler 抽象出了一个 <code>NodeGroup</code> 的概念，与之对应的是云厂商的伸缩组服务。Cluster Autoscaler 通过 <code>CloudProvider</code> 提供的 <code>NodeGroup</code> 计算集群内节点资源，以此来进行伸缩。</p>
<p>在启动后，Cluster Autoscaler 会定期（默认 10s）检查未调度的 Pod 和 Node 的资源使用情况，并进行相应的 <code>Scale UP</code> 和 <code>Scale Down</code> 操作。</p>
<h4 id="scale-up">Scale UP</h4>
<p>当 Cluster Autoscaler 发现有 Pod 由于资源不足而无法调度时，就会通过调用 <code>Scale UP</code> 执行扩容操作。</p>
<p>在 <code>Scale UP</code> 中会只会计算在 <code>NodeGroup</code> 中存在的 Node，我们可以将 Worker Node 统一交由伸缩组进行管理。并且由于伸缩组非同步加入的特性，也会考虑到 Upcoming Node。</p>
<p>为了业务需要，集群中可能会有不同规格的 Node，我们可以创建多个 <code>NodeGroup</code>，在扩容时会根据 <code>--expander</code> 选项配置指定的策略，选择一个扩容的节点组，支持如下<a href="https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-are-expanders">五种策略</a>：</p>
<ul>
<li>**random：**随机选择一个 <code>NodeGroup</code>。如果未指定，则默认为此策略。</li>
<li>**most-pods：**选择能够调度最多 Pod 的 <code>NodeGroup</code>，比如有的 Pod 未调度是因为 <code>nodeSelector</code>，此策略会优先选择能满足的 <code>NodeGroup</code> 来保证大多数的 Pod 可以被调度。</li>
<li>**least-waste：**为避免浪费，此策略会优先选择能满足 Pod 需求资源的最小资源类型的 <code>NodeGroup</code>。</li>
<li>**price：**根据 <code>CloudProvider</code> 提供的价格模型，选择最省钱的 <code>NodeGroup</code>。</li>
<li>**priority：**通过配置优先级来进行选择，用起来比较麻烦，需要额外的配置，可以看<a href="https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/expander/priority/readme.md">文档</a>。</li>
</ul>
<p>如果有需要，也可以平衡相似 <code>NodeGroup</code> 中的	Node 数量，避免 <code>NodeGroup</code> 达到 <code>MaxSize</code> 而导致无法加入新 Node。通过 <code>--balance-similar-node-groups</code> 选项配置，默认为 <code>false</code>。</p>
<p>再经过一系列的操作后，最终计算出要扩容的 Node 数量及 <code>NodeGroup</code>，使用 <code>CloudProvider</code> 执行 <code>IncreaseSize</code> 操作，增加云厂商的伸缩组大小，从而完成扩容操作。</p>
<p><em>文字表达能力不足，如果有不清晰的地方，可以参考下面的 <a href="#ScaleUP%E6%BA%90%E7%A0%81%E8%A7%A3%E6%9E%90">ScaleUP 源码解析</a>。</em></p>
<h4 id="scale-down">Scale Down</h4>
<p>缩容是一个可选的功能，通过 <code>--scale-down-enabled</code> 选项配置，默认为 <code>true</code>。</p>
<p>在 Cluster Autoscaler 监控 Node 资源时，如果发现有 Node 满足以下三个条件时，就会标记这个 Node 为 <code>unneeded</code>：</p>
<ul>
<li>Node 上运行的所有的 Pod 的 Cpu 和内存之和小于该 Node 可分配容量的 50%。可通过 <code>--scale-down-utilization-threshold</code> 选项改变这个配置。</li>
<li>Node 上所有的 Pod 都可以被调度到其他节点。</li>
<li>Node 没有表示不可缩容的 annotaition。</li>
</ul>
<p>如果一个 Node 被标记为 <code>unneeded</code> 超过 10 分钟（可通过 <code>--scale-down-unneeded-time</code> 选项配置），则使用 <code>CloudProvider</code> 执行 <code>DeleteNodes</code> 操作将其删除。一次最多删除一个 <code>unneeded Node</code>，但空 Node 可以批量删除，每次最多删除 10 个（通过 <code>----max-empty-bulk-delete</code> 选项配置）。</p>
<p>实际上并不是只有这一个判定条件，还会有其他的条件来阻止删除这个 Node，比如 <code>NodeGroup</code> 已达到 <code>MinSize</code>，或在过去的 10 分钟内有过一次 <code>Scale UP</code> 操作（通过 <code>--scale-down-delay-after-add</code> 选项配置）等等，更详细可查看<a href="https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-does-scale-down-work">文档</a>。</p>
<p>Cluster Autoscaler 的工作机制很复杂，但其中大部分都能通过 flags 进行配置，如果有需要，请详细阅读文档：https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md</p>
<h2 id="如何实现-cloudprovider">如何实现 CloudProvider</h2>
<p>如果使用上述中已实现接入的云厂商，只需要通过 <code>--cloud-provider</code> 选项指定来自哪个云厂商就可以，如果想要对接自己的 IaaS 或有特定的业务逻辑，就需要自己实现 <code>CloudProvider Interface</code> 与 <code>NodeGroupInterface</code>。并将其注册到 <code>builder</code> 中，用于通过 <code>--cloud-provider</code> 参数指定。</p>
<p><code>builder</code> 在 <code>cloudprovider/builder</code> 中的 <a href="https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/builder/builder_all.go">builder_all.go</a> 中注册，也可以在其中新建一个自己的 <code>build</code>，通过 go 文件的 <code>+build</code> 编译参数来指定使用的 <code>CloudProvider</code>。</p>
<p><code>CloudProvider</code> 接口与 <code>NodeGroup</code> 接口在 <a href="https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/cloud_provider.go">cloud_provider.go</a> 中定义，其中需要注意的是 <code>Refresh</code> 方法，它会在每一次循环（默认 10 秒）的开始时调用，可在此时请求接口并刷新 <code>NodeGroup</code> 状态，通常的做法是增加一个 <code>manager</code> 用于管理状态。有不理解的部分可参考其他 <code>CloudProvider</code> 的实现。</p>
<pre><code>type CloudProvider interface {
	// Name returns name of the cloud provider.
	Name() string

	// NodeGroups returns all node groups configured for this cloud provider.
	// 会在一此循环中多次调用此方法，所以不适合每次都请求云厂商服务，可以在 Refresh 时存储状态
	NodeGroups() []NodeGroup

	// NodeGroupForNode returns the node group for the given node, nil if the node
	// should not be processed by cluster autoscaler, or non-nil error if such
	// occurred. Must be implemented.
	// 同上
	NodeGroupForNode(*apiv1.Node) (NodeGroup, error)

	// Pricing returns pricing model for this cloud provider or error if not available.
	// Implementation optional.
	// 如果不使用 price expander 就可以不实现此方法
	Pricing() (PricingModel, errors.AutoscalerError)

	// GetAvailableMachineTypes get all machine types that can be requested from the cloud provider.
	// Implementation optional.
	// 没用，不需要实现
	GetAvailableMachineTypes() ([]string, error)

	// NewNodeGroup builds a theoretical node group based on the node definition provided. The node group is not automatically
	// created on the cloud provider side. The node group is not returned by NodeGroups() until it is created.
	// Implementation optional.
	// 通常情况下，不需要实现此方法，但如果你需要 ClusterAutoscaler 创建一个默认的 NodeGroup 的话，也可以实现。
	// 但其实更好的做法是将默认 NodeGroup 写入云端的伸缩组
	NewNodeGroup(machineType string, labels map[string]string, systemLabels map[string]string,
		taints []apiv1.Taint, extraResources map[string]resource.Quantity) (NodeGroup, error)

	// GetResourceLimiter returns struct containing limits (max, min) for resources (cores, memory etc.).
	// 资源限制对象，会在 build 时传入，通常情况下不需要更改，除非在云端有显示的提示用户更改的地方，否则使用时会迷惑用户
	GetResourceLimiter() (*ResourceLimiter, error)

	// GPULabel returns the label added to nodes with GPU resource.
	// GPU 相关，如果集群中有使用 GPU 资源，需要返回对应内容。 hack: we assume anything which is not cpu/memory to be a gpu.
	GPULabel() string

	// GetAvailableGPUTypes return all available GPU types cloud provider supports.
	// 同上
	GetAvailableGPUTypes() map[string]struct{}

	// Cleanup cleans up open resources before the cloud provider is destroyed, i.e. go routines etc.
	// CloudProvider 只会在启动时被初始化一次，如果每次循环后有需要清除的内容，在这里处理
	Cleanup() error

	// Refresh is called before every main loop and can be used to dynamically update cloud provider state.
	// In particular the list of node groups returned by NodeGroups can change as a result of CloudProvider.Refresh().
	// 会在 StaticAutoscaler RunOnce 中被调用
	Refresh() error
}
// NodeGroup contains configuration info and functions to control a set
// of nodes that have the same capacity and set of labels.
type NodeGroup interface {
	// MaxSize returns maximum size of the node group.
	MaxSize() int

	// MinSize returns minimum size of the node group.
	MinSize() int

	// TargetSize returns the current target size of the node group. It is possible that the
	// number of nodes in Kubernetes is different at the moment but should be equal
	// to Size() once everything stabilizes (new nodes finish startup and registration or
	// removed nodes are deleted completely). Implementation required.
	// 响应的是伸缩组的节点数，并不一定与 kubernetes 中的节点数保持一致
	TargetSize() (int, error)

	// IncreaseSize increases the size of the node group. To delete a node you need
	// to explicitly name it and use DeleteNode. This function should wait until
	// node group size is updated. Implementation required.
	// 扩容的方法，增加伸缩组的节点数
	IncreaseSize(delta int) error

	// DeleteNodes deletes nodes from this node group. Error is returned either on
	// failure or if the given node doesn't belong to this node group. This function
	// should wait until node group size is updated. Implementation required.
	// 删除的节点一定要在该节点组中
	DeleteNodes([]*apiv1.Node) error

	// DecreaseTargetSize decreases the target size of the node group. This function
	// doesn't permit to delete any existing node and can be used only to reduce the
	// request for new nodes that have not been yet fulfilled. Delta should be negative.
	// It is assumed that cloud provider will not delete the existing nodes when there
	// is an option to just decrease the target. Implementation required.
	// 当 ClusterAutoscaler 发现 kubernetes 节点数与伸缩组的节点数长时间不一致，会调用此方法来调整
	DecreaseTargetSize(delta int) error

	// Id returns an unique identifier of the node group.
	Id() string

	// Debug returns a string containing all information regarding this node group.
	Debug() string

	// Nodes returns a list of all nodes that belong to this node group.
	// It is required that Instance objects returned by this method have Id field set.
	// Other fields are optional.
	// This list should include also instances that might have not become a kubernetes node yet.
	// 返回伸缩组中的所有节点，哪怕它还没有成为 kubernetes 的节点
	Nodes() ([]Instance, error)

	// TemplateNodeInfo returns a schedulernodeinfo.NodeInfo structure of an empty
	// (as if just started) node. This will be used in scale-up simulations to
	// predict what would a new node look like if a node group was expanded. The returned
	// NodeInfo is expected to have a fully populated Node object, with all of the labels,
	// capacity and allocatable information as well as all pods that are started on
	// the node by default, using manifest (most likely only kube-proxy). Implementation optional.
	// ClusterAutoscaler 会将节点信息与节点组对应，来判断资源条件，如果是一个空的节点组，那么就会通过此方法来虚拟一个节点信息。
	TemplateNodeInfo() (*schedulernodeinfo.NodeInfo, error)

	// Exist checks if the node group really exists on the cloud provider side. Allows to tell the
	// theoretical node group from the real one. Implementation required.
	Exist() bool

	// Create creates the node group on the cloud provider side. Implementation optional.
	// 与 CloudProvider.NewNodeGroup 配合使用
	Create() (NodeGroup, error)

	// Delete deletes the node group on the cloud provider side.
	// This will be executed only for autoprovisioned node groups, once their size drops to 0.
	// Implementation optional.
	Delete() error

	// Autoprovisioned returns true if the node group is autoprovisioned. An autoprovisioned group
	// was created by CA and can be deleted when scaled to 0.
	Autoprovisioned() bool
}
</code></pre>
<h2 id="scaleup-源码解析">ScaleUP 源码解析</h2>
<pre><code>func ScaleUp(context *context.AutoscalingContext, processors *ca_processors.AutoscalingProcessors, clusterStateRegistry *clusterstate.ClusterStateRegistry, unschedulablePods []*apiv1.Pod, nodes []*apiv1.Node, daemonSets []*appsv1.DaemonSet, nodeInfos map[string]*schedulernodeinfo.NodeInfo, ignoredTaints taints.TaintKeySet) (*status.ScaleUpStatus, errors.AutoscalerError) {
	
	......
	// 验证当前集群中所有 ready node 是否来自于 nodeGroups，取得所有非组内的 node
	nodesFromNotAutoscaledGroups, err := utils.FilterOutNodesFromNotAutoscaledGroups(nodes, context.CloudProvider)
	if err != nil {
		return &amp;status.ScaleUpStatus{Result: status.ScaleUpError}, err.AddPrefix(&quot;failed to filter out nodes which are from not autoscaled groups: &quot;)
	}

	nodeGroups := context.CloudProvider.NodeGroups()
	gpuLabel := context.CloudProvider.GPULabel()
	availableGPUTypes := context.CloudProvider.GetAvailableGPUTypes()

	// 资源限制对象，会在 build cloud provider 时传入
	// 如果有需要可在 CloudProvider 中自行更改，但不建议改动，会对用户造成迷惑
	resourceLimiter, errCP := context.CloudProvider.GetResourceLimiter()
	if errCP != nil {
		return &amp;status.ScaleUpStatus{Result: status.ScaleUpError}, errors.ToAutoscalerError(
			errors.CloudProviderError,
			errCP)
	}

	// 计算资源限制
	// nodeInfos 是所有拥有节点组的节点与示例节点的映射
	// 示例节点会优先考虑真实节点的数据，如果 NodeGroup 中还没有真实节点的部署，则使用 Template 的节点数据
	scaleUpResourcesLeft, errLimits := computeScaleUpResourcesLeftLimits(context.CloudProvider, nodeGroups, nodeInfos, nodesFromNotAutoscaledGroups, resourceLimiter)
	if errLimits != nil {
		return &amp;status.ScaleUpStatus{Result: status.ScaleUpError}, errLimits.AddPrefix(&quot;Could not compute total resources: &quot;)
	}

	// 根据当前节点与 NodeGroups 中的节点来计算会有多少节点即将加入集群中
	// 由于云服务商的伸缩组 increase size 操作并不是同步加入 node，所以将其统计，以便于后面计算节点资源
	upcomingNodes := make([]*schedulernodeinfo.NodeInfo, 0)
	for nodeGroup, numberOfNodes := range clusterStateRegistry.GetUpcomingNodes() {
		......
	}
	klog.V(4).Infof(&quot;Upcoming %d nodes&quot;, len(upcomingNodes))

	// 最终会进入选择的节点组
	expansionOptions := make(map[string]expander.Option, 0)
	......
	// 出于某些限制或错误导致不能加入新节点的节点组，例如节点组已达到 MaxSize
	skippedNodeGroups := map[string]status.Reasons{}
	// 综合各种情况，筛选出节点组
	for _, nodeGroup := range nodeGroups {
	......
	}
	if len(expansionOptions) == 0 {
		klog.V(1).Info(&quot;No expansion options&quot;)
		return &amp;status.ScaleUpStatus{
			Result:					status.ScaleUpNoOptionsAvailable,
			PodsRemainUnschedulable: getRemainingPods(podEquivalenceGroups, skippedNodeGroups),
			ConsideredNodeGroups:	nodeGroups,
		}, nil
	}

	......
	// 选择一个最佳的节点组进行扩容，expander 用于选择一个合适的节点组进行扩容，默认为 RandomExpander，flag: expander
	// random 随机选一个，适合只有一个节点组
	// most-pods 选择能够调度最多 pod 的节点组，比如有 noSchedulerPods 是有 nodeSelector 的，它会优先选择此类节点组以满足大多数 pod 的需求
	// least-waste 优先选择能满足 pod 需求资源的最小资源类型的节点组
	// price 根据价格模型，选择最省钱的
	// priority 根据优先级选择
	bestOption := context.ExpanderStrategy.BestOption(options, nodeInfos)
	if bestOption != nil &amp;&amp; bestOption.NodeCount &gt; 0 {
	......
		newNodes := bestOption.NodeCount

		// 考虑到 upcomingNodes, 重新计算本次新加入节点
		if context.MaxNodesTotal &gt; 0 &amp;&amp; len(nodes)+newNodes+len(upcomingNodes) &gt; context.MaxNodesTotal {
			klog.V(1).Infof(&quot;Capping size to max cluster total size (%d)&quot;, context.MaxNodesTotal)
			newNodes = context.MaxNodesTotal - len(nodes) - len(upcomingNodes)
			if newNodes &lt; 1 {
				return &amp;status.ScaleUpStatus{Result: status.ScaleUpError}, errors.NewAutoscalerError(
					errors.TransientError,
					&quot;max node total count already reached&quot;)
			}
		}

		createNodeGroupResults := make([]nodegroups.CreateNodeGroupResult, 0)
	
		// 如果节点组在云服务商端处不存在，会尝试创建根据现有信息重新创建一个云端节点组
		// 但是目前所有的 CloudProvider 实现都没有允许这种操作，这好像是个多余的方法
		// 云服务商不想，也不应该将云端节点组的创建权限交给 ClusterAutoscaler
		if !bestOption.NodeGroup.Exist() {
			oldId := bestOption.NodeGroup.Id()
			createNodeGroupResult, err := processors.NodeGroupManager.CreateNodeGroup(context, bestOption.NodeGroup)
		......
		}

		// 得到最佳节点组的示例节点
		nodeInfo, found := nodeInfos[bestOption.NodeGroup.Id()]
		if !found {
			// This should never happen, as we already should have retrieved
			// nodeInfo for any considered nodegroup.
			klog.Errorf(&quot;No node info for: %s&quot;, bestOption.NodeGroup.Id())
			return &amp;status.ScaleUpStatus{Result: status.ScaleUpError, CreateNodeGroupResults: createNodeGroupResults}, errors.NewAutoscalerError(
				errors.CloudProviderError,
				&quot;No node info for best expansion option!&quot;)
		}

		// 根据 CPU、Memory及可能存在的 GPU 资源（hack: we assume anything which is not cpu/memory to be a gpu.），计算出需要多少个 Nodes
		newNodes, err = applyScaleUpResourcesLimits(context.CloudProvider, newNodes, scaleUpResourcesLeft, nodeInfo, bestOption.NodeGroup, resourceLimiter)
		if err != nil {
			return &amp;status.ScaleUpStatus{Result: status.ScaleUpError, CreateNodeGroupResults: createNodeGroupResults}, err
		}

		// 需要平衡的节点组
		targetNodeGroups := []cloudprovider.NodeGroup{bestOption.NodeGroup}
		// 如果需要平衡节点组，根据 balance-similar-node-groups flag 设置。
		// 检测相似的节点组，并平衡它们之间的节点数量
		if context.BalanceSimilarNodeGroups {
		......
		}
		// 具体平衡策略可以看 (b *BalancingNodeGroupSetProcessor) BalanceScaleUpBetweenGroups 方法
		scaleUpInfos, typedErr := processors.NodeGroupSetProcessor.BalanceScaleUpBetweenGroups(context, targetNodeGroups, newNodes)
		if typedErr != nil {
			return &amp;status.ScaleUpStatus{Result: status.ScaleUpError, CreateNodeGroupResults: createNodeGroupResults}, typedErr
		}
		klog.V(1).Infof(&quot;Final scale-up plan: %v&quot;, scaleUpInfos)
		// 开始扩容，通过 IncreaseSize 扩容
		for _, info := range scaleUpInfos {
			typedErr := executeScaleUp(context, clusterStateRegistry, info, gpu.GetGpuTypeForMetrics(gpuLabel, availableGPUTypes, nodeInfo.Node(), nil), now)
			if typedErr != nil {
				return &amp;status.ScaleUpStatus{Result: status.ScaleUpError, CreateNodeGroupResults: createNodeGroupResults}, typedErr
			}
		}
		......
	}
	......
}


</code></pre>
]]></content>
    </entry>
    <entry>
        <title type="html"><![CDATA[Kubernetes 的 Dynamic Provisioning 实现]]></title>
        <id>https://cnbailian.github.io/post/dynamic-provisioning-of-kubernetes/</id>
        <link href="https://cnbailian.github.io/post/dynamic-provisioning-of-kubernetes/">
        </link>
        <updated>2020-03-11T11:00:00.000Z</updated>
        <summary type="html"><![CDATA[<p>存储一直是容器运行的关键部分，Kubernetes 为此做了很多努力，从一开始的 Pod Volumes、PV(Persistent Volumes) 与 PVC(Persistent Volume Claim)，到 StorageClass 与 Dynamic Provisioning，再到现在 “out-of-tree” 的 CSI(Container Storage Interface)，Kubernetes 社区一直在演进存储的实现。</p>
<p>前面基础的就不讲了，我们从 StorageClass 与 Dynamic Provisioning 开始了解。</p>
]]></summary>
        <content type="html"><![CDATA[<p>存储一直是容器运行的关键部分，Kubernetes 为此做了很多努力，从一开始的 Pod Volumes、PV(Persistent Volumes) 与 PVC(Persistent Volume Claim)，到 StorageClass 与 Dynamic Provisioning，再到现在 “out-of-tree” 的 CSI(Container Storage Interface)，Kubernetes 社区一直在演进存储的实现。</p>
<p>前面基础的就不讲了，我们从 StorageClass 与 Dynamic Provisioning 开始了解。</p>
<!--more-->  
<h2 id="关于-storageclass-与-dynamic-provisioning">关于 StorageClass 与 Dynamic Provisioning</h2>
<p>StorageClass 为存储提供了“类”的概念，使得 PVC 可以申请不同类别的 PV，以满足用户不同质量、不同策略要求的存储需求。但仅仅是这样还不够，我们还需要手动去创建存储，创建 PV 并与之绑定。所以 StorageClass 还有一个功能就是<strong>动态卷供应（Dynamic Provisioning）</strong>，通过它，Kubernetes 可以根据用户的需求，自动创建其需要的存储。</p>
<h3 id="如何使用">如何使用</h3>
<p>我们需要创建 StorageClass 对象，通过 <code>provisioner</code> 属性指定所用的动态供应的种类：</p>
<pre><code class="language-yaml">apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: standard
provisioner: kubernetes.io/aws-ebs
parameters:
  type: gp2
</code></pre>
<p>创建好以后，所有指定这个 StorageClass 的 PVC 都会动态分配 PV：</p>
<pre><code class="language-yaml">apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: example
  namespace: default
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
  storageClassName: standard
</code></pre>
<p>当然，也需要些其他的配置，比如 aws-ebs 需要在启动参数中加入 <code>--cloud-provider=aws</code>。Glusterfs 需要在集群节点中预先安装好分布式存储等。具体请参考官方手册或 Google，这里不赘述了。</p>
<h3 id="external-provisioner">External provisioner</h3>
<p>官方提供了许多 Provisioner 的实现：AWSElasticBlockStore、AzureFile、Glusterfs <a href="https://kubernetes.io/docs/concepts/storage/storage-classes/#provisioner">等等</a>，这些都是 “in-tree” 的，所以官方也在实验一些 external provisioner 的实现方式。在 <strong><a href="https://github.com/kubernetes-incubator/external-storage">kubernetes-incubator/external-storage</a></strong> 这个仓库中，就有一些孵化中的项目，不过随着 CSI 的出现，应该已经孵死了。官方也正在将 “in-tree” 的存储实现迁移到 CSI 上。</p>
<h2 id="如何实现">如何实现</h2>
<p>我们根据 external-storage 仓库中的项目，简单的分析一下如何自定义一个 Dynamic Provisioner。</p>
<p>其实这个仓库中的项目都很简单，文件没有几个，代码也没有几行。这是因为它们都是基于官方社区的 <a href="https://github.com/kubernetes-sigs/sig-storage-lib-external-provisioner#sig-storage-lib-external-provisioner">library</a> 实现的，它实现了 <code>Provisioner Controller</code> 的整个流程，包括监听、创建 PV 资源等，我们只需要实现 <code>Provisioner</code> 接口的两个方法就可以：</p>
<pre><code>// Provisioner is an interface that creates templates for PersistentVolumes
// and can create the volume as a new resource in the infrastructure provider.
// It can also remove the volume it created from the underlying storage
// provider.
type Provisioner interface {
	// Provision creates a volume i.e. the storage asset and returns a PV object
	// for the volume
	Provision(ProvisionOptions) (*v1.PersistentVolume, error)
	// Delete removes the storage asset that was created by Provision backing the
	// given PV. Does not delete the PV object itself.
	//
	// May return IgnoredError to indicate that the call has been ignored and no
	// action taken.
	Delete(*v1.PersistentVolume) error
}
</code></pre>
<p><code>Provision</code> 方法需要根据给定的数据，分配存储，响应 PV 对象。<code>Delete</code> 方法需要在 PV 删除时，也删除对应存储中的数据。</p>
<p>我们选择仓库中的 nfs 项目来进行详细的分析，它不同于其他 client 类项目，它还维护了一份 nfs server，使得它可以不基于其他外部存储服务。可以在 <code>main</code> 函数中看到，通过 <code>runServer flag</code> 判断是否需要启动服务，默认为 <code>true</code>：</p>
<pre><code>	if *runServer {
		......
		go func() {
			for {
				// This blocks until server exits (presumably due to an error)
				err = server.Run(ganeshaLog, ganeshaPid, ganeshaConfig)
				if err != nil {
					glog.Errorf(&quot;NFS server Exited Unexpectedly with err: %v&quot;, err)
				}

				// take a moment before trying to restart
				time.Sleep(time.Second)
			}
		}()
		// Wait for NFS server to come up before continuing provisioner process
		time.Sleep(5 * time.Second)
	}
</code></pre>
<p>随后通过 <code>Provisioner Controller</code> 的 <code>Run</code> 方法启动 Provisioner 服务：</p>
<pre><code>	// Create the provisioner: it implements the Provisioner interface expected by
	// the controller
	nfsProvisioner := vol.NewNFSProvisioner(exportDir, clientset, outOfCluster, *useGanesha, ganeshaConfig, *enableXfsQuota, *serverHostname, *maxExports, *exportSubnet)

	// Start the provision controller which will dynamically provision NFS PVs
	pc := controller.NewProvisionController(
		clientset,
		*provisioner,
		nfsProvisioner,
		serverVersion.GitVersion,
	)

	pc.Run(wait.NeverStop)
</code></pre>
<p><code>NewNFSProvisioner</code> 返回的是实现了 <code>Provisioner</code> 接口的结构体：</p>
<pre><code>type nfsProvisioner struct {
  ......
}

var _ controller.Provisioner = &amp;nfsProvisioner{}
</code></pre>
<p>接下来就看下如何实现的 <code>Provision</code> 方法：</p>
<pre><code>// options 里包含创建 pv 的数据，pvName、pvc、sc、selectedNode 等
func (p *nfsProvisioner) Provision(options controller.ProvisionOptions) (*v1.PersistentVolume, error) {
  // 在这里进行验证，创建目录等操作
	volume, err := p.createVolume(options)
	if err != nil {
		return nil, err
	}

	annotations := make(map[string]string)
  ......

	pv := &amp;v1.PersistentVolume{
		ObjectMeta: metav1.ObjectMeta{
			Name:        options.PVName,
			Labels:      map[string]string{},
			Annotations: annotations,
		},
		Spec: v1.PersistentVolumeSpec{
			PersistentVolumeReclaimPolicy: *options.StorageClass.ReclaimPolicy,
			AccessModes:                   options.PVC.Spec.AccessModes,
			Capacity: v1.ResourceList{
				v1.ResourceName(v1.ResourceStorage): options.PVC.Spec.Resources.Requests[v1.ResourceName(v1.ResourceStorage)],
			},
			PersistentVolumeSource: v1.PersistentVolumeSource{
				NFS: &amp;v1.NFSVolumeSource{
					Server:   volume.server,
					Path:     volume.path,
					ReadOnly: false,
				},
			},
			MountOptions: options.StorageClass.MountOptions,
		},
	}

	return pv, nil
}

func (p *nfsProvisioner) createVolume(options controller.ProvisionOptions) (volume, error) {
	// 在这里验证剩余磁盘空间是否超出请求大小，只计算当前剩余
  gid, rootSquash, mountOptions, err := p.validateOptions(options)
	if err != nil {
		return volume{}, fmt.Errorf(&quot;error validating options for volume: %v&quot;, err)
	}
  ......
  // 根据 pvc 创建目录
	path := path.Join(p.exportDir, options.PVName)

	err = p.createDirectory(options.PVName, gid)
	if err != nil {
		return volume{}, fmt.Errorf(&quot;error creating directory for volume: %v&quot;, err)
	}
  ......
}


func (p *nfsProvisioner) validateOptions(options controller.ProvisionOptions) (string, bool, string, error) {
  ......
	var stat syscall.Statfs_t
	if err := syscall.Statfs(p.exportDir, &amp;stat); err != nil {
		return &quot;&quot;, false, &quot;&quot;, fmt.Errorf(&quot;error calling statfs on %v: %v&quot;, p.exportDir, err)
	}
	capacity := options.PVC.Spec.Resources.Requests[v1.ResourceName(v1.ResourceStorage)]
	requestBytes := capacity.Value()
	available := int64(stat.Bavail) * int64(stat.Bsize)
	if requestBytes &gt; available {
		return &quot;&quot;, false, &quot;&quot;, fmt.Errorf(&quot;insufficient available space %v bytes to satisfy claim for %v bytes&quot;, available, requestBytes)
	}

	return gid, rootSquash, mountOptions, nil
}
</code></pre>
<p>然后是 <code>Delete</code> 方法的实现：</p>
<pre><code>func (p *nfsProvisioner) Delete(volume *v1.PersistentVolume) error {
  ......
  // pv 删除后，删除对应的目录
	err = p.deleteDirectory(volume)
	if err != nil {
		return fmt.Errorf(&quot;error deleting volume's backing path: %v&quot;, err)
	}
  ......
	return nil
}
</code></pre>
<p>这里只是简单的讲解下 <code>Provisioner</code> 的实现，省略了其他一些比如 <code>xfs quota</code> 等操作，有兴趣的可以去项目中看一下。顺便提一下，这个项目虽然部署了 nfs server，但没有部署成分布式存储，局限性很大，毕竟只是实验中的项目，生产环境慎用。</p>
<h2 id="后记">后记</h2>
<p>碰巧在项目中接触到了 nfs 这个 Provisioner，并且经过测试及源码分析验证了这个项目不可用。经过查阅学习之后写下了这篇文章，算是为以后学习 CSI 作准备吧。</p>
]]></content>
    </entry>
    <entry>
        <title type="html"><![CDATA[Vue 学习路线]]></title>
        <id>https://cnbailian.github.io/post/vue-learning-route/</id>
        <link href="https://cnbailian.github.io/post/vue-learning-route/">
        </link>
        <updated>2019-04-10T11:00:00.000Z</updated>
        <summary type="html"><![CDATA[<p>本文旨在规划 Vue 框架的学习路线，通过掌握基本概念了解框架，熟悉生态系统，最后深入至框架本身。并未涉及到框架使用方式等详细内容，对每个知识点也只是浅尝即止。</p>
]]></summary>
        <content type="html"><![CDATA[<p>本文旨在规划 Vue 框架的学习路线，通过掌握基本概念了解框架，熟悉生态系统，最后深入至框架本身。并未涉及到框架使用方式等详细内容，对每个知识点也只是浅尝即止。</p>
<!--more-->  
<h2 id="为什么选择vue">为什么选择vue</h2>
<p>可能有很多人只是知道 Vue 这个框架，并没有详细的了解，所以在这里简单的列举下 Vue 的优势。</p>
<ul>
<li>
<p>Vue 有着前端框架中最多的 stars，人数众多的开发者，保证了社区的繁荣。</p>
</li>
<li>
<p>相对来说较平滑的学习曲线，这主要取决于vue是一个渐进式框架，同时使用基础的HTML模版语法，这让有HTML经验的人很少上手。</p>
</li>
<li>
<p>渐进式框架也可以更好的逐步的改变原有项目。</p>
</li>
<li>
<p>团队中有来自世界各地的专家开发者，中文社区和文档质量相对不错。</p>
</li>
</ul>
<h2 id="学习路线">学习路线</h2>
<ol>
<li>
<h3 id="javascript与web基础">JavaScript与web基础</h3>
<p>学习 Vue 框架之前必须先了解 JavaScript 与 web 开发的基本知识，就像看一本英语书前，你需要先掌握英文。</p>
</li>
<li>
<h3 id="vue-基本概念">Vue 基本概念</h3>
<p>使用 Vue 来构建项目，需要先了解一些基本概念：</p>
<p><strong>渐进式框架</strong></p>
<p>渐进式就是：一步一步，不需要在一开始就把所有的东西都用上。</p>
<p>在 Vue 上的体现就是：它的核心库只包含视图，其他的客户端路由、全局状态管理等通过核心插件提供。</p>
<p>Vue 在设计角度上，包含了解决构建大型单页面应用的大部分问题，但你不需要一开始就把所有的东西都用上。这就带来了较平滑的学习曲线与对老项目渐进式重构的好处。</p>
<p><strong>声明式渲染</strong></p>
<pre><code class="language-HTML">&lt;div id=&quot;app&quot;&gt;
  {{ message }}
&lt;/div&gt;
</code></pre>
<pre><code class="language-javascript">var app = new Vue({
  el: '#app',
  data: {
    message: 'Hello Vue!'
  }
})
</code></pre>
<p>这里的示例代码就是声明式渲染，你写出想要的结果，由框架执行渲染的命令。</p>
<p><strong>响应式数据</strong></p>
<p>在上面的示例代码中，数据就与 DOM 建立了关联，成为响应式数据。此时改变 <code>app.message</code> 的值，就可以看到页面也会发生对应改变。</p>
<p><strong>组件化</strong></p>
<p>组件化的核心思想就是：将页面结构映射为组件树。</p>
<figure data-type="image" tabindex="1"><img src="https://www.superbed.cn/pic/5c4555f25f3e509ed94b6480" alt="component-tree.png" loading="lazy"></figure>
<p>组件是资源独立的，组件可以复用，组件与组件之间可以嵌套。</p>
<p><strong>单页面应用与客户端路由</strong></p>
<p>单页面应用（SPA）可以通过单个页面实现传统网站多个页面的功能，通过客户端路由实现加载新内容，而不需要通过浏览器跳转，重新加载页面。</p>
<p>Vue Router 就是 Vue 的实现，由官方维护，通过插件的形式加载。</p>
<p><strong>状态管理</strong></p>
<p>在 Vue 中，每个组件管理着自己的状态，如果有状态需要在多个组件间复用，就需要把共享的状态抽离出来，作为全局的状态来管理，这样，在任何组件中都能获取到。</p>
<p>这就是 Vuex 所做的事情。</p>
</li>
<li>
<h3 id="使用-vue-构建单页面应用">使用 Vue 构建单页面应用</h3>
<p>以上的基本概念用于理解 Vue，如果要将它实际应用到项目中，还需要了解更多的东西。</p>
<p><strong>构建工具</strong></p>
<p>Vue 提供了一个官方的 CLI：Vue CLI，为单页面应用搭建繁杂的脚手架。</p>
<p>最新的版本 Vue CLI3中加入了 GUI 的支持，对用户更为友好。</p>
<p><strong>使用 axios 访问 Web API</strong></p>
<p>Vue 的一个核心思想就是数据驱动。所谓数据驱动，是指视图是由数据驱动生成的，Vue 将数据与 DOM 关联，构建响应式数据，我们对视图的修改，不会直接修改 DOM，而是修改数据，响应至视图。</p>
<p>作为一个单页面应用，数据需要通过 Web API 获取，这些数据可能通过 RESTful API 或 GraphQL 提供，也可能通过 WebSocket 提供。</p>
<p>如果是使用的 HTTP 协议，在 Vue Cookbook 中，推荐使用基于 promise 的 axios。</p>
<p><strong>测试</strong></p>
<p>如果想要开发出稳定可维护的项目，测试是必不可少的。</p>
<p>Vue 官方团队提供了 Vue Test Utils，Vue Test Utils 通过将组件隔离挂载，然后模拟必要的输入和对输出的断言来测试。</p>
<p><strong>Chrome 开发者工具</strong></p>
<p>Vue.js devtools 是一个用于 Chrome 的开发者工具，使用它可以清楚的看到组件树的结构，组件的状态等信息。如果使用了 Vuex，还可以看到全局状态，并将其快照发送给其他人，这个人可以在控制台导入状态，方便定位问题。</p>
<p><strong>多端支持</strong></p>
<p>可以在 Weex 中使用 Vue，Vue 的官方也与 Weex 的团队加深联系，在未来的 Vue3 中，会有更好的支持。</p>
</li>
<li>
<h3 id="前端技术栈">前端技术栈</h3>
<p>上述所讲的大多是 Vue 或 Vue 生态系统中的工具。但 Vue 并不是独立存在的，它知识前端技术栈中的一部分。</p>
<p><strong>现代 JavaScript 与 Babel</strong></p>
<p>Vue 应用程序可以使用 ES5 开发，这是现代浏览器都支持的 JavaScript 标准。</p>
<p>如果想要获得更好的开发体验，可以更新 JavaScript 标准 ES2015 或更高版本，但这会导致不支持旧版浏览器，为了解决这个问题，就需要使用 Babel，它可以将你的新语法编译为 ES5 代码。</p>
<p><strong>Webpack</strong></p>
<p>Webpack 是一个模块打包器，它可以将你的应用程序中各个模块的代码打包至一个或多个文件中，形成浏览器可读的 js 文件。还可以在打包过程中，对代码进行转换、使用 Babel、Sass、TypeScript 等。</p>
<p>虽然 Vue CLI 可以为我们构建基础的 webpack 配置，并且在新版本中，可以使用 GUI 来调整，但这并不意味着你可以不学习它，你还是不可避免的需要自行调试它的配置。</p>
<p><strong>TypeScript 与 Flow</strong></p>
<p>Vue2 版本中使用的是 Flow，在 Vue3 中将重构为使用 TypeScript。</p>
<p>这两门语言的主要目的是让 js 拥有类型系统，使用它们可以写出高健壮性的代码，并且可以编译为普通的 ES 语法。</p>
<p>Vue3 将完全使用 TypeScript 编写，这并不意味着你必须使用它。但是如果想要了解 Vue 源码，也是不可避免的。</p>
</li>
<li>
<h3 id="vue-生态系统">Vue 生态系统</h3>
<p><strong>官方核心插件</strong></p>
<p>上述提到的 Vue Router、Vuex，还有 Vue SSR 都是由官方维护的，这区别于 React，官方主要是考虑到了社区维护会导致更新频繁、解决方案太杂乱的问题。</p>
<p><strong>官方工具</strong></p>
<p>上述也提到过的 Vue devtools、Vue CLI，还有 Vue Loader，也都是基于同样的原因。但这不意味着没有社区参与，作为开源项目，依然可以提出建议，修复问题，只是官方有一个发展方向作为参考。</p>
<p><strong>UI 组件库</strong></p>
<p>也可以称为 UI 框架，主要是一系列常用的组件，例如 Form、Table 等常见的元素，方便快速开发。</p>
<p>市面上有非常多的 UI 框架可供选择，Element UI、iView、Vux 等，各有各的风格特色。</p>
</li>
<li>
<h3 id="深入理解-vue">深入理解 Vue</h3>
<p><strong>为什么是渐进式框架</strong></p>
<p><em>框架的存在是为了帮助我们应对复杂度 - 《Vue 2.0——渐进式前端解决方案》</em></p>
<p>当我们在做一个前端应用时，会遇许多的问题，这些问题可以称为应用复杂度，前端框架的出现，就是为了降低应用复杂度，解决一些重复的并且已经有良好解决方案的问题。</p>
<p>但是，框架本身由于其学习曲线，也会带来不同的复杂度，称为框架复杂度。如何权衡应用复杂度与框架复杂度就称为了一个问题。</p>
<p>React 与 Vue 的选择的模式就是：以可弹性伸缩的框架复杂度来应对不同的应用复杂度。框架核心库只包含视图层，其他的问题都由可选的附加库/工具来解决。</p>
<p>Facebook 团队只专注做 React 本身，其他的问题都是由社区贡献解决方案，社区非常活跃，也有很多优秀的想法和思路，但社区的活跃性也会带来一些副作用，版本更新太快，一个问题有太多的解决方案导致的选择困难，库与库之间可能存在的磨合问题。</p>
<p>Vue 的团队选择的方向就是渐进式，核心插件\工具由团队开发，负责一些大方向上的统一，同时也是模块化的，可供选择。</p>
<p><strong>声明式渲染</strong></p>
<p>Vue 或者说现代 js 框架，都有一个统一的看法，数据状态是唯一的真相，DOM 状态只是数据状态的映射。所有的逻辑操作都应在状态的层面进行，当状态发生改变时，DOM 在框架的帮助下自动更新至合理的状态。</p>
<p>那么，Vue 时如何实现的呢？主要是使用的虚拟（Virtual） DOM。</p>
<p>虚拟 DOM 简单来说就是使用 js 对象去描述一个 DOM 节点，它产生的前提就是一个 DOM 元素在浏览器中是非常庞大的，因为有着各种属性，各种事件，浏览器的标准就是这么设计的。相比于 DOM 对象，原生的 js 对象处理起来更快，而且更简单。</p>
<p>Vue 将它所有要监听的 DOM 映射为一个虚拟 DOM 树，这个树非常的轻量，它的职责就是描述当前页面的 DOM 状态。</p>
<p>当数据状态发生改变时，Vue 的响应系统会侦测到变化，并生成一个新的虚拟 DOM 树，通过与上一个虚拟 DOM 树进行比较，将改动应用至真实 DOM 状态。</p>
<p>不同于 React 的是，Vue 可以使用 HTML 模版，也可以是用 JSX，这是 Vue 在编译时将模版编译为渲染函数。</p>
<p><strong>状态管理</strong></p>
<p>状态管理本质上就是把整个应用抽象为下图中的循环，State 驱动 View 的渲染，而用户对 View 进行操作产生 Action，会使 State 产生变化，从而导致 View 重新渲染，这就是单向数据流。</p>
<figure data-type="image" tabindex="2"><img src="https://www.superbed.cn/pic/5c4554ae5f3e509ed94b5b8c" alt="state-单向数据流.png" loading="lazy"></figure>
<p>在 Vue 中，一个组件就已经是这样的结构了，在多个组件共享状态时，或是来自不同视图的行为变更一个状态时，应该如何管理呢？此问题的答案就是 Vuex。</p>
<p>它将组件的共享状态抽离出来，放入 Store，组件通过调度（<code>dispatch</code>）使用 Action，Action 通过提交（<code>commit</code>）Mutation 修改 State，然后响应到组件。</p>
<figure data-type="image" tabindex="3"><img src="https://www.superbed.cn/pic/5c4554c75f3e509ed94b5c5d" alt="vuex.png" loading="lazy"></figure>
</li>
<li>
<h3 id="实现原理">实现原理</h3>
<p><strong>生命周期</strong></p>
<figure data-type="image" tabindex="4"><img src="https://www.superbed.cn/pic/5c45553b5f3e509ed94b5ec4" alt="lifecycle.png" loading="lazy"></figure>
<p><strong>Virtual DOM</strong></p>
<p>Virtual DOM 在 Vue 中的实现。</p>
<p><strong>响应式数据原理</strong></p>
<p>在 Vue2，使用的是 ES5 的 <code>Object.defineProperty</code> 来构成数据监听系统，这也是 Vue2 不能兼容 IE8 及以下的原因。</p>
<p>在即将到来的 Vue3 中，会使用 <code>Proxy</code> 进行重构数据监听系统，这会导致 Vue3 不能兼容 IE11 及一下，Vue 团队会提出其他的办法来解决这个问题。</p>
<p><strong>编译与渲染函数</strong></p>
<p>在 Vue 中，会将模版编译为渲染函数，在 Vue3 中，也做出了相当的优化。</p>
<p><strong>组件化</strong></p>
<p>每一个组件就是一个 Vue 实例，组件内部是如何工作的，组件间的嵌套等实现。</p>
<p><strong>v-model</strong></p>
<p>Vue 提供了 <code>v-model</code> 的指令，用于实现表单与数据状态之间的双向绑定，这也没有破坏单向数据流，只是语法糖。</p>
<pre><code class="language-html">&lt;input v-model=&quot;sth&quot; /&gt;
&lt;input v-bind:value=&quot;sth&quot; v-on:input=&quot;sth = $event.target.value&quot; /&gt;
</code></pre>
<p><strong>核心插件</strong></p>
<p>Vue Router：客户端路由中存在的种种问题，嵌套路由、重定向/别名、懒加载等。</p>
<p>Vuex：初始化过程，如何管理全局状态等。</p>
</li>
</ol>
<h2 id="思维导图">思维导图</h2>
<figure data-type="image" tabindex="5"><img src="https://www.superbed.cn/pic/5c4554da5f3e509ed94b5cd6" alt="Vue 学习路线.png" loading="lazy"></figure>
<h2 id="相关学习资料">相关学习资料</h2>
<p><a href="https://www.infoq.cn/article/vue-2-progressive-front-end-solution">《Vue 2.0——渐进式前端解决方案》</a> 尤雨溪</p>
<p><a href="https://cn.vuejs.org/v2/guide/">《Vue Guide》</a> Vue 官方团队</p>
<p><a href="https://ustbhuangyi.github.io/vue-analysis/">《Vue.js 技术揭秘》</a> ustbhuangyi</p>
<p><a href="https://vuex.vuejs.org/zh/">《Vuex》</a> Vuex</p>
<p><a href="https://vue.w3ctech.com/">VueConf</a> VueConf</p>
<p><a href="https://vuejsdevelopers.com/">Vue.js developers</a> vuejsdevelopers.com</p>
<h2 id="参考文章">参考文章</h2>
<p><a href="https://www.infoq.cn/article/9XymmTqu*4QwahqikMka">《2019 年 Vue 学习路线图》</a></p>
]]></content>
    </entry>
</feed>