Scheduler是集群中Master节点的重要组件,其功能是根据集群中各Pod的资源需求、亲和性等指标,将Pod合理调度到Kubernetes集群中的各个节点上。
一、入口函数
入口函数与Controller Manager的入口函数结构相同,同样是应用了cobra包,在命令行中注册了kube-scheduler命令。
cmd/kube-scheduler/scheduler.go
func main() { rand.Seed(time.Now().UnixNano()) command := app.NewSchedulerCommand() // TODO: once we switch everything over to Cobra commands, we can go back to calling // utilflag.InitFlags() (by removing its pflag.Parse() call). For now, we have to set the // normalize func and add the go flag set by hand. pflag.CommandLine.SetNormalizeFunc(utilflag.WordSepNormalizeFunc) // utilflag.InitFlags() logs.InitLogs() defer logs.FlushLogs() if err := command.Execute(); err != nil { fmt.Fprintf(os.Stderr, "%v\n", err) os.Exit(1) } }
这里核心的方法仍然是NewSchedulerCommand。该方法位于app/server.go中,结构与Controller Manager几乎一样,因此不贴上来了。核心的部分仍然是在cobra.Command结构体的Run字段中调用runCommand方法。
runCommand方法为Scheduler配置Config,最终返回的是Run方法,将Scheduler运行起来。
runCommand方法中间有一行值得注意:
func runCommand(cmd *cobra.Command, args []string, opts *options.Options) error { ... algorithmprovider.ApplyFeatureGates() ... }
这一行的作用是调用ApplyFeatureGates方法,应用Pod调度算法。详细的调度算法实现位于pkg/scheduler/algorithm和pkg/scheduler/algorithmprovider中,这里不详细展开。
二、Run
看一下Run方法:
func Run(cc schedulerserverconfig.CompletedConfig, stopCh <-chan struct{}) error { // Create the scheduler. sched, err := scheduler.New(cc.Client, cc.InformerFactory.Core().V1().Nodes(), cc.PodInformer, cc.InformerFactory.Core().V1().PersistentVolumes(), cc.InformerFactory.Core().V1().PersistentVolumeClaims(), cc.InformerFactory.Core().V1().ReplicationControllers(), cc.InformerFactory.Apps().V1().ReplicaSets(), cc.InformerFactory.Apps().V1().StatefulSets(), cc.InformerFactory.Core().V1().Services(), cc.InformerFactory.Policy().V1beta1().PodDisruptionBudgets(), cc.InformerFactory.Storage().V1().StorageClasses(), cc.Recorder, cc.ComponentConfig.AlgorithmSource, stopCh, scheduler.WithName(cc.ComponentConfig.SchedulerName), scheduler.WithHardPodAffinitySymmetricWeight(cc.ComponentConfig.HardPodAffinitySymmetricWeight), scheduler.WithPreemptionDisabled(cc.ComponentConfig.DisablePreemption), scheduler.WithPercentageOfNodesToScore(cc.ComponentConfig.PercentageOfNodesToScore), scheduler.WithBindTimeoutSeconds(*cc.ComponentConfig.BindTimeoutSeconds)) if err != nil { return err } // Prepare the event broadcaster. if cc.Broadcaster != nil && cc.EventClient != nil { cc.Broadcaster.StartLogging(klog.V(6).Infof) cc.Broadcaster.StartRecordingToSink(&v1core.EventSinkImpl{Interface: cc.EventClient.Events("")}) } // Setup healthz checks. var checks []healthz.HealthzChecker if cc.ComponentConfig.LeaderElection.LeaderElect { checks = append(checks, cc.LeaderElection.WatchDog) } // Start up the healthz server. if cc.InsecureServing != nil { separateMetrics := cc.InsecureMetricsServing != nil handler := buildHandlerChain(newHealthzHandler(&cc.ComponentConfig, separateMetrics, checks...), nil, nil) if err := cc.InsecureServing.Serve(handler, 0, stopCh); err != nil { return fmt.Errorf("failed to start healthz server: %v", err) } } if cc.InsecureMetricsServing != nil { handler := buildHandlerChain(newMetricsHandler(&cc.ComponentConfig), nil, nil) if err := cc.InsecureMetricsServing.Serve(handler, 0, stopCh); err != nil { return fmt.Errorf("failed to start metrics server: %v", err) } } if cc.SecureServing != nil { handler := buildHandlerChain(newHealthzHandler(&cc.ComponentConfig, false, checks...), cc.Authentication.Authenticator, cc.Authorization.Authorizer) if err := cc.SecureServing.Serve(handler, 0, stopCh); err != nil { // fail early for secure handlers, removing the old error loop from above return fmt.Errorf("failed to start healthz server: %v", err) } } // Start all informers. go cc.PodInformer.Informer().Run(stopCh) cc.InformerFactory.Start(stopCh) // Wait for all caches to sync before scheduling. cc.InformerFactory.WaitForCacheSync(stopCh) controller.WaitForCacheSync("scheduler", stopCh, cc.PodInformer.Informer().HasSynced) // Prepare a reusable runCommand function. run := func(ctx context.Context) { sched.Run() <-ctx.Done() } ctx, cancel := context.WithCancel(context.TODO()) // TODO once Run() accepts a context, it should be used here defer cancel() go func() { select { case <-stopCh: cancel() case <-ctx.Done(): } }() // If leader election is enabled, runCommand via LeaderElector until done and exit. if cc.LeaderElection != nil { cc.LeaderElection.Callbacks = leaderelection.LeaderCallbacks{ OnStartedLeading: run, OnStoppedLeading: func() { utilruntime.HandleError(fmt.Errorf("lost master")) }, } leaderElector, err := leaderelection.NewLeaderElector(*cc.LeaderElection) if err != nil { return fmt.Errorf("couldn't create leader elector: %v", err) } leaderElector.Run(ctx) return fmt.Errorf("lost lease") } // Leader election is disabled, so runCommand inline until done. run(ctx) return fmt.Errorf("finished without leader elect") }
Run方法包含的内容较多,下面慢慢分析:
(1)创建Scheduler。
Run方法的前几行代码调用了New方法,创建了一个Scheduler对象。这个New方法位于pkg/scheduler/scheduler.go中:
pkg/scheduler/scheduler.go
func New(client clientset.Interface, nodeInformer coreinformers.NodeInformer, podInformer coreinformers.PodInformer, pvInformer coreinformers.PersistentVolumeInformer, pvcInformer coreinformers.PersistentVolumeClaimInformer, replicationControllerInformer coreinformers.ReplicationControllerInformer, replicaSetInformer appsinformers.ReplicaSetInformer, statefulSetInformer appsinformers.StatefulSetInformer, serviceInformer coreinformers.ServiceInformer, pdbInformer policyinformers.PodDisruptionBudgetInformer, storageClassInformer storageinformers.StorageClassInformer, recorder record.EventRecorder, schedulerAlgorithmSource kubeschedulerconfig.SchedulerAlgorithmSource, stopCh <-chan struct{}, opts ...func(o *schedulerOptions)) (*Scheduler, error) { options := defaultSchedulerOptions for _, opt := range opts { opt(&options) } // Set up the configurator which can create schedulers from configs. configurator := factory.NewConfigFactory(&factory.ConfigFactoryArgs{ SchedulerName: options.schedulerName, Client: client, NodeInformer: nodeInformer, PodInformer: podInformer, PvInformer: pvInformer, PvcInformer: pvcInformer, ReplicationControllerInformer: replicationControllerInformer, ReplicaSetInformer: replicaSetInformer, StatefulSetInformer: statefulSetInformer, ServiceInformer: serviceInformer, PdbInformer: pdbInformer, StorageClassInformer: storageClassInformer, HardPodAffinitySymmetricWeight: options.hardPodAffinitySymmetricWeight, DisablePreemption: options.disablePreemption, PercentageOfNodesToScore: options.percentageOfNodesToScore, BindTimeoutSeconds: options.bindTimeoutSeconds, }) var config *factory.Config source := schedulerAlgorithmSource switch { case source.Provider != nil: // Create the config from a named algorithm provider. sc, err := configurator.CreateFromProvider(*source.Provider) if err != nil { return nil, fmt.Errorf("couldn't create scheduler using provider %q: %v", *source.Provider, err) } config = sc case source.Policy != nil: // Create the config from a user specified policy source. policy := &schedulerapi.Policy{} switch { case source.Policy.File != nil: if err := initPolicyFromFile(source.Policy.File.Path, policy); err != nil { return nil, err } case source.Policy.ConfigMap != nil: if err := initPolicyFromConfigMap(client, source.Policy.ConfigMap, policy); err != nil { return nil, err } } sc, err := configurator.CreateFromConfig(*policy) if err != nil { return nil, fmt.Errorf("couldn't create scheduler from policy: %v", err) } config = sc default: return nil, fmt.Errorf("unsupported algorithm source: %v", source) } // Additional tweaks to the config produced by the configurator. config.Recorder = recorder config.DisablePreemption = options.disablePreemption config.StopEverything = stopCh // Create the scheduler. sched := NewFromConfig(config) return sched, nil }
New方法很长,但是逻辑相对清晰。其本质就是根据传入的Informer、算法等参数,实例化一个Config,然后调用NewFromConfig方法,通过这个Config创建一个scheduler实例并返回。可以看到,scheduler中也用到了包括nodeInformer、podInformer等在内的大量Informer,因为scheduler也需要及时掌握资源的变化,从而调整调度的策略。
创建config的NewConfigFactory方法位于pkg/scheduler/factory/factory.go中,进入可以看到,该方法为一系列Informer初始化了回调函数。
其中最重要的是PodInformer的两个回调函数,将已调度和未调度的Pod分别存入缓存和队列中,并定义了对两种Pod的增、改、删方法。
此外,可以看到,在ConfigFactory中,维护了一个podQueue队列,用于存放待调度的Pod。
(2)运行广播和健康检查。
中间有几行是为Scheduler配置广播和健康检查相关内容,与Controller Manager类似,不提。
(3)Informer启动。
值得注意的是,Scheduler将PodInformer从其他的Informer中独立出来了,因为对Pod的调度才是Scheduler的核心。
(4)运行Scheduler。
这是整个方法的核心。通过调用Scheduler的Run方法,将Scheduler运行起来。
进入Run方法,我们发现方法非常简洁,就做了2件事:
pkg/scheduler/scheduler.go func (sched *Scheduler) Run() { if !sched.config.WaitForCacheSync() { return } go wait.Until(sched.scheduleOne, 0, sched.config.StopEverything) }
第一件事是等待缓存同步,第二件事是调用scheduleOne方法,执行Pod的调度操作。具体内容我们下一篇文章再继续分析。
三、总结
总结Scheduler的逻辑,大体上是通过cobra注册一个kube-scheduler命令并运行。命令运行时,首先应用给定的调度算法,然后基于ConfigFactory,创建一个Scheduler的实例,启动相关的Informer,然后开始执行调度。