Documentation/scheduler/sched-domains

丁善
2023-12-01

Chinese translated version of Documentation/scheduler/sched-domains

If you have any comment or update to the content, please contact the
original document maintainer directly.  However, if you have a problem
communicating in English you can also ask the Chinese maintainer for
help.  Contact the Chinese maintainer if this translation is outdated
or if there is a problem with the translation.

Chinese maintainer: 799942107@qq.com
---------------------------------------------------------------------
Documentation/scheduler/sched-domains的中文翻译


如果想评论或更新本文的内容,请直接联系原文档的维护者。如果你使用英文
交流有困难的话,也可以向中文版维护者求助。如果本翻译更新不及时或者翻
译存在问题,请联系中文版维护者。

中文版维护者: 黄佳露   799942107@qq.com
中文版翻译者: 黄佳露   799942107@qq.com
中文版校译者: 潘丽卡   774945605@qq.com


以下为正文
---------------------------------------------------------------------

                                  sched-domains
Each CPU has a "base" scheduling domain (struct sched_domain). The domain
hierarchy is built from these base domains via the ->parent pointer. ->parent
MUST be NULL terminated, and domain structures should be per-CPU as they are
locklessly updated.

每一个CPU都有一个“基本”的调度域(struct sched_domain)。域的层次结构均是通过
->parent指针所指向的基础域而构建的.->parent指针必须以NULL结束,而且应当为每个
CPU分配域结构以便于CPU的无锁更新。

Each scheduling domain spans a number of CPUs (stored in the ->span field).
A domain's span MUST be a superset of it child's span (this restriction could
be relaxed if the need arises), and a base domain for CPU i MUST span at least
i. The top domain for each CPU will generally span all CPUs in the system
although strictly it doesn't have to, but this could lead to a case where some
CPUs will never be given tasks to run unless the CPUs allowed mask is
explicitly set. A sched domain's span means "balance process load among these
CPUs".

每一个调度域都跨越几个CPU(存于->span指针中)。一个域的跨度必须不小于它的子域跨
度(有特殊情况除外),而且CPU i的基本域范围至少为i。位于域层次结构最顶层的域的
范围将覆盖系统中的所有CPU,这个不是必要的,但是会出现这样一个结果:即有些CPU只
在被显示设置时才去执行任务,其它时候被搁置。进程调度域的跨度意思是:“在各CPU之
间达到负载的平衡”。

Each scheduling domain must have one or more CPU groups (struct sched_group)
which are organised as a circular one way linked list from the ->groups
pointer. The union of cpumasks of these groups MUST be the same as the
domain's span. The intersection of cpumasks from any two of these groups
MUST be the empty set. The group pointed to by the ->groups pointer MUST
contain the CPU to which the domain belongs. Groups may be shared among
CPUs as they contain read only data after they have been set up.

每一个调度域都必须有一个或更多CPU组(struct sched_group),即可以用->groups指针
链接成一个环形。而这些组的cpumask在域范围内都必须是相同的。任意的两个组的
cpumask交集必须为空集。->groups指针所指向的组必须包含这个域所拥有的CPU。而当
这些组被建立后,它们可能因为包含一些只读数据而被CPU所分享。

Balancing within a sched domain occurs between groups. That is, each group
is treated as one entity. The load of a group is defined as the sum of the
load of each of its member CPUs, and only when the load of a group becomes
out of balance are tasks moved between groups.

在两个组之前会发生调度域的平衡操作,其中每个组被视为一个整体。组的负载被定义
为它每个组的CPUS成员的负载的总和,只有当组的负载超出平衡时才会在组之间转移任务。

In kernel/sched/core.c, trigger_load_balance() is run periodically on each CPU
through scheduler_tick(). It raises a softirq after the next regularly scheduled
rebalancing event for the current runqueue has arrived. The actual load
balancing workhorse, run_rebalance_domains()->rebalance_domains(), is then run
in softirq context (SCHED_SOFTIRQ).

在内核/调度/core.c中,trigger_load_balance()通过scheduler_tick()在每个CPU上周期
性的运行。由于当前运行队列已到位,它会在下一个定期调度再平衡事件中提出软中断。
然后实际负载平衡的主力run_rebalance_domains()->rebalance_domains()在软中断上下文
中运行。

The latter function takes two arguments: the current CPU and whether it was idle
at the time the scheduler_tick() happened and iterates over all sched domains
our CPU is on, starting from its base domain and going up the ->parent chain.
While doing that, it checks to see if the current domain has exhausted its
rebalance interval. If so, it runs load_balance() on that domain. It then checks
the parent sched_domain (if it exists), and the parent of the parent and so
forth.

后者函数接受两个参数:一个是当前CPU,另一个是当CPU运行并且从基本的域开始通过parent
链向上迭代时时,scheduler_tick()遍历所有调度域的时候是否被闲置。在做这个的时候,需
要检查看当前域是否用尽了平衡区间的所有范围。如果是这样,它会在该域运行load_balance()。
然后检查父调度域(如果存在),然后是父调度域的父调度域,如此下去。

Initially, load_balance() finds the busiest group in the current sched domain.
If it succeeds, it looks for the busiest runqueue of all the CPUs' runqueues in
that group. If it manages to find such a runqueue, it locks both our initial
CPU's runqueue and the newly found busiest one and starts moving tasks from it
to our runqueue. The exact number of tasks amounts to an imbalance previously
computed while iterating over this sched domain's groups.

最初,load_balance()在当前调度域中找到最繁忙组。如果成功,就寻找那个组的CPU运行
队列中最繁忙的运行队列。如果能够找到这样一个运行队列,它会锁定我们的初始CPU运行队
列和新发现的繁忙运行队列,然后开始将任务从它的运行队列移动到我们的运行队列中来。具
体的准确任务量相当于当此调度域迭代时候组计算的不平衡值。

*** Implementing sched domains ***
The "base" domain will "span" the first level of the hierarchy. In the case
of SMT, you'll span all siblings of the physical CPU, with each group being
a single virtual CPU.

***实现调度域***
基本的域跨域层次结构中的第一层。在SMT中,会跨越所有物理cpu的sblings, 每一个组为
一个虚拟cpu。

In SMP, the parent of the base domain will span all physical CPUs in the
node. Each group being a single physical CPU. Then with NUMA, the parent
of the SMP domain will span the entire machine, with each group having the
cpumask of a node. Or, you could do multi-level NUMA or Opteron, for example,
might have just one domain covering its one NUMA level.

在SMP中,基本域的父节点将跨域所有物理CPU。每一个组时一个单独的物理CPU。在NUMA中,
SMP父域会跨域整个系统,每个组有一个组的cpumask。或者,你可以做多层NUMA或处理器,
例如只有一个域覆盖一个NUMA级别。

The implementor should read comments in include/linux/sched.h:
struct sched_domain fields, SD_FLAG_*, SD_*_INIT to get an idea of
the specifics and what to tune.

实现者必须阅读include / linux / sched.h中的注释:struct sched_domain域,
SD_FLAG_*, SD_*_INIT来获得基本信息和适当的调节。

Architectures may retain the regular override the default SD_*_INIT flags
while using the generic domain builder in kernel/sched/core.c if they wish to
retain the traditional SMT->SMP->NUMA topology (or some subset of that). This
can be done by #define'ing ARCH_HASH_SCHED_TUNE.

如果他们希望保留传统SMT->SMP->NUMA拓扑(或一部分),那么在kernel/sched/core.c
中使用通用域生成器时体系结构可能会保留常规覆盖的默认的SD_*_INIT标识。这个可以
用#define'ing ARCH_HASH_SCHED_TUNE实现。

Alternatively, the architecture may completely override the generic domain
builder by #define'ing ARCH_HASH_SCHED_DOMAIN, and exporting your
arch_init_sched_domains function. This function will attach domains to all
CPUs using cpu_attach_domain.
体系结构可以完全覆盖通用的域生成器,通过定义 ARCH_HASH_SCHED_DOMAIN,
然后导出自己的arch_init_shed_domains函数。该函数会将所有的域连接到
cpu_attach_domain的所有cpu中。

The sched-domains debugging infrastructure can be enabled by enabling
CONFIG_SCHED_DEBUG. This enables an error checking parse of the sched domains
which should catch most possible errors (described above). It also prints out
the domain structure in a visual format.
sched-domains调试的基础设施可以用CONFIG_SCHED_DEBUG打开。这可以检查到域名解析
中可能性最大的错误(上述)。它还能用视觉形式打印出域结构。

 类似资料: