ww_mutex 一种伤害自己保证资源可靠性的锁

伏砚

2023-12-01

ww_mutex

https://www.kernel.org/doc/html/latest/locking/ww-mutex-design.html

ww_mutex创建的动机

GPU会在绘制时使用很多bo，这些bo可以在contex/process中共享，bo的domain可以是多种(VRAM, GTT, RAM)等等，而且一些bo还可以通过prime(dmabuffer)导出给其他设备共用。因此，多数情况下driver程序需要等待bo完全准备好可以被使用。

如果认为给bo加mutex锁来考虑，就会出现问题，因为无法保证bo在所有上下文中以相同的顺序出现在执行队列中，这是直接由用户空间控制的，也是应用程序执行OpenGL调用序列的结果。可能导致deadlock。举个例子：

BOs： A1 A2 B2 B3

context1：

A2—B2—B3—A1

context2：

B2—A2—A1—B3

在context1申请A2通过，context2申请B2通过，但是接下来可能发生context1申请B2同时context2申请A2. 囚徒了~~~

这个问题越来越复杂当你考虑到内核BOs可能需要迁移到VRAM GPU之前对BOs进行操作，这可能反过来要求（evict）驱逐一些其他缓冲区(你不想退出已经排队到GPU的其他BO)，但对于一个简化的问题的理解则可以忽略。

TTM图形子系统提出的用于解决此问题的算法非常简单。对于需要锁定的每组缓冲区（execbuf），将通过全局计数器（ global counter）为调用方分配唯一的保留ID /票证。如果在锁定与execbuf相关联的所有缓冲区的同时发生死锁，则保留票证（id）最低的那个（即最旧的任务）获胜，而保留ID较高的那个（即较新的任务）则将其解锁的所有缓冲区解锁。已经锁定，然后再试一次。

两种防止死锁的等待模式

TIP1:http://cn.voidcc.com/question/p-zrowubfq-mh.html

等死模式：这是一种非抢先式的预防死锁技术。当事务Ti请求当前由Tj存储的数据项时，只有当Ti具有比Tj小的时间戳（即Ti比Tj更早）时才允许Ti等待，否则Ti被回滚（死亡）。

在该方案中，如果事务请求锁定资源（数据项），这是由另一个事务与冲突锁已经保持，然后的两种可能性之一可能发生 -

（1）如果TS（Ti）< TS（Tj） - 即请求冲突锁的Ti比Tj早 - 则允许Ti等待，直到数据项可用。
	如果TS（Ti）> TS（tj） - 即Ti比Tj年轻 - 则Ti死亡。
	 Ti稍后以随机延迟重启，但具有相同的时间戳。

该方案允许旧的交易等待，但杀死年轻的交易。

例如：

假设交易T22，T23，T24有时间标记分别为5，10和15。如果T22请求T23保存的数据项，则T22将等待。
如果T24请求T23保存的数据项，则T24将被回滚。

伤口等待方案：这是一种预防死锁的技术。这是等待死亡计划的一个对等物。当事务Ti请求当前由Tj持有的数据项时，只有当Ti具有大于Tj的时间戳时，才允许Ti等待，否则Tj被回滚（Tj受Ti影响）。

在该方案中，如果事务请求锁定资源（数据项），这是由一些用冲突锁已持有另一事务中的两种可能性之一可能发生 -

（1） 如果TS（Tj），那么Ti迫使Tj被回滚 - 即Ti缠绕Tj。 Tj稍后以随机延迟重启，但具有相同的时间戳。如果TS（Ti）> TS（Tj），则Ti被迫等待，直到资源可用为止。如果TS（Ti）> TS（Tj）

这个方案，允许年轻的交易等待;但是当较旧的交易请求较年轻的交易时，较旧的交易强制较年轻的交易放弃并释放该项目。

例如：

假设交易T22，T23，T24有时间标记5，分别为10和15。如果T22请求T23保存的数据项，则数据项将从T23被抢占，并且T23将被回滚。
如果T24请求T23保存的数据项，则T24将等待。

在这两种情况下，中止在后期进入系统的事务。

概念

与普通的mutex锁对比。 ww_mutex锁的锁定界面显示了两个其他的概念(定义):

Acquire上下文（ctx）:在acquire上下文中的resv_id是在开始时请求到的id，这个id保存在acquire上下文（ acquire context），而且这个id保存在acquire上下文里，

此外，acquire上下文跟踪调试状态以捕获w / w互斥锁接口滥用。acquire上下文表示事务transaction。
ww_class:与普通互斥锁相反，对于w w互斥锁，锁类需要是显式的，因为它需要初始化获取上下文。锁类还指定了要使用的算法，Wound-Wait或Wait-Die

有三类不同的w/w锁获取函数

Normal lock acquisition with a context, using ww_mutex_lock().

使用ww_mutex_lock() 在上下文中进行普通锁获取”
Slowpath lock acquisition on the contending lock, used by the task that just killed its transaction after having dropped all already acquired locks. These functions have the _slow postfix.

From a simple semantics point-of-view the _slow functions are not strictly required, since simply calling the normal ww_mutex_lock functions on the contending lock (after having dropped all other already acquired locks) will work correctly. After all if no other ww mutex has been acquired yet there’s no deadlock potential and hence the ww_mutex_lock call will block and not prematurely return -EDEADLK. The advantage of the _slow functions is in interface safety:
- ww_mutex_lock has a __must_check int return type, whereas ww_mutex_lock_slow has a void return type. Note that since ww mutex code needs loops/retries anyway the __must_check doesn’t result in spurious warnings, even though the very first lock operation can never fail.
- When full debugging is enabled ww_mutex_lock_slow checks that all acquired ww mutex have been released (preventing deadlocks) and makes sure that we block on the contending lock (preventing spinning through the -EDEADLK slowpath until the contended lock can be acquired).
Functions to only acquire a single w/w mutex, which results in the exact same semantics as a normal mutex. This is done by calling ww_mutex_lock with a NULL context.

Again this is not strictly required. But often you only want to acquire a single lock in which case it’s pointless to set up an acquire context (and so better to avoid grabbing a deadlock avoidance ticket).

Of course, all the usual variants for handling wake-ups due to signals are also provided.

如何使用

The algorithm (Wait-Die vs Wound-Wait) is chosen by using either DEFINE_WW_CLASS() (Wound-Wait) or DEFINE_WD_CLASS() (Wait-Die) As a rough rule of thumb, use Wound-Wait if you expect the number of simultaneous competing transactions to be typically small, and you want to reduce the number of rollbacks.

Three different ways to acquire locks within the same w/w class. Common definitions for methods #1 and #2:

/* 静态初始化一个 ww_class */
static DEFINE_WW_CLASS(ww_class);

/* 要被加锁的对象，在其中嵌入 struct ww_mutex lock */
struct obj {
      struct ww_mutex lock;
      /* obj data */
};

/* 需要获取的对象组成的list */
struct obj_entry {
      struct list_head head;
      struct obj *obj;
};

Method 1, using a list in execbuf->buffers that’s not allowed to be reordered. This is useful if a list of required objects is already tracked somewhere. Furthermore the lock helper can use propagate the -EALREADY return code back to the caller as a signal that an object is twice on the list. This is useful if the list is constructed from userspace input and the ABI requires userspace to not have duplicate entries (e.g. for a gpu commandbuffer submission ioctl):

int lock_objs(struct list_head *list, struct ww_acquire_ctx *ctx)
{
      struct obj *res_obj = NULL;
      struct obj_entry *contended_entry = NULL;
      struct obj_entry *entry;

     /* 加锁前对ww_acquire_ctx进行初始化 */
      ww_acquire_init(ctx, &ww_class);

retry:
      /* 一次从list中取出要加锁的对象，并对其进行加锁操作 */
      list_for_each_entry (entry, list, head) {
              if (entry->obj == res_obj) {
                      res_obj = NULL;
                      continue;
              }
              /* 加锁操作，如果出现冲突，且当前进程较旧，会等待在 ww_mutex_lock()中，与mutex_lock()类似 */
              ret = ww_mutex_lock(&entry->obj->lock, ctx);
              if (ret < 0) {
              /* 加锁失败，并且当前进行较新，当前进行将被终止继续获取剩余的锁，记录下冲突对象 */
                      contended_entry = entry;
                      goto err;
              }
      }

      ww_acquire_done(ctx);
      return 0;

err:
      /* 在进行下一轮加锁前，释放掉已获取到的锁 */
      list_for_each_entry_continue_reverse (entry, list, head)
              ww_mutex_unlock(&entry->obj->lock);/* 与mutex_unlock类似 */

      if (res_obj)
              ww_mutex_unlock(&res_obj->lock);

      if (ret == -EDEADLK) {
      /* 在开始下一轮的加锁前，使用ww_mutex_lock_slow()获取上一轮有冲突的锁，ww_mutex_lock_slow()会一直休眠，直到该锁可用为止 */
              /* we lost out in a seqno race, lock and retry.. */
              ww_mutex_lock_slow(&contended_entry->obj->lock, ctx);
              res_obj = contended_entry->obj;
               /* 跳转到下一轮的加锁操作 */
              goto retry;
      }
      ww_acquire_fini(ctx);

      return ret;
}

Method 2, using a list in execbuf->buffers that can be reordered. Same semantics of duplicate entry detection using -EALREADY as method 1 above. But the list-reordering allows for a bit more idiomatic code:

int lock_objs(struct list_head *list, struct ww_acquire_ctx *ctx)
{
      struct obj_entry *entry, *entry2;

      ww_acquire_init(ctx, &ww_class);

      list_for_each_entry (entry, list, head) {
              ret = ww_mutex_lock(&entry->obj->lock, ctx);
              if (ret < 0) {
                      entry2 = entry;

                      list_for_each_entry_continue_reverse (entry2, list, head)
                              ww_mutex_unlock(&entry2->obj->lock);

                      if (ret != -EDEADLK) {
                              ww_acquire_fini(ctx);
                              return ret;
                      }

                      /* we lost out in a seqno race, lock and retry.. */
                      ww_mutex_lock_slow(&entry->obj->lock, ctx);

                      /*
                       * Move buf to head of the list, this will point
                       * buf->next to the first unlocked entry,
                       * restarting the for loop.
                       */
                      list_del(&entry->head);
                      list_add(&entry->head, list);
              }
      }

      ww_acquire_done(ctx);
      return 0;
}

Unlocking works the same way for both methods #1 and #2:

void unlock_objs(struct list_head *list, struct ww_acquire_ctx *ctx)
{
      struct obj_entry *entry;

//依次释放list中的锁
      list_for_each_entry (entry, list, head)
              ww_mutex_unlock(&entry->obj->lock);

      ww_acquire_fini(ctx);
}

Method 3 is useful if the list of objects is constructed ad-hoc and not upfront, e.g. when adjusting edges in a graph where each node has its own ww_mutex lock, and edges can only be changed when holding the locks of all involved nodes. w/w mutexes are a natural fit for such a case for two reasons:

They can handle lock-acquisition in any order which allows us to start walking a graph from a starting point and then iteratively discovering new edges and locking down the nodes those edges connect to.
Due to the -EALREADY return code signalling that a given objects is already held there’s no need for additional book-keeping to break cycles in the graph or keep track off which looks are already held (when using more than one node as a starting point).

Note that this approach differs in two important ways from the above methods:

Since the list of objects is dynamically constructed (and might very well be different when retrying due to hitting the -EDEADLK die condition) there’s no need to keep any object on a persistent list when it’s not locked. We can therefore move the list_head into the object itself.
On the other hand the dynamic object list construction also means that the -EALREADY return code can’t be propagated.

Note also that methods #1 and #2 and method #3 can be combined, e.g. to first lock a list of starting nodes (passed in from userspace) using one of the above methods. And then lock any additional objects affected by the operations using method #3 below. The backoff/retry procedure will be a bit more involved, since when the dynamic locking step hits -EDEADLK we also need to unlock all the objects acquired with the fixed list. But the w/w mutex debug checks will catch any interface misuse for these cases.

Also, method 3 can’t fail the lock acquisition step since it doesn’t return -EALREADY. Of course this would be different when using the _interruptible variants, but that’s outside of the scope of these examples here:

struct obj {
      struct ww_mutex ww_mutex;
      struct list_head locked_list;
};

static DEFINE_WW_CLASS(ww_class);

void __unlock_objs(struct list_head *list)
{
      struct obj *entry, *temp;

      list_for_each_entry_safe (entry, temp, list, locked_list) {
              /* need to do that before unlocking, since only the current lock holder is
              allowed to use object */
              list_del(&entry->locked_list);
              ww_mutex_unlock(entry->ww_mutex)
      }
}

void lock_objs(struct list_head *list, struct ww_acquire_ctx *ctx)
{
      struct obj *obj;

      ww_acquire_init(ctx, &ww_class);

retry:
      /* re-init loop start state */
      loop {
              /* magic code which walks over a graph and decides which objects
               * to lock */

              ret = ww_mutex_lock(obj->ww_mutex, ctx);
              if (ret == -EALREADY) {
                      /* we have that one already, get to the next object */
                      continue;
              }
              if (ret == -EDEADLK) {
                      __unlock_objs(list);

                      ww_mutex_lock_slow(obj, ctx);
                      list_add(&entry->locked_list, list);
                      goto retry;
              }

              /* locked a new object, add it to the list */
              list_add_tail(&entry->locked_list, list);
      }

      ww_acquire_done(ctx);
      return 0;
}

void unlock_objs(struct list_head *list, struct ww_acquire_ctx *ctx)
{
      __unlock_objs(list);
      ww_acquire_fini(ctx);
}

Method 4: Only lock one single objects. In that case deadlock detection and prevention is obviously overkill, since with grabbing just one lock you can’t produce a deadlock within just one class. To simplify this case the w/w mutex api can be used with a NULL context.

Implementation Details

Design:

ww_mutex currently encapsulates a struct mutex, this means no extra overhead for normal mutex locks, which are far more common. As such there is only a small increase in code size if wait/wound mutexes are not used.

We maintain the following invariants for the wait list:

Waiters with an acquire context are sorted by stamp order; waiters without an acquire context are interspersed in FIFO order.
For Wait-Die, among waiters with contexts, only the first one can have other locks acquired already (ctx->acquired > 0). Note that this waiter may come after other waiters without contexts in the list.

The Wound-Wait preemption is implemented with a lazy-preemption scheme: The wounded status of the transaction is checked only when there is contention for a new lock and hence a true chance of deadlock. In that situation, if the transaction is wounded, it backs off, clears the wounded status and retries. A great benefit of implementing preemption in this way is that the wounded transaction can identify a contending lock to wait for before restarting the transaction. Just blindly restarting the transaction would likely make the transaction end up in a situation where it would have to back off again.

In general, not much contention is expected. The locks are typically used to serialize access to resources for devices, and optimization focus should therefore be directed towards the uncontended cases.

Lockdep:

Special care has been taken to warn for as many cases of api abuse as possible. Some common api abuses will be caught with CONFIG_DEBUG_MUTEXES, but CONFIG_PROVE_LOCKING is recommended.

Some of the errors which will be warned about:

Forgetting to call ww_acquire_fini or ww_acquire_init. Attempting to lock more mutexes after ww_acquire_done. Attempting to lock the wrong mutex after -EDEADLK and unlocking all mutexes. Attempting to lock the right mutex after -EDEADLK, before unlocking all mutexes. Calling ww_mutex_lock_slow before -EDEADLK was returned. Unlocking mutexes with the wrong unlock function. Calling one of the ww_acquire_* twice on the same context. Using a different ww_class for the mutex than for the ww_acquire_ctx. Normal lockdep errors that can result in deadlocks.

Some of the lockdep errors that can result in deadlocks:

Calling ww_acquire_init to initialize a second ww_acquire_ctx before having called ww_acquire_fini on the first. ‘normal’ deadlocks that can occur.

FIXME:

Update this section once we have the TASK_DEADLOCK task state flag magic implemented.
参考链接

知乎对mutex
知乎对ww_mutex
lwn社区解释
 另外注解
 内核doc ww_mutex

ww_mutex 一种伤害自己保证资源可靠性的锁

ww_mutex

ww_mutex创建的动机

两种防止死锁的等待模式

概念

如何使用

Implementation Details

Design:

Lockdep:

相关阅读

相关文章

相关问答

相关文档