L
d
i
c
e
=
1
−
2
I
U
w
h
e
r
e
I
=
∑
i
=
1
N
y
i
p
i
U
=
∑
i
=
1
N
y
i
+
p
i
\boldsymbol{L}_{dice} = 1-\frac{2\boldsymbol{I}}{\boldsymbol{U}}\\ where\ \boldsymbol{I} = \sum_{i=1}^N y_ip_i\ \ \ \boldsymbol{U} = \sum_{i=1}^Ny_i+p_i
Ldice=1−U2Iwhere I=i=1∑Nyipi U=i=1∑Nyi+pi
y
y
y是真实分布,
p
p
p是预测分布。
假设只有一个像素点,则
∂
L
d
i
c
e
∂
p
=
∂
(
1
−
2
y
p
y
+
p
)
∂
p
=
−
2
y
2
(
y
+
p
)
2
\frac{\partial{\boldsymbol{L}_{dice}}}{\partial p} = \frac{\partial (1-\frac{2yp}{y+p})}{\partial p} = -\frac{2y^2}{(y+p)^2}
∂p∂Ldice=∂p∂(1−y+p2yp)=−(y+p)22y2
=
{
0
,
y
=
0
−
2
(
1
+
p
)
2
,
y
=
1
=\left \{ \begin{aligned}&0, &y=0\\&-\frac{2}{(1+p)^2}, &y=1\end{aligned}\right.
=⎩⎪⎨⎪⎧0,−(1+p)22,y=0y=1
可见diceloss更多关注前景区域,但是背景区域梯度也不一定为0,因为对于多像素点的diceloss
∂
L
d
i
c
e
∂
p
=
∂
(
1
−
2
I
U
)
∂
p
=
−
2
∂
I
p
U
−
∂
U
p
I
U
2
\frac{\partial{\boldsymbol{L}_{dice}}}{\partial p} = \frac{\partial (1-\frac{2\boldsymbol{I}}{\boldsymbol{U}})}{\partial p} = -2\frac{\frac{\partial \boldsymbol{I}}{p}\boldsymbol{U}-\frac{\partial \boldsymbol{U}}{p}\boldsymbol{I}}{\boldsymbol{U}^2}
∂p∂Ldice=∂p∂(1−U2I)=−2U2p∂IU−p∂UI
=
{
2
I
U
2
,
y
=
0
2
I
−
U
U
2
,
y
=
1
=\left \{ \begin{aligned}&2\frac{\boldsymbol{I}}{\boldsymbol{U}^2}, &y=0\\&2\frac{\boldsymbol{I}-\boldsymbol{U}}{\boldsymbol{U}^2}, &y=1\end{aligned}\right.
=⎩⎪⎪⎨⎪⎪⎧2U2I,2U2I−U,y=0y=1
可见diceloss对于某个像素点的梯度不但考虑到了当前像素点,还考虑了其他像素点。刚起步时,每个预测点都为0.5,则
U
=
N
+
0.5
N
=
1.5
N
I
≤
0.5
N
∣
I
−
U
∣
≥
N
\begin{aligned} &\boldsymbol{U}= N+0.5N=1.5N\\ &\boldsymbol{I}\le 0.5N\\&|\boldsymbol{I}-\boldsymbol{U}|\ge N\end{aligned}
U=N+0.5N=1.5NI≤0.5N∣I−U∣≥N
显然前景梯度大于背景梯度。
缺点:
1.训练震荡
假设前景只有一个像素点,则该点预测正确loss=0,预测错误loss =1。
2.极端情况梯度消失
对于
p
=
σ
(
x
)
p = \sigma(x)
p=σ(x),无法解决
∂
p
∂
x
=
σ
(
x
)
(
1
−
σ
(
x
)
)
\frac{\partial p}{\partial x} = \sigma(x)(1-\sigma(x))
∂x∂p=σ(x)(1−σ(x))产生的极端情况($\sigma(x) \rightarrow 0\ or\
σ
(
x
)
→
1
\sigma(x) \rightarrow 1
σ(x)→1)产生的梯度消失现象,而celoss可以解决这一点。