Keras - Softmax 与 Sigmoid 二分类问题分析

Author： AIHGF
发布时间：November 8, 2018
2160views
5 comments
8087 words
Categories： Python

主要分析 Sigmoid 和 Softmax 对于二分类问题，二者之间的差异性.

曾涉及到 Sigmoid 和 Softmax 的问题一般用于交叉熵损失函数，如：

[2] - CaffeLoss - SigmoidCrossEntropyLoss 推导与Python实现

习惯性的认为，SigmoidCrossEntropyLoss 用于二类问题；SoftmaxCrossEntropyLoss 用于多类问题. 但，在二分类情况时，SoftmaxCrossEntropyLoss 与 SigmoidCrossEntropyLoss 作用等价.

这里从两方面分析下，对于 Sigmoid 和 Softmax 在二分类情况下的等价性.

主要参考 - ypwhs/sigmoid_and_softmax.ipynb.

1. 理论分析

二分类场景，

[1] - Sigmoid:

${\begin{cases} p (y = 1 | x) = \frac{1}{1 + e^{- θ^{T} x}} \\ p (y = 0 | x) = 1 - p (y = 1 | x) = \frac{e^{- θ^{T} x}}{1 + e^{- θ^{T} x}} \end{cases}$

[2] - Softmax:

${\begin{cases} p (y = 0 | x) = \frac{e^{θ_{0}^{T} x}}{e^{θ_{0}^{T} x} + e^{θ_{1}^{T} x}} = \frac{e^{(θ_{0}^{T} - θ_{1}^{T}) x}}{1 + e^{(θ_{0}^{T} - θ_{1}^{T}) x}} \\ p (y = 1 | x) = 1 - p (y = 0 | x) \end{cases}$

令 $β = - (θ_{0}^{T} - θ_{1}^{T})$ ，则有：

${\begin{cases} p (y = 1 | x) = \frac{1}{1 + e^{- β^{T} x}} \\ p (y = 0 | x) = \frac{e^{- β^{T} x}}{1 + e^{- β^{T} x}} \end{cases}$

可见，此时，Softmax 与 Sigmoid 二者理论公式的等价性.

2. 基于 Kera 的计算对比

以猫狗分类的数据集为例，采用在 ImageNet 上与训练的 Xception 模型导出的特征.( 复制于 ypwhs/sigmoid_and_softmax.ipynb 中的代码. )

gap_Xception.h5


import h5py
import numpy as np
from sklearn.model_selection import train_test_split

from keras.layers import *
from keras.models import *
from keras.optimizers import *
from keras.utils import to_categorical

import matplotlib.pyplot as plt

np.random.seed(20180520)

#---------------------------------
#      加载猫狗分类的特征数据
#---------------------------------
with h5py.File("gap_Xception.h5", 'r') as h:
    X = np.array(h['train'])
    y = np.array(h['label'])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

y_train_softmax = to_categorical(y_train) # ont-hot
y_test_softmax = to_categorical(y_test)


#---------------------------------
#           Softmax
#---------------------------------
# loss: Softmax Cross Entropy Loss
input_tensor = Input(X.shape[1:])
x = input_tensor
x = Dropout(0.5)(x)
softmax = Dense(2, activation='softmax')
x = softmax(x)
model = Model(input_tensor, x)
model.compile(optimizer=SGD(1e-3),  # lr = 1e-3
              loss='categorical_crossentropy',
              metrics=['accuracy'])

softmax_weights, softmax_bias = softmax.get_weights()

history_softmax = model.fit(X_train, y_train_softmax, batch_size=128, epochs=5,
                            validation_data=(X_test, y_test_softmax))

#---------------------------------
#           Sigmoid
#---------------------------------
# loss: Binary(Sigmoid) Cross Entropy Loss
np.random.seed(20180520)
input_tensor = Input(X.shape[1:])
x = input_tensor
x = Dropout(0.5)(x)
sigmoid = Dense(1, activation='sigmoid')
x = sigmoid(x)
model = Model(input_tensor, x)
model.compile(optimizer=SGD(2e-3),  # lr = 2e-3
              loss='binary_crossentropy',
              metrics=['accuracy'])

# Sigmoid 权重初始化为 Softmax 的权重计算差值
# beta = sigmoid weights
beta = -(softmax_weights[:,0] - softmax_weights[:,1]).reshape((-1, 1))

sigmoid.set_weights([beta, np.zeros(1)]) # set beta to sigmoid weights

history_sigmoid = model.fit(X_train, y_train, batch_size=128, epochs=5, validation_data=(X_test, y_test))


#---------------------------------
#           Sigmoid
#---------------------------------
input_tensor = Input(X.shape[1:])
x = input_tensor
x = Dropout(0.5)(x)
sigmoid = Dense(1, activation='sigmoid')
x = sigmoid(x)
model = Model(input_tensor, x)
model.compile(optimizer=SGD(2e-3),  # lr = 2e-3
              loss='binary_crossentropy',
              metrics=['accuracy'])

# random init Sigmoid weights
# 随机初始化 Sigmoid 权重
history_sigmoid_2 = model.fit(X_train, y_train, batch_size=128, epochs=5, validation_data=(X_test, y_test))

训练过程输出：

Softmax:


Train on 20000 samples, validate on 5000 samples
Epoch 1/5
20000/20000 [==============================] - 1s 68us/step - loss: 0.4711 - acc: 0.8535 - val_loss: 0.3765 - val_acc: 0.9642
Epoch 2/5
20000/20000 [==============================] - 0s 13us/step - loss: 0.3338 - acc: 0.9486 - val_loss: 0.2825 - val_acc: 0.9816
Epoch 3/5
20000/20000 [==============================] - 0s 11us/step - loss: 0.2594 - acc: 0.9726 - val_loss: 0.2279 - val_acc: 0.9858
Epoch 4/5
20000/20000 [==============================] - 0s 11us/step - loss: 0.2160 - acc: 0.9799 - val_loss: 0.1923 - val_acc: 0.9866
Epoch 5/5
20000/20000 [==============================] - 0s 13us/step - loss: 0.1860 - acc: 0.9825 - val_loss: 0.1677 - val_acc: 0.9868

Sigmoid With Softmax Weights:


Train on 20000 samples, validate on 5000 samples
Epoch 1/5
20000/20000 [==============================] - 0s 18us/step - loss: 0.4706 - acc: 0.8544 - val_loss: 0.3766 - val_acc: 0.9644
Epoch 2/5
20000/20000 [==============================] - 0s 12us/step - loss: 0.3346 - acc: 0.9476 - val_loss: 0.2824 - val_acc: 0.9816
Epoch 3/5
20000/20000 [==============================] - 0s 14us/step - loss: 0.2613 - acc: 0.9709 - val_loss: 0.2275 - val_acc: 0.9860
Epoch 4/5
20000/20000 [==============================] - 0s 14us/step - loss: 0.2151 - acc: 0.9789 - val_loss: 0.1923 - val_acc: 0.9868
Epoch 5/5
20000/20000 [==============================] - 0s 11us/step - loss: 0.1857 - acc: 0.9825 - val_loss: 0.1676 - val_acc: 0.9872

Sigmoid With Random Init Weight:


Train on 20000 samples, validate on 5000 samples
Epoch 1/5
20000/20000 [==============================] - 0s 18us/step - loss: 0.5690 - acc: 0.7607 - val_loss: 0.4415 - val_acc: 0.9718
Epoch 2/5
20000/20000 [==============================] - 0s 11us/step - loss: 0.3753 - acc: 0.9576 - val_loss: 0.3151 - val_acc: 0.9852
Epoch 3/5
20000/20000 [==============================] - 0s 12us/step - loss: 0.2819 - acc: 0.9814 - val_loss: 0.2464 - val_acc: 0.9878
Epoch 4/5
20000/20000 [==============================] - 0s 12us/step - loss: 0.2267 - acc: 0.9858 - val_loss: 0.2042 - val_acc: 0.9882
Epoch 5/5
20000/20000 [==============================] - 0s 11us/step - loss: 0.1921 - acc: 0.9872 - val_loss: 0.1759 - val_acc: 0.9882

2.1 训练 loss 曲线变化情况对比


plt.plot(history_sigmoid.history['loss'])
plt.plot(history_softmax.history['loss'])

plt.plot(history_sigmoid.history['val_loss'])
plt.plot(history_softmax.history['val_loss'])

plt.ylabel('loss')
plt.xlabel('epoch')

plt.legend(['sigmoid_loss', 'softmax_loss', 
            'sigmoid_val_loss', 'softmax_val_loss'], loc='upper right')
plt.show()

从图中可知，Sigmoid 和 Softmax 的训练曲线几乎完全重合.

2.2 Loss 差值可视化对比


plt.plot(np.array(history_sigmoid.history['val_loss']) - np.array(history_softmax.history['val_loss']))
plt.plot(np.array(history_sigmoid.history['val_loss']) - np.array(history_sigmoid_2.history['val_loss']))

plt.ylabel('loss')
plt.xlabel('epoch')

plt.legend(['sigmoid_softmax_beta_gap', 'sigmoid_random_weight_gap'], loc='upper right')
plt.show()

图中蓝色曲线几乎一直是 0，其表示 Sigmoid 和 Softmax 训练的模型的 loss 差异性很小. 但黄色曲线 的差值相对就较大，其采用的随机初始化 Sigmoid 权重值，影响了训练过程中的 loss 曲线的变化.

也就是说，如果设置了正确的 beta 值，Sigmoid 与 Softmax 的效果可认为时等价的.

2.3 总结

对于二分类问题，

[1] - Sigmoid 与 Softmax 完全等价.

[2] - Sigmoid 与 Softmax 分类器的权值可以相互转换.

[3] - Softmax 的学习率是 Sigmoid 学习率的2倍. (如：1e-3与2e-3)

[4] - Softmax 会比 Sigmoid 浪费 2 倍的权值空间(权重参数是两倍).

Last modification：November 8th, 2018 at 10:47 pm

5 comments

k
March 18th, 2020 at 09:12 pm

请问二分类下，sigmoid = Dense(1, activation='sigmoid') 和softmax = Dense(2, activation='softmax') 用哪一个是不是都是一样的呢？

Reply
1. AIHGF
  March 18th, 2020 at 10:10 pm
  
  @k
  
  效果一致，但权重参数量会不一致.
  
  Reply
hu
March 8th, 2019 at 03:12 pm

为什么学习率需要两倍呢，有推导吗

Reply
1. 三土
  July 10th, 2019 at 02:08 pm
  
  @hu
  
  二分类的情况下θ_1 = 1-θ_0，那么β=1-2θ_0，所以就是两倍的关系
  
  Reply
2. AIHGF
  March 11th, 2019 at 09:47 pm
  
  @hu
  
  不好意思没有推导，是炼丹经验吧
  
  Reply

Keras - Softmax 与 Sigmoid 二分类问题分析

AIHGF • 2018 年 11 月 08 日

主要分析 Sigmoid 和 Softmax 对于二分类问题，二者之间的差异性.

曾涉及到 Sigmoid 和 Softmax 的问题一般用于交叉熵损失函数，如：

[1] - 机器学习 - 交叉熵Cross Entropy

[2] - CaffeLoss - SigmoidCrossEntropyLoss 推导与Python实现

[3] - Focal Loss 论文理解及公式推导

这里从两方面分析下，对于 Sigmoid 和 Softmax 在二分类情况下的等价性.

主要参考 - ypwhs/sigmoid_and_softmax.ipynb.

1. 理论分析

二分类场景，

[1] - Sigmoid:

${\begin{cases} p (y = 1 | x) = \frac{1}{1 + e^{- θ^{T} x}} \\ p (y = 0 | x) = 1 - p (y = 1 | x) = \frac{e^{- θ^{T} x}}{1 + e^{- θ^{T} x}} \end{cases}$

[2] - Softmax:

令 $β = - (θ_{0}^{T} - θ_{1}^{T})$ ，则有：

${\begin{cases} p (y = 1 | x) = \frac{1}{1 + e^{- β^{T} x}} \\ p (y = 0 | x) = \frac{e^{- β^{T} x}}{1 + e^{- β^{T} x}} \end{cases}$

可见，此时，Softmax 与 Sigmoid 二者理论公式的等价性.

2. 基于 Kera 的计算对比

以猫狗分类的数据集为例，采用在 ImageNet 上与训练的 Xception 模型导出的特征.( 复制于 ypwhs/sigmoid_and_softmax.ipynb 中的代码. )

gap_Xception.h5


import h5py
import numpy as np
from sklearn.model_selection import train_test_split

from keras.layers import *
from keras.models import *
from keras.optimizers import *
from keras.utils import to_categorical

import matplotlib.pyplot as plt

np.random.seed(20180520)

#---------------------------------
#      加载猫狗分类的特征数据
#---------------------------------
with h5py.File("gap_Xception.h5", 'r') as h:
    X = np.array(h['train'])
    y = np.array(h['label'])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

y_train_softmax = to_categorical(y_train) # ont-hot
y_test_softmax = to_categorical(y_test)


#---------------------------------
#           Softmax
#---------------------------------
# loss: Softmax Cross Entropy Loss
input_tensor = Input(X.shape[1:])
x = input_tensor
x = Dropout(0.5)(x)
softmax = Dense(2, activation='softmax')
x = softmax(x)
model = Model(input_tensor, x)
model.compile(optimizer=SGD(1e-3),  # lr = 1e-3
              loss='categorical_crossentropy',
              metrics=['accuracy'])

softmax_weights, softmax_bias = softmax.get_weights()

history_softmax = model.fit(X_train, y_train_softmax, batch_size=128, epochs=5,
                            validation_data=(X_test, y_test_softmax))

#---------------------------------
#           Sigmoid
#---------------------------------
# loss: Binary(Sigmoid) Cross Entropy Loss
np.random.seed(20180520)
input_tensor = Input(X.shape[1:])
x = input_tensor
x = Dropout(0.5)(x)
sigmoid = Dense(1, activation='sigmoid')
x = sigmoid(x)
model = Model(input_tensor, x)
model.compile(optimizer=SGD(2e-3),  # lr = 2e-3
              loss='binary_crossentropy',
              metrics=['accuracy'])

# Sigmoid 权重初始化为 Softmax 的权重计算差值
# beta = sigmoid weights
beta = -(softmax_weights[:,0] - softmax_weights[:,1]).reshape((-1, 1))

sigmoid.set_weights([beta, np.zeros(1)]) # set beta to sigmoid weights

history_sigmoid = model.fit(X_train, y_train, batch_size=128, epochs=5, validation_data=(X_test, y_test))


#---------------------------------
#           Sigmoid
#---------------------------------
input_tensor = Input(X.shape[1:])
x = input_tensor
x = Dropout(0.5)(x)
sigmoid = Dense(1, activation='sigmoid')
x = sigmoid(x)
model = Model(input_tensor, x)
model.compile(optimizer=SGD(2e-3),  # lr = 2e-3
              loss='binary_crossentropy',
              metrics=['accuracy'])

# random init Sigmoid weights
# 随机初始化 Sigmoid 权重
history_sigmoid_2 = model.fit(X_train, y_train, batch_size=128, epochs=5, validation_data=(X_test, y_test))

训练过程输出：

Softmax:


Train on 20000 samples, validate on 5000 samples
Epoch 1/5
20000/20000 [==============================] - 1s 68us/step - loss: 0.4711 - acc: 0.8535 - val_loss: 0.3765 - val_acc: 0.9642
Epoch 2/5
20000/20000 [==============================] - 0s 13us/step - loss: 0.3338 - acc: 0.9486 - val_loss: 0.2825 - val_acc: 0.9816
Epoch 3/5
20000/20000 [==============================] - 0s 11us/step - loss: 0.2594 - acc: 0.9726 - val_loss: 0.2279 - val_acc: 0.9858
Epoch 4/5
20000/20000 [==============================] - 0s 11us/step - loss: 0.2160 - acc: 0.9799 - val_loss: 0.1923 - val_acc: 0.9866
Epoch 5/5
20000/20000 [==============================] - 0s 13us/step - loss: 0.1860 - acc: 0.9825 - val_loss: 0.1677 - val_acc: 0.9868

Sigmoid With Softmax Weights:


Train on 20000 samples, validate on 5000 samples
Epoch 1/5
20000/20000 [==============================] - 0s 18us/step - loss: 0.4706 - acc: 0.8544 - val_loss: 0.3766 - val_acc: 0.9644
Epoch 2/5
20000/20000 [==============================] - 0s 12us/step - loss: 0.3346 - acc: 0.9476 - val_loss: 0.2824 - val_acc: 0.9816
Epoch 3/5
20000/20000 [==============================] - 0s 14us/step - loss: 0.2613 - acc: 0.9709 - val_loss: 0.2275 - val_acc: 0.9860
Epoch 4/5
20000/20000 [==============================] - 0s 14us/step - loss: 0.2151 - acc: 0.9789 - val_loss: 0.1923 - val_acc: 0.9868
Epoch 5/5
20000/20000 [==============================] - 0s 11us/step - loss: 0.1857 - acc: 0.9825 - val_loss: 0.1676 - val_acc: 0.9872

Sigmoid With Random Init Weight:


Train on 20000 samples, validate on 5000 samples
Epoch 1/5
20000/20000 [==============================] - 0s 18us/step - loss: 0.5690 - acc: 0.7607 - val_loss: 0.4415 - val_acc: 0.9718
Epoch 2/5
20000/20000 [==============================] - 0s 11us/step - loss: 0.3753 - acc: 0.9576 - val_loss: 0.3151 - val_acc: 0.9852
Epoch 3/5
20000/20000 [==============================] - 0s 12us/step - loss: 0.2819 - acc: 0.9814 - val_loss: 0.2464 - val_acc: 0.9878
Epoch 4/5
20000/20000 [==============================] - 0s 12us/step - loss: 0.2267 - acc: 0.9858 - val_loss: 0.2042 - val_acc: 0.9882
Epoch 5/5
20000/20000 [==============================] - 0s 11us/step - loss: 0.1921 - acc: 0.9872 - val_loss: 0.1759 - val_acc: 0.9882

2.1 训练 loss 曲线变化情况对比


plt.plot(history_sigmoid.history['loss'])
plt.plot(history_softmax.history['loss'])

plt.plot(history_sigmoid.history['val_loss'])
plt.plot(history_softmax.history['val_loss'])

plt.ylabel('loss')
plt.xlabel('epoch')

plt.legend(['sigmoid_loss', 'softmax_loss', 
            'sigmoid_val_loss', 'softmax_val_loss'], loc='upper right')
plt.show()

从图中可知，Sigmoid 和 Softmax 的训练曲线几乎完全重合.

2.2 Loss 差值可视化对比


plt.plot(np.array(history_sigmoid.history['val_loss']) - np.array(history_softmax.history['val_loss']))
plt.plot(np.array(history_sigmoid.history['val_loss']) - np.array(history_sigmoid_2.history['val_loss']))

plt.ylabel('loss')
plt.xlabel('epoch')

plt.legend(['sigmoid_softmax_beta_gap', 'sigmoid_random_weight_gap'], loc='upper right')
plt.show()

也就是说，如果设置了正确的 beta 值，Sigmoid 与 Softmax 的效果可认为时等价的.

2.3 总结

对于二分类问题，

[1] - Sigmoid 与 Softmax 完全等价.

[2] - Sigmoid 与 Softmax 分类器的权值可以相互转换.

[3] - Softmax 的学习率是 Sigmoid 学习率的2倍. (如：1e-3与2e-3)

[4] - Softmax 会比 Sigmoid 浪费 2 倍的权值空间(权重参数是两倍).

Keras - Softmax 与 Sigmoid 二分类问题分析

1. 理论分析

2. 基于 Kera 的计算对比

2.1 训练 loss 曲线变化情况对比

2.2 Loss 差值可视化对比

2.3 总结

※相关文章推荐※

※最新文章推荐※

5 comments

Leave a Comment Cancel reply

Keras - Softmax 与 Sigmoid 二分类问题分析

1. 理论分析

2. 基于 Kera 的计算对比

2.1 训练 loss 曲线变化情况对比

2.2 Loss 差值可视化对比

2.3 总结