主要分析 Sigmoid 和 Softmax 对于二分类问题,二者之间的差异性.

曾涉及到 Sigmoid 和 Softmax 的问题一般用于交叉熵损失函数,如:

[1] - 机器学习 - 交叉熵Cross Entropy

[2] - CaffeLoss - SigmoidCrossEntropyLoss 推导与Python实现

[3] - Focal Loss 论文理解及公式推导

习惯性的认为,SigmoidCrossEntropyLoss 用于二类问题;SoftmaxCrossEntropyLoss 用于多类问题. 但,在二分类情况时,SoftmaxCrossEntropyLoss 与 SigmoidCrossEntropyLoss 作用等价.

这里从两方面分析下,对于 Sigmoid 和 Softmax 在二分类情况下的等价性.

主要参考 - ypwhs/sigmoid_and_softmax.ipynb.

1. 理论分析

二分类场景,

[1] - Sigmoid:

{p(y=1|x)=11+eθTxp(y=0|x)=1p(y=1|x)=eθTx1+eθTx

[2] - Softmax:

{p(y=0|x)=eθ0Txeθ0Tx+eθ1Tx=e(θ0Tθ1T)x1+e(θ0Tθ1T)xp(y=1|x)=1p(y=0|x)

β=(θ0Tθ1T),则有:

{p(y=1|x)=11+eβTxp(y=0|x)=eβTx1+eβTx

可见,此时,Softmax 与 Sigmoid 二者理论公式的等价性.

2. 基于 Kera 的计算对比

以猫狗分类的数据集为例,采用在 ImageNet 上与训练的 Xception 模型导出的特征.( 复制于 ypwhs/sigmoid_and_softmax.ipynb 中的代码. )

gap_Xception.h5

import h5py import numpy as np from sklearn.model_selection import train_test_split from keras.layers import * from keras.models import * from keras.optimizers import * from keras.utils import to_categorical import matplotlib.pyplot as plt np.random.seed(20180520) #--------------------------------- # 加载猫狗分类的特征数据 #--------------------------------- with h5py.File("gap_Xception.h5", 'r') as h: X = np.array(h['train']) y = np.array(h['label']) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) y_train_softmax = to_categorical(y_train) # ont-hot y_test_softmax = to_categorical(y_test) #--------------------------------- # Softmax #--------------------------------- # loss: Softmax Cross Entropy Loss input_tensor = Input(X.shape[1:]) x = input_tensor x = Dropout(0.5)(x) softmax = Dense(2, activation='softmax') x = softmax(x) model = Model(input_tensor, x) model.compile(optimizer=SGD(1e-3), # lr = 1e-3 loss='categorical_crossentropy', metrics=['accuracy']) softmax_weights, softmax_bias = softmax.get_weights() history_softmax = model.fit(X_train, y_train_softmax, batch_size=128, epochs=5, validation_data=(X_test, y_test_softmax)) #--------------------------------- # Sigmoid #--------------------------------- # loss: Binary(Sigmoid) Cross Entropy Loss np.random.seed(20180520) input_tensor = Input(X.shape[1:]) x = input_tensor x = Dropout(0.5)(x) sigmoid = Dense(1, activation='sigmoid') x = sigmoid(x) model = Model(input_tensor, x) model.compile(optimizer=SGD(2e-3), # lr = 2e-3 loss='binary_crossentropy', metrics=['accuracy']) # Sigmoid 权重初始化为 Softmax 的权重计算差值 # beta = sigmoid weights beta = -(softmax_weights[:,0] - softmax_weights[:,1]).reshape((-1, 1)) sigmoid.set_weights([beta, np.zeros(1)]) # set beta to sigmoid weights history_sigmoid = model.fit(X_train, y_train, batch_size=128, epochs=5, validation_data=(X_test, y_test)) #--------------------------------- # Sigmoid #--------------------------------- input_tensor = Input(X.shape[1:]) x = input_tensor x = Dropout(0.5)(x) sigmoid = Dense(1, activation='sigmoid') x = sigmoid(x) model = Model(input_tensor, x) model.compile(optimizer=SGD(2e-3), # lr = 2e-3 loss='binary_crossentropy', metrics=['accuracy']) # random init Sigmoid weights # 随机初始化 Sigmoid 权重 history_sigmoid_2 = model.fit(X_train, y_train, batch_size=128, epochs=5, validation_data=(X_test, y_test))

训练过程输出:

Softmax:

Train on 20000 samples, validate on 5000 samples Epoch 1/5 20000/20000 [==============================] - 1s 68us/step - loss: 0.4711 - acc: 0.8535 - val_loss: 0.3765 - val_acc: 0.9642 Epoch 2/5 20000/20000 [==============================] - 0s 13us/step - loss: 0.3338 - acc: 0.9486 - val_loss: 0.2825 - val_acc: 0.9816 Epoch 3/5 20000/20000 [==============================] - 0s 11us/step - loss: 0.2594 - acc: 0.9726 - val_loss: 0.2279 - val_acc: 0.9858 Epoch 4/5 20000/20000 [==============================] - 0s 11us/step - loss: 0.2160 - acc: 0.9799 - val_loss: 0.1923 - val_acc: 0.9866 Epoch 5/5 20000/20000 [==============================] - 0s 13us/step - loss: 0.1860 - acc: 0.9825 - val_loss: 0.1677 - val_acc: 0.9868

Sigmoid With Softmax Weights:

Train on 20000 samples, validate on 5000 samples Epoch 1/5 20000/20000 [==============================] - 0s 18us/step - loss: 0.4706 - acc: 0.8544 - val_loss: 0.3766 - val_acc: 0.9644 Epoch 2/5 20000/20000 [==============================] - 0s 12us/step - loss: 0.3346 - acc: 0.9476 - val_loss: 0.2824 - val_acc: 0.9816 Epoch 3/5 20000/20000 [==============================] - 0s 14us/step - loss: 0.2613 - acc: 0.9709 - val_loss: 0.2275 - val_acc: 0.9860 Epoch 4/5 20000/20000 [==============================] - 0s 14us/step - loss: 0.2151 - acc: 0.9789 - val_loss: 0.1923 - val_acc: 0.9868 Epoch 5/5 20000/20000 [==============================] - 0s 11us/step - loss: 0.1857 - acc: 0.9825 - val_loss: 0.1676 - val_acc: 0.9872

Sigmoid With Random Init Weight:

Train on 20000 samples, validate on 5000 samples Epoch 1/5 20000/20000 [==============================] - 0s 18us/step - loss: 0.5690 - acc: 0.7607 - val_loss: 0.4415 - val_acc: 0.9718 Epoch 2/5 20000/20000 [==============================] - 0s 11us/step - loss: 0.3753 - acc: 0.9576 - val_loss: 0.3151 - val_acc: 0.9852 Epoch 3/5 20000/20000 [==============================] - 0s 12us/step - loss: 0.2819 - acc: 0.9814 - val_loss: 0.2464 - val_acc: 0.9878 Epoch 4/5 20000/20000 [==============================] - 0s 12us/step - loss: 0.2267 - acc: 0.9858 - val_loss: 0.2042 - val_acc: 0.9882 Epoch 5/5 20000/20000 [==============================] - 0s 11us/step - loss: 0.1921 - acc: 0.9872 - val_loss: 0.1759 - val_acc: 0.9882

2.1 训练 loss 曲线变化情况对比

plt.plot(history_sigmoid.history['loss']) plt.plot(history_softmax.history['loss']) plt.plot(history_sigmoid.history['val_loss']) plt.plot(history_softmax.history['val_loss']) plt.ylabel('loss') plt.xlabel('epoch') plt.legend(['sigmoid_loss', 'softmax_loss', 'sigmoid_val_loss', 'softmax_val_loss'], loc='upper right') plt.show()

从图中可知,Sigmoid 和 Softmax 的训练曲线几乎完全重合.

2.2 Loss 差值可视化对比

plt.plot(np.array(history_sigmoid.history['val_loss']) - np.array(history_softmax.history['val_loss'])) plt.plot(np.array(history_sigmoid.history['val_loss']) - np.array(history_sigmoid_2.history['val_loss'])) plt.ylabel('loss') plt.xlabel('epoch') plt.legend(['sigmoid_softmax_beta_gap', 'sigmoid_random_weight_gap'], loc='upper right') plt.show()

图中蓝色曲线几乎一直是 0,其表示 Sigmoid 和 Softmax 训练的模型的 loss 差异性很小. 但黄色曲线 的差值相对就较大,其采用的随机初始化 Sigmoid 权重值,影响了训练过程中的 loss 曲线的变化.

也就是说,如果设置了正确的 beta 值,Sigmoid 与 Softmax 的效果可认为时等价的.

2.3 总结

对于二分类问题,

[1] - Sigmoid 与 Softmax 完全等价.

[2] - Sigmoid 与 Softmax 分类器的权值可以相互转换.

[3] - Softmax 的学习率是 Sigmoid 学习率的2倍. (如:1e-3与2e-3)

[4] - Softmax 会比 Sigmoid 浪费 2 倍的权值空间(权重参数是两倍).

Last modification:November 8th, 2018 at 10:47 pm