原文:Image Segmentation: Tips and Tricks from 39 Kaggle Competitions - 2020.04.07
半译半转,学习下.
作者参加了超过 39 个 Kaggle 比赛后进行的总结,如:
- Data Science Bowl 2017 – $1,000,000
- Intel & MobileODT Cervical Cancer Screening – $100,000
- 2018 Data Science Bowl – $100,000
- Airbus Ship Detection Challenge – $60,000
- Planet: Understanding the Amazon from Space – $60,000
- APTOS 2019 Blindness Detection – $50,000
- Human Protein Atlas Image Classification – $37,000
- SIIM-ACR Pneumothorax Segmentation – $30,000
- Inclusive Images Challenge – $25,000
1. 使用外在数据
External Data
[1] - 使用 LUng Node Analysis Grand Challenge 数据,其包含了 radiologists 的详细标注数据.
[2] - 使用 LIDC-IDRI 数据,其包含了每个肿瘤的 radiologist 描述数据.
[3] - 使用 Flickr CC, Wikipedia Commons datasets 数据.
[4] - 使用 Human Protein Atlas Dataset 数据.
[5] - 使用 IDRiD 数据.
2. 数据探索和洞察
Data Exploration and Gaining insights
[1] - Clustering of 3d segmentation with the 0.5 threshold
[2] - 确认训练和测试数据集的标签分布是否存在巨大差异. Identify if there is a substantial difference in train/test label distributions
3. 预处理
[1] - 采用 Difference of Gaussian (DoG) 算法进行斑点检测(blob Detection). 使用 skimage 库的函数实现.
[2] - 使用 patch-based inputs for training,以减少训练时间.
[3] - 使用 cudf 进行数据加载,而不是 Pandas,其具有更快的读取.
[4] - 确保所有的图片具有相同的朝向(Ensure that all the images have the same orientation).
[5] - 使用有限对比度自适应直方图均衡化(Apply contrast limited adaptive histogram equalization).
[6] - 使用 OpenCV 进行所有的常规图像处理操作.
[7] - 使用主动学习(automatic active learning) 并添加手工标注(Employ automatic active learning and adding manual annotations).
[8] - 将所有的图像调整为相同尺寸(Resize all images to the same resolution in order to apply the same model to scans of different thicknesses).
[9] - 将扫描图像转换为归一化的 3D numpy 数组(Convert scan images into normalized 3D numpy arrays).
[10] - 利用暗通道先验进行单图像去雾(Apply single Image Haze Removal using Dark Channel Prior).
[11] - 将所有数据转换为 Hounsfield 单元(Convert all data to Hounsfield units).
[12] - 查找重复图像(Find duplicate images using pair-wise correlation on RGBY).
[13] - 开发类别更均衡的采样器(Make labels more balanced by (https://www.sebastiansylvan.com/post/importancesampling/)developing a sampler).
[14] - 对测试数据采用伪标签(Apply pseudo labeling to test data in order to improve score).
[15] - 缩放图像/标注masks 尺寸到 320x480(Scale down images/masks to 320×480).
[16] - 采用 32x32 的核进行直方图均衡化(Histogram equalization (CLAHE) with kernel size 32×32).
[17] - 将 DCM 转化为 PNG(Convert DCM to PNG).
[18] - 计算每张图像的 md5 去重(Calculate the md5 hash for each image when there are duplicate images).
4. 数据增强
[1] - 使用 albumentations 库进行图像增强.
[2] - 随机旋转 90 度(Apply random rotation by 90 degrees).
[3] - 水平、垂直、水平垂直翻转(Use horizontal, vertical or both flips).
[4] - 深度几何变换(Attempt heavy geometric transformations: Elastic Transform, PerspectiveTransform, Piecewise Affine transforms, pincushion distortion).
[5] - 随机HSV(Apply random HSV).
[6] - 采用 loss-less 增强以仅泛化,避免有用信息的 loss 损失(Use of loss-less augmentation for generalization to prevent loss of useful image information).
[7] - 通道乱序(Apply channel shuffling).
[8] - 基于类别频率进行数据增强(Do data augmentation based on class frequency).
[9] - 使用高斯噪声(Apply gaussian noise).
[10] - Use lossless permutations of 3D images for data augmentation.
[11] - 随机旋转一个[0, 45] 间角度(Rotate by a random angle from 0 to 45 degrees).
[12] - 随机缩放一个 [0.8, 12] 间的尺度(Scale by a random factor from 0.8 to 1.2).
[13] - 改变光照(Brightness changing).
[14] - 随机改变HSV值(Randomly change hue, saturation and value).
[15] - Apply D4 augmentations.
[16] - 有限对比度自适应直方图均衡化(Contrast limited adaptive histogram equalization).
[17] - 自动增强策略(Use the AutoAugment augmentation strategy).
5. 模型
5.1. 网络结构
[1] - 使用 UNet 类网络结构(Use of a U-net based architecture. Adopted the concepts and applied them to 3D input tensors)
[2] - The inception-ResNet v2 architecture for training features with different receptive fields
[3] - Siamese networks with adversarial training
[4] - ResNet50, Xception, Inception ResNet v2 x 5 with Dense (FC) layer as the final layer
[5] - Use of a global max-pooling layer which returns a fixed-length output no matter the input size
[6] - Use of stacked dilated convolutions
[7] - VoxelNet
[8] - Replace plus sign in LinkNet skip connections with concat and conv1x1
[9] - Generalized mean pooling
[10] - Keras NASNetLarge to train the model from scratch using 224x224x3
[11] - Use of the 3D convnet to slide over the images
[12] - Imagenet-pre-trained ResNet152 as the feature extractor
[13] - Replace the final fully-connected layers of ResNet by 3 fully connected layers with dropout
[14] - Use ConvTranspose in the decoder
[15] - Applying the VGG baseline architecture
[16] - Implementing the C3D network with adjusted receptive fields and a 64 unit bottleneck layer on the end of the network
[17] - Use of UNet type architectures with pre-trained weights to improve convergence and performance of binary segmentation on 8-bit RGB input images
[18] - LinkNet since it’s fast and memory efficient
[19] - MASKRCNN
[20] - BN-Inception
[21] - Fast Point R-CNN
[22] - Seresnext
[24] - Faster RCNN
[25] - SENet154
[26] - ResNet152
[27] - NASNet-A-Large
[28] - EfficientNetB4
[29] - ResNet101
[30] - GAPNet
[31] - PNASNet-5-Large
[32] - Densenet121
[33] - AC-GAN
[34] - XceptionNet (96), XceptionNet (299), Inception v3 (139), InceptionResNet v2 (299), DenseNet121 (224)
[35] - AlbuNet (resnet34) from ternausnets
[36] - SpaceNet
[37] - Resnet50 from selim_sef SpaceNet 4
[38] - SCSEUnet (seresnext50) from selim_sef SpaceNet 4
[39] - A custom Unet and Linknet architecture
[40] - FPNetResNet50 (5 folds)
[41] - FPNetResNet101 (5 folds)
[42] - FPNetResNet101 (7 folds with different seeds)
[43] - PANetDilatedResNet34 (4 folds)
[44] - PANetResNet50 (4 folds)
[45] - EMANetResNet101 (2 folds)
[46] - RetinaNet
[47] - Deformable R-FCN
[48] - Deformable Relation Networks
5.2. 硬件平台
[1] - Use of the AWS GPU instance p2.xlarge with a NVIDIA K80 GPU
[2] - Pascal Titan-X GPU
[3] - Use of 8 TITAN X GPUs
[4] - 6 GPUs: 21080Ti + 41080
[5] - Server with 8×NVIDIA Tesla P40, 256 GB RAM and 28 CPU cores
[6] - Intel Core i7 5930k, 2×1080, 64 GB of RAM, 2x512GB SSD, 3TB HDD
[7] - GCP 1x P100, 8x CPU, 15 GB RAM, SSD or 2x P100, 16x CPU, 30 GB RAM
[8] - NVIDIA Tesla P100 GPU with 16GB of RAM
[9] - Intel Core i7 5930k, 2×1080, 64 GB of RAM, 2x512GB SSD, 3TB HDD
[10] - 980Ti GPU, 2600k CPU, and 14GB RAM
5.3. 损失函数
[1] - Dice Coefficient ,对于不平衡数据(imbalanced data)更有效.
[2] - Weighted boundary loss,用于降低预测 segmentation 和 GT 的距离.
[3] - MultiLabelSoftMarginLoss that creates a criterion that optimizes a multi-label one-versus-all loss based on max-entropy, between input and target
[4] - Balanced cross entropy (BCE) with logit loss that involves weighing the positive and negative examples by a certain coefficient
[5] - Lovasz that performs direct optimization of the mean intersection-over-union loss in neural networks based on the convex Lovasz extension of sub-modular losses
[6] - FocalLoss + Lovasz obtained by summing the Focal and Lovasz losses
[7] - Arc margin loss that incorporates margin in order to maximise face class separability
[8] - Npairs loss that computes the npairs loss between y_true and y_pred.
[9] - A combination of BCE and Dice loss functions
[10] - LSEP – a pairwise ranking that is is smooth everywhere and thus is easier to optimize
[11] - Center loss that simultaneously learns a center for deep features of each class and penalizes the distances between the deep features and their corresponding class centers
[12] - Ring Loss that augments standard loss functions such as Softmax
[13] - Hard triplet loss that trains a network to embed features of the same class at the same time maximizing the embedding distance of different classes
[14] - 1 + BCE – Dice that involves subtracting the BCE and DICE losses then adding 1
[15] - Binary cross-entropy – log(dice) that is the binary cross-entropy minus the log of the dice loss
[16] - Combinations of BCE, dice and focal
[17] - Lovasz Loss that loss performs direct optimization of the mean intersection-over-union loss
[18] - BCE + DICE - Dice loss is obtained by calculating smooth dice coefficient function
[19] - Focal loss with Gamma 2 that is an improvement to the standard cross-entropy criterion
[20] - BCE + DICE + Focal – this is basically a summation of the three loss functions
[21] - Active Contour Loss that incorporates the area and size information and integrates the information in a dense deep learning model
[22] - 1024 * BCE(results, masks) + BCE(cls, cls_target)
[23] - Focal + kappa – Kappa is a loss function for multi-class classification of ordinal data in deep learning. In this case we sum it and the focal loss
[24] - ArcFaceLoss — Additive Angular Margin Loss for Deep Face Recognition
[25] - soft Dice trained on positives only – Soft Dice uses predicted probabilities
[26] - 2.7 BCE(pred_mask, gt_mask) + 0.9 DICE(pred_mask, gt_mask) + 0.1 * BCE(pred_empty, gt_empty) which is a custom loss used by the Kaggler
[27] - nn.SmoothL1Loss() that creates a criterion that uses a squared term if the absolute element-wise error falls below 1 and an L1 term otherwise
[28] - Use of the Mean Squared Error objective function in scenarios where it seems to work better than binary-cross entropy objective function.
5.4. 训练技巧
[1] - 尝试不同的学习率(Try different learning rates).
[2] -尝试不同的 batchsizes(Try different batch sizes).
[3] - Use SDG with momentum with manual rate scheduling
[4] - 过多的数据增强会降低精度(Too much augmentation will reduce the accuracy).
[5] - 在裁剪后的图像上训练,但在完整图像上测试(Train on image crops and predict on full images)
[6] - 采用 Keras ReduceLROnPlateau()
来控制学习率( Use of Keras’s ReduceLROnPlateau() to the learning rate).
[7] - 首先不数据增强进行训练,直到损失函数趋平;然后在某些 epochs 采用 soft 和 hard 数据增强(Train without augmentation until plateau then apply soft and hard augmentation to some epochs).
[8] - 冻结除了最后一层的其他网络层;使用 Stage1 的 1000 张图片进行finetune(Freeze all layers except the last one and use 1000 images from Stage1 for tuning
[9] - 开发采样器使标签更均衡(Make labels more balanced by developing a sampler).
[10] - Use of class aware sampling
[11] - Use dropout and augmentation while tuning the last layer
[12] - Pseudo Labeling to improve score
[13] - Use Adam reducing LR on plateau with patience 2–4
[14] - Use Cyclic LR with SGD
[15] - Reduce the learning rate by a factor of two if validation loss does not improve for two consecutive epochs
[16] - Repeat the worst batch out of 10 batches
[17] - Train with default UNET
[18] - Overlap tiles so that each edge pixel is covered twice
[20] - Remove low bounding box with low confidence score
[21] - Train different convolutional neural networks then build an ensemble
[22] - Stop training when the F1 score is decreasing
[23] - Differential learning rate with gradual reducing
[24] - Train ANNs in a stacking way using 5 folds and 30 repeats
[25] - Track of your experiments using Neptune
6. 评测和交叉验证
[1] - Split on non-uniform stratified by classes
[2] - Avoid overfitting by applying cross-validation while tuning the last layer
[3] - 10-fold CV ensemble for classification
[4] - Combination of 5 10-fold CV ensembles for detection
[5] - Sklearn’s stratified K fold function
[6] - 5 KFold Cross-Validation
[7] - Adversarial Validation & Weighting
7. 集成方法
[1] - Use simple majority voting for ensemble
[2] - XGBoost on the max malignancy at 3 zoom levels, the z-location and the amount of strange tissue
[3] - LightGBM for models with too many classes. This was done for raw data features only.
[4] - CatBoost for a second-layer model
[5] - Training with 7 features for the gradient boosting classifier
[6] - Use ‘curriculum learning’ to speed up model training. In this technique, models are first trained on simple samples then progressively moving to hard ones.
[7] - Ensemble with ResNet50, InceptionV3, and InceptionResNetV2
[8] - Ensemble method for object detection
[9] - An ensemble of Mask RCNN, YOLOv3, and Faster RCNN architectures n with a classification network — DenseNet-121 architecture
8. 后处理
[1] - Apply test time augmentation — presenting an image to a model several times with different random transformations and average the predictions you get
[2] - Equalize test prediction probabilities instead of only using predicted classes
[3] - Apply geometric mean to the predictions
[4] - Overlap tiles during inferencing so that each edge pixel is covered at least thrice because UNET tends to have bad predictions around edge areas.
[5] - Non-maximum suppression and bounding box shrinkage
[6] - Watershed post processing to detach objects in instance segmentation problems.