Github项目 - 全球AI挑战之场景分类实现

> [<Github 项目 - SceneClassyify>](https://github.com/puke3615/SceneClassify)

[全球AI挑战赛中场景分类](https://challenger.ai/dataset/scene) 比赛源码.

这里主要包含赛期间遇到的问题，踩的坑等的总结.

>数据集下载 - 官网 https://challenger.ai/dataset/scene
>
>或：
>
>百度网盘链接: https://pan.baidu.com/s/1cjR-xhsCq8BD5nH7yQeiIA 密码: xfcp

## 1. 源码快速使用

### 1.1 配置数据集路径

修改 `config.py` 文件：

```python
# coding=utf-8
import os
import platform

os_name = platform.system().lower()

def is_mac():
    return os_name.startswith('darwin')

def is_windows():
    return os_name.startswith('windows')

def is_linux():
    return os_name.startswith('linux')

def parse_weigths(weights):
    if not weights \
            or not weights.endswith('.h5') \
            or not weights.__contains__('/') \
            or not weights.__contains__('-'):
        return None
    try:
        weights_info = weights.split(os.path.sep)[-1].replace('.h5', '').split('-')
        if len(weights_info) != 3:
            return None
        epoch = int(weights_info[0])
        val_loss = float(weights_info[1])
        val_acc = float(weights_info[2])
        return epoch, val_loss, val_acc
    except Exception as e:
        raise Exception('Parse weights failure: %s', str(e))

def CONTEXT(name, **kwargs):
    return {
        'weights': 'params/%s/{epoch:05d}-{val_loss:.4f}-{val_acc:.4f}.h5' % name,
        'summary': 'log/%s' % name,
        'predictor_cache_dir': 'cache/%s' % name,
        'load_imagenet_weights': is_windows(),
        'path_json_dump': 'eval_json/%s/result%s.json' % (
            name, ('_' + kwargs['policy']) if kwargs.__contains__('policy') else ''),
    }

# 数据集图片路径
# image path
if is_windows():
    PATH_TRAIN_BASE = 'D:/path/to/ai_challenger_scene_train_20170904'
    PATH_VAL_BASE = 'D:/path/to/ai_challenger_scene_validation_20170908'
    PATH_TEST_B = 'D:/path/to/ai_challenger_scene_test_b_20170922/scene_test_b_images_20170922'
elif is_mac():
    PATH_TRAIN_BASE = '/path/to/ai_challenger_scene_train_20170904'
    PATH_VAL_BASE = '/path/to/ai_challenger_scene_validation_20170908'
    PATH_TEST_B = ''
elif is_linux():
    PATH_TRAIN_BASE = ''
    PATH_VAL_BASE = ''
    PATH_TEST_B = ''
else:
    raise Exception('No images configured on %s' % os_name)

PATH_TRAIN_IMAGES = os.path.join(PATH_TRAIN_BASE, 'classes')
PATH_TRAIN_JSON = os.path.join(PATH_TRAIN_BASE, 'scene_train_annotations_20170904.json')

PATH_VAL_IMAGES = os.path.join(PATH_VAL_BASE, 'classes')
PATH_VAL_JSON = os.path.join(PATH_VAL_BASE, 'scene_validation_annotations_20170908.json')

PATH_JSON_DUMP = 'eval_json/resnet.json'

# train info
IM_SIZE_299 = 299
IM_SIZE_224 = 224
BATCH_SIZE = 32
CLASSES = len(os.listdir(PATH_TRAIN_IMAGES))
EPOCH = 100

if __name__ == '__main__':
    print(PATH_TRAIN_IMAGES)
    print(CONTEXT('test').values())

```

### 1.2. 数据集分类

修改`split_by_class.py ` 脚本文件，分别对 `train` 数据集和 `val` 数据集进行按照子文件夹分类.

```python
# coding=utf-8
import numpy as np
import config
import json
import csv
import os

# 源文件路径
PATH_BASE_DIR = config.PATH_TRAIN_BASE
# PATH_BASE_DIR = config.PATH_VAL_BASE

# 保存文件路径
PATH_SAVE_DIR = os.path.join(PATH_BASE_DIR, 'classes')
# 是否按照分类名保存
SUB_DIR_WITH_NAME = False

PATH_IMAGES = os.path.join(PATH_BASE_DIR, 'scene_train_images_20170904')
PATH_JSON = os.path.join(PATH_BASE_DIR, 'scene_train_annotations_20170904.json')

# PATH_IMAGES = os.path.join(PATH_BASE_DIR, 'scene_validation_images_20170908')
# PATH_JSON = os.path.join(PATH_BASE_DIR, 'scene_validation_annotations_20170908.json')
PATH_CSV = os.path.join(PATH_BASE_DIR, 'scene_classes.csv')
PRINT = True
# 均值处理类不均衡问题
MEAN_HANDLE = False

def output(obj):
    if PRINT:
        if isinstance(obj, list) or isinstance(obj, tuple):
            for i in obj:
                print(i)
        else:
            print(obj)

def parse_labels():
    with open(PATH_CSV, encoding='utf-8') as f:
        return [line[1] for line in csv.reader(f)]

def parse_mapping():
    with open(PATH_JSON) as f:
        mapping = json.load(f)
        image2label = {item['image_id']: 
                       int(item['label_id']) for item in mapping}
        label2image = {}
        for image, label in image2label.items():
            if not label2image.__contains__(label):
                label2image[label] = []
            label2image[label].append(image)
        return image2label, label2image

if __name__ == '__main__':
    labels = parse_labels()
    output(labels[:5])

image2label, label2image = parse_mapping()
    output(label2image[0][:5])

for label, images in label2image.items():
        label_format = unicode(labels[label], 'utf-8') 
        				if SUB_DIR_WITH_NAME else ('%02d' % label)
        sub_dir = os.path.join(PATH_SAVE_DIR, label_format)
        if not os.path.exists(sub_dir):
            os.makedirs(sub_dir)
        if MEAN_HANDLE:
            target_files_size = len(image2label) // len(label2image)
            if len(images) > target_files_size:
                # 多了抽取
                images = np.random.choice(images, 
                                          target_files_size, 
                                          replace=False).tolist()
            elif len(images) < target_files_size:
                # 少了添加
                added = []
                while len(images) + len(added) < target_files_size:
                    offset = target_files_size - len(images) - len(added)
                    if offset >= len(images):
                        added.extend(images)
                    else:
                        images.extend(np.random.choice(images, 
                                                       offset, 
                                                       replace=False).tolist())
                images.extend(added)
        for image in images:
            with open(os.path.join(PATH_IMAGES, image), 'rb') as old:
                target_file = os.path.join(sub_dir, image)
                while os.path.exists(target_file):
                    target_file = target_file.replace('.', '_.')
                with open(target_file, 'wb') as new:
                    new.write(old.read())
                    output('Write finish % s' % image)
    output('Completed.')

```

### 1.3. 模型训练

主要包含的模型训练脚本有：

- classifier_10.py
- classifier_base.py
- classifier_inception_resnet_v2.py
- classifier_inception_v3.py
- classifier_resnet.py
- classifier_vgg16.py
- classifier_vgg19.py
- classifier_xception.py
- classifier_xception_trainable.py

运行任意一个 `classifier_xxx.py` 训练脚本(`classifier_base` 除外). 包含了`VGG16/19`、`Xception`、`Inception-V3`、`Inception-Resnet-V2`等经典模型.

## 2.  要点概述

[1] - 支持多个单模型进行集成，可选多种集成方式

[2] - 支持多种集成方式间的任意组合和自动择优

[3] - 支持间断训练时权重文件的择优选择

[4] - 支持`VGG16`、`VGG19`、`Resnet50`、`Inception-V3`、`Xception`、`Inception-Resnet-V3`模型

[5] - `imgaug` 图片数据增强库替换Keras自带的图片预处理

[6] - 支持多进程进行图片预处理

## 3. 踩坑

### 3.1. 数据增强很重要

Keras 自带的图片增强远远不够的，这里选择了 [imgaug](https://github.com/aleju/imgaug)图片数据增强库，直接上图，这种效果是目前的Keras望尘莫及的，尽可能最大限度利用当前有限的数据集.

**提高1~3个百分点**

![](https://www.aiuai.cn/uploads/sina/5ce8e08e20e0a.jpg)

### 3.2. 尽可能高效使用CPU

训练任务交给GPU去做，新添加的 imgaug 图片处理方式之后，一个Epoch在1050Ti上耗时90mins+，排查发现大部分时间都在进行图片数据增强处理，于是将该部分的处理替换为多进程方式.

**时间从90mins降到30mins左右**

### 3.3. 标准化很重要

先计算出整体训练集的mean和std，然后在训练阶段的输入数据以mean和std进行高斯化处理（参[mean_var_fetcher.py](mean_var_fetcher.py)）

**提高0.5~1.0个百分点**.

`mean_var_fetcher.py`：

```python
from PIL import Image
import numpy as np
import config

def get_files(dir):
    import os
    if not os.path.exists(dir):
        return []
    if os.path.isfile(dir):
        return [dir]
    result = []
    for subdir in os.listdir(dir):
        sub_path = os.path.join(dir, subdir)
        result += get_files(sub_path)
    return result

r = 0  # r mean
g = 0  # g mean
b = 0  # b mean

r_2 = 0  # r^2
g_2 = 0  # g^2
b_2 = 0  # b^2

total = 0

files = get_files(config.PATH_TRAIN_IMAGES)
count = len(files)

for i, image_file in enumerate(files):
    print('Process: %d/%d' % (i, count))
    img = Image.open(image_file)
    # img = img.resize((299, 299))
    img = np.asarray(img)
    img = img.astype('float32') / 255.
    total += img.shape[0] * img.shape[1]

r += img[:, :, 0].sum()
    g += img[:, :, 1].sum()
    b += img[:, :, 2].sum()

r_2 += (img[:, :, 0] ** 2).sum()
    g_2 += (img[:, :, 1] ** 2).sum()
    b_2 += (img[:, :, 2] ** 2).sum()

r_mean = r / total
g_mean = g / total
b_mean = b / total

r_var = r_2 / total - r_mean ** 2
g_var = g_2 / total - g_mean ** 2
b_var = b_2 / total - b_mean ** 2

print('Mean is %s' % ([r_mean, g_mean, b_mean]))
print('Var is %s' % ([r_var, g_var, b_var]))
# Mean is [0.4960301824223457, 
#          0.47806493084428053, 
#          0.44767167301470545]
# Var is [0.084966025569294362, 
#         0.082005493489533315, 
#         0.088877477602068156]
```

### 3.4. Fine-tune别绑太紧

这点尤为重要！

Fine-tune时松太开，可能导致训练耗时，也可能导致机器带不动；

绑太紧可能导致Fixed的权重参数扼制了模型的学习能力.

建议是在机器能扛得住的基础下，尽可能松绑多一些.

**提高2~5个百分点**

### 3.5. 模型选择很重要

糟糕的模型训练几天几夜，可能赶不上优势模型训练几个epoch.

**VGG16=>Xception提高5~8个百分点**

### 3.6. Loss降不下去时尝试调低LR

降不下去就调小，调下的幅度一般是5倍、10倍左右.

**提高1~3个百分点**

### 3.7. TensorbBoard监视训练状态

尽可能使用Tensorflow提供的Tensorboard可视化工具，方便从宏观把控训练过程.

### 3.8. 适度过拟合是良性的

训练过程中一直没有过拟合，要从两方面考虑：

* 模型太简单，拟合能力不足，这时要考虑增强网络复杂度

* 数据增强程度太大，学不到某些特征

### 3.9. 模型集成

单模型没有什么提升空间时，要尝试将多个单模型进行集成.

集成的方式可以选择投票法、均值法、按照模型Acc加权法等等.

**提高0.5~1.5个百分点**

### 3.10. 预测数据增强

为了确保预测结果的准确性，可以将待预测结果进行水平翻转（或随机裁取patch等）处理，将这多张孪生图片进行预测，最终结果取多个结果的均值.

**提高0.25~1.0个百分点**

<Github 项目 - SceneClassyify>

全球AI挑战赛中场景分类比赛源码.

这里主要包含赛期间遇到的问题，踩的坑等的总结.

数据集下载 - 官网 https://challenger.ai/dataset/scene
或：
百度网盘链接: https://pan.baidu.com/s/1cjR-xhsCq8BD5nH7yQeiIA 密码: xfcp

1. 源码快速使用

1.1 配置数据集路径

修改 config.py 文件：

# coding=utf-8
import os
import platform

os_name = platform.system().lower()

def is_mac():
    return os_name.startswith('darwin')

def is_windows():
    return os_name.startswith('windows')

def is_linux():
    return os_name.startswith('linux')

def parse_weigths(weights):
    if not weights \
            or not weights.endswith('.h5') \
            or not weights.__contains__('/') \
            or not weights.__contains__('-'):
        return None
    try:
        weights_info = weights.split(os.path.sep)[-1].replace('.h5', '').split('-')
        if len(weights_info) != 3:
            return None
        epoch = int(weights_info[0])
        val_loss = float(weights_info[1])
        val_acc = float(weights_info[2])
        return epoch, val_loss, val_acc
    except Exception as e:
        raise Exception('Parse weights failure: %s', str(e))


def CONTEXT(name, **kwargs):
    return {
        'weights': 'params/%s/{epoch:05d}-{val_loss:.4f}-{val_acc:.4f}.h5' % name,
        'summary': 'log/%s' % name,
        'predictor_cache_dir': 'cache/%s' % name,
        'load_imagenet_weights': is_windows(),
        'path_json_dump': 'eval_json/%s/result%s.json' % (
            name, ('_' + kwargs['policy']) if kwargs.__contains__('policy') else ''),
    }

# 数据集图片路径
# image path
if is_windows():
    PATH_TRAIN_BASE = 'D:/path/to/ai_challenger_scene_train_20170904'
    PATH_VAL_BASE = 'D:/path/to/ai_challenger_scene_validation_20170908'
    PATH_TEST_B = 'D:/path/to/ai_challenger_scene_test_b_20170922/scene_test_b_images_20170922'
elif is_mac():
    PATH_TRAIN_BASE = '/path/to/ai_challenger_scene_train_20170904'
    PATH_VAL_BASE = '/path/to/ai_challenger_scene_validation_20170908'
    PATH_TEST_B = ''
elif is_linux():
    PATH_TRAIN_BASE = ''
    PATH_VAL_BASE = ''
    PATH_TEST_B = ''
else:
    raise Exception('No images configured on %s' % os_name)

PATH_TRAIN_IMAGES = os.path.join(PATH_TRAIN_BASE, 'classes')
PATH_TRAIN_JSON = os.path.join(PATH_TRAIN_BASE, 'scene_train_annotations_20170904.json')

PATH_VAL_IMAGES = os.path.join(PATH_VAL_BASE, 'classes')
PATH_VAL_JSON = os.path.join(PATH_VAL_BASE, 'scene_validation_annotations_20170908.json')

PATH_JSON_DUMP = 'eval_json/resnet.json'

# train info
IM_SIZE_299 = 299
IM_SIZE_224 = 224
BATCH_SIZE = 32
CLASSES = len(os.listdir(PATH_TRAIN_IMAGES))
EPOCH = 100

if __name__ == '__main__':
    print(PATH_TRAIN_IMAGES)
    print(CONTEXT('test').values())

1.2. 数据集分类

修改split_by_class.py 脚本文件，分别对 train 数据集和 val 数据集进行按照子文件夹分类.

# coding=utf-8
import numpy as np
import config
import json
import csv
import os

# 源文件路径
PATH_BASE_DIR = config.PATH_TRAIN_BASE
# PATH_BASE_DIR = config.PATH_VAL_BASE

# 保存文件路径
PATH_SAVE_DIR = os.path.join(PATH_BASE_DIR, 'classes')
# 是否按照分类名保存
SUB_DIR_WITH_NAME = False

PATH_IMAGES = os.path.join(PATH_BASE_DIR, 'scene_train_images_20170904')
PATH_JSON = os.path.join(PATH_BASE_DIR, 'scene_train_annotations_20170904.json')

# PATH_IMAGES = os.path.join(PATH_BASE_DIR, 'scene_validation_images_20170908')
# PATH_JSON = os.path.join(PATH_BASE_DIR, 'scene_validation_annotations_20170908.json')
PATH_CSV = os.path.join(PATH_BASE_DIR, 'scene_classes.csv')
PRINT = True
# 均值处理类不均衡问题
MEAN_HANDLE = False


def output(obj):
    if PRINT:
        if isinstance(obj, list) or isinstance(obj, tuple):
            for i in obj:
                print(i)
        else:
            print(obj)


def parse_labels():
    with open(PATH_CSV, encoding='utf-8') as f:
        return [line[1] for line in csv.reader(f)]


def parse_mapping():
    with open(PATH_JSON) as f:
        mapping = json.load(f)
        image2label = {item['image_id']: 
                       int(item['label_id']) for item in mapping}
        label2image = {}
        for image, label in image2label.items():
            if not label2image.__contains__(label):
                label2image[label] = []
            label2image[label].append(image)
        return image2label, label2image


if __name__ == '__main__':
    labels = parse_labels()
    output(labels[:5])

    image2label, label2image = parse_mapping()
    output(label2image[0][:5])

    for label, images in label2image.items():
        label_format = unicode(labels[label], 'utf-8') 
                        if SUB_DIR_WITH_NAME else ('%02d' % label)
        sub_dir = os.path.join(PATH_SAVE_DIR, label_format)
        if not os.path.exists(sub_dir):
            os.makedirs(sub_dir)
        if MEAN_HANDLE:
            target_files_size = len(image2label) // len(label2image)
            if len(images) > target_files_size:
                # 多了抽取
                images = np.random.choice(images, 
                                          target_files_size, 
                                          replace=False).tolist()
            elif len(images) < target_files_size:
                # 少了添加
                added = []
                while len(images) + len(added) < target_files_size:
                    offset = target_files_size - len(images) - len(added)
                    if offset >= len(images):
                        added.extend(images)
                    else:
                        images.extend(np.random.choice(images, 
                                                       offset, 
                                                       replace=False).tolist())
                images.extend(added)
        for image in images:
            with open(os.path.join(PATH_IMAGES, image), 'rb') as old:
                target_file = os.path.join(sub_dir, image)
                while os.path.exists(target_file):
                    target_file = target_file.replace('.', '_.')
                with open(target_file, 'wb') as new:
                    new.write(old.read())
                    output('Write finish % s' % image)
    output('Completed.')

1.3. 模型训练

主要包含的模型训练脚本有：

classifier_10.py
classifier_base.py
classifier_inception_resnet_v2.py
classifier_inception_v3.py
classifier_resnet.py
classifier_vgg16.py
classifier_vgg19.py
classifier_xception.py
classifier_xception_trainable.py

运行任意一个 classifier_xxx.py 训练脚本(classifier_base 除外). 包含了VGG16/19、Xception、Inception-V3、Inception-Resnet-V2等经典模型.

2. 要点概述

[1] - 支持多个单模型进行集成，可选多种集成方式

[2] - 支持多种集成方式间的任意组合和自动择优

[3] - 支持间断训练时权重文件的择优选择

[4] - 支持VGG16、VGG19、Resnet50、Inception-V3、Xception、Inception-Resnet-V3模型

[5] - imgaug 图片数据增强库替换Keras自带的图片预处理

[6] - 支持多进程进行图片预处理

3. 踩坑

3.1. 数据增强很重要

Keras 自带的图片增强远远不够的，这里选择了 imgaug图片数据增强库，直接上图，这种效果是目前的Keras望尘莫及的，尽可能最大限度利用当前有限的数据集.

提高1~3个百分点

3.2. 尽可能高效使用CPU

时间从90mins降到30mins左右

3.3. 标准化很重要

先计算出整体训练集的mean和std，然后在训练阶段的输入数据以mean和std进行高斯化处理（参mean_var_fetcher.py）

提高0.5~1.0个百分点.

mean_var_fetcher.py：

from PIL import Image
import numpy as np
import config


def get_files(dir):
    import os
    if not os.path.exists(dir):
        return []
    if os.path.isfile(dir):
        return [dir]
    result = []
    for subdir in os.listdir(dir):
        sub_path = os.path.join(dir, subdir)
        result += get_files(sub_path)
    return result


r = 0  # r mean
g = 0  # g mean
b = 0  # b mean

r_2 = 0  # r^2
g_2 = 0  # g^2
b_2 = 0  # b^2

total = 0

files = get_files(config.PATH_TRAIN_IMAGES)
count = len(files)

for i, image_file in enumerate(files):
    print('Process: %d/%d' % (i, count))
    img = Image.open(image_file)
    # img = img.resize((299, 299))
    img = np.asarray(img)
    img = img.astype('float32') / 255.
    total += img.shape[0] * img.shape[1]

    r += img[:, :, 0].sum()
    g += img[:, :, 1].sum()
    b += img[:, :, 2].sum()

    r_2 += (img[:, :, 0] ** 2).sum()
    g_2 += (img[:, :, 1] ** 2).sum()
    b_2 += (img[:, :, 2] ** 2).sum()

r_mean = r / total
g_mean = g / total
b_mean = b / total

r_var = r_2 / total - r_mean ** 2
g_var = g_2 / total - g_mean ** 2
b_var = b_2 / total - b_mean ** 2

print('Mean is %s' % ([r_mean, g_mean, b_mean]))
print('Var is %s' % ([r_var, g_var, b_var]))
# Mean is [0.4960301824223457, 
#          0.47806493084428053, 
#          0.44767167301470545]
# Var is [0.084966025569294362, 
#         0.082005493489533315, 
#         0.088877477602068156]

3.4. Fine-tune别绑太紧

这点尤为重要！

Fine-tune时松太开，可能导致训练耗时，也可能导致机器带不动；

绑太紧可能导致Fixed的权重参数扼制了模型的学习能力.

建议是在机器能扛得住的基础下，尽可能松绑多一些.

提高2~5个百分点

3.5. 模型选择很重要

糟糕的模型训练几天几夜，可能赶不上优势模型训练几个epoch.

VGG16=>Xception提高5~8个百分点

3.6. Loss降不下去时尝试调低LR