目标检测评测指标mAP及计算[译]

Author： AIHGF
发布时间：January 19, 2019
1929views
18 comments
14226 words
Categories：目标检测

原文：Measuring Object Detection models - mAP - What is Mean Average Precision? - 2018.01.27
出处：TARANG SHAH - BLOG

对于大部分采用机器学习解决的问题，通常会有多种可用模型，且每个模型都各有特点，并会因各种因素的改变而不同.

机器学习模型往往都是在 验证/测试(validation/test) 数据集 上进行性能评测的，包括准确度(accuracy)、精度(precision)，召回率(recall) 等评测指标. 具体评测指标一般是根据应用场景和用例来选择的. 且，对于各具体的应用场景，选择一个能够客观评测模型性能的度量指标是很重要的.

这里，将介绍目标检测问题提中最常用的评测指标- 平均精度(mAP, Mean Average Precision).

一般来说，评测指标都是很容易理解和计算的. 例如，二值分类(binary classification)中，精度(precision) 和召回率(recall) 即是简单直观的统计指标.

但是，目标检测却是一个不同但很有趣的问题.

即使目标检测器检测到图片中的猫，但如果不能找到猫在图片中的位置，该目标检测器还是没用的.

因为目标检测问题是为了预测图片中各个物体是否存在以及对应的位置，因此，mAP 指标的计算是蛮有趣的.

1. 目标检测问题

目标检测问题是指，给定一张图片，找出图片中所包含的物体，并定位物体所在的位置和对应的物体类别.

目标检测模型通常是在特定类别的数据集上进行训练，因此，模型只会定位和分类图片中对应数据集中的那些物体类别.

此外，物体的位置通常是以矩形边界框的形式表示.

因此，目标检测问题同时包括图片中目标物体的定位和物体的分类. 如图：

图1 - 图像处理相关问题，From Stanford CS231n Course slides(lecture 8).

下面介绍的 mAP 是特别适用于同时预测物体位置和对应类别的目标检测算法的. 如图 1 所示，mAP 对于定位模型、目标检测模型和分割模型的评测是非常有用的.

2. 目标检测模型的评测

2.1. Why mAP?

目标检测问题中，每张图片可能包含多个不同类别的不同物体，需要评测模型的分类和定位性能.

因此，图像分类问题中的精度指标是不能直接适用的. 这就是为什么采用 mAP 的原因.

2.2. 关于 GroundTruth

对于任何算法，评测指标一般都是相对于 GroundTruth 数据进行计算的. 一般是只知道训练数据集(Training)，验证数据集(Validation) 和测试数据集(Test) 的Ground Truth.

对于目标检测问题，其GT包括图片、图片中目标物体的类别以及图片中各目标物体的边界框.

例如:

这里给出了一张真实图片(jpg, png 等)和对应的文本标注(边界框坐标 - (x, y, width, height) 和类别(class)). 上图中的红色矩形框和蓝色类别标签.

对于该实例，模型训练时需要原始图片(如下图)和对应的 GT (3 个目标物体的坐标及类别，如下表.) 假设图片尺寸为 1000x800px，且所有的坐标值是以像素为单位的近似值.

图 - 真实图片

表 - GT 标注数据

Class	X coordinate	Y coordinate	Box Width	Box Height
Dog	100	600	150	100
Horse	700	300	200	250
Person	400	400	100	500

依次为例，动手计算 mAP.

3. mAP 计算

这里假设已经有训练好的目标检测模型，只需要在验证数据集上评测该模型的性能.

目标检测模型输出大量的预测结果，但其中大部分结构都具有非常低的置信度(confidence score)，因此，只需要考虑置信度大于设置的阈值的预测结果.

将原始图片进行模型预测，会输出置信度阈值筛选后的结果：

图 2 - 模型预测结果.

对于图2 的预测结果，人作为目标检测专家，可以知道这些检测结果是正确的，但是如何量化呢？

首先，需要知道如何判断每个检测结果的正确性. 采用 IoU(Intersection over Union) 可以判断给定边界框的正确性，其比较简单直观.

3.1. IoU

IoU 是预测框与 GT 框的交集与并集的比值. 其是在 20 世纪初 Paul Jaccard 首先提出的.

为了得到交集值和并集值，首先需要将预测框与 GT 框放在一起进行重叠，如图：

图 3 - 以 horse 为例，重叠其预测框和 GT 框.

对于每个物体类，预测框和 GT 框重叠的区域即是交集(intesection)，全部的展开区域即是并集(union). 以 horse 类为例，其交集和并集，如下图，其中蓝绿色区域是交集，蓝绿色加橘色区域是并集.

图 4 - horse 类的交集和并集，这里交集区域比较大.

因此，IoU 的计算如：

3.2. 判断正确的检测结果并计算 Precision 和 Recall

对于精度(Precision) 和召回率(Recall) 的计算，所有机器学习问题都一样，必须要判断出 True Positives(真正例)、False Positives(假正例)、True Negatives(真负例) 和 False Negatives(假负例).

为了得到 True Positives 和 False Positives，需要用到 IoU，以判断检测结果是正确的(True) 还是错误的(False). 一般以 0.5 的 IoU 阈值，如，如果 IoU>0.5，则，检测结果是 True Positive；否则，检测结果是 False Positive. 在 COCO 数据集和竞赛中，其评测指标推荐对多个不同的 IoU 阈值进行计算. 但，简洁起见，这里仍只采用 IoU 阈值为 0.5，这也是 PASCAL VOC 的评测指标.

为了计算召回率(Recall)，需要知道 Negatives 的数量. 由于图片中所有没有检测到物体的部分都被视为 Negative，因此，True Negatives 的计算有点棘手. 但，可以只计算 False Negatives，如模型漏检的目标物体.

此外，另一个待考虑的因素是，模型预测的每个检测结果的置信度. 通过改变置信度阈值，可以改变一个预测框是 Positive 还是 Negative. 基本来说，大于置信度阈值的所有预测结果(Box + CLass) 都被作为 Positives，低于阈值的都是 Negatives.

对于每张图片，GT 标注数据给出了在图片中各物体类别的实际物体数量.

可以计算每个 Positive 模型检测框和 GT 框的 IoU 值. 基于该得到的 IoU 值和设定的 IoU 阈值(0.5)，计算出图片中每个类别的正确检测结果(True Positive，TP)以及错误检测结果(False Positives, FP). 其用于计算每个类别的精度(Precision).

$$ Precision = \frac{TP}{TP + FP} $$

给定图片中类别 $C$ 的精度 = 图片中类别 C 的真正数量与图片中类别 C 所有目标的数量之比：

$$ Precision_C = \frac{N(TP)_C}{N(Total)_C} $$

计算得到了正确预测的物体数量(True Positives, TP) 和漏检的物体数量(False Negatives, FN)后，进而可以计算出召回率(Recall):

$$ Recall = \frac{TP}{TP + FN} $$

其中，TP+FN 相当于 GT 的总数.

3.3. 计算 mAP

mAP(Mean Average Precision) 具有不同的定义，其通常用于信息检索(Information Retrieval) 和目标检测领域. 但，其计算方式各不相同. 这里只是介绍目标检测中的 mAP 计算.

目标检测中的 mAP，首先在 PASCAL Visual Objects CLasses(VOC) 竞赛中.

根据前面介绍的 Precision 和 Recall 的计算方法，还有两个其它变量影响 Precision 和 Recall 的值，即，IoU 和 置信度阈值.

IoU 是一种简单的几何度量，很容易标准化. 比如 PASCAL VOC 中采用的 IoU 阈值为 0.5，在 COCO 竞赛中以 0.05 到 0.95 的多个 IoU 阈值计算 mAP.

但置信度阈值对于不同模型，差异会比较大，可能一个模型采用的 0.5 的结果，却等价于另一个模型采用的 0.8 的结果，都会导致 PR 曲线变化. 对此，PASCAL VOC 组织者提出了一种方法来处理该问题. 其在论文中，推荐使用如下方法计算 Average Precision(AP):

For a given task and class, the precision/recall curve is computed from a method’s ranked output.
Recall is defined as the proportion of all positive examples ranked above a given rank.
Precision is the proportion of all examples above that rank which are from the positive class.
The AP summarises the shape of the precision/recall curve, and is defined as the mean precision at a set of eleven equally spaced recall levels [0,0.1,…,1]:

也就是说，选定了 11 种不同的置信度阈值(其决定了 "rank")，此时召回率(Recall) 即是置信度阈值 - [0, 0.1, 0.2, 0.3, ..., 0.9, 1.0]. 而 AP 定义为在这 11 个 Recall 值的 Precision 值的均值.

论文还进一步介绍了 Precision 计算的细节：

The precision at each recall level r is interpolated by taking the maximum precision measured for a method for which the corresponding recall exceeds r:

一般地，采用给定 Recall 值的最大 Precision.

因此，mAP 是所有类别的平均精度值的均值,

对于特定类别C，其平均精度为验证数据集中该类(C) 的所有精度之和与含有该类别(C)目标的图像数量.

$$ AP_C = \frac{\sum Precision_C}{N(TotalImages)_C} $$

$$ mAP = \frac{\sum AP_C}{N(Classes)} $$

什么是mAP ? 比较目标检测模型性能的统计量，了解一下？

4. mAP 一些点

在对比 mAP 值时，有一些重要的点：

[1] - mAP 一般是对整个数据集进行计算的.

[2] - 尽管很难简单的评测模型输出的绝对量化，但 mAP 确实是一种相对较好的度量指标. 当对于公开数据集计算 mAP 时，其能够简单有效的对比不同目标检测算法的性能.

[3] - 取决于训练数据集中的物体类别，Average Precision 值可能对于各类之间变化较大，如，对于质量好的物体类别的训练数据，AP 较高；而质量差的物体类别的训练数据，AP 可能非常低. 因此，模型的 mAP 可能比较一般，但对于某些类别的物体检测可能很好，而对于某些类别的物体检测可能很差. 故，分析模型性能时，分别对每个类别的 AP 进行观察是很有帮助的，其可以作为添加更多训练样本的指示器.

5. voc_eval.py

Detectron 中关于 VOC 数据集的 mAP 计算 - voc_eval.py

"""Python implementation of the PASCAL VOC devkit's AP evaluation code."""

import cPickle
import logging
import numpy as np
import os
import xml.etree.ElementTree as ET

logger = logging.getLogger(__name__)


def parse_rec(filename):
    """Parse a PASCAL VOC xml file."""
    tree = ET.parse(filename)
    objects = []
    for obj in tree.findall('object'):
        obj_struct = {}
        obj_struct['name'] = obj.find('name').text
        obj_struct['pose'] = obj.find('pose').text
        obj_struct['truncated'] = int(obj.find('truncated').text)
        obj_struct['difficult'] = int(obj.find('difficult').text)
        bbox = obj.find('bndbox')
        obj_struct['bbox'] = [int(bbox.find('xmin').text),
                              int(bbox.find('ymin').text),
                              int(bbox.find('xmax').text),
                              int(bbox.find('ymax').text)]
        objects.append(obj_struct)

    return objects


def voc_ap(rec, prec, use_07_metric=False):
    """
    给定 precision 和 recall，计算 VO AP.
    Compute VOC AP given precision and recall. 
    If use_07_metric is true, uses the VOC 07 11-point method (default:False).
    """
    if use_07_metric: # VOC 2007 采用的方法 
        # 11 point metric
        ap = 0.
        for t in np.arange(0., 1.1, 0.1):
            if np.sum(rec >= t) == 0:
                p = 0
            else:
                p = np.max(prec[rec >= t])
            ap = ap + p / 11.
    else: # 新方法
        # correct AP calculation
        # first append sentinel values at the end
        mrec = np.concatenate(([0.], rec, [1.]))
        mpre = np.concatenate(([0.], prec, [0.]))

        # compute the precision envelope
        for i in range(mpre.size - 1, 0, -1):
            mpre[i - 1] = np.maximum(mpre[i - 1], mpre[i])

        # to calculate area under PR curve, look for points
        # where X axis (recall) changes value
        i = np.where(mrec[1:] != mrec[:-1])[0]

        # and sum (\Delta recall) * prec
        ap = np.sum((mrec[i + 1] - mrec[i]) * mpre[i + 1])
    return ap


def voc_eval(detpath,
             annopath,
             imagesetfile,
             classname,
             cachedir,
             ovthresh=0.5,
             use_07_metric=False):
    """rec, prec, ap = voc_eval(detpath,
                                annopath,
                                imagesetfile,
                                classname,
                                [ovthresh],
                                [use_07_metric])

    Top level function that does the PASCAL VOC evaluation.

    detpath: Path to detections
        detpath.format(classname) should produce the detection results file.
    annopath: Path to annotations
        annopath.format(imagename) should be the xml annotations file.
    imagesetfile: Text file containing the list of images, one image per line.
    classname: Category name (duh)
    cachedir: Directory for caching the annotations
    [ovthresh]: Overlap threshold (default = 0.5)
    [use_07_metric]: Whether to use VOC07's 11 point AP computation
        (default False)
    """
    # assumes detections are in detpath.format(classname)
    # assumes annotations are in annopath.format(imagename)
    # assumes imagesetfile is a text file with each line an image name
    # cachedir caches the annotations in a pickle file
    
    # GT
    # first load gt
    if not os.path.isdir(cachedir):
        os.mkdir(cachedir)
    imageset = os.path.splitext(os.path.basename(imagesetfile))[0]
    cachefile = os.path.join(cachedir, imageset + '_annots.pkl')
    # read list of images
    with open(imagesetfile, 'r') as f:
        lines = f.readlines()
    imagenames = [x.strip() for x in lines]

    if not os.path.isfile(cachefile):
        # load annots
        recs = {}
        for i, imagename in enumerate(imagenames):
            recs[imagename] = parse_rec(annopath.format(imagename))
            if i % 100 == 0:
                logger.info(
                    'Reading annotation for {:d}/{:d}'.format(
                        i + 1, len(imagenames)))
        # save
        logger.info('Saving cached annotations to {:s}'.format(cachefile))
        with open(cachefile, 'w') as f:
            cPickle.dump(recs, f)
    else:
        # load
        with open(cachefile, 'r') as f:
            recs = cPickle.load(f)

    # extract gt objects for this class
    class_recs = {}
    npos = 0
    for imagename in imagenames:
        R = [obj for obj in recs[imagename] if obj['name'] == classname]
        bbox = np.array([x['bbox'] for x in R])
        difficult = np.array([x['difficult'] for x in R]).astype(np.bool)
        det = [False] * len(R)
        npos = npos + sum(~difficult)
        class_recs[imagename] = {'bbox': bbox,
                                 'difficult': difficult,
                                 'det': det}
    
    # 预测结果
    # read dets
    detfile = detpath.format(classname)
    with open(detfile, 'r') as f:
        lines = f.readlines()

    splitlines = [x.strip().split(' ') for x in lines]
    image_ids = [x[0] for x in splitlines]
    confidence = np.array([float(x[1]) for x in splitlines])
    BB = np.array([[float(z) for z in x[2:]] for x in splitlines])
    
    ####################
    # 根据置信度从大到小排序
    # sort by confidence
    sorted_ind = np.argsort(-confidence)
    BB = BB[sorted_ind, :] # 预测矩形框坐标
    image_ids = [image_ids[x] for x in sorted_ind] # 预测框对应的图片 id
    
    # 继续计算 TPs 和 FPs
    # go down dets and mark TPs and FPs
    nd = len(image_ids)
    tp = np.zeros(nd)
    fp = np.zeros(nd)
    for d in range(nd):
        R = class_recs[image_ids[d]]
        bb = BB[d, :].astype(float)
        ovmax = -np.inf
        BBGT = R['bbox'].astype(float)
        
        # 计算 IoU
        if BBGT.size > 0:
            # compute overlaps
            # intersection
            ixmin = np.maximum(BBGT[:, 0], bb[0])
            iymin = np.maximum(BBGT[:, 1], bb[1])
            ixmax = np.minimum(BBGT[:, 2], bb[2])
            iymax = np.minimum(BBGT[:, 3], bb[3])
            iw = np.maximum(ixmax - ixmin + 1., 0.)
            ih = np.maximum(iymax - iymin + 1., 0.)
            inters = iw * ih

            # union
            uni = ((bb[2] - bb[0] + 1.) * (bb[3] - bb[1] + 1.) +
                   (BBGT[:, 2] - BBGT[:, 0] + 1.) *
                   (BBGT[:, 3] - BBGT[:, 1] + 1.) - inters)

            overlaps = inters / uni
            ovmax = np.max(overlaps) # 最大 IoU
            jmax = np.argmax(overlaps) # 选择最大的 IoU

        if ovmax > ovthresh: # 判断 IoU 阈值
            if not R['difficult'][jmax]: # 非 'difficult' 物体
                if not R['det'][jmax]:   # 漏检
                    tp[d] = 1.
                    R['det'][jmax] = 1   # 标记漏检物体
                else:
                    fp[d] = 1.
        else:
            fp[d] = 1.

    # 计算 precision recall
    fp = np.cumsum(fp)
    tp = np.cumsum(tp)
    rec = tp / float(npos)
    # 避免出现 0 为分母，以防第一个检测结果匹配了不同的 GT
    # avoid divide by zero in case the first detection matches a difficult
    # ground truth
    prec = tp / np.maximum(tp + fp, np.finfo(np.float64).eps)
    ap = voc_ap(rec, prec, use_07_metric)

    return rec, prec, ap

Last modification：May 10th, 2019 at 10:52 am

18 comments

唯一
March 19th, 2020 at 02:18 pm

是的在测试集中含有这个图片但在test.txt、val.txt没有而在train.txt、trainval.txt中有当把路径改为train.txt 出错的图片名字在train.txt、val.txt中无而在test.txt中有另外删了cache文件中的pkl，还是出现类似问题另外我在原21类voc数据集上可以评估测试但是改了自己的数据集后检测2类评估出现此问题

Reply
float
February 8th, 2020 at 11:49 am

博主您好！我看到很多博客写的都是通过voc_eval.py计算mAP，请问我是coco格式的数据集，该怎么计算mAP呢？

Reply
1. AIHGF
  February 8th, 2020 at 04:15 pm
  
  @float
  
  COCO 数据集目标检测等相关评测指标可能有帮助.
  
  Reply
lala
January 4th, 2020 at 10:03 pm

也就是说，选定了 11 种不同的置信度阈值(其决定了 "rank")，此时召回率(Recall) 即是置信度阈值 - [0, 0.1, 0.2, 0.3, ..., 0.9, 1.0].
请问为什么：此时召回率(Recall) 即是置信度阈值？

Reply
1. AIHGF
  January 8th, 2020 at 08:16 pm
  
  @lala
  
  召回率相当于被设置作为了置信度阈值
  
  Reply
MasterLeo
November 11th, 2019 at 10:21 pm

不知道您那里有没有相应源码可以分享一下，多谢了

Reply
1. AIHGF
  November 12th, 2019 at 11:11 am
  
  @MasterLeo
  
  mAP 计算的源码吗？
  
  Reply
MasterLeo
November 11th, 2019 at 10:18 pm

我觉得这个问题有点诡异，暂时得不到解决办法

Reply
MasterLeo
November 11th, 2019 at 09:53 pm

是有的啊，而且出来的annots.pkl中也是有内容的

Reply
MasterLeo
November 8th, 2019 at 07:58 pm

楼主你好，我在运行时发现了如下错误，请问怎么解决呢 ?
R = class_recs[image_ids[d]]
KeyError: '000044'

Reply
1. 唯一
  March 20th, 2020 at 10:13 am
  
  @MasterLeo
  
  同志你问题解决了吗回复一下帮一下我可以吗谢谢
  
  Reply
2. 唯一
  March 19th, 2020 at 02:23 pm
  
  @MasterLeo
  
  解决了吗我在测试集中含有这个图片但在test.txt、val.txt没有而在train.txt、trainval.txt中有当把路径改为train.txt 出错的图片名字在train.txt、val.txt中无而在test.txt中有另外删了cache文件中的pkl，还是出现类似问题另外我在原21类voc数据集上可以评估测试但是改了自己的数据集后检测2类评估出现此问题
  
  Reply
3. 唯一
  January 1st, 2020 at 08:35 pm
  
  @MasterLeo
  
  请问该问题解决了吗我遇到了同样问题是测试集中没有这个图片吗？我的是在测试集中JPEGImages中含有这个图片但在ImageSets中的main文件下没有这个请问怎样修改谢谢扣扣 1497243382 尽快回复谢谢
  
  Reply
4. AIHGF
  November 9th, 2019 at 05:05 pm
  
  @MasterLeo
  
  dict 没有 000044 这个，要确认下你的数据
  
  Reply
  1. 唯一
    January 1st, 2020 at 08:37 pm
    
    @AIHGF
    
    您说的dict是哪个dict 不太理解您所说的是测试集中没有这个图片吗？我的是在测试集中JPEGImages中含有这个图片但在ImageSets中的main文件下没有这个请问怎样修改谢谢扣扣 1497243382
    
    Reply
    
    AIHGF
    January 2nd, 2020 at 06:13 pm
    
    @唯一
    
    同样的报错吗？
    
    Reply
    
    唯一
    March 19th, 2020 at 02:30 pm
    
    @AIHGF
    
    训练可以但评估出现keyerror:002296
    
    Reply
    
    唯一
    March 19th, 2020 at 02:22 pm
    
    @AIHGF
    
    是的在测试集中含有这个图片但在test.txt、val.txt没有而在train.txt、trainval.txt中有当把路径改为train.txt 出错的图片名字在train.txt、val.txt中无而在test.txt中有另外删了cache文件中的pkl，还是出现类似问题另外我在原21类voc数据集上可以评估测试但是改了自己的数据集后检测2类评估出现此问题
    
    Reply