问答系统评估代码:计算准确率、召回率和 F1 分数
这段代码实现了一个问答系统的评估功能,它从两个 JSON 文件中读取数据,一个是标准答案文件,一个是预测答案文件。然后,它对每个问题和其对应的标准答案进行比较,并计算出准确率、召回率、F1 分数等指标。最后,它输出评估结果,包括每个问题的平均准确率、召回率、F1 分数,以及总体准确率等指标。
def main():
if len(sys.argv) != 3:
print('Usage: python eval.py goldData predAnswers')
sys.exit(-1)
goldData = json.loads(open(sys.argv[1]).read())
predAnswers = json.loads(open(sys.argv[2]).read())
PredAnswersById = {}
for item in predAnswers:
PredAnswersById[item['QuestionId']] = item['Answers']
total = 0.0
f1sum = 0.0
recSum = 0.0
precSum = 0.0
numCorrect = 0
for entry in goldData['Questions']:
skip = True
for pidx in range(0, len(entry['Parses'])):
np = entry['Parses'][pidx]
if np['AnnotatorComment']['QuestionQuality'] == 'Good' and np['AnnotatorComment']['ParseQuality'] == 'Complete':
skip = False
if (len(entry['Parses']) == 0 or skip):
continue
total += 1
id = entry['QuestionId']
if id not in PredAnswersById:
print('The problem ' + id + ' is not in the prediction set')
print('Continue to evaluate the other entries')
continue
if len(entry['Parses']) == 0:
print('Empty parses in the gold set. Breaking!!')
break
predAnswers = PredAnswersById[id]
bestf1 = -9999
bestf1Rec = -9999
bestf1Prec = -9999
for pidx in range(0, len(entry['Parses'])):
pidxAnswers = entry['Parses'][pidx]['Answers']
prec, rec, f1 = CalculatePRF1(pidxAnswers, predAnswers)
if f1 > bestf1:
bestf1 = f1
bestf1Rec = rec
bestf1Prec = prec
f1sum += bestf1
recSum += bestf1Rec
precSum += bestf1Prec
if bestf1 == 1.0:
numCorrect += 1
print('Number of questions:', int(total))
print('Average precision over questions: %.3f' % (precSum / total))
print('Average recall over questions: %.3f' % (recSum / total))
print('Average f1 over questions (accuracy): %.3f' % (f1sum / total))
print('F1 of average recall and average precision: %.3f' % (2 * (recSum / total) * (precSum / total) / (recSum / total + precSum / total)))
print('True accuracy (ratio of questions answered exactly correctly): %.3f' % (numCorrect / total))
原文地址: https://www.cveoy.top/t/topic/n4mP 著作权归作者所有。请勿转载和采集!