医学报告生成 On the Automatic Generati

2023-03-28 本文已影响0人 richybai

Jing B, Xie P, Xing E. On the Automatic Generation of Medical Imaging Reports[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018: 2577-2586.

论文导读

医学图像在诊断和治疗中广泛应用。但对于一个经验不丰富的医生，写报告可能会出错（error-prone），对于经验丰富的医生费时费力。因此需要自动生成医学图像报告，辅助医生诊断。生成报告有如下挑战：

报告组成成分多，有findings和tags。
图像中的异常区域难以辨别。
报告长，包含多句话。

为了解决上述的难题，本文提出了如下方法，并在两个数据集上进行验证：

构建了多任务学习框架（multi-task learning framework），同时对tag预测和finding生成。
co-attention 机制，定位异常区域。
提出了层级的LSTM，生成长段落。

模型结构

医学报告

一份医学报告长这样，findings里面是对医学影像的描述，tags是报告中的关键词。任务需要输入图像，输出tags分类结果，并输出报告。（impression里面是对病例的诊断）。
模型结构如下图：

模型结构

1. Encoder

输入图像后，使用CNN提取patch $\{v_n\}_{n=1}^N$ ，作为visual features，分两条路径：

进入MLC（multi-label classification），对tags进行预测。tags再word-embedding得到 $\{a_m\}_{m=1}^M$ ，作为semantic features。
2.visual features $\{v_n\}_{n=1}^N$ 和semantic features $\{a_m\}_{m=1}^M$ 进入co-attention，至此完成encoding过程。

2. Decoder

报告是多个句子的，论文采用了先生成每个句子的high-level topic vector，之后再根据这个vector生成相应的句子。从co-attention中输出的context vector首先输入sentence LSTM，生成每一个句子所对应的topic vector，代表了每一个句子的语义信息。之后topic vector再输入到word LSTM里生成整个句子。

Tag Prediction

多标签分类任务，把visual feature $\{v_n\}_{n=1}^N$ 提取出来后，输入到MLC中，生成L个tags的分布：

多标签分类任务

对于每一个tag，都生成一个预测值，之后作用一个指数函数？？个人感觉是想表达softmax，并通过一个阈值确定类别1和0，代表有这个tag和没有这个tag。与多类别分类任务不同，多类别任务是对最终的输出向量整体作用softmax。本文使用了VGG19的卷积层提取visual features，最后两层FC用作MLC。之后，分类出来的tags被embedding为semantic features $\{a_m\}_{m=1}^M$ 用于topic generation。

Co-Attention

Visual Attention 可以定位目标（ObjectRecognition），也可以帮助生成图像说明文字（ImageCaption），但可能不会提供高阶的语义信息。然而tags总是可以提供高阶语义信息，因此使用co-attention 机制同时注意visual和semantic模态（modalities）。
这里co-attention会利用 $\{v_n\}_{n=1}^N$ ， $\{a_m\}_{m=1}^M$ 以及sentence LSTM的第 $s-1$ 步的hidden state $h_{sent}^{(s-1)}$ 计算下一时间步 $s$ 的 joint contest vector $ctx^{(s)}$ 。
首先使用单层feed-forward network计算visual feature 和semantic feature的权重：

visual and semantic attentions

这里的正比符号，相当于是在feature channel维度上面做了softmax，最终的 visual and semantic context vector分别是前面的attention和：

visual and semantic context vectors

最后把两个向量拼接在一起，在使用fully connected layer得到最终的输入到sentence LSTM中的joint context vector $ctx^{(s)}$ ：

第s步的joint context vector

这个地方最终的 $ctx$ 是有joint的意思了。还有一种思路，在计算权重那个地方就都输入，相当于提前joint在一起。

Sentence LSTM

这一部分是包含Sentence LSTM，topic generator和stop control component。Sentence LSTM 是一个单层的LSTM，接收 $ctx$ 并通过topic generator生成topic vector $t$ ，由stop control component决定是否继续生成。

Topic generator

接收Sentence LSTM的hidden state $h_{sent}^{(s)}$ 和joint context vector $ctx^{(s)}$ ，计算当前第 $s$ 步的topic vector $t^{(s)}$ 。

topic vector

Stop control

以前一步和当前步的hidden state $h_{sent}^{(s-1)}$ 和 $h_{sent}^{(s)}$ 为输入，计算是否继续生成的概率：

probability of stop

如果大于预定好的阈值，则停止，否则继续生成。

Word LSTM

topic vector $t$ 和START token作为第一个和第二个输入输入到LSTM中，得到后续的word sequence。得到的hidden state $h_{word}$ 直接用到了词的预测中：

word prediction

损失函数

在multi-label classification任务中，文中先对tag ground truth 一范数归一化，然后和预测向量计算cross-entropy。个人感觉，不对ground truth归一化，预测向量使用sigmoid函数，然后再计算binary cross entropy效果会好一点。
在报告生成阶段，损失函数由两部分组成：停止损失和词损失。

最终的损失函数如下：

loss function

最后害加入了一个正则化项，是关于visual and semantic attentions矩阵 $\alpha \in \mathbb{R}^{N \times S}, \beta\in \mathbb{R}^{M \times S}$ 的：

attention regularization

这个正则化鼓励模型在不同的图像区域以及不同的tags上面，都有相似的注意力。

医学报告生成 On the Automatic Generati

论文导读

模型结构

1. Encoder

2. Decoder

Tag Prediction

Co-Attention

Sentence LSTM

Topic generator

Stop control

Word LSTM

损失函数

猜你喜欢

热点阅读