Paper | Palm: Predicting Actions

2023-09-28  本文已影响0人  与阳光共进早餐

1. Introduction

key idea:

Solution:

  1. input video --> image caption model, action recognition model --> captions and recognized actions;
  2. use {captions, actions} to create a prompt --> large language model --> future action anticipation;

2. method

use natural language to describe past events and perform reasoning and prediction in the semantic space, leveraging the commonsense knowledge embedded in large language models.

2.1 Task

long-term action anticipation (LTA)

给定大约5分钟的视频片段,其中包括视频中每个动作的时间边界信息,从中预测出20个未来动作序列,每个动作都由一个“动词-名词”(verb, noun)的组合来描述。

2.2 Prompt Design

formulate the action anticipation task as a sentence completion task

通过prompt LLMs让其能够根据给定过去的动作描述预测未来动作。

Prompt template as:

其中, 1) 红色框的instruction paragraph用于guide LLMs; 2)蓝色框的是一些training例子,N表示past actions的数量,对于past action,caption和action都会给到LLM,而Z表示GT的future action;3)绿色框的是prediction,对于prediction,会给出N‘个past caption以及action,但是会设定N‘ > N。

past actions:

narrations:

prompt selection:

2.3 LLM Inference

3 Result

achieves the 1st place at CVPR 2023 competitions.

上一篇 下一篇

猜你喜欢

热点阅读