GroundVQA
所属分类:视频捕捉采集剪辑
开发工具:Python
文件大小:0KB
下载次数:0
上传日期:2024-02-25 06:35:23
上 传 者:
sh-1993
说明: PyTorch官方代码“长时间自我中心视频中的扎根问题回答”
(Official PyTorch code of "Grounded Question-Answering in Long Egocentric Videos")
文件列表:
config/
model/ours/
scripts/
utils/
LICENSE
eval.py
eval_nlq.py
requirements.txt
run.py
# GroundVQA
Official PyTorch code of "Grounded Question-Answering in Long Egocentric Videos".
[[Project page]](https://dszdsz.cn/GroundVQA/index.html) [[Paper]](https://arxiv.org/abs/2312.06505)
The release is expected in two weeks.
## Abstract
Existing approaches to video understanding, mainly designed for short videos from a third-person perspective, are limited in their applicability in certain fields, such as robotics. In this paper, we delve into **open-ended question-answering (QA) in long, egocentric videos**, which allows individuals or robots to inquire about their own past visual experiences.
This task presents **unique challenges**, including the complexity of temporally grounding queries within extensive video content, the high resource demands for precise data annotation, and the inherent difficulty of evaluating open-ended answers due to their ambiguous nature.
Our proposed approach tackles these challenges by
- **GroundVQA**: integrating query grounding and answering within a unified model to reduce error propagation;
- **EgoTimeQA**: employing large language models for efficient and scalable data synthesis;
- **QaEgo4D**$`_\texttt{close}`$: introducing a close-ended QA task for evaluation, to manage answer ambiguity.
Extensive experiments demonstrate the effectiveness of our method, which also achieves state-of-the-art performance on the QaEgo4D and Ego4D-NLQ benchmarks.
## Directory Structure
```
.
|-- checkpoints provided model checkpoints
|-- config configs of models and datasets
|-- data processed dataset and video features
|-- eval.py code for evaluating QaEgo4D performance
|-- eval_nlq.py code for evaluating NLQ performance
|-- model code for model, dataset, and training
|-- requirements.txt list of packages for building the Python environment
|-- run.py entry code
|-- scripts scripts for training and evaluation
`-- utils code for generating OpenQA and CloseQA data from Ego4D narrations
```
## Preparation
Our setup: Ubuntu 20.04, CUDA 12.2, 8x Nvidia A100 (80GB)
- Clone this repo: `https://github.com/Becomebright/GroundVQA.git`
- Create the conda environment: `conda create -n groundvqa python=3.9 -y && conda activate groundvqa`
- Install packages: `pip install -r requirements.txt`
- Compile `nms_1d_cpu` following [here](https://github.com/happyharrycn/actionformer_release/blob/main/INSTALL.md)
- Download the data, video feature, and model checkpoints from [Huggingface](https://huggingface.co/Becomebright/GroundVQA)
- **[TODO] data:** unzip `data.zip` under the project's root directory.
- **video feature:** merge the files `cat egovlp_internvideoa* > egovlp_internvideo.hdf5` and put it under `data/unified/`
- **model checkpoints**: put them under `checkpoints/`
| Model | Data | Task | NLQ$`_\texttt{v2}`$ | QaEgo4D | Cost$`^{*}`$ |
| ------------------------------- | ------------------------------------------------------------ | ---------------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | ----------------- |
| $`\text{GroundVQA}_\texttt{S}`$ | QaEgo4D | CloseQA + OpenQA + VLG | [[val_R1_03=11.0]](https://huggingface.co/Becomebright/GroundVQA/blob/main/GroundVQA_S-QaEgo4D-COV-val_R1_03%3D11.0.ckpt) | [[test_ROUGE=29.0]](https://huggingface.co/Becomebright/GroundVQA/blob/main/GroundVQA_S-QaEgo4D-COV-test_ROUGE%3D29.0.ckpt) | 7 |
| $`\text{GroundVQA}_\texttt{S}`$ | QaEgo4D + EgoTimeQA | CloseQA + OpenQA + VLG | [[val_R1_03=23.3]](https://huggingface.co/Becomebright/GroundVQA/blob/main/GroundVQA_S-QaEgo4D_EgoTimeQA-COV-val_R1_03%3D23.3.ckpt) | [[test_ROUGE=30.2]](https://huggingface.co/Becomebright/GroundVQA/blob/main/GroundVQA_S-QaEgo4D_EgoTimeQA-COV-test_ROUGE%3D30.2.ckpt) | 150 |
| $`\text{GroundVQA}_\texttt{B}`$ | QaEgo4D + EgoTimeQA | CloseQA + OpenQA + VLG | [[val_R1_03=25.6]](https://huggingface.co/Becomebright/GroundVQA/blob/main/GroundVQA_B-QaEgo4D_EgoTimeQA-COV-val_R1_03%3D25.6.ckpt) | [[test_ROUGE=30.4]](https://huggingface.co/Becomebright/GroundVQA/blob/main/GroundVQA_B-QaEgo4D_EgoTimeQA-COV-test_ROUGE%3D30.4.ckpt) | 350 |
| $`\text{GroundVQA}_\texttt{B}`$ | NLQ$`_\texttt{v2}`$ + NaQ $\rightarrow$ NLQ$`_\texttt{v2}`$$`^{**}`$ | VLG | [[val_R1_03=29.7]](https://huggingface.co/Becomebright/GroundVQA/blob/main/GroundVQA_B-NLQ_NaQ-finetune_NLQ-VLG-val_R1_03%3D29.7.ckpt) | - | 700 |
\* The training costs counted by GPU hours.
** Pre-trained on NLQ$`_\texttt{v2}`$ and NaQ, and further fine-tuned on NLQ$`_\texttt{v2}`$.
## Training
```bash
# train GroundVQA_S on QaEgo4D
bash scripts/train_groundvqa_small-qaego4d.sh
# train GroundVQA_S on QaEgo4D and EgoTimeQA
bash scripts/train_groundvqa_small-qaego4d_egotimeqa.sh
# train GroundVQA_B on QaEgo4D and EgoTimeQA
bash scripts/train_groundvqa_base-qaego4d_egotimeqa.sh
```
## Evaluation
```bash
# evaluate GroundVQA_S train on QaEgo4D
bash scripts/evaluate_groundvqa_s-qaego4d.sh
# evaluate GroundVQA_S train on QaEgo4D and EgoTimeQA
bash scripts/evaluate_groundvqa_s-qaego4d_egotimeqa.sh
# evaluate GroundVQA_B train on QaEgo4D and EgoTimeQA
bash scripts/evaluate_groundvqa_b-qaego4d_egotimeqa.sh
# evaluate GroundVQA_B train on NLQv2 and NaQ and further fine-tuned on NLQv2
bash scripts/evaluate_groundvqa_b-nlq_naq.sh
```
## Generate OpenQA data
Download the processed narrations [[em_train_narrations.pkl]](https://huggingface.co/Becomebright/GroundVQA/blob/main/em_train_narrations.pkl)
Put it under `utils/generate_open_qa`
Generate in parallel on multiple GPUs (*e.g.*, 2)
```bash
cd utils/generate_open_qa
# GPU-0
CUDA_VISIBLE_DEVICES=0 python generate.py -start 0 -end 5000
# GPU-1
CUDA_VISIBLE_DEVICES=0 python generate.py -start 5000 -end 11000 # 10777 clips in total
```
Merge the generation results and normalize the duration of temporal windows
```bash
python merge.py
```
## Generate CloseQA data
```bash
cd utils/generate_close_qa
python generate.py
```
The above script produce wrong answers for EgoTimeQA using a single GPU.
You can also conduct generation on multiple GPUs or generate wrong answers for QaEgo4D.
## Citation
```latex
@article{di2023groundvqa,
title={Grounded Question-Answering in Long Egocentric Videos},
author={Di, Shangzhe and Xie, Weidi},
journal={arXiv preprint arXiv:2312.06505},
year={2023}
}
```
## Acknowledgements
Our code is based on [QaEgo4D](https://github.com/lbaermann/qaego4d), [GroundNLQ](https://github.com/houzhijian/GroundNLQ), and [ActionFormer](https://github.com/happyharrycn/actionformer_release).
近期下载者:
相关文件:
收藏者: