
上传日期:2024-02-25 06:35:23
上 传 者sh-1993
说明:  PyTorch官方代码“长时间自我中心视频中的扎根问题回答”
(Official PyTorch code of "Grounded Question-Answering in Long Egocentric Videos")


# GroundVQA Official PyTorch code of "Grounded Question-Answering in Long Egocentric Videos". [[Project page]]( [[Paper]]( The release is expected in two weeks. ## Abstract Existing approaches to video understanding, mainly designed for short videos from a third-person perspective, are limited in their applicability in certain fields, such as robotics. In this paper, we delve into **open-ended question-answering (QA) in long, egocentric videos**, which allows individuals or robots to inquire about their own past visual experiences. This task presents **unique challenges**, including the complexity of temporally grounding queries within extensive video content, the high resource demands for precise data annotation, and the inherent difficulty of evaluating open-ended answers due to their ambiguous nature. Our proposed approach tackles these challenges by - **GroundVQA**: integrating query grounding and answering within a unified model to reduce error propagation; - **EgoTimeQA**: employing large language models for efficient and scalable data synthesis; - **QaEgo4D**$`_\texttt{close}`$: introducing a close-ended QA task for evaluation, to manage answer ambiguity. Extensive experiments demonstrate the effectiveness of our method, which also achieves state-of-the-art performance on the QaEgo4D and Ego4D-NLQ benchmarks. ## Directory Structure ``` . |-- checkpoints provided model checkpoints |-- config configs of models and datasets |-- data processed dataset and video features |-- code for evaluating QaEgo4D performance |-- code for evaluating NLQ performance |-- model code for model, dataset, and training |-- requirements.txt list of packages for building the Python environment |-- entry code |-- scripts scripts for training and evaluation `-- utils code for generating OpenQA and CloseQA data from Ego4D narrations ``` ## Preparation Our setup: Ubuntu 20.04, CUDA 12.2, 8x Nvidia A100 (80GB) - Clone this repo: `` - Create the conda environment: `conda create -n groundvqa python=3.9 -y && conda activate groundvqa` - Install packages: `pip install -r requirements.txt` - Compile `nms_1d_cpu` following [here]( - Download the data, video feature, and model checkpoints from [Huggingface]( - **[TODO] data:** unzip `` under the project's root directory. - **video feature:** merge the files `cat egovlp_internvideoa* > egovlp_internvideo.hdf5` and put it under `data/unified/` - **model checkpoints**: put them under `checkpoints/` | Model | Data | Task | NLQ$`_\texttt{v2}`$ | QaEgo4D | Cost$`^{*}`$ | | ------------------------------- | ------------------------------------------------------------ | ---------------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | ----------------- | | $`\text{GroundVQA}_\texttt{S}`$ | QaEgo4D | CloseQA + OpenQA + VLG | [[val_R1_03=11.0]]( | [[test_ROUGE=29.0]]( | 7 | | $`\text{GroundVQA}_\texttt{S}`$ | QaEgo4D + EgoTimeQA | CloseQA + OpenQA + VLG | [[val_R1_03=23.3]]( | [[test_ROUGE=30.2]]( | 150 | | $`\text{GroundVQA}_\texttt{B}`$ | QaEgo4D + EgoTimeQA | CloseQA + OpenQA + VLG | [[val_R1_03=25.6]]( | [[test_ROUGE=30.4]]( | 350 | | $`\text{GroundVQA}_\texttt{B}`$ | NLQ$`_\texttt{v2}`$ + NaQ $\rightarrow$ NLQ$`_\texttt{v2}`$$`^{**}`$ | VLG | [[val_R1_03=29.7]]( | - | 700 | \* The training costs counted by GPU hours. ** Pre-trained on NLQ$`_\texttt{v2}`$ and NaQ, and further fine-tuned on NLQ$`_\texttt{v2}`$. ## Training ```bash # train GroundVQA_S on QaEgo4D bash scripts/ # train GroundVQA_S on QaEgo4D and EgoTimeQA bash scripts/ # train GroundVQA_B on QaEgo4D and EgoTimeQA bash scripts/ ``` ## Evaluation ```bash # evaluate GroundVQA_S train on QaEgo4D bash scripts/ # evaluate GroundVQA_S train on QaEgo4D and EgoTimeQA bash scripts/ # evaluate GroundVQA_B train on QaEgo4D and EgoTimeQA bash scripts/ # evaluate GroundVQA_B train on NLQv2 and NaQ and further fine-tuned on NLQv2 bash scripts/ ``` ## Generate OpenQA data Download the processed narrations [[em_train_narrations.pkl]]( Put it under `utils/generate_open_qa` Generate in parallel on multiple GPUs (*e.g.*, 2) ```bash cd utils/generate_open_qa # GPU-0 CUDA_VISIBLE_DEVICES=0 python -start 0 -end 5000 # GPU-1 CUDA_VISIBLE_DEVICES=0 python -start 5000 -end 11000 # 10777 clips in total ``` Merge the generation results and normalize the duration of temporal windows ```bash python ``` ## Generate CloseQA data ```bash cd utils/generate_close_qa python ``` The above script produce wrong answers for EgoTimeQA using a single GPU. You can also conduct generation on multiple GPUs or generate wrong answers for QaEgo4D. ## Citation ```latex @article{di2023groundvqa, title={Grounded Question-Answering in Long Egocentric Videos}, author={Di, Shangzhe and Xie, Weidi}, journal={arXiv preprint arXiv:2312.06505}, year={2023} } ``` ## Acknowledgements Our code is based on [QaEgo4D](, [GroundNLQ](, and [ActionFormer](


