RevDet

所属分类:大数据
开发工具:Python
文件大小:157KB
下载次数:0
上传日期:2021-10-15 19:19:24
上 传 者sh-1993
说明:  大型新闻源中鲁棒且内存有效的事件检测和跟踪
(Robust and Memory Efficient Event Detection and Tracking in Large News Feeds)

文件列表:
algorithm.py (6809, 2021-10-16)
evaluate_algorithm.py (2927, 2021-10-16)
images (0, 2021-10-16)
images\activeeventchains.png (73002, 2021-10-16)
images\dataset_formation.png (35893, 2021-10-16)
images\evaluation_procedure.png (19181, 2021-10-16)
in_memory_clustering.py (3560, 2021-10-16)
inmemory.dat (40282, 2021-10-16)
movefiles.py (528, 2021-10-16)
plot_results.py (2629, 2021-10-16)
prepare_data.py (1934, 2021-10-16)
remove_redundancy.py (8159, 2021-10-16)
revdet.dat (119735, 2021-10-16)
revdet_profiling.py (6983, 2021-10-16)
run_revdet.py (3943, 2021-10-16)

# RevDet RevDet is an algorithm for robust and efficient event detection and tracking in large news feeds. It adopts an iterative clustering approach for tracking events. Even though many events continue to develop for many days or even months, RevDet is able to detect and track those events while utilizing only a constant amount of space on main memory. It takes as input news articles data (with two necessary columns: a list of locations and heading) in the form of per day files (sorted by ascending timestamp of the event), window size and threshold for birch clustering algorithm. It then forms event chains and outputs each chain in a separate file. The figure below shows per day active event chains of an year formed by our RevDet algorithm vs the ground truth chains. To form these chains, RevDet only utilized memory required for storing eight days data.
## Dataset The event chain algorithm has been run on the w2e_gkg dataset, which has been prepared as below:
Dataset Link: https://drive.google.com/file/d/1Xc_9FJkaYsCcNPMatlHvHmyGr7NJAPSN/view?usp=sharing ## Running RevDet
First, some pre-processing needs to be performed on the w2e_gkg dataset for removal of redundant (duplicate) news articles. Then it has to be transformed into per day files, which will serve as the input to the algorithm. Both these steps can be done by running `prepare_data.py` like this: ```bash python3 prepare_data.py ``` You can now run the script `run_revdet.py` to run RevDet on the formed dataset and evaluate the formed chains on the ground truth chains. The plot of precision, recall, f-measure for different window sizes can be generated through: ```bash python3 run_revdet.py --plotgraph ``` A plot of macro comparison between ground-truth and the formed chains can be generated as below: ```bash python3 run_revdet.py --plotactivechains ``` ## Other options for `run_revdet.py` ### Setting input and output directories - `--inputchains`: Directory for redundancy removed input event chains. Default is `redundancy_removed_chains/`. - `--outputchains`: Directory for output event chains. Default is `output_chains/`. - `--perdaydata`: Directory for per day data. Default is `per_day_data/`. ### Algorithm Options - `--birch_thresh`: Threshold for the birch algorithm. Default is 2.3. - `--window_size`: Window size for the revdet algorithm. Default is 8.`. ## Reference Azeemi, A. H., Sohail, M. H., Zubair, T., Maqbool, M., Younas, I., & Shafiq, O. (2021). RevDet: Robust and Memory Efficient Event Detection and Tracking in Large News Feeds. International Workshop on Advanced Analytics and Learning on Temporal Data @ ECML PKDD, 2021 (In-Press). Preprint: [arXiv:2103.04390](https://arxiv.org/abs/2103.04390).

近期下载者

相关文件


收藏者