NewsCluster
所属分类:数据挖掘/数据仓库
开发工具:Python
文件大小:20KB
下载次数:0
上传日期:2020-01-10 07:25:57
上 传 者:
sh-1993
说明: NewsCluster,新闻事件挖掘。通过聚合公开的新闻数据,聚合描述相同事件的新闻并生成相关事件信息。
(NewsCluster, news event mining. By aggregating public news data, news describing the same event is aggregated and relevant event information is generated.)
文件列表:
Cluster.py (2820, 2020-01-09)
DataLoader.py (4115, 2020-01-09)
EventExtractor.py (2871, 2020-01-09)
Extractor (0, 2020-01-09)
Extractor\ToyExtractor.py (1188, 2020-01-09)
Extractor\__init__.py (72, 2020-01-09)
Extractor\__pycache__ (0, 2020-01-09)
Extractor\__pycache__\ToyExtractor.cpython-36.pyc (982, 2020-01-09)
Extractor\__pycache__\__init__.cpython-36.pyc (133, 2020-01-09)
Extractor\config.py (111, 2020-01-09)
Text2Vector.py (3055, 2020-01-09)
__init__.py (72, 2020-01-09)
__pycache__ (0, 2020-01-09)
__pycache__\Cluster.cpython-36.pyc (2574, 2020-01-09)
__pycache__\DataLoader.cpython-36.pyc (3284, 2020-01-09)
__pycache__\EventExtractor.cpython-36.pyc (2570, 2020-01-09)
__pycache__\Text2Vector.cpython-36.pyc (3075, 2020-01-09)
__pycache__\config.cpython-36.pyc (998, 2020-01-09)
config.py (944, 2020-01-09)
data (0, 2020-01-09)
model (0, 2020-01-09)
run.py (802, 2020-01-09)
# 简介
新闻事件挖掘。通过聚合公开的新闻数据,聚合描述相同事件的新闻并生成相关事件信息。
# 协作说明
请保持 DataLoader.py、Text2Vector.py、Cluster.py、EventExtractor.py 这四个文件尽量简洁。不要在这些文件里实现具体算法。在其他地方实现,在这些文件中 import 后调用。比如,EventExtractor.py 是对聚类结果提取事件信息,目前实现了一个 ToyExtractor, 其具体实现在 Extractor 文件夹下,EventExtractor.py 只是调用该文件。
# 数据库
目前有两个表结构:原始新闻表(news) 存储原始新闻信息,事件信息表(event)存储聚类分析后的事件信息。
### 原始新闻表(news)
表结构:
| Field | Type | Null | Key | Default | Extra |
| --- | --- | --- | --- | --- | --- |
| news_id | int(10) unsigned | NO | PRI | | auto_increment |
| source | varchar(1000) | YES | | | |
| author | varchar(1000) | YES | | | |
| title | varchar(1000) | YES | | | |
| queryKeyWord | varchar(100) | YES | | | |
| description | varchar(2000) | YES | | | |
| url | varchar(1000) | YES | | | |
| urlToImage | varchar(1000) | YES | | | |
| publishedAt | datetime | YES | | | |
| content | text | YES | | | |
字段说明:
| 字段 | 说明 | 示例 |
| ------------ | ------------------------------------------------------------ | -------------------- |
| news_id | | 106511 |
| source | The identifier display name for the source this
article came from | "The New York Times" |
| author | The author of the article | "Michael Levenson" |
| title | The headline or title of the article | |
| queryKeyWord | Keywords or phrases to search for in the article
title and body | "Donald Trump" |
| description | A description or snippet from the article | |
| url | The direct URL to the article | |
| urlToImage | The URL to a relevant image for the article | |
| publishedAt | The date and time the article was published | 2019-12-17 11:26:36 |
| content | The unformatted content of the article.
This is truncated to 260 chars for Developer plan users | |
### 事件信息表(event)
表结构:
| Field | Type | Null | Key | Default | Extra |
| --- | --- | --- | --- | --- | --- |
| label | varchar(20) | NO | PRI | | |
| newsid | varchar(2000) | NO | | | |
| title | varchar(1000) | YES | | | |
| keyWord | varchar(100) | YES | | | |
| time | datetime | YES | | | |
| abstract | varchar(2000) | YES | | | |
| content | text | YES | | | |
字段说明:
| 字段 | 说明 | 示例 |
| -------- | -------------------------------- | ----------------------- |
| label | 簇标记/事件id | |
| newsid | 该事件包含的news_id,用空格分隔 | "106511 106522" |
| title | 事件标题 | |
| keyWord | 事件关键字,多个关键字用 \| 分隔 | “ keyword1 \| keyword2" |
| time | 事件发生时间 | 2019-12-17 11:26:36 |
| abstract | 事件摘要 | |
| content | 事件详细描述 | |
近期下载者:
相关文件:
收藏者: