archive_news_cc

所属分类:虚拟/增强现实-VR/AR
开发工具:HTML
文件大小:16002KB
下载次数:0
上传日期:2023-01-18 19:40:55
上 传 者sh-1993
说明:  archive.org 2014-2012年新闻视频的隐藏字幕转录本
(Closed Caption Transcripts of News Videos from archive.org 2014--2022)

文件列表:
appveyor.yml (511, 2023-06-12)
data (0, 2023-06-12)
data\archive-out.csv (2180950, 2023-06-12)
data\html (0, 2023-06-12)
data\html\FOXNEWS_20130203_050000_Justice_With_Judge_Jeanine.html (209824, 2023-06-12)
data\html\FOXNEWS_20130204_220000_The_Five.html (203452, 2023-06-12)
data\html\FOXNEWS_20130206_140000_Americas_Newsroom.html (332106, 2023-06-12)
data\html\FOXNEWS_20130209_090000_The_OReilly_Factor.html (202241, 2023-06-12)
data\html\FOXNEWS_20130210_160000_Americas_News_Headquarters.html (205362, 2023-06-12)
data\html\FOXNEWS_20130212_210000_Your_World_With_Neil_Cavuto.html (213734, 2023-06-12)
data\html\FOXNEWS_20130213_100000_FOX_and_Friends_First.html (204064, 2023-06-12)
data\html\FOXNEWS_20130216_030000_Greta_Van_Susteren.html (209534, 2023-06-12)
data\html\FOXNEWS_20130222_230000_Special_Report_With_Bret_Baier.html (211968, 2023-06-12)
data\html\FOXNEWS_20130415_050000_Stossel.html (193474, 2023-06-12)
data\html\FOXNEWS_20130430_220000_Special_Report_With_Bret_Baier.html (213826, 2023-06-12)
data\html\FOXNEWS_20130519_200000_Americas_News_Headquarters.html (346896, 2023-06-12)
data\html\FOXNEWS_20130616_070000_Huckabee.html (196573, 2023-06-12)
data\html\FOXNEWS_20130709_030000_The_OReilly_Factor.html (205631, 2023-06-12)
data\html\FOXNEWS_20130715_130000_Americas_Newsroom.html (328724, 2023-06-12)
data\html\FOXNEWS_20130715_220000_Special_Report_With_Bret_Baier.html (208670, 2023-06-12)
data\html\FOXNEWS_20130716_090000_FOX_and_Friends_First.html (203338, 2023-06-12)
data\html\FOXNEWS_20130812_040000_Fox_News_ReportFood_Stamp_Binge.html (202283, 2023-06-12)
data\html\FOXNEWS_20130818_060000_Red_Eye.html (159130, 2023-06-12)
data\html\FOXNEWS_20130819_100000_FOX_and_Friends.html (372028, 2023-06-12)
data\html\FOXNEWS_20130822_030000_The_OReilly_Factor.html (205221, 2023-06-12)
data\html\FOXNEWS_20130823_020000_Greta_Van_Susteren.html (213615, 2023-06-12)
data\html\FOXNEWS_20130824_200000_Cavuto_on_Business.html (136510, 2023-06-12)
data\html\FOXNEWS_20130826_030000_Huckabee.html (202987, 2023-06-12)
data\html\FOXNEWS_20130914_153000_Cashin_In.html (131162, 2023-06-12)
data\meta (0, 2023-06-12)
data\meta\FOXNEWS_20130203_050000_Justice_With_Judge_Jeanine_meta.xml (10560, 2023-06-12)
data\meta\FOXNEWS_20130204_220000_The_Five_meta.xml (10022, 2023-06-12)
data\meta\FOXNEWS_20130206_140000_Americas_Newsroom_meta.xml (16080, 2023-06-12)
data\meta\FOXNEWS_20130209_090000_The_OReilly_Factor_meta.xml (9865, 2023-06-12)
data\meta\FOXNEWS_20130210_160000_Americas_News_Headquarters_meta.xml (9585, 2023-06-12)
data\meta\FOXNEWS_20130212_210000_Your_World_With_Neil_Cavuto_meta.xml (10820, 2023-06-12)
... ...

## Closed Captions of News Videos from Archive.org The repository provides scripts for downloading the data, and link to two datasets that were built using the scripts: * [Scripts](https://github.com/notnews/archive_news_cc#downloading-the-data-from-archiveorg) * [Data](https://github.com/notnews/archive_news_cc#data) ------------- ### Downloading the Data from Archive.org Download closed caption transcripts of nearly 1.3M news shows from [http://archive.org](http://archive.org). There are three steps to downloading the transcripts: 1. We start by searching [https://archive.org/advancedsearch.php](https://archive.org/advancedsearch.php) with collection `collection:"tvarchive"`. This gets us unique identifiers for each of the news shows. An identifier is a simple string that combines channel_name, show_name, time, and date. The current final list of identifiers (2009--Nov. 2017) is posted [here](data/search.csv). 2. Next, we use the identifier to build a URL where the metadata file and HTML file with the closed captions is posted. The general base URL is http://archive.org/download followed by the identifier. 3. The third script parses the downloaded metadata and HTML closed caption files and creates a CSV along with the meta data. For instance, we will go http://archive.org/download/CSPAN_20090604_230000 for identifier `CSPAN_20090604_230000` And from http://archive.org/download/CSPAN_20090604_230000/CSPAN_20090604_230000_meta.xml, we read the link http://archive.org/details/CSPAN_20090604_230000, from which we get the text from HTML file. We also store the meta data from the META XML file. #### Scripts 1. **Get Show Identifiers** - [Get Identifiers For Each Show (Channel, Show, Date, Time)](scripts/get_news_identifiers.py) - Produces [data/search.csv](data/search.csv) 2. **Download Metadata and HTML Files** - [Download the Metadata and HTML Files](scripts/scrape_archive_org.py) - Saves the metadata and HTML files to two separate folders specified in `--meta` and `--html` respectively. The default folder names are `meta` and `html` respectively. 3. **Parse Metadata and HTML Files** - [Parses metadata and HTML Files and Saves to a CSV](scripts/parse_archive.py) - Produces a CSV. [Here's an example](data/archive-out.csv) #### Running the Scripts 1. Get all TV Archive identifiers from archive.org. ``` python get_news_identifiers.py -o ../data/search.csv ``` 2. Download metadata and HTML files for all the shows in the [sample input file](data/search-test.csv) ``` python scrape_archive_org.py ../data/search-test.csv ``` This will create two directories `meta` and `html` by default in the same folder as where the script is. We have included the first [25 metadata](data/meta/) and first 25 [html files](data/html/). You can change the folder for `meta` by using the `--meta` flag. To change the directory for `html`, use the `--html` flag and specify the new directory. For instance, ``` python scrape_archive_org.py --meta meta-foxnews --html html-foxnews ../data/search-test.csv ``` Use `-c/--compress` option to store and parse the downloaded files in compression format (GZip). 3. Parse and extract meta fields and text from [sample metadata](data/meta) and [HTML files](data/html). ``` python parse_archive.py ../data/search-test.csv ``` A [sample output file](data/archive-out.csv). ### Data The data are hosted on [Harvard Dataverse](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/OAJJHI) **Dataset Summary:** 1. **500k Dataset from 2014:** - CSV: `archive-cc-2014.csv.xza*` (2.7 GB, split into 2GB files) - HTML: `html-2014.7za*` (10.4 GB, split into 2GB files) 2. **860k Dataset from 2017:** - CSV: `archive-cc-2017.csv.gza*` (10.6 GB, split into 2GB files) - HTML: `html-2017.tar.gza*` (20.2 GB, split into 2GB files) - Meta: `meta-2017.tar.gza*` (2.6 GB, split into 2GB files) 3. **917k Dataset from 2022:** - CSV: `archive-cc-2022.csv.gza*` (12.6 GB, split into 2GB files) - HTML: `html-2022.tar.gza*` (41.1 GB, split into 2GB files) - Meta: `meta-2022.tar.gz` (2.1 GB) 4. **179k Dataset from 2023:** - CSV: `archive-cc-2023.csv.gz` (1.7 GB) - HTML: `html-2023.tar.gza*` (7.3 GB, split into 2GB files) - Meta: `meta-2023.tar.gz` (317 MB) Please note that the file sizes and splitting information mentioned above are approximate. ### License We are releasing the scripts under the [MIT License](https://opensource.org/licenses/MIT). ### Suggested Citation Please credit Internet Archive for the data. If you wanted to refer to this particular corpus so that the research is reproducible, you can cite it as: ``` archive.org TV News Closed Caption Corpus. Laohaprapanon, Suriyan and Gaurav Sood. 2017. https://github.com/notnews/archive_news_cc/ ```

近期下载者

相关文件


收藏者