nrk-sapmi-crawler

所属分类:搜索引擎
开发工具:JavaScript
文件大小:0KB
下载次数:0
上传日期:2023-09-06 04:43:10
上 传 者sh-1993
说明:  Crawler for NRK Sapmi新闻公告,这将是Sami关键词列表的基础,也是Sami.内容的示例搜索引擎。,
(Crawler for NRK Sapmi news bulletins that will be the basis for Sami stopword lists and an example search engine for content in Sami.,)

文件列表:
nrk-sapmi-crawler-trunk/ (0, 2023-11-12)
nrk-sapmi-crawler-trunk/LICENSE (1067, 2023-11-12)
nrk-sapmi-crawler-trunk/index.js (5788, 2023-11-12)
nrk-sapmi-crawler-trunk/package-lock.json (346558, 2023-11-12)
nrk-sapmi-crawler-trunk/package.json (1098, 2023-11-12)
nrk-sapmi-crawler-trunk/test/ (0, 2023-11-12)
nrk-sapmi-crawler-trunk/test/lib/ (0, 2023-11-12)
nrk-sapmi-crawler-trunk/test/lib/content.southSami.json (1310, 2023-11-12)
nrk-sapmi-crawler-trunk/test/lib/content.southSami.json.backup (1306, 2023-11-12)
nrk-sapmi-crawler-trunk/test/lib/list.southSami.json (60216, 2023-11-12)
nrk-sapmi-crawler-trunk/test/lib/list.southSami.json.backup (60218, 2023-11-12)
nrk-sapmi-crawler-trunk/test/testContentCrawler.js (1899, 2023-11-12)
nrk-sapmi-crawler-trunk/test/testIdCrawler.js (1232, 2023-11-12)

# nrk-sapmi-crawler [![NPM version](http://img.shields.io/npm/v/nrk-sapmi-crawler.svg?style=flat)](https://npmjs.org/package/nrk-sapmi-crawler) [![NPM downloads](http://img.shields.io/npm/dm/nrk-sapmi-crawler.svg?style=flat)](https://npmjs.org/package/nrk-sapmi-crawler) [![tests](https://github.com/eklem/nrk-sapmi-crawler/actions/workflows/tests.yml/badge.svg)](https://github.com/eklem/nrk-sapmi-crawler/actions/workflows/tests.yml) [![MIT License](http://img.shields.io/badge/license-MIT-blue.svg?style=flat)](LICENSE) Crawler for [NRK Sapmi news bulletins](https://www.nrk.no/sapmi/samegillii/) that will be the basis for [stopword-sami](https://github.com/eklem/stopword-sami) and an example search engine for content in Sami. Crawl news bulletins in [Northern Sami](https://www.nrk.no/sapmi/o__asat---davvisamegillii-1.13572949), [Lule Sami](https://www.nrk.no/sapmi/adasa---julevsabmaj-1.13572946) and [South Sami](https://www.nrk.no/sapmi/saernie---aarjelsaemien-1.13572943). Code is not the cleanest one, but it works well enough, and hopefully will without too much maintenance for the next copule of years. If you just want the datasets, install [stopword-sami](https://github.com/eklem/stopword-sami) modul. ## Getting a list of article IDs to crawl ```javaScript import { getList, crawlHeader, readIfExists, calculateIdListAndWrite } from '../index.js' const southSami = { id: '1.13572943', languageName: 'arjelsaemien', url: 'https://www.nrk.no/serum/api/content/json/1.13572943?v=2&limit=1000&context=items', file: './lib/list.southSami.json' } // Bringing it all together, fetching URL and reading file, and if new content -> merging arrays and writing Promise.all([getList(southSami.url, crawlHeader), readIfExists(southSami.file).catch(e => e)]) .then((data) => { calculateListAndWrite(data, southSami.id, southSami.file, southSami.languageName) }) .catch(function (err) { console.log('Error: ' + err) }) ``` ## To change user-agent for the crawler ```javaScript crawlHeader['user-agent'] = 'name of crawler/version - comment (i.e. contact-info)' ``` ## Getting the content from a list of IDs ```javaScript import { crawlContentAndWrite } from 'nrk-sapmi-crawler' const appropriateTime = 2000 const southSami = { idFile: './datasets/list.southSami.json', contentFile: './datasets/content.southSami.json' } async function crawl () { await crawlContentAndWrite(southSami.idFile, southSami.contentFile, appropriateTime) } crawl() ```

近期下载者

相关文件


收藏者