cryptodata

所属分类:图形图象
开发工具:Python
文件大小:0KB
下载次数:0
上传日期:2023-11-03 15:09:15
上 传 者sh-1993
说明:  您的应用程序必须帮助客户端完成以下任务:持续从加密货币新闻提要中收集数据;继续...
(Your application must help the client with the following tasks: ? continuously collect data from a cryptocurrency news feed ; ? continuously process the collected data and provide analytics ; ? dynamically visualize the provided analytics with the appropriate graphs.)

文件列表:
DEV_NOTES.md (1368, 2023-12-21)
archives/ (0, 2023-12-21)
archives/cryptofeed/ (0, 2023-12-21)
archives/cryptofeed/cryptofeed/ (0, 2023-12-21)
archives/cryptofeed/cryptofeed/__init__.py (0, 2023-12-21)
archives/cryptofeed/cryptofeed/items.py (343, 2023-12-21)
archives/cryptofeed/cryptofeed/middlewares.py (3656, 2023-12-21)
archives/cryptofeed/cryptofeed/pipelines.py (364, 2023-12-21)
archives/cryptofeed/cryptofeed/settings.py (3328, 2023-12-21)
archives/cryptofeed/cryptofeed/spiders/ (0, 2023-12-21)
archives/cryptofeed/cryptofeed/spiders/__init__.py (161, 2023-12-21)
archives/cryptofeed/cryptofeed/spiders/coinmarketcap.py (524, 2023-12-21)
archives/cryptofeed/cryptofeed/spiders/cryptonews.py (473, 2023-12-21)
archives/cryptofeed/cryptofeed/spiders/cryptopanic.py (486, 2023-12-21)
archives/cryptofeed/cryptofeed/spiders/googlenews.py (520, 2023-12-21)
archives/cryptofeed/cryptofeed/spiders/newsdataio.py (476, 2023-12-21)
archives/cryptofeed/scrapped_results/ (0, 2023-12-21)
archives/cryptofeed/scrapped_results/cryptonews/ (0, 2023-12-21)
archives/cryptofeed/scrapped_results/cryptonews/cryptonews-news.html (152037, 2023-12-21)
archives/cryptofeed/scrapped_results/cryptopanic/ (0, 2023-12-21)
archives/cryptofeed/scrapped_results/cryptopanic/cryptopanic-home.html (14452, 2023-12-21)
archives/cryptofeed/scrapped_results/cryptopanic/cryptopanic-news-rss.html (10015, 2023-12-21)
archives/cryptofeed/scrapped_results/cryptopanic/cryptopanic-news.html (14452, 2023-12-21)
archives/cryptofeed/scrapy.cfg (263, 2023-12-21)
archives/scrapping-cluster/ (0, 2023-12-21)
archives/scrapping-cluster/.env (23, 2023-12-21)
archives/scrapping-cluster/crawler/ (0, 2023-12-21)
archives/scrapping-cluster/crawler/.coveragerc (77, 2023-12-21)
archives/scrapping-cluster/crawler/config/ (0, 2023-12-21)
archives/scrapping-cluster/crawler/config/example.yml (156, 2023-12-21)
archives/scrapping-cluster/crawler/config/file_pusher.py (1865, 2023-12-21)
archives/scrapping-cluster/crawler/crawling/ (0, 2023-12-21)
archives/scrapping-cluster/crawler/crawling/__init__.py (0, 2023-12-21)
archives/scrapping-cluster/crawler/crawling/custom_cookies.py (774, 2023-12-21)
archives/scrapping-cluster/crawler/crawling/distributed_scheduler.py (27350, 2023-12-21)
archives/scrapping-cluster/crawler/crawling/items.py (551, 2023-12-21)
archives/scrapping-cluster/crawler/crawling/log_retry_middleware.py (6506, 2023-12-21)
... ...

# Cryptodata ## Sytem Design ### Web scrapper The following tools will be use to scrap the data from the news feed: - [Scrapy](https://scrapy.org/) - [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) These are robust and widely used for web scraping. ### Data Storage Depending on the volume and nature of our data, we should consider using a combination of relational databases (like PostgreSQL) and NoSQL databases (like MongoDB or Elasticsearch) - [PostgreSQL](https://www.postgresql.org/) - [MongoDB](https://www.mongodb.com/) - [Elasticsearch](https://www.elastic.co/) - [InfluxDB](https://www.influxdata.com/) - [TimescaleDB](https://www.timescale.com/) ### Data Builder (Processing) The following tools will be use to real-time data processing, especially if wz expect high volumes of data. - [Apache Spark](https://spark.apache.org/) - [Apache Kafka](https://kafka.apache.org/) It can also handle batch processing, so it offers flexibility. Apache Kafka can be used in conjunction with Spark to handle real-time data ingestion and processing. Kafka can act as a buffer to store the scraped data and Spark can then pick it up for processing. Depending on the kind of analytics we're running, a time-series database like InfluxDB or TimescaleDB might be beneficial. ### Monitoring & Error Handling Since we're setting up a pipeline, we should have monitoring and alerting in place. Tools like Prometheus and Alertmanager can be integrated with Grafana to provide monitoring capabilities. - [Prometheus](https://prometheus.io/) - [Alertmanager](https://prometheus.io/docs/alerting/latest/alertmanager/) ### Dynamic Viewer with Analytics The following tools will be use to visualize the data: - [Grafana](https://grafana.com/) Grafana is a solid choice for visualizing time-series data. It integrates well with many databases, including InfluxDB and TimescaleDB. We should ensure we have the right plugins or visualizations to represent the analytics as you envision. Grafana has a rich library of plugins. For more interactive and custom analytics visualization, we could consider using Tableau or Power BI. ### Infrastructure We'll use a self-managed system and maybe Kubernetes to manage the infrastructure. - [Kubernetes](https://kubernetes.io/) - [Docker](https://www.docker.com/) ### Automation We should also consider setting up an automation tool or CI/CD pipeline, like Jenkins or GitHub Actions, to deploy updates and changes to our system seamlessly. - [Jenkins](https://www.jenkins.io/) - [GitHub Actions](https://github.com) ## Feeds We'll use the following news feeds: - [CryptoPanic](https://cryptopanic.com/news/) ## Data schema ## News object A news should have the following attributes: - `id`: unique identifier - `title`: title of the news - `datetime`: date of the news - `description`: description of the news - `url`: url of the news

近期下载者

相关文件


收藏者