spark-streaming-playground

所属分类:云计算
开发工具:HTML
文件大小:54621KB
下载次数:0
上传日期:2022-11-22 05:24:59
上 传 者sh-1993
说明:  全堆栈数据科学项目以Apache Spark Streaming为中心,用于教育目的。
(Full Stack Data Science projects centered around Apache Spark Streaming for educational purpose.)

文件列表:
LICENSE (11355, 2020-08-05)
Makefile (121, 2020-08-05)
bin (0, 2020-08-05)
bin\analytics (0, 2020-08-05)
bin\analytics\online_lda_topic_modeling.sh (1048, 2020-08-05)
bin\analytics\streaming_sentiment_tweet_analysis.sh (1228, 2020-08-05)
bin\analytics\trending_tweet_hashtags.sh (1229, 2020-08-05)
bin\data (0, 2020-08-05)
bin\data\dump_raw_data_as_file.sh (1137, 2020-08-05)
bin\data\dump_raw_data_into_bronze_lake.sh (962, 2020-08-05)
bin\data\dump_raw_data_into_hdfs.sh (932, 2020-08-05)
bin\data\dump_raw_data_into_postgresql.sh (1159, 2020-08-05)
bin\data\prepare_ssp_dataset.sh (221, 2020-08-05)
bin\data\start_kafka_producer.sh (263, 2020-08-05)
bin\data\start_kafka_scrapy.sh (162, 2020-08-05)
bin\data\visulaize_raw_text.sh (950, 2020-08-05)
bin\download (0, 2020-08-05)
bin\download\stackoverflow.sh (327, 2020-08-05)
bin\emr_run.sh (1099, 2020-08-05)
bin\flask (0, 2020-08-05)
bin\flask\ai_tweets_dashboard.sh (90, 2020-08-05)
bin\flask\api_endpoint.sh (83, 2020-08-05)
bin\flask\tagger.sh (77, 2020-08-05)
bin\flask\trending_hashtags_dashboard.sh (88, 2020-08-05)
bin\models (0, 2020-08-05)
bin\models\build_naive_dl_text_classifier.sh (207, 2020-08-05)
bin\models\build_sentiment_spark_model_offline.sh (961, 2020-08-05)
bin\models\evalaute_snorkel_labeller.sh (195, 2020-08-05)
bin\models\online_lda_topic_modeling.sh (1039, 2020-08-05)
bin\nlp (0, 2020-08-05)
bin\nlp\ner_extraction_using_spacy.sh (1374, 2020-08-05)
bin\nlp\spark_dl_text_classification_main.sh (1387, 2020-08-05)
config (0, 2020-08-05)
config\ai_tweets_dashboard.gin (347, 2020-08-05)
config\api_endpoint.gin (56, 2020-08-05)
config\conf (0, 2020-08-05)
config\conf\hadoop (0, 2020-08-05)
... ...

# [Fullstack Data Science Examples with Structured Streaming](https://gyan42.github.io/spark-streaming-playground/) The aim of the this project is to create a zoo of Big Data frameworks on a single machine, where pipelines can be build and tested based on Twitter stream. Which involves but not limited to fetch, store the data in data lake, play around with the Spark Structured SQLs for processing, create dataset from live stream for Machine Learning and do interactive visualization from the data lake. ![](docs/source/drawio/big_data_zoo.png) ## What is [Spark Streaming](https://techvidvan.com/tutorials/spark-streaming/)? First of all, what is streaming? A data stream is an unbounded sequence of data arriving continuously. Streaming divides continuously flowing input data into discrete units for processing. Stream processing is low latency processing and analyzing of streaming data. Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data. Spark Streaming is for use cases which require a significant amount of data to be quickly processed as soon as it arrives. Example real-time use cases are: - Website monitoring, network monitoring - Fraud detection - Web clicks - Advertising - Internet of Things sensors Spark Streaming supports data sources such as HDFS directories, TCP sockets, Kafka, Flume, Twitter, etc. Data Streams can be processed with Sparks core APIS, DataFrames SQL, or machine learning APIs, and can be persisted to a filesystem, HDFS, databases, or any data source offering a Hadoop OutputFormat. - [Spark Streaming Playground Environment Setup](https://gyan42.github.io/spark-streaming-playground/build/html/setup/setup.html) - [Learning Materials](https://gyan42.github.io/spark-streaming-playground/build/html/tutorials.html) - [Localhost Port Number used](https://gyan42.github.io/spark-streaming-playground/build/html/host_urls_n_ports.html) - [How to Run?](https://gyan42.github.io/spark-streaming-playground/build/html/how_to_run.html) - [Usecases](https://gyan42.github.io/spark-streaming-playground/build/html/usecases/usecases.html) ![](docs/source/drawio/usecase6.png) **Sanity test** Run pytest to check everything works fine... ``` pytest -s pytest -rP #shows the captured output of passed tests. pytest -rx #shows the captured output of failed tests (default behaviour). ``` **Build Documents** ``` cd docs make ssp ``` ## Medium Post @ [https://medium.com/@mageswaran1***9/big-data-play-ground-for-engineers-intro-71d7c174dfd0](https://medium.com/@mageswaran1***9/big-data-play-ground-for-engineers-intro-71d7c174dfd0) ## Block Chain and Streaming https://github.com/dhiraa/blockchain-streaming

近期下载者

相关文件


收藏者