• o4_232628
  • 2.9KB
  • zip
  • 0
  • VIP专享
  • 0
  • 2022-04-15 00:10
阿帕奇火花 下载Spark 本地版本: 日期:2021年3月 Spark版本3.1.1; 程序包类型“为Hadoop 2.7预先构建”; openjdk 15.0.2; python 3.9.2和pyspark; PySpark外壳 在Linux / OS X上启动PySpark $HOME/spark-3.1.1-bin-hadoop2.7/bin/pyspark 或使用pip安装并启动pyspark pip install pyspark pyspark 理论 参考:《火花在行动》第二版,让·乔治·佩林,曼宁出版社。 大局 认证Apache Spark 3可以在Java 11上运行[参考:图书1.5.1] 支持Scala,R,Python,Java的API。 它是什么以及它用作什么 spark可以想象成一个分析操作系统。 自动管理其下的分布式节点。 为rdbms提供标准化
  • spark-main
  • apps
  • ingestionlessPiApprox.py
  • README.md
# Apache Spark ## Download Spark Local build: Date: March 2021 Spark release 3.1.1; package type "prebuilt for Hadoop 2.7"; openjdk 15.0.2; python 3.9.2 and pyspark; ## PySpark Shell To start PySpark on Linux/OS X ``` $HOME/spark-3.1.1-bin-hadoop2.7/bin/pyspark ``` or install with pip and start pyspark ``` pip install pyspark pyspark ``` # theory Reference: Spark in Action Second Edition, Jean-Georges Perrin, Manning Publications. ## the big picture - Apache Spark 3 is certified to run on Java 11 [ref: book 1.5.1] - Supports APIs for Scala, R, Python, Java. ### what is it and what is it used as - spark can be imagined as an analytics operating system. - automatically manages the distrubuted nodes under it. - provides standardised spark SQL for rdbms. - to utilize spark, it is not necessary to understand the workings under the hood. - refer: figure 1.3; - excels in big data scenarios: data eng, data sciences, etc. - Basic steps: Ingestion (raw data), Improvement of data quality (pure data), Transformation (rich data), publication. fig 1.5 ### four pillars of spark - ref: fig 1.4 - Spark SQL: data operations, offers API and SQL - Spark streaming (uses RDD), Structured streaming(uses Spark SQL and dataframe): to analyze streaming data. - Spark MLlib: for ML, DL - Spark GraphX: to manipulate graph-based data structures within spark. ### storage and APIs; the dataframe - essential to spark, dataframe. A data container and an API. - fig 1.7; - API is extensible through user defined functions (UDF), [ref ch 16] ## architecture and flow - the application is the driver. - data can be processed and manipulated within the master, even remotely. ### typical flow of data in an app - ref fig 2.9 - Connecting to a master: for every spark app, first operation is to connect to the spark master and get a spark "session." - master can be either local or remote cluster. - **Ingestion**: ask spark to load the data - spark relies on the distrubuted worker nodes to do the work. - the allocated workers ingest at the same time. - the filesystem must be of either shared format or of a distrubuted filesystem. - each worker node assigns partitions in memory for ingestion and creates tasks which ingests the records from the filesystem/file. - data is ingested by all assigned worker nodes simultaneosly. - **Transformation**: tasks within each node will perform the transformations - entire processing takes place within the workers. - **Saving or Publishing**: the processed data is stored in the database or published from the partitions. - if there is four partitions then saving to database will require four connections to the database. - if there is 200,000 tasks then it causes 200,000 simultaneos connection to the database. - chapter 17: solution to the load issue. ## role of the dataframe - dataframe: data structure similar to a table in the relational database world. - **immutable** distrubuted collection of data. - organized into **columns**. - a dataframe is an RDD with a schema. - use dataframe over an RDD when **performance is critical**. - transformations: operations performed on data. - resilient distrubuted dataset (RDD): dataframe is built on top of the RDD concept. RDD was the first generation of storage in Spark. - **APIs across Spark libs are unified under the dataframe API**. ### using dataframe in python [pyspark.sql API](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html) ### essential role of dataframe - figure 3.1 ### data immutability - **Spark stores the initial state of the data, in an immutable way, and then keeps the recipe (a list of transformations.)** - The access to data is kept in sync with the nodes. - Spark stores only the steps of the transformation. - This becomes intuitive when the nodes are distrubuted. - fig 3.4 ### Catalyst - Spark builds the list of transformations as a directed acyclic graph (DAG) - DAG is optimized by Catalyst, built-in optimizer. # Ingestion ## from files - regex can be used to specify the path - to parse CSV files, spark provides a rich set of options to tune the parser. - Spark can ingest both one-line and multiline JSON - To ingest XML, a plugin provided by Databricks is required. - Avro, ORC and Parquet are well supported by Spark. - The pattern for ingesting is fairly similar: specify the format and read. ## from databases - Spark requires JDBC drivers. - Spark comes with support for IBM Db2, Apache Derby, MySQL, Microsoft SQL Server, Oracle, PostgreSQL and Terradata Database. - Joins can be performed at database level prior to ingestion, but can also do joins in Spark. - Ingesting data from Elasticsearch follows same principle as any other ingestion. - Elasticsearch contains JSON documents, which are ingested as is in Spark. ## advanced - WIP ## from streaming data - # Transformation ## SQL ## data ## entire documents ## UDF ## Aggregating