PySpark is a set of Spark APIs in Python language. It not only offers for you to write an application
with Python APIs but also provides PySpark shell so you can interactively analyze your data in a distributed environment.
PySpark includes almost all Apache Spark features.
General Execution: Spark Core
Spark Core is the underlying general execution engine for the Spark platform that all
other functionality is built on top of. It provides in-memory computing capabilities.
Structured Data: Spark SQL
Spark SQL is a Spark module for structured data processing. It provides a programming
abstraction called DataFrames and can also act as distributed SQL query engine.
Streaming Analytics: Spark Streaming
Running on top of Spark, Spark Streaming enables powerful interactive and analytical
applications across both streaming and historical data, while inheriting Spark’s ease of use
and fault tolerance characteristics.
Machine Learning: MLlib
Machine learning has quickly emerged as a critical piece in mining Big Data for actionable insights.
Built on top of Spark, MLlib is a scalable machine learning library that delivers both high-quality
algorithms (e.g., multiple iterations to increase accuracy) and blazing speed
(up to 100x faster than MapReduce).