Compare SparkCore and SparkSQL and 4 ways we use SparkSQL in production

SparkCore && SparkSQL - English Version

Introduction

Apache Spark is a powerful framework for big data processing. It has multiple components, and two of the most important ones are SparkCore and SparkSQL. In modern data warehouses, many companies still use Hive on Spark to manage and model structured data. However, HiveQL is no longer the preferred choice for big data computation.

Why Not Hive?

Hive has several limitations that make it less suitable for modern big data processing:

Performance Issues

HiveQL translates queries into MapReduce jobs, which are disk-based. This makes it very slow.

MapReduce is inefficient compared to in-memory computation.

Limited Functionality

Some data processing tasks are hard to achieve using HiveQL alone. Developers must write UDFs (User Defined Functions), which increases complexity.

Structured Data Only

Hive only supports structured data (Schema). It cannot handle semi-structured or unstructured data.

Why Use Spark?

Spark overcomes Hive’s limitations and provides powerful advantages:

High Performance

Spark processes data in memory, making it 10 times faster than disk-based MapReduce.

Functional Programming with RDDs

SparkCore provides RDD (Resilient Distributed Dataset) operations. This functional programming model allows flexible data transformations.

SQL Support with SparkSQL

SparkSQL provides an SQL API, similar to HiveQL, but with much better performance.

SparkSQL also supports DSL functions and RDD-based transformations, making it more flexible.

Understanding SparkCore and SparkSQL

SparkCore 【Call API function to process data: O-Object, O-Function】

Works like MapReduce, using RDD transformations to process data.

Use case: Often used to clean semi-structured data and convert it into structured data before analysis.

Example: ETL (Extract, Transform, Load) processing.


from pyspark import SparkContext

# Initialize SparkContext
sc = SparkContext("local", "SparkCoreExample")

# Create an RDD from a list
data = ["Spark", "Hadoop", "Big Data", "Machine Learning"]
rdd = sc.parallelize(data)

# Transform the RDD by mapping each element to its length
length_rdd = rdd.map(lambda word: (word, len(word)))

# Collect and print the results
print(length_rdd.collect())

SparkSQL 【O-tables】- What we use most in real world

Works like HiveSQL, using SQL queries for large-scale data analysis, and computation.

Use case: Performing analytics and batch processing on structured data.

Development methods:

DSL (Functional API with RDDs)

SQL (Using SQL queries on tables)


from pyspark.sql import SparkSession

# Initialize SparkSession
spark = SparkSession.builder.appName("SparkSQLExample").getOrCreate()

# Create a DataFrame from sample data
data = [("Alice", 25), ("Bob", 30), ("Charlie", 35)]
df = spark.createDataFrame(data, ["Name", "Age"])

# Register DataFrame as a temporary SQL table
df.createOrReplaceTempView("people")

# Run SQL query
result = spark.sql("SELECT Name, Age FROM people WHERE Age > 28")

# Show results
result.show()

Conclusion

SparkCore and SparkSQL are essential for modern big data processing. While SparkCore helps process and transform raw data, SparkSQL provides an easy way to run SQL queries efficiently. By using Spark, companies can improve performance, reduce complexity, and handle more data types compared to traditional Hive-based solutions.

4 ways we use SparkSQL in production

1. Python/Jar: Using `spark-submit`

spark-submit allows submitting both SQL queries and Spark Core applications. It can execute Python scripts (.py) or Java JAR packages.

Usage Scenario: ETL (Extract, Transform, Load)

Python: Submit a Spark job developed in Python.

Java: Submit a compiled JAR file to Spark.

Example: Submitting a Python script


spark-submit --master yarn my_spark_job.py

Example: Submitting a Java JAR file


spark-submit --class com.example.MyApp --master yarn my_app.jar

2. ThriftServer: Submitting SQL Only (Similar to Hive's HiveServer2)

ThriftServer is used to execute SQL queries only, without the need to write Spark Core transformations.

Common Ways to Submit SQL

a) PyHive (Most Common in Production)

PyHive allows Python applications to connect to SparkSQL and execute SQL queries.

Example: Connecting to SparkSQL from Python


from pyhive import hive

conn = hive.Connection(host='spark-thriftserver-host', port=10000, database='default')
cursor = conn.cursor()
cursor.execute("SELECT * FROM my_table LIMIT 10")
for result in cursor.fetchall():
    print(result)

b) spark-sql -f (For Automated SQL Execution)

Used for executing SQL scripts, similar to hive -f.

Example: Running an SQL file


spark-sql -f my_query.sql

c) Beeline (Typically Used for Testing)

Beeline provides an interactive command-line interface to connect to SparkSQL.

Example: Connecting to SparkSQL


beeline -u jdbc:hive2://spark-thriftserver-host:10000/default

d) JDBC (Rarely Used)

JDBC allows Java applications to connect to SparkSQL and execute SQL queries.

Example: Connecting via JDBC in Java


Connection conn = DriverManager.getConnection("jdbc:hive2://spark-thriftserver-host:10000/default", "user", "password");
Statement stmt = conn.createStatement();
ResultSet rs = stmt.executeQuery("SELECT * FROM my_table");
while (rs.next()) {
    System.out.println(rs.getString(1));
}

SparkCore && SparkSQL - 中文版本

介绍

Apache Spark 是一个强大的大数据处理框架。它有多个组件，其中最重要的两个是 SparkCore 和 SparkSQL。在现代数据仓库中，许多公司仍然使用 Hive on Spark 进行建模和管理结构化数据。然而，HiveQL 逐渐被淘汰，不再是大数据计算的首选。

为什么不使用 Hive？

Hive 有几个明显的限制，使其不适用于现代大数据处理：

性能问题

HiveQL 需要将查询转换为 MapReduce 任务，而 MapReduce 是基于磁盘的计算，速度非常慢。

相比之下，内存计算的 Spark 速度要快得多。

功能有限

HiveQL 无法满足复杂的数据处理需求，必须开发 UDF（用户自定义函数），增加了开发难度。

只能处理结构化数据

Hive 仅支持 结构化数据（Schema），无法处理 半结构化 或 非结构化 数据。

为什么要使用 Spark？

Spark 克服了 Hive 的缺点，并提供了以下优势：

高性能

Spark 采用 内存计算，相比基于磁盘的 MapReduce，速度提高 10 倍。

基于函数式编程的 RDD

SparkCore 提供 RDD（弹性分布式数据集），可以灵活地进行数据转换。

SparkSQL 提供 SQL 计算能力

SparkSQL 提供 SQL API，类似于 HiveQL，但效率更高。

SparkSQL 还支持 DSL 函数 和 RDD 计算，更具灵活性。

理解 SparkCore 和 SparkSQL

SparkCore - 面向对象编程、面向函数编程

类似 MapReduce，使用 RDD 进行数据转换和计算。

应用场景：通常用于处理 半结构化数据，将其转换为结构化数据，以便进行分析。

示例：ETL（提取、转换、加载）数据处理。


from pyspark import SparkContext

# Initialize SparkContext
sc = SparkContext("local", "SparkCoreExample")

# Create an RDD from a list
data = ["Spark", "Hadoop", "Big Data", "Machine Learning"]
rdd = sc.parallelize(data)

# Transform the RDD by mapping each element to its length
length_rdd = rdd.map(lambda word: (word, len(word)))

# Collect and print the results
print(length_rdd.collect())

SparkSQL - 面向表，工作中最常用的

类似 HiveSQL，使用 SQL 语法来处理结构化数据。

应用场景：大规模数据分析和批量计算。

开发方式：

DSL（函数 API + RDD 计算）

SQL（使用 SQL 处理表数据）


from pyspark.sql import SparkSession

# Initialize SparkSession
spark = SparkSession.builder.appName("SparkSQLExample").getOrCreate()

# Create a DataFrame from sample data
data = [("Alice", 25), ("Bob", 30), ("Charlie", 35)]
df = spark.createDataFrame(data, ["Name", "Age"])

# Register DataFrame as a temporary SQL table
df.createOrReplaceTempView("people")

# Run SQL query
result = spark.sql("SELECT Name, Age FROM people WHERE Age > 28")

# Show results
result.show()

总结

SparkCore 和 SparkSQL 是现代大数据处理中不可或缺的组件。SparkCore 用于数据转换和清理，而 SparkSQL 提供高效的 SQL 查询能力。通过使用 Spark，公司可以 提高计算性能、简化开发流程，并且能够 处理更多类型的数据，相比传统 Hive 方案更具优势。

如何在工作中使用 SparkSQL

1. Python/Jar: 使用 `spark-submit`

spark-submit 允许提交 SQL 查询 和 Spark Core 应用，它可以执行 Python 脚本 (.py) 或 Java JAR 包。

应用场景: ETL（提取、转换、加载）

Python: 提交基于 Python 开发的 Spark 作业。

Java: 提交编译后的 JAR 文件到 Spark。

示例: 提交 Python 脚本


spark-submit --master yarn my_spark_job.py

示例: 提交 Java JAR 文件


spark-submit --class com.example.MyApp --master yarn my_app.jar

2. ThriftServer: 仅用于提交 SQL（类似于 Hive 的 HiveServer2）

ThriftServer 主要用于执行 SQL 查询，无需编写 Spark Core 计算逻辑。

常见的 SQL 提交方式

a) PyHive（生产环境中最常见）

PyHive 允许 Python 应用 连接 SparkSQL 并执行 SQL 查询。

示例: 通过 Python 连接 SparkSQL


from pyhive import hive

conn = hive.Connection(host='spark-thriftserver-host', port=10000, database='default')
cursor = conn.cursor()
cursor.execute("SELECT * FROM my_table LIMIT 10")
for result in cursor.fetchall():
    print(result)

b) spark-sql -f（用于 SQL 自动化执行）

类似于 hive -f，用于执行 SQL 脚本。

示例: 运行 SQL 文件


spark-sql -f my_query.sql

c) Beeline（通常用于测试）

Beeline 提供 交互式命令行界面，用于连接 SparkSQL。

示例: 连接 SparkSQL


beeline -u jdbc:hive2://spark-thriftserver-host:10000/default

d) JDBC（较少使用）

JDBC 允许 Java 应用程序连接 SparkSQL 并执行 SQL 查询。

示例: 通过 JDBC 连接 SparkSQL