【Spark】Spark Web UI - SQL

工作中经常会出现 Spark SQL 执行很慢或者失败的情况，如果要排查问题，就必须要学会看 Spark Web UI。可以参考官网来学习：https://spark.apache.org/docs/3.2.1/web-ui.html#content。关于 Spark Web UI，上面有很多个 tab 页，后面逐一学习。

在这里插入图片描述

今天学习一个常用的 tab —— SQL。

SQL Tab

If the application executes Spark SQL queries, the SQL tab displays information, such as the duration, jobs, and physical and logical plans for the queries. Here we include a basic example to illustrate this tab:

如果应用程序执行 Spark SQL 查询，SQL 选项卡会显示查询的持续时间、作业以及物理和逻辑计划等信息。这里我们包含一个基本示例来说明此选项卡：

scala> val df = Seq((1, "andy"), (2, "bob"), (2, "andy")).toDF("count", "name")
df: org.apache.spark.sql.DataFrame = [count: int, name: string]

scala> df.count
res0: Long = 3                                                                  

scala> df.createGlobalTempView("df")

scala> spark.sql("select name,sum(count) from global_temp.df group by name").show
+----+----------+
|name|sum(count)|
+----+----------+
|andy|         3|
| bob|         2|
+----+----------+

在这里插入图片描述

Now the above three dataframe/SQL operators are shown in the list. If we click the ‘show at : 24’ link of the last query, we will see the DAG and details of the query execution.

现在上述三个 dataframe/SQL 运算符显示在列表中。如果我们单击最后一个查询的 show at <console>: 24 链接，我们将看到 DAG 和查询执行的详细信息。

在这里插入图片描述

The query details page displays information about the query execution time, its duration, the list of associated jobs, and the query execution DAG. The first block ‘WholeStageCodegen (1)’ compiles multiple operators (‘LocalTableScan’ and ‘HashAggregate’) together into a single Java function to improve performance, and metrics like number of rows and spill size are listed in the block. The annotation ‘(1)’ in the block name is the code generation id. The second block ‘Exchange’ shows the metrics on the shuffle exchange, including number of written shuffle records, total data size, etc.

查询详细信息页面显示有关查询执行时间、持续时间、关联作业列表和查询执行 DAG 的信息。第一个块 “WholeStageCodegen (1)” 将多个运算符（ “LocalTableScan” 和 “HashAggregate” ）一起编译成一个 Java 函数以提高性能，该块中列出了行数和溢出大小等指标。块名称中的注释 “(1)” 是代码生成 ID。第二个 “Exchange” 块显示了 shuffle 交换的指标，包括写入的 shuffle 记录数、总数据大小等。

扫描二维码关注公众号，回复： 14416181 查看本文章

在这里插入图片描述

Clicking the ‘Details’ link on the bottom displays the logical plans and the physical plan, which illustrate how Spark parses, analyzes, optimizes and performs the query. Steps in the physical plan subject to whole stage code generation optimization, are prefixed by a star followed by the code generation id, for example: ‘*(1) LocalTableScan’

单击底部的 Details 链接会显示逻辑计划和物理计划，说明 Spark 如何解析、分析、优化和执行查询。物理计划中要进行全阶段代码生成优化的步骤，以星号为前缀，后跟代码生成 id，例如：‘*(1) LocalTableScan’

SQL metrics

The metrics of SQL operators are shown in the block of physical operators. The SQL metrics can be useful when we want to dive into the execution details of each operator. For example, “number of output rows” can answer how many rows are output after a Filter operator, “shuffle bytes written total” in an Exchange operator shows the number of bytes written by a shuffle.

SQL 算子的指标显示在物理算子块中。当我们想要深入了解每个运算符的执行细节时，SQL 指标会很有用。例如，“number of output rows” 可以回答过滤操作符后输出了多少行，交换操作符中的 “shuffle byteswritten total” 显示了 shuffle 写入的字节数。

Here is the list of SQL metrics:

SQL metrics	Meaning	Operators
`number of output rows`	the number of output rows of the operator	Aggregate operators, Join operators, Sample, Range, Scan operators, Filter, etc.
`data size`	the size of broadcast/shuffled/collected data of the operator	BroadcastExchange, ShuffleExchange, Subquery
`time to collect`	the time spent on collecting data	BroadcastExchange, Subquery
`scan time`	the time spent on scanning data	ColumnarBatchScan, FileSourceScan
`metadata time`	the time spent on getting metadata like number of partitions, number of files	FileSourceScan
`shuffle bytes written`	the number of bytes written	CollectLimit, TakeOrderedAndProject, ShuffleExchange
`shuffle records written`	the number of records written	CollectLimit, TakeOrderedAndProject, ShuffleExchange
`shuffle write time`	the time spent on shuffle writing	CollectLimit, TakeOrderedAndProject, ShuffleExchange
`remote blocks read`	the number of blocks read remotely	CollectLimit, TakeOrderedAndProject, ShuffleExchange
`remote bytes read`	the number of bytes read remotely	CollectLimit, TakeOrderedAndProject, ShuffleExchange
`remote bytes read to disk`	the number of bytes read from remote to local disk	CollectLimit, TakeOrderedAndProject, ShuffleExchange
`local blocks read`	the number of blocks read locally	CollectLimit, TakeOrderedAndProject, ShuffleExchange
`local bytes read`	the number of bytes read locally	CollectLimit, TakeOrderedAndProject, ShuffleExchange
`fetch wait time`	the time spent on fetching data (local and remote)	CollectLimit, TakeOrderedAndProject, ShuffleExchange
`records read`	the number of read records	CollectLimit, TakeOrderedAndProject, ShuffleExchange
`sort time`	the time spent on sorting	Sort
`peak memory`	the peak memory usage in the operator	Sort, HashAggregate
`spill size`	number of bytes spilled to disk from memory in the operator	Sort, HashAggregate
`time in aggregation build`	the time spent on aggregation	HashAggregate, ObjectHashAggregate
`avg hash probe bucket list iters`	the average bucket list iterations per lookup during aggregation	HashAggregate
`data size of build side`	the size of built hash map	ShuffledHashJoin
`time to build hash map`	the time spent on building hash map	ShuffledHashJoin
`task commit time`	the time spent on committing the output of a task after the writes succeed	any write operation on a file-based table
`job commit time`	the time spent on committing the output of a job after the writes succeed	any write operation on a file-based table

欢迎点击此处关注公众号。