Gluten, the vectorized execution engine framework, is officially open sourced and unveiled at the Spark Technology Summit

MissD發表於2022-11-24
"Kyligence enterprise-level products are derived from Apache Kylin. Today, both of them deeply integrate Spark's capabilities in offline data processing, real-time query analysis, etc. Through Gluten, an open source project, Kylin and Kyligence enterprise-level products will effectively improve OLAP. Query performance and execution efficiency, especially in the cloud-native version of Kyligence Cloud, will reduce the total cost of ownership (TCO) to a greater extent, improve the cost-efficiency of cloud-based data analysis, and accelerate the transition of large customers from traditional data analysis architectures to cloud-native data lakes The process of architecture.” - Li Yang, Co-founder and CTO of Kyligence

At the recently held Databricks Data & AI Summit 2022, Chen Weiting from Intel and Zhang Zhichao from Kyligence shared the new open source project "Gluten" that Intel and Kyligence have jointly built since 2021. This is also the first time that Gluten has appeared on a global platform, and today we will learn more about Gluten through this article.

The Gluten project aims to inject the ability of Native Vectorized Execution into Apache Spark, which greatly optimizes the execution efficiency and cost of Spark. At present, the main participants in the Gluten community are Intel, Kyligence, etc.

1. Why do you need Gluten?

In recent years, with the improvement of IO technology, especially the popularization of SSD and 10G NICs, more and more CPU computing bottlenecks are encountered in data load scenarios based on Apache Spark, rather than the IO bottlenecks in traditional cognition. As we all know, it is difficult to optimize CPU instructions based on JVM, because JVM provides CPU instruction-level optimization (such as SIMD) far less than other native languages (such as C++).

At the same time, everyone also found that the open source community already has relatively mature Native Engines (such as ClickHouse, Velox), which have excellent vectorized execution (Vectorized Execution) capabilities, and have been proven to bring significant performance advantages. However, they are often free Outside of the Spark ecosystem, this is not friendly enough for users who already rely heavily on the Spark computing framework and cannot accept a lot of O&M and migration costs. The Gluten community hopes to enable Spark users to enjoy the performance benefits of these mature Native Engines without migrating.

Coincidentally, Databricks published an article about the Photon project "Photon: A Fast Query Engine for Lakehouse Systems" at SIGMOD 2022 not long ago. The article describes in detail how Databricks integrates Photon, a native subsystem in Apache Spark, through vectorization Execution and other optimizations have brought Apache Spark a significant improvement in execution performance. The Gluten project has been independently established and launched before Photon was released, but we see a certain similarity in the realization of ideas and acceleration effects.

The following figure is from the public speech material of Databricks. From the figure, it can be seen that the performance gain of introducing the Native Vectorized engine (Photon) is better than the sum of all performance optimizations in the past 5 years. The performance improvement can also bring about the improvement of Spark experience and the reduction of IT costs, which is a very attractive improvement when enterprise users are using hundreds or thousands of servers to run Spark jobs. Photon is not currently open source, so the Gluten project could very well fill a gap in the industry here.

2. What is the Gluten project?

The word Gluten means glue in Latin, and the Gluten project acts like glue, mainly used to "glue" Apache Spark and the Native Vectorized Engine as a Backend. There are many options for Backend, Velox, Clickhouse, and Apache Arrow are now explicitly supported in the Gluten project.

Starting from this positioning, combined with the following figure, we can roughly see what capabilities the Gluten project needs to provide:

2.1 Plan Conversion & Fallback

This is the core capability of Gluten. In short, it intercepts Spark query plans and sends them to Native Engine for execution through the Spark Plugin mechanism, skipping the inefficient execution path of native Spark. The overall execution framework still uses Spark's existing implementation, including consumption interfaces, resource and execution scheduling, query plan optimization, and upstream and downstream integration.

Generally speaking, the capabilities of the Native Engine cannot cover 100% of the operators in the Spark query execution plan, so Gluten must analyze which operators in the Spark query execution plan can be pushed down to the Native Engine, and compare these adjacent, The pushdown operator is encapsulated into a Pipeline, serialized and sent to the Native Engine to execute and return the result. We rely on a separate open source project called substrait that uses protobuf to serialize engine-neutral query plans.

For operators that the Native Engine cannot undertake, Gluten arranges the fallback to return to the normal Spark execution path for calculation. Databricks' Photon currently only supports some Spark operators, and it should adopt a similar approach.

From the perspective of threading model, Gluten uses the form of JNI calling Library to directly call Native code in the Spark Executor Task thread, and strictly controls the number of JNI calls. Therefore, Gluten does not introduce a complex threading model. For details, please refer to the following figure:

2.2 Memory Management

Since Native code and Spark Java code run in the same process, Gluten has the conditions for unified management of Native space and JVM space memory. In Gluten, when the code in the Native space applies for memory, it will first apply for memory from the local Memory Pool. If the memory is insufficient, it will further apply for a memory quota from the Task Memory Manager in the JVM, and only after obtaining the corresponding quota will it succeed in the Native space. Apply for memory. In this way, the memory application of the Native space is also managed uniformly by the Task Memory Manager. When the phenomenon of insufficient memory occurs, the Task Memory Manager will trigger the spill, and the operator in either the Native or the JVM will release the memory when it receives the spill notification.

2.3 Columnar Shuffle

Shuffle itself is an important part that affects performance. Since Native Engine mostly uses Columnar data structure to temporarily store data, if you simply follow Spark's Shuffle based on row data model, data column transfer will be introduced in the Shuffle Write stage. In the Shuffle Read stage, the data row and column are introduced in order to make the data flow smoothly. However, the cost of row to column or column to row is not low. Therefore, Gluten must provide a complete Columnar Shuffle mechanism to avoid the conversion overhead here.

Like native Spark, Columnar Shuffle also needs to support spill operations when memory is insufficient, giving priority to ensuring the robustness of queries.

2.4 Compatibility

Users may prefer to use different Native Engines that are compatible with their company's technology stack. Therefore, it is necessary for Gluten to define a clear JNI interface as a bridge between the Spark framework and the underlying Backend communication. These interfaces are used to meet the requirements of request transmission, data transmission, capability detection and other aspects. Developers only need to implement these interfaces and satisfy the corresponding semantic guarantees, and then they can use Gluten to complete the "glue" work between Spark and Native Engine.

On the Spark side, the Shim Layer reserved in the current architecture design is used to adapt to different versions of Spark.

2.5 Other optimizations

In addition to using native code to mine the performance benefits of vectorized execution, Photon's performance benefits also come from other optimizations (mainly the query optimizer), but many of these optimizations are not open source, and the Gluten project is constantly absorbing this part of the open source version Optimization.

3. Status & Roadmap

At present, the Gluten community has completed the verification of Velox Backend and Clickhouse Backend on the TPC-H dataset. The performance of the two backends under the TPC-H 1000 dataset is shown in the figure below. It can be seen that no matter which backend is used, they have achieved significant performance improvements. For all TPC-H queries, it is very exciting to see generally greater than two-fold performance gains through simple integration without deep customization of backends.


Next, the work of the Gluten community will revolve around the following:

  • Completion of validation and performance testing on the TPC-DS dataset
  • Improve support for data types and functions
  • Improve data source connection and data source format support
  • Improve CICD process and test coverage
  • Try the docking work of Remote Shuffle Service
  • Try other hardware accelerated work

Gluten community project address: https://github.com/oap-project/gluten

To learn more about Gluten, watch the Data & AI Meetup on July 21st.

About Kyligence

Shanghai Kyligence Information Technology Co., Ltd. (Kyligence) was founded in 2016 by the founding team of Apache Kylin. It is committed to building the next generation of enterprise-level intelligent multidimensional databases and simplifying multidimensional data analysis (OLAP) on data lakes for enterprises. Through AI-enhanced high-performance analysis engine, unified SQL service interface, business semantic layer and other functions, Kyligence provides cost-optimized multi-dimensional data analysis capabilities, supporting enterprise business intelligence (BI) analysis, flexible query and Internet-level data services and other categories Application scenarios help enterprises build a more reliable indicator system and release the potential of business self-service analysis.

Kyligence has served a number of customers in banking, securities, insurance, manufacturing, retail and other industries in China, the United States, Europe and Asia Pacific, including China Construction Bank, Shanghai Pudong Development Bank, China Merchants Bank, Ping An Bank, Ningbo Bank, Pacific Insurance, China UnionPay, SAIC, Costa, UBS, MetLife and other world-renowned companies, and global partnerships with technology leaders such as Microsoft, Amazon, Huawei, and Tableau. At present, the company has opened branches or offices in Shanghai, Beijing, Shenzhen, Xiamen, Wuhan and Silicon Valley, New York and Seattle in the United States.

Original link: https://mp.weixin.qq.com/s/0045Rc1mQPIYNhY1mDwsxw

相關文章