Overhauling Apache Kylin for the cloud

Not too long ago, the Apache Kylin neighborhood launched a serious replace with the final availability of Kylin 4. Kylin 4 continues the mission to offer a unified, high-performance, cloud-friendly, open supply OLAP (on-line analytical processing) platform. Kylin 4 upgrades the Kylin structure to make it straightforward to deploy and scale within the cloud. The brand new launch options three main platform updates and myriad different enhancements.

First, Kylin 4 replaces its earlier HBase storage engine with Apache Parquet, making it doable to decouple compute and storage for limitless impartial scalability. Second, Kylin 4 unifies the compute engine and removes any earlier dependencies on the Hadoop ecosystem. This makes useful resource allocation way more versatile, leading to a major discount in whole cloud useful resource utilization and related prices. Third, by introducing a model new, absolutely distributed question engine, Kylin 4 makes cubing length and question latency way more performant in comparison with earlier releases.

On this article, we’ll dive into the main points of those new improvements and the brand new capabilities they permit.

What’s Apache Kylin?

Apache Kylin is an open supply distributed evaluation engine that gives SQL question interfaces above Hadoop and Spark, together with OLAP capabilities to help extraordinarily massive knowledge units. It was initially developed at eBay and contributed to the Apache Software program Basis. Kylin can question large relational tables with sub-second response instances.

Kylin’s core thought is the precomputation of end result units, that means it calculates all doable question outcomes prematurely in keeping with the desired dimensions and measures. Kylin principally exchanges house for time to hurry up OLAP queries with fastened question patterns.

Apache Kylin enables you to question billions of rows at sub-second latency in three steps:

  1. Establish a star or snowflake schema on Hadoop/Spark.
  2. Construct a dice from the recognized tables.
  3. Question utilizing ANSI-SQL and get outcomes by way of ODBC, JDBC, or RESTful API.

How Kylin works

Ideas

Kylin’s ideas of a dice and a cuboid will be understood from the next determine:

apache kylin 01 Kyligence

Every mixture of dimensions is named a cuboid and the set of all cuboids is a dice. The cuboid composed of all dimensions is named the base cuboid. All cuboids will be calculated from the bottom cuboid. A cuboid will be understood as a large desk after precomputation. Throughout the question, Kylin will robotically choose probably the most appropriate cuboid that meets the question necessities.

Fundamental question course of

apache kylin 02 Kyligence

The above figure is a scenario without precomputation, which requires on-site calculation. Agg and Join will involve a shuffle, so the performance will be poor and more resources will be occupied with large amounts of data, which will affect the concurrency of queries.

apache kylin 03 Kyligence

After the precomputation, the previously most time-consuming two-step operation (Agg/Join) disappeared from the rewritten execution plan, showing a cuboid precise match. Additionally, when defining the cube we can choose to order by column so the Sort operation does not need to be calculated. The whole calculation is a single stage without the expense of a shuffle. The calculation can be completed with only a few tasks therefore improving the concurrency of the query.

Cloud-friendly architecture

New storage engine

When Apache Kylin was born, it relied on Hadoop. In Kylin 3.x and before, Kylin used HBase as a storage engine to save the precomputing results generated after cube builds; supported MapReduce, Spark, and Flink as the build engine; and used the query engine based on Apache Calcite.

Time in production use and continued development have gradually exposed a variety of problems with this architecture, such as the high maintenance cost of HBase and the performance limitations of the Calcite query engine, which is difficult to expand horizontally. And while HBase, as the database of HDFS, has been excellent in terms of query performance, it still has the following disadvantages:

  • HBase is not real columnar storage.
  • HBase has no secondary index; Rowkey is the only index.
  • HBase has no encoding; Kylin has to do the encoding by itself.
  • HBase does not fit for cloud deployment and auto-scaling.
  • HBase has different API versions and compatibility issues between them (e.g, 0.98, 1.0, 1.1, 2.0).
  • HBase has different vendor releases and compatibility issues between them (e.g, Cloudera’s is not compatible with others).

Facing the above problems, the Apache Kylin community proposed to replace HBase with Apache Parquet and Apache Spark, for the following reasons:

  • Parquet is a mature and stable open source column storage format.
  • Parquet is more cloud-friendly, able to work with most cloud file systems (HDFS, Amazon S3, Azure Blob Storage, Alibaba Cloud Object Storage Service, etc.).
  • Parquet can be tightly integrated with Hadoop, Hive, Spark, Impala, etc.
  • Parquet supports custom indexes.

New Spark build engine

In Kylin 4, the Spark engine is the only build engine. Compared with the build engine in previous versions, the Spark engine has the following characteristics:

  • Kylin 4 simplifies many build steps. For example, Kylin 4 only needs two steps to build a cube: resource detection and cubing.
  • Because Parquet encodes the stored data, an encoding process for dimension dictionaries and dimension columns is no longer needed in Kylin 4.
  • Kylin 4 implements a new global dictionary. For more details, please refer to this Kylin Wiki article.
  • Kylin 4 will automatically adjust the parameters of Spark according to available cluster resources and the build job.
  • Kylin 4 improves build performance.

New distributed query engine

apache kylin 04 Kyligence

Sparder, the new query engine of Kylin 4, is a distributed query engine implemented by the Spark back end. Compared with the original query engine, Sparder has the following advantages:

  • Distributed query engine eliminates a single point of failure.
  • Unified computation framework (Spark) for building and querying.
  • Substantial increase in performance of complex queries.
  • Can benefit from new features in Spark and the Spark ecosystem.
apache kylin 05 Kyligence

Kylin 4 in the cloud

Cloud computing has many compelling features (unlimited storage capacity, easy maintenance, paying for what you use) that are drawing more enterprises into the public cloud. We see many companies benefiting from moving their on-premises infrastructure to cloud, achieving goals of lower TCO (total cost of ownership), greater scalability and reliability, and stronger data protection.

On the engineering side, cloud computing also brings changes to the way enterprises design and deploy their software. Modular software architecture makes applications user-friendly and flexible to develop and use.

Kylin 3 relies on Hadoop. Before deploying a Kylin 3 instance, users must prepare a Hadoop cluster including heavy services such as HDFS and HBase. Kylin 3 users must acquire a lot of knowledge about how to maintain and optimize these Hadoop components. Because Kylin 3 has a complex architecture, and suffers reliability and scalability problems, it is not generally suitable for cloud deployment.

All of this changes with Kylin 4. Kylin 4 removes Kylin’s dependency on Hadoop components such as Yarn and HBase. The “Kylin plus Spark plus object storage” architecture has less complexity, making deployment in the cloud easier and more manageable. In this new architecture, Parquet replaces HBase and Spark replaces Yarn and MapReduce.

apache kylin 06 Kyligence

This figure shows how Kylin 4 could be deployed on a public cloud. First, the new architecture is lightweight, and the required components are fewer than before. Deployment is easier and faster, and most components are stateless; by contrast, HDFS and HBase are stateful services. Statelessness means we can delete these resources when we do not need them. Second, scaling is much easier than before, done simply by adding or deleting these components to your Spark cluster.

Kylin 4 performance on AWS

Preparation

In order to help readers understand the performance differences between Kylin 3 and Kylin 4, we have provided a performance benchmark report in a standard software and hardware environment. Amazon EMR was chosen as our benchmark platform.

Additionally, we chose TPC-H and SSB as our benchmark standards. The scale factor used in this test is 10 (meaning fact table has 60 million rows).

The following table shows the aspects compared between different versions in this benchmark report.

Metrics/Aspect

Description

Cubing duration

Duration of precalculation (cube building) process (load source table into Kylin)

Cube size

Disk space occupied by cube/index

Response time

Serial query test lasting 15 minutes, taking the 95th percentile of the overall response time as the result.

The following table shows information about software and hardware used in this performance benchmark.

Item

Value

Instance type

m5.4xlarge

Node memory

64 GB

Node vCPU

16

Node disk

400 * 2; SSD

Network brandwith

Up to 10 Gbps

Node count

A master node and four worker nodes

Allocated memory on Yarn

202 GB

Allocated cores on Yarn

52

Kylin version

3.1.2 & 4.0.0

EMR version

5.31

Hadoop version

2.10.0

HBase version

1.4.13

Benchmark results

apache kylin 07 Kyligence

Cubing duration of TPC-H (sf = 10)

apache kylin 08 Kyligence

Storage size of TPC-H (sf = 10)

apache kylin 09 Kyligence

Average response time of SSB query (sf=10)

apache kylin 10 Kyligence

Average response time of TPC-H query (sf=10)

Conclusions

Cubing duration and cube size

Compared with Kylin 3’s MapReduce cube engine, thanks to higher resource utilization and no more converting cuboids to a specific data format (HFile), Kylin 4 greatly reduces the cubing duration by 62.6%.

In Kylin 3, the cuboid files are stored in two different formats. Instead Kylin 4 uses Parquet. We know Parquet has better encoding efficiency and a higher compression ratio, so the disk space consumed of the same cubes was reduced greatly by 72.56%.

apache kylin 11 Kyligence

Kylin 3 (MapReduce engine) has lower resource utilization.

apache kylin 12 Kyligence

Kylin 4 (new Spark engine) has higher and more stable resource utilization.

Query performance

In big query scenarios (queries that scan and do on-site complex calculations on large numbers of partitions/files) Kylin 3 query optimization is difficult, requiring repeated optimization of HBase RS servers and Kylin query servers. In stress test scenarios, query nodes become unstable because they need to do post-calculation on large data sets, and performance (query latency) degrades over time. Kylin 4 removes the single bottleneck of the Kylin query server, significantly improving both response time and QPS. Further, performance is stable during the stress test. In the TPC-H query set, response time of Kylin 4 is improved by 5x to 7x, and its concurrency is improved by 4x.

apache kylin 13 Kyligence

P95 response time of TPC-H query under different concurrency levels.

In point query scenarios (queries that scan small numbers of partitions/files and do not do many on-site calculations) Kylin 4 can meet the sub-second query latency requirement after some simple parameter adjustments, and its performance is relatively close to Kylin 3 (to be specific, only slightly worse).

Cost of learning and difficulty of performance optimization (parameter adjustment)

Kylin 3 has many build steps including steps that depend on different components, such as Hive, MapReduce, and HBase. Operating Kylin 3 requires learning and understanding many architectures and technical details, and being familiar with many parameters related to these components.

Kylin 4 removes all of these limitations. Cubing and queries in Kylin 4 are uniformly switched to the popular Spark engine, and new users only need to master Spark to learn and adjust parameters. These learning materials for Spark can be easily found, and the commonly used parameters are far fewer than in Kylin 3.

Xiaoxiang Yu is a software engineer at Kyligence and member of Apache Kylin PMC (Project Management Committee). He became an active Kylin project maintainer in 2018 and has been the release manager of several of the most recent versions. Yaqian Zhang is a software engineer at Kyligence, committer and maintainer of Apache Kylin.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to [email protected]

Copyright © 2021 IDG Communications, Inc.

Source link

Leave a Reply