超越批量处理与MapReduce：如何让Hadoop走得更远

作者: 大数据观察来源: 大数据观察时间:2017-08-22 12:24:430

导语：数据可以说是现代世界当中的新型货币资源。能够充分发掘数据价值的企业将制定出更有利于自身运营及发展的正确决策，并进一步引导客户一同迈向胜利的彼岸。作为现实层面上不可替代的大数据平台，Apache Hadoop允许企业用户构建起具备高度可扩展性且极具成本效益的数据存储体系。

Apache Tez框架开启了一道大门，引导我们迈向高性能、交互型、分布式数据处理应用程序的全新世代。

数据可以说是现代世界当中的新型货币资源。能够充分发掘数据价值的企业将制定出更有利于自身运营及发展的正确决策，并进一步引导客户一同迈向胜利的彼岸。作为现实层面上不可替代的大数据平台，Apache Hadoop允许企业用户构建起具备高度可扩展性且极具成本效益的数据存储体系。企业能够借此针对数据运行大规模并行及高性能分析型工作负载，进而解开长久以来受技术或者经济成本束缚而尘封于深处的指导性结论。Hadoop能够以前所未有的规模及效率实现数据价值——这在很大程度上要归功于Apache Tez以及YARN的鼎力协助。

分析型应用程序以目的驱动型导向对数据进行处理，因此不同类型的业务问题或者差异化供应商产品设计都会给处理过程带来差异化特性。要针对Hadoop数据访问创建出目的驱动型应用程序，首先需要满足两大先决条件。第一，用户的“操作系统”（类似于Windows或者Linux）必须能够在共享式Hadoop环境当中托管、管理并运行这些应用程序。Apache YARN正是这样一套面向Hadoop的数据操作系统。第二个先决条件在于，开发人员需要一套能够编写出可运行在YARN上的数据访问应用程序的应用构建框架以及通用型标准。

Apache Tez恰好满足这两大决定性因素。Tez是一套可嵌入且具备扩展性的框架，能够允许与YARN相结合并允许开发人员编写出足以涵盖各类交互式批量工作负载的原生YARN应用程序。Tez利用Hadoop那无与伦比的强大能力对PB级别的数据集进行处理，从而保证Apache Hadoop生态系统中的各类项目能够实现与目的相符的数据处理逻辑、迅捷的响应时间以及极致化数据吞吐能力。Tez能够为Hive及Pig等Apache项目带来史无前例的卓越处理速度及可扩展性，并逐步成为越来越多专门用于同Hadoop内存储数据进行高速交互的针对性第三方软件应用实现设计效果的重要前提与依托。

后MapReduce时代下的Hadoop

熟悉MapReduce的朋友们肯定急于了解Tez到底拥有哪些独特的差异化能力。Tez是一套适用范围极广且更为强大的框架，在继承了MapReduce的优势之外也修正了后者的一部分固有局限。Tez身上沿袭自MapReduce的优势包括以下几点：

横向可扩展能力，包括增加数据规模以及计算容量。具备资源弹性机制，能够同时在容量充裕或者有限的情况下正常运作。对于分布式系统中的各类不可避免及多发故障拥有理想的容错效果及恢复能力。利用内置Hadoop安全机制实现数据处理安全保护。

不过Tez本身并不属于处理引擎。相反，Tez的作用在于通过自身灵活性与可定制性优势帮助使用者构建应用程序及引擎。开发人员可以利用Tez库编写出MapReduce任务，而Tez代码在内置于MapReduce当中之后能够将前者的高效特性与后者的现有任务结合起来，最终实现MapReduce处理流程的成效提升。

MapReduce曾经是（当然现在也是）那些单纯只打算初步尝试Hadoop使用体验的用户的理想选择。不过当下企业级Hadoop应用已经逐步成为现实，这套被广泛接受的平台开始帮助越来越多用户利用保存在其内部集群中的数据挖掘出最大商业价值，与之相关的投资力度也在持续扩大。有鉴于此，定制化应用程序开始取代以MapReduce为代表的各类通用型引擎，旨在实现更为卓越的资源利用率以及性能表现提升。

Tez框架的设计理念

Apache Tez专门针对这些运行在Hadoop当中的定制化数据处理应用进行了优化。它能够将数据处理流程整理成一套数据流程图模型，这样Apache Hadoop生态系统中的各类项目就能够借此满足人机交互时对响应时间以及PB级别极端数据吞吐能力的要求。数据流程图中的每个节点都代表着一部分业务逻辑，专门负责对应的数据传输或者分析工作。不同节点之间的连接则代表着数据在不同传输体系间的出入往返。

一旦应用程序逻辑通过这套流程图被确定下来，Tez就会对该逻辑进行并行化、进而在Hadoop对其加以执行。如果某款数据处理应用程序能够通过这种方式进行建模，则意味着用户可以利用Tez对其加以构建。提取、传输与载入（简称ETL）任务在Hadoop数据处理体系当中随处可见，而任何一款定制型ETL应用程序都非常适合通过Tez进行打理。其它适合Tez框架的项目还包括查询处理引擎——例如Apache Hive——以及脚本语言——例如Apache Pig，此外还有Cascadig for Java以及Scalding for Scala等语言集成及数据处理API。

在与其它Apache项目结合加以使用时，Tez框架允许大家执行更多更具成效的处理任务。Apache Hive与Tez相结合能够为Hadoop带来极为出色的高性能SQL执行效果，而Apache Pig与Tez联姻后则可以对Hadoop当中的大规模复杂ETL任务进行优化。Cascading与Scalding遇见Tez框架之后将大大提升Java与Scala代码的转译效率。

Tez框架当中包含有直观的Java API，能够帮助开发人员更为轻松地创建出独特的数据处理流程图，从而最大程度提升应用程序的执行效率。在一套流程被定义完成后，Tez框架能够将额外API纳入到定制化业务逻辑当中，并使其运行在任务流程之内。这些API将与模块化环境当中的输入信息（即读取数据）、输出信息（即写入数据）以及处理机制（即处理数据）相结合。大家不妨将此视为在数据分析领域搭建自己的乐高积木。

利用这些API构建而成的应用程序能够高效运行在Hadoop环境当中，而Tez框架则负责处理其与其它堆栈组件之间的复杂交互任务。这样一来，我们就获得了一款定制优化且与YARN实现原生集成的应用程序，其具备出色的执行效率、可扩展性、容错能力并能够在多租户Hadoop环境中保障安全效果。

Tez框架的应用

因此，企业用户可以利用Tez框架在Hadoop当中创建目的驱动型分析应用程序。当选择这种实施方式时，大家可以在Tez当中采取两种不同类型的应用程序定制方式：要么对数据流程加以定义，要么对业务逻辑进行定制。

第一步是对数据流程加以定义以解决相关难题。大家可以利用多种数据流程图实现同样的解决成效，但从其中选择最理想的方案则能够大大改善应用程序的执行性能。举例来说，Apache Hive的性能表现能够通过在利用Tez API所构建的最佳连接图的支持下得到显著提升。

接下来，如果数据处理流程已经确定，那么企业用户还可以对任务执行中的输入信息、输出信息以及处理机制作出调整，从而实现业务逻辑的定制化设计。

需要注意的是，除了企业用户能够对数据处理应用程序进行定制化设计，互联网服务供应商及其它厂商也能够利用Tez框架实现自己的独特价值主张。举例来说，存储服务供应商可以为其存储服务实现定制化输入与输出实施方案。如果一家供应商拥有更为先进的硬件配置——例如RDMA或者InfiniBand连接机制——那么他们将能够更轻松地将优化方案引入现有业务实施体系。

大数据拥有光明甚至堪称爆炸性的迅猛发展前景，其中由Apache Hadoop负责实现的数据捕捉、存储以及处理等任务必然衍生出规模庞大且各类各异的新型表现形式。由于其在成本削减、复杂程度控制以及大数据管理风险缓冲等方面的出色表现，Hadoop已经在现代数据架构当中牢牢占据着举足轻重的地位——即成为企业级数据仓库当中的一大主要组成部分。

Apache Tez的出现让Hadoop在适用性方面得到了进一步提升，并能够在满足现有使用需求的同时开拓出更多新型目的驱动型应用程序类别。Tez框架为大数据架开启了一道通往新世代高度的大门，大家能够利用它在无需摒弃现有处理流程或者应用程序方案的前提下在Hadoop中打造出性能卓越的交互型应用程序。

英语原文：

Moving Hadoop beyond batch processing and MapReduce

Data is the new currency of the modern world. Businesses that successfully maximize its value will have a decisive impact on their own value and on their customers’ success. As the de-facto platform for big data, Apache Hadoop allows businesses to create highly scalable and cost-efficient data stores. Organizations can then run massively parallel and high-performance analytical workloads on that data, unlocking new insight previously hidden by technical or economic limitations. Hadoop offers data value at unprecedented scale and efficiency — in part thanks to Apache Tez and YARN.

Analytic applications perform data processing in purpose-driven ways that are unique to specific business problems or vendor products. There are two prerequisites to creating purpose-built applications for Hadoop data access. The first is an “operating system” (somewhat akin to Windows or Linux) that can host, manage, and execute these applications in a shared Hadoop environment. Apache YARN is that data operating system for Hadoop. The second prerequisite is an application-building framework and a common standard that developers can use to write data access applications that run on YARN.

Apache Tez meets this second need. Tez is an embeddable and extensible framework that enables easy integration with YARN and allows developers to write native YARN applications that bridge the spectrum of interactive and batch workloads. Tez leverages Hadoop’s unparalleled ability to process petabyte-scale datasets, allowing projects in the Apache Hadoop ecosystem to express fit-to-purpose data processing logic, yielding fast response times and extreme throughput. Tez brings unprecedented speed and scalability to Apache projects like Hive and Pig, as well as to a growing field of third-party software applications designed for high-speed interaction with data stored in Hadoop.

Hadoop in a post-MapReduce world

Those familiar with MapReduce will wonder how Tez is different. Tez is a broader, more powerful framework that maintains MapReduce’s strengths while overcoming some of its limitations. Tez retains the following strengths from MapReduce:

Horizontal scalability with increasing data size and compute capacity Resource elasticity to work both when capacity is abundant and when it’s limited Fault tolerance and recovery from inevitable and common failures in distributed systems Secure data processing using built-in Hadoop security mechanisms But Tez is not an engine by itself. Rather, Tez provides common primitives for building applications and engines — thus, its flexibility and customizability. Developers can write MapReduce jobs using the Tez library, and Tez comes with a built-in implementation of MapReduce, which can be used to run any existing MapReduce job with Tez efficiency.

MapReduce was (and is) ideal for Hadoop users that simply want to start using Hadoop with minimal effort. Now that enterprise Hadoop is a viable, widely accepted platform, organizations are investing to extract the maximum value from data stored in their clusters. As a result, customized applications are replacing general-purpose engines such as MapReduce, bringing about greater resource utilization and improved performance.

The Tez design philosophy

Apache Tez is optimized for such customized data-processing applications running in Hadoop. It models data processing as a data flow graph, so projects in the Apache Hadoop ecosystem can meet requirements for human-interactive response times and extreme throughput at petabyte scale. Each node in the data flow graph represents a bit of business logic that transforms or analyzes data. The connections between nodes represent movement of data between different transformations.

Once the application logic has been defined via this graph, Tez parallelizes the logic and executes it in Hadoop. If a data-processing application can be modeled in this manner, it can likely be built with Tez. Extract-Transform-Load (ETL) jobs are a common form of Hadoop data processing, and any custom ETL application is a perfect fit for Tez. Other good matches are query-processing engines like Apache Hive, scripting languages like Apache Pig, and language-integrated, data processing APIs like Cascading for Java and Scalding for Scala.

When used in conjunction with other Apache projects, Tez allows for more expressive processing tasks. Apache Hive with Tez brings high-performance SQL execution to Hadoop. Apache Pig with Tez is optimized for large-scale, complex ETL in Hadoop. Cascading and Scalding can use Tez to run the most efficient translations of Java and Scala code.

Tez includes intuitive Java APIs that offer developers avenues for creating unique data-processing graphs for the most efficient representation of their applications’ data-processing flows. After a flow has been defined, Tez provides additional APIs to inject custom business logic that will run in that flow. These APIs combine Inputs (that read data), Outputs (that write data), and Processors (that process data) in a modular environment. Think of these as build-your-own Lego blocks for data analysis.

Applications built with these APIs can run efficiently in Hadoop while the Tez framework handles the complexities of interacting with the other stack components. The result is a custom-optimized, natively integrated YARN application that’s efficient, scalable, fault-tolerant, and secure in multitenant Hadoop environments.

Applying Tez

Thus, businesses can use Tez to create purpose-built analytics applications in Hadoop. When doing so, they can draw on two types of application customizations in Tez: They can define the data flow, and they can customize the business logic.

The first step is to define the data flow that solves the problem. Multiple data flow graphs can solve the same problem, but choosing the right one has a large impact on the application’s performance. For example, Apache Hive’s performance is vastly improved by being able to define optimal joining graphs using Tez APIs.

Then, for the same data flow, businesses can customize the business logic using the Inputs, Outputs, and Processors that execute the task.

Note that in the same way businesses can customize their data processing applications, ISVs and other vendors can draw on Tez to showcase their unique value propositions. For example, a storage provider can swap inputs and outputs with custom implementations for its storage service. If a vendor has advanced hardware — say, with RDMA or InfiniBand — then it is easy to plug in an optimized implementation.

The big data landscape is exploding with possibilities, with large volumes of new types of data captured, stored, and processed by Apache Hadoop. Because it reduces the cost, complexity, and risk of managing big data, Hadoop has taken its rightful place in the modern data architecture — as a mainstream component in the enterprise data warehouse.

Apache Tez makes Hadoop even more applicable, with opportunities to solve existing use cases and discover new ones with purpose-built applications. Tez unlocks the potential of big data by enabling the next generation of high-performance, interactive applications in Hadoop, without requiring the elimination of any process or application that already works well.

banner

看过还想看

可能还想看

热点推荐