我的大数据有精神分裂症

作者: 大数据观察来源: 大数据观察时间:2017-04-14 12:03:500

在我以前的博客，我们已经讨论了对物联网的炒作和现实的对比，并介绍了物联网服务的框架。在物联网的产业方面，有一些商业模式使物联网货币化。近日，Kaggle已经与一家主要工业集团进行合作。其目标是开发者和数据科学家们运行公共需求以创造最好的新算法，来减少航空旅行延误。

作为一名飞行常客，我总是惊讶于航班有多少次不得不处于静止模式并且只是环绕着机场。很多因素会导致这些延迟包括天气状况、交通拥堵和门的可用性。其中一个有趣的统计是，即使是削减掉10英里的平均飞行便可以为航空公司节省数百万美元的燃油成本。

其中物联网和大数据的衔接是一个很好的例子。该算法涉及到对飞行的历史事件、飞行计划、飞行轨道（实际GPS信息），天气和FAA计划的整体历史分析。物联网的真正好处是，飞行修正可以实时通过传感数据完成，并且通过大数据分析还未出现的模式的可行性。

这带来了一个有趣的悖论 – 大数据的精神分裂症性质。为了提供一个可操作的见解，实时数据流分析是必须的。但是，这个时间点流传感器数据没有多大用处，除非我们知道相互依赖的历史数据和相互作用的模式。

传统的大数据解决方案，如Hadoop，依赖于使用各种基于MapReduce的HDFS架构进行批处理。最近，已经有许多个流处理系统，如正在得到了很多的关注的ApacheSpark。我们需要一个统一的基于依赖批处理和流处理的解决方案。作为IT领导者，最后我想要的是一个架构，在那里我可以维护多个代码库以解决单个业务问题。一种办法是在MapReduce和风暴或类似的系统之上构建流处理应用程序。

在物联网的世界里，最关键的成功因素之一是，我们如何能够产生上下文敏感的，真正可操作的警报。这是一个老问题，对于这个问题人们已经不再关注汽车报警器在停车场熄声了。在物联网解决方案需要紧缩亿万传感器数据元素，并找到可行的模式。例如，什么真正构成信用卡购买欺诈交易？什么天气模式以及飞行员的技能和机场设备实际上会导致延迟？以下提议的架构解决了这个难题。

大数据决策模式

有一个提供实时可操作警报的仪表盘。仪表板获得其标定的动力来自一个规则引擎。规则引擎不断在批处理模式下运行，并通过数据挖掘算法和机器学习更新自己。实时流数据通过该批次系统生成动态规则连续搜索。这确保了只有在真正需要的时候警报才会响起。

一些开源的大数据技术来共同实现这一架构。其中一个亮点是采用ApacheKafka（高通量分布式消息系统）。Kafka允许监听多个传感器的话题，并提供流数据到ApacheStorm。Apache Flume 起着输送成批处理和流式数据储存库中数据的作用。

英语原文：

In my previous blog, we had discussed the Hype vs Reality of the Internet of the things (IoT) and introduced the TaaS (Internet of Things as a Service) framework. In the industrial side of the IoT, there are some business models that lends themselves for the monetization of IoT. Recently, Kaggle had partnered with a major industrial conglomerate. The objective was to run a public quest for developers and data scientists to create the best new algorithms to reduce air travel delays.

As a frequent flier, I am always amazed by how many times flights have to go on holding patterns and just circle the airports. Many factors cause these delays including weather patterns, traffic congestions, and gate availability. One of the interesting statistics that came out of this quest was that even a 10-mile reduction off an average flight can save millions of dollars in fuel costs to airlines. This is a great example where IoT and big data converge. The algorithms involved holistic analysis of flight history events, flight plans, flight tracks (actual GPS information), weather and FAA programs. The real benefit of IoT is that the course correction of a flight can be done in real time based on the sensor data and availability of prior patterns unearthed by big data analytics.

This brings an interesting paradox – the schizophrenic nature of big data. To provide an actionable insight, analysis of real time streaming data is a must. However, this point in time streaming sensor data is of not much use, unless we know the historical data inter-dependencies and patterns of interactions.

Traditionally big data solutions like Hadoop rely on batch processing using variety of MapReduce architectures built on HDFS. Recently, there have been many stream processing systems such as Apache Spark that are getting a lot of attention. We need a unified solution that relies on both batch as well as stream processing. As an IT leader, the last thing I want is an architecture where I have to maintain multiple code bases to solve a single business problem. An approach could be to build stream processing applications on top of MapReduce and Storm or similar systems.

In the IoT world, one of the most critical success factors would be how can we generate context sensitive, real actionable alerts. It is an age old problem where people have stopped paying attention to car alarms going off in parking lots. The IoT solutions need to crunch millions and millions of sensor data elements and find actionable patterns. For example, what really constitutes a fraudulent transaction on credit card purchase? What weather patterns along with crew skills and airport equipment would actually cause a delay? The proposed architecture below addresses this dilemma.

Big data decision making pattern

There is a dashboard that provides the real time actionable alerts. The dashboard gets its calibrated feeds from a rule engine. The rule engine constantly updates itself through data mining algorithms and machine learning by operating in batch mode. The real time streaming data is continuously looked through the dynamic rules that the batch system generates. This ensures that alerts are only raised when real actions are required.

A number of open source big data technologies come together to achieve this architecture. One of the highlights is the use of Apache Kafka (high-throughput distributed messaging system). Kafka allows listening on multiple sensor topics and provides streaming data to Apache Storm. Apache Flume plays the role of the data transport channel that feeds into both batch and streaming data repositories.

看过还想看

可能还想看

热点推荐