banner

大数据日志分析的成功取决于机器学习

作者: 大数据观察来源: 大数据观察时间:2017-01-02 15:44:440

各种设备产生数量庞大的日志数据为深入了解它们创造了巨大可能,但为更透彻理解,机器学习十分必要。

机器生成的日志数据就像大数据宇宙的暗物质,在每一层,每一节点产生,然后在包括智能手机和互联网终端在内的分布式信息技术生态系统中生成。它们被收集,处理,分析和广泛使用,但大多时候,这些都发生在幕后。

日志数据对许多微型企业应用起到很基础的作用,如故障排除,调试,监控,安全,反欺诈,法规遵从和电子发现。然而,它也可以成为一个强大的工具,以用于分析点击流,地理空间,社交媒体,以及许多以客户为中心的使用情况等记录相关的行为数据。

机器学习能浮动大数据海洋上所有船只。

人类很难跟上机器记录数据,在设计之初,它们就没打算供人类直接分析。除非注入非凡效率,日志数据的高量,速度和品种可以迅速压倒人的认知。埃森哲撰写的最近这篇文章对此解释简明扼要:

日志文件的数量和种类日益上升,因此,管理和分析它,跟踪潜在的问题,发现错误–尤其当跨数发生关联时,都变得越来越困难。即使在最好的情况下,它仍需要一个有经验的操作人员遵循事件链,滤除噪声,并最终诊断出一个复杂的问题的根本原因。

显然,自动化是深入了解日志数据的关键,因为日志数据在大数据领域里成规模分布。自动化可以确保数据的采集,分析处理,同时,它对数据的显示结果规制和事件驱动的履行和数据流一样高速。日志分析自动化主要引擎包括机器数据集成中间件,业务规则管理系统,语义分析,数据流计算平台和机器学习算法。

其中,机器学习对于日志数据深入了解的自动化和精华甄选最为关键。但是,机器学习并不对于所有记录数据都完全准确的分析方法。不同的机器学习适合于不同类型的日志数据,用于不同的分析挑战。当寻求相关性或其他模式时可通过机器学习先验,而要进一步探索,监督学习则为上策。然而,监督学习需要人类专家从日志中准备一个培训数据的设置,以改进机器学习算法,使它们具有与辨别最相关的模式的能力。

但是,如果不能对日志数据模式提前精确定义,无监督和强化学习可能更合适。它们由机器学习提供,帮助日志数据分析方案最大化适合于全自动化,因为它们可以挑选出并优先最相关的模式,进行手头的任务,而不需要增设人类额外操作的培训数据设置。

多样相关性是用与无监督和强化学习的核心日志数据分析使用案例。当多样的日志数据被合成,最终它们合成,变得更异质的,复杂莫测,最有趣的数据也发生变化,这种关系完全不能被清楚地预先分析。因此,如果我们只是尝试使用简单的查询、预先存在的报表和仪表盘,以及其他标准分析视图进行查看,隐藏的模式可能仍然不可见。在这些情况下,机器学习可以提供各种显著的量化方法对此进一步探讨,例如聚类,马尔可夫模型,自组织映射等等。

另一个无监督学习和强化学习的关键应用是识别要么从未发生过或者除了被认定为杂音外从来没有被标记过的那些显著模式。文章作者讨论了一款假定的机器学习的安全日志分析应用程序,它可以“立即为用户发现非典型访问模式,即使这种特殊访问模式此前从未出现,他也能力及识别,这样就可以防止特别是私人信息的高风险损失。

许多对海量日志数据最具破坏性的见解都具有这种特质:复杂,死气,前所未有。从日志数据本身而不是从任何先验知识可知,将有许多数据科学家花费大量的时间去研究。他们将越来越多地调整自己的机器学习算法来监听日志中夹带的那些即使是最先进的人类主题专家此前也曾忽视了的“信号”。

英语原文:

Big data log analysis thrives on machine learning

Machine-generated log data is the dark matter of the big data cosmos. It is generated at every layer, node, and component within distributed information technology ecosystems, including smartphones and Internet-of-things endpoints. It is collected, processed, analyzed, and used everywhere, but mostly behind the scenes.

Log data is fundamental to many of the least glamorous enterprise applications, such as troubleshooting, debugging, monitoring, security, antifraud, compliance, and e-discovery. However, it can also be a powerful tool for analyzing clickstream, geospatial, social media, and other logged behavioral data relevant to many customer-centric use cases.

Mortals can barely keep up with machine-logged data. Most of it is not designed or intended for direct human analysis. Unless filtered with brutal efficiency, the extreme volumes, velocities, and varieties of log data can quickly overwhelm human cognition. The authors of this recent Accenture article explain it succinctly:

[A]s the volume and variety of log files rises, it becomes increasingly difficult for log management solutions to parse log files, trace potential issues, and actually find errors — particularly when cross-log correlations come into play. Even in the best-case scenarios, it requires an experienced operator to follow event chains, filter noise, and eventually diagnose the root cause to a complex problem.

Clearly, automation is key to finding insights within log data, especially as it all scales into big data territory. Automation can ensure that data collection, analytical processing, and rule- and event-driven responses to what the data reveals are executed as rapidly as the data flows. Key enablers for scalable log-analysis automation include machine-data integration middleware, business rules management systems, semantic analysis, stream computing platforms, and machine-learning algorithms.

Among these, machine learning is the key for automating and scaling distillation of insights from log data. But machine learning is not a one-size-fits-all approach to log-data analysis. Different machine-learning techniques are suited to different types of log data and to different analytical challenges. When the correlations and other patterns sought through machine learning can be specified a priori, supervised learning is the way to proceed. However, supervised learning requires a human expert to prepare a reference “training data” set from the log in order to refine a machine-learning algorithm’s ability to discern the most relevant patterns.

But when the log-data patterns cannot be precisely defined in advance, unsupervised and reinforcement learning may be more appropriate. Those are the machine-learning-powered, log-data-analysis scenarios most amenable to full automation, because they can pick out and prioritize the most relevant patterns to the task at hand without need of human-supplied training-data sets. (For links to further details on these machine-learning approaches, see my recent post.)

Multilog correlation is a core log-data analysis use case for unsupervised and reinforcement learning. As heterogeneous log-data sets are combined and grow more heterogeneous, complex, and inscrutable, the most interesting data variables and relationships are not at all clear in advance of the analysis. Consequently, the hidden patterns may remain invisible if we merely try to view them using simple queries, pre-existing reports and dashboards, and other standard analytic views. In these cases, machine learning can pull out the most noteworthy patterns for further exploration by using various quantitative approaches such as clustering, Markov models, self-organizing maps, and so forth.

Another key use of unsupervised and reinforcement learning is to identify significant patterns that either never occurred before or, if they had, never been flagged by human analysts as anything other than “noise.” The article’s authors discuss a hypothetical security-log analysis application of machine learning that can “immediately spot an atypical access pattern for a user, even if that specific access pattern had never been seen before, and prevent particularly high-risk losses of private information.”

Many of the most disruptive insights from massive log data will be of this nature: complex, buried, and unprecedented. Learning from the log data itself, rather than from any a priori knowledge, will be how many data scientists spend much of their time. They will increasingly tune their machine-learning algorithms to listen for “signals” in the log that even the most advanced human subject-matter experts had previously overlooked.

本文由36大数据合作伙伴北理大数据教育 翻译自infoworld,拒绝任何不标明 banner

看过还想看
可能还想看
热点推荐

永洪科技
致力于打造全球领先的数据技术厂商

申请试用
Copyright © 2012-2024开发者:北京永洪商智科技有限公司版本:V10.2
京ICP备12050607号-1京公网安备110110802011451号 隐私政策应用权限