文档章节

On the use of machine learning to predict the time and resources consumed by applications

猪迪
 猪迪
发布于 2017/08/13 12:12
字数 1436
阅读 10
收藏 1

现代机器学习技术可以使用大量的特征,考虑application- and system-specific attributes,例如CPU 微架构, size and speed of memory and storage, input data characteristics and input parameters。本文扩展了一种已有的分类树算法Predicting Query Runtime (PQR预测查询运行时间),选择最好的回归算法。PQR2取得了最好的平均误差百分比(对于 predicting execution time, memory and disk consumption for two bioinformatics applications, BLAST and RAxML)

使用非线性函数处理系统和应用的特征,配置两种机器学习算法:Support Vector Machine and k-nearest neighbors。

机器学习相关的工作

预测(CPU, memory, disk and network)资源消耗是一个有监督学习问题,

假定有n个之前运行的任务作为历史数据,特征集合包含m个特征。

几个参数学习 (e.g.,linear regression, polynomial regression) and 非参数学习
(e.g., k-nn, locally weighted linear regression, decision trees)监督学习算法

参数学习方法定义了假设空间和损失函数。

训练数据用于提取模型参数以最小化损失函数(老生常谈)。

KNN:一个挑战就是找到理想的近邻数k,依赖于训练数据。

High value for k can reduce the influence of noise

可以用遗传算法确定k

Locally weighted polynomial regression (LWPR) is similar to k-nn algorithm,见文献[6]:W. Smith, ““Prediction Services for Distributed Computing,”” in Proc. 21st Int. Parallel Distributed Processing Symp., 2007.

线性回归是简单的模型,应用于工作流时间预测和LWPR,SVM,参见:R. Albers, E. Suijs, and P. H. N. de With, ““Triple-C: Resource-usage prediction for semi-automatic parallelization of groups of dynamic image-processing tasks,”” in Proc. 23rd Int. Parallel Distributed Processing Symp., 2009.

Decision table/tree (DT) algorithm 决策表决策树:给予分治策略

使用树形结构按照他们的特征分离数据

C4.5 classification tree was applied in I. Rodero, F. Guim, J. Corbalan et al., "The Grid Backfilling: a Multi-Site Scheduling Architecture with Data Mining Prediction Techniques," Grid Middleware and Services, pp.137-152, 2008.

使用模板而不是树形图的论文很多([1][4][5][6][12]),从所有任务特征中选择一个子集合。选择统计函数,如平均,平均加1.96*标准差,平均加1.5标准差,根据单个特征的线性回归,根据单个特征的逆回归和对数回归。

人工神经网络:Radial Basis Function network (RBFn) is a feedforward ANN

径向基函数网络 Typically with three layers: input, hidden and output

RBFn is very similar to k-nn, except for the fact that RBFn is a parametric method.神经网络是参数学习

SVM:kernel method for solving classification [19] and regression [20]
problems, especially for scenarios with non-linear learning pattern. 非线性场景

时间序列方法:Network Weather Service (NWS) [22] or a probabilistic approach as a Markov-chain.马尔科夫时间序列

Triple-C :Resource-usage prediction for semi-automatic parallelization of groups of dynamic image-processing tasks

 

PQR回归

Generates a binary tree that can combine a variety of classifiers

树的每个节点可以表示成二分类器(从分类算法池中依据精度选取的)

算法将数值特征离散化为二分类,m个训练特征值(ai,i=1,...,m)

找到第k大的gaps gaps_i=(a_i+1-ai)/ai,i=1,..,m-1成为潜在的分割点

The best combination of classifier and split location determines the two attribute
ranges.

Instead of outputting classes (a broad range) or a static value (e.g., range median), PQR2 selects the best regression model for the availab le data (LR and
SVM in the case of the leaves shown in the figure).

生物信息学应用-序列比对

Basic Local Alignment Search Tool (BLAST) [1] and Randomized Axelerated
Maximu m Likelihood (RAxML)

BLAST----the non-redundant (NR) protein sequence database from NCBI split into 1 frag ment (total of 3.5 GB of data)

 

树生成 by PQR and PQR2 algorithms

The nodes of the tree (circles) are common to both methods, while leaves (方形叶子) of PQR2 (加粗) yield lower errors than PQR (正常字体).

The improvement comes from the ability of selecting the best regression method
from a pool, whereas leaf range median is used in PQR.

The number between square brackets represent the range of values of the attribute to be predicted, which is followed by the number of historical data points in each node/leaf.

The percentage value indicates the accuracy of each classifier (nodes 分类) or the percentage error of each regressior (leaves 回归).

The last value indicates the name of each cassification (PQR and PQR2) or regression (PQR2) algorithm selected.

BLAST运行时间如下图,运行时间与输入序列的长度成线性关系:

需要尝试的机器学习预测算法:

问题

Question 1: Which ML algorithm offers the best accuracy?

Question 2: Which attributes should be included in the training dataset?

Question 3: Which ML algorith m provides better accuracy when dealing with training datasets with low coverage?

Regression Error Characteristic Curves

Comparing accuracy of PQR and PQR2 for predicting BLAST output and execution time (2 graphs on the left) as well as RAxML memory consumption and execution time (graphs on the right).

Question 4: Does PQR2 offer better accuracy than PQR?

总结与讨论

To better adapt to scenarios with different characteristics (linear and non-linear relationships, high and low density of training data points) by choosing different models for its nodes and leaves.

更一般的,using the largest dataset (BLAST), PQR2 required a few minutes to create the model and a few milliseconds to produce a single prediction, indicat ing practicality of PQR2 for production deployments.

PQR2 是最佳的方案 for BLAST and RAxML and should 作为备选的方案 for other applications.

2. Attributes can have high impact on the performance of the learning algorith ms. 特征对性能的影响

The use of system performance attributes showed to be relevant for execution time prediction whereas application specific attributes were pertinent for all scenarios.

This work makes the case for including as many attributes as available, while letting the algorithms analyze the relevance of the attributes when necessary. For cloud and grid computing scenarios, where resources are outsourced, the provision of this informat ion to its users (or services acting on behalf of the users) through the use of benchmarks and runtime mon itoring,especially of shared resources, can bring several benefits.

* Amazon CloudWatch is one such example limited to a virtual mach ine instance. AWS云端资源监控
改进预测能更好的利用系统资源,避免系统应用中断,量入为出的使用资源

© 著作权归作者所有

共有 人打赏支持
猪迪
粉丝 6
博文 134
码字总数 180528
作品 0
海淀
程序员
私信 提问
Technology Predictions for 2018 and Beyond

Every year about this time, we gaze into crystal balls to divine the future of our industry – or at least where it’s headed over the next 365 days. The result is often a triu......

Otto Berkes
2017/12/22
0
0
Choosing Between Baked and Fried Provisioning

Provisioning always requires resources from somewhere. The resources are packages in remote repositories, compressed files from Internet addresses, and they have all kinds of si......

Gustavo Carmo
2017/12/20
0
0
随机森林入门

Random forest is a highly versatile machine learning method with numerous applications ranging from marketing to healthcare and insurance. It can be used to model the impact of ......

AC-carrot
2016/06/03
44
0
How Do You Ask Questions of Data Using APIs?

I’m preparing to publish a bunch of transit-related data as APIs, for us across a number of applications from visualizations to conversation interfaces like bots and voice-ena......

Kin Lane
2017/12/22
0
0
Machine Learning Project Walkthrough: Making Predictions

1: Recap We spent the last 2 missions cleaning and preparing a dataset that contains data on loans made to members of Lending Club. Our eventual goal is to generate features fro......

Betty__
2016/09/29
32
0

没有更多内容

加载失败,请刷新页面

加载更多

崛起于Springboot2.X之集成工作流Activiti5.22(42)

声明:该博客主要是Springboot1.X和Springboot2.X集成Activiti5.22版本,并说一下两个版本的搭建不同的地方 技术:Springboot2.0.3+mysql+jpa(自动生成25张表)+Activiti5.22 /然后Springboo...

木九天
5分钟前
0
1
windows环境下搭建rabbitMQ开发环境

windows环境下搭建rabbitMQ开发环境 下载与安装 erlang rabbitmq 是使用erlang语言开发的,所以需要erlang环境; 下载地址 rabbitmq 下载地址 rabbitmq与erlang版本关系 下载之后直接安装即可...

晨猫
16分钟前
0
0
JVM 中的守护线程

特点 通常由JVM启动 运行在后台处理任务,比如垃圾回收等 用户启动线程执行结束或者JVM结束时,会等待所有的非守护线程执行结束,但是不会因为守护线程的存在而影响关闭。 判断线程是否为守护...

小刀爱编程
20分钟前
1
0

参考 极客时间《数据结构与算法之美》

grace_233
33分钟前
2
0
谈谈KMP算法

KMP算法的资料网上已经一大把了,主要用来解决某个文本片段是否包含另一个子串问题。这里假设文本片段的长度n大于子串长度m,如: 文本串为ABCDABGHIJK 子串为 ABCDABE 在传统的暴力解法中当...

FAT_mt
35分钟前
3
0

没有更多内容

加载失败,请刷新页面

加载更多

返回顶部
顶部