文档章节

On the use of machine learning to predict the time and resources consumed by applications

猪迪
 猪迪
发布于 2017/08/13 12:12
字数 1436
阅读 9
收藏 1

现代机器学习技术可以使用大量的特征,考虑application- and system-specific attributes,例如CPU 微架构, size and speed of memory and storage, input data characteristics and input parameters。本文扩展了一种已有的分类树算法Predicting Query Runtime (PQR预测查询运行时间),选择最好的回归算法。PQR2取得了最好的平均误差百分比(对于 predicting execution time, memory and disk consumption for two bioinformatics applications, BLAST and RAxML)

使用非线性函数处理系统和应用的特征,配置两种机器学习算法:Support Vector Machine and k-nearest neighbors。

机器学习相关的工作

预测(CPU, memory, disk and network)资源消耗是一个有监督学习问题,

假定有n个之前运行的任务作为历史数据,特征集合包含m个特征。

几个参数学习 (e.g.,linear regression, polynomial regression) and 非参数学习
(e.g., k-nn, locally weighted linear regression, decision trees)监督学习算法

参数学习方法定义了假设空间和损失函数。

训练数据用于提取模型参数以最小化损失函数(老生常谈)。

KNN:一个挑战就是找到理想的近邻数k,依赖于训练数据。

High value for k can reduce the influence of noise

可以用遗传算法确定k

Locally weighted polynomial regression (LWPR) is similar to k-nn algorithm,见文献[6]:W. Smith, ““Prediction Services for Distributed Computing,”” in Proc. 21st Int. Parallel Distributed Processing Symp., 2007.

线性回归是简单的模型,应用于工作流时间预测和LWPR,SVM,参见:R. Albers, E. Suijs, and P. H. N. de With, ““Triple-C: Resource-usage prediction for semi-automatic parallelization of groups of dynamic image-processing tasks,”” in Proc. 23rd Int. Parallel Distributed Processing Symp., 2009.

Decision table/tree (DT) algorithm 决策表决策树:给予分治策略

使用树形结构按照他们的特征分离数据

C4.5 classification tree was applied in I. Rodero, F. Guim, J. Corbalan et al., "The Grid Backfilling: a Multi-Site Scheduling Architecture with Data Mining Prediction Techniques," Grid Middleware and Services, pp.137-152, 2008.

使用模板而不是树形图的论文很多([1][4][5][6][12]),从所有任务特征中选择一个子集合。选择统计函数,如平均,平均加1.96*标准差,平均加1.5标准差,根据单个特征的线性回归,根据单个特征的逆回归和对数回归。

人工神经网络:Radial Basis Function network (RBFn) is a feedforward ANN

径向基函数网络 Typically with three layers: input, hidden and output

RBFn is very similar to k-nn, except for the fact that RBFn is a parametric method.神经网络是参数学习

SVM:kernel method for solving classification [19] and regression [20]
problems, especially for scenarios with non-linear learning pattern. 非线性场景

时间序列方法:Network Weather Service (NWS) [22] or a probabilistic approach as a Markov-chain.马尔科夫时间序列

Triple-C :Resource-usage prediction for semi-automatic parallelization of groups of dynamic image-processing tasks

 

PQR回归

Generates a binary tree that can combine a variety of classifiers

树的每个节点可以表示成二分类器(从分类算法池中依据精度选取的)

算法将数值特征离散化为二分类,m个训练特征值(ai,i=1,...,m)

找到第k大的gaps gaps_i=(a_i+1-ai)/ai,i=1,..,m-1成为潜在的分割点

The best combination of classifier and split location determines the two attribute
ranges.

Instead of outputting classes (a broad range) or a static value (e.g., range median), PQR2 selects the best regression model for the availab le data (LR and
SVM in the case of the leaves shown in the figure).

生物信息学应用-序列比对

Basic Local Alignment Search Tool (BLAST) [1] and Randomized Axelerated
Maximu m Likelihood (RAxML)

BLAST----the non-redundant (NR) protein sequence database from NCBI split into 1 frag ment (total of 3.5 GB of data)

 

树生成 by PQR and PQR2 algorithms

The nodes of the tree (circles) are common to both methods, while leaves (方形叶子) of PQR2 (加粗) yield lower errors than PQR (正常字体).

The improvement comes from the ability of selecting the best regression method
from a pool, whereas leaf range median is used in PQR.

The number between square brackets represent the range of values of the attribute to be predicted, which is followed by the number of historical data points in each node/leaf.

The percentage value indicates the accuracy of each classifier (nodes 分类) or the percentage error of each regressior (leaves 回归).

The last value indicates the name of each cassification (PQR and PQR2) or regression (PQR2) algorithm selected.

BLAST运行时间如下图,运行时间与输入序列的长度成线性关系:

需要尝试的机器学习预测算法:

问题

Question 1: Which ML algorithm offers the best accuracy?

Question 2: Which attributes should be included in the training dataset?

Question 3: Which ML algorith m provides better accuracy when dealing with training datasets with low coverage?

Regression Error Characteristic Curves

Comparing accuracy of PQR and PQR2 for predicting BLAST output and execution time (2 graphs on the left) as well as RAxML memory consumption and execution time (graphs on the right).

Question 4: Does PQR2 offer better accuracy than PQR?

总结与讨论

To better adapt to scenarios with different characteristics (linear and non-linear relationships, high and low density of training data points) by choosing different models for its nodes and leaves.

更一般的,using the largest dataset (BLAST), PQR2 required a few minutes to create the model and a few milliseconds to produce a single prediction, indicat ing practicality of PQR2 for production deployments.

PQR2 是最佳的方案 for BLAST and RAxML and should 作为备选的方案 for other applications.

2. Attributes can have high impact on the performance of the learning algorith ms. 特征对性能的影响

The use of system performance attributes showed to be relevant for execution time prediction whereas application specific attributes were pertinent for all scenarios.

This work makes the case for including as many attributes as available, while letting the algorithms analyze the relevance of the attributes when necessary. For cloud and grid computing scenarios, where resources are outsourced, the provision of this informat ion to its users (or services acting on behalf of the users) through the use of benchmarks and runtime mon itoring,especially of shared resources, can bring several benefits.

* Amazon CloudWatch is one such example limited to a virtual mach ine instance. AWS云端资源监控
改进预测能更好的利用系统资源,避免系统应用中断,量入为出的使用资源

© 著作权归作者所有

共有 人打赏支持
猪迪
粉丝 6
博文 134
码字总数 180528
作品 0
海淀
程序员
Technology Predictions for 2018 and Beyond

Every year about this time, we gaze into crystal balls to divine the future of our industry – or at least where it’s headed over the next 365 days. The result is often a triu......

Otto Berkes
2017/12/22
0
0
随机森林入门

Random forest is a highly versatile machine learning method with numerous applications ranging from marketing to healthcare and insurance. It can be used to model the impact of ......

AC-carrot
2016/06/03
44
0
Machine Learning Project Walkthrough: Making Predictions

1: Recap We spent the last 2 missions cleaning and preparing a dataset that contains data on loans made to members of Lending Club. Our eventual goal is to generate features fro......

Betty__
2016/09/29
32
0
How Do You Ask Questions of Data Using APIs?

I’m preparing to publish a bunch of transit-related data as APIs, for us across a number of applications from visualizations to conversation interfaces like bots and voice-ena......

Kin Lane
2017/12/22
0
0
从 Quora 的 187 个问题中学习机器学习和NLP

Quora 已经变成了一个获取重要资源的有效途径。许多的顶尖研究人员都会积极的在现场回答问题。 以下是一些在 Quora 上有关 AI 的主题。如果你已经在 Quora 上面注册了账号,你可以订阅这些主...

chen_h
2017/10/31
0
0

没有更多内容

加载失败,请刷新页面

加载更多

下一页

docker中安装了RabbitMQ后无法访问其Web管理页面

在官网找了"$ docker run -d --hostname my-rabbit --name some-rabbit -p 8080:15672 rabbitmq:3-management"这条安装命令,在docker上安装了RabbitMQ,,结果输入http://localhost:8080并不......

钟然千落
25分钟前
0
0
spring-cloud | 分布式session共享

写在前面的话 各位小伙伴,你们有福了,这一节不仅教大家怎么实现分布式session的问题,还用kotlin开发,喜欢kotlin的小伙伴是不是很开心! 以前在写Android的时候,就对客户端请求有一定的认...

冯文议
44分钟前
0
0
c语言之内存分配笔记

先看一个数组: short array[5] = {1,2} // 这儿定义的一个int类型的数组,数组第1和第2个元素值是1和2.其余后面默认会给值为0; 或者 short array[] = {1,2};//这儿数组第1和第2个元素,数组...

DannyCoder
今天
4
0
Shell | linux安装包不用选择Y/N的方法

apt-get install -y packageOR echo "y" | sudo apt-get install package

云迹
今天
2
0
Hadoop的大数据生态圈

基于Hadoop的大数据的产品圈 大数据产品的一句话概括 Apache Hadoop: 是Apache开源组织的一个分布式计算开源框架,提供了一个分布式文件系统子项目(HDFS)和支持MapReduce分布式计算的软件架...

zimingforever
今天
7
0

没有更多内容

加载失败,请刷新页面

加载更多

下一页

返回顶部
顶部