文档章节

Hive Outline - Part I (UDF, JOIN, SELECT, Map-Red)

woodo
 woodo
发布于 2014/08/15 16:54
字数 1142
阅读 115
收藏 1

The Outline of Hive

User-Defined Functions

- UDF

UDF is one-one row function,for example, Substring,trim. A customized class extended from org.apache.hadoop.hive.ql.exec.UDF, And implement method 'evaluate' with multiple signature body. This is not declared as interface method, just because it might need multiple parameters,

this is uncertain on application context. for Example:

public class strip extends UDF {
  public Text evaluate(...){}
}


- UDAF

UDAF is many-one rows (aggregation) function, example, Count,Max. The class extends org.apache.hadoop.hive.ql.exec.UDAF. And The class need multiple nested class that extends from org.apache.hadoop.hive.ql.exec.UDADEvaluator, like:

 class maxint extends UDAF {
  public static class maxintEvaluator extends UDAFEvaluator{
   //very-first
   void init(){}
   
   // for each row in partial
   bool iterator(){}
   
   // finish a partial
   terminatePartial(){}
   
   //merge partial
   bool merge(partialresult);
   
   //final result
   doubleWritable terminate(){}
  }
  public static class maxlongEvaluator extends UDAFEvaluator{
  }
 }

 

- UDTF 

UDTF is one-many (table-generated) function, for example explode(column),
 The class extends org.apache.hadoop.hive.ql.udf.generic.GenericUDTF
 The class implement initialize, process, close() method.
 using with lateral view,like:

 select src.id, mytable.col1, mytable.col2 from src lateral view explode_map
(properties) mytable as col1, col2;


Query Hive

Questions
1. What's LATERAL VIEW? Lateral View is using with split/explode together, which are able to explode Array/Map/Structure to multiple rows. but not able to use join, sort, order and etc. Very limited.

  • View

generate by CVAS (Create View AS Select), it's just a saved query, no cached View in hive. it can show by 'Show tables;', and using 'describe extended view' for details.

  • SubQuery, must have alias, how many levels not sure, need verify on practise.

Select a,b
 From (
   select...
 ) aliashere
 Group by a,b

  • Joins,
    Hive support Inner Join, Left/Right/Full Outer join, On syntax only support eaqual condition, other condition not supported.

    The algorithm will use Reduce Only, which will try to cache left table (FROM table) , and transfer right table (JOIN table). So better set FROM table smaller, this will get best in practise.


       'Explain/Explain Extended' can use rule-based query optimizer, and tell more details about       the query. cost-based optimizer might be adedd in futrue.

    -- Semi Join, which is to repolace IN syntax, example, the right table only be referenced in On condition, and it can't be referenced in SELECT and other key words, for example.
 

Select * from TA left SEMI Join TB on (TA.id = TB.id);

    -- Map-side Join

    If the table is a small data set, then Hive able to optimize the table by In-Memory in mapper, to enhance the speed, that's so called Map size, since it's different with Reduce side jon. Using C style comment, this looks wierd, I guess some unkown reason the developer using this syntax, such as this is a hint, but not a rule. because some condition hive need to justice if it's ready to use map-side join.

SELECT /*+ MAPJOIN(tb2) */ tb1.field1,tb2.filed1
FROM student tb1
JOIN teacher tb2
ON(tb1.teacherId = tb2.id)

In above sample, Table teacher is smaller data set than student.


Without MAPJOIN, hive might use auto convert to optimize common JOIN, some Configurations:

hive.auto.convert.join = true; this enable Auto convert in Hive, if the table is taken as small table, the value is as below setting shows.

hive.smalltable.filesize=10M; small table size standard.

hive.hashtable.max.memory.usage =50; if Map task need memory exceed this percentage, the map task will terminate.

Normally, LEFT OUTER JOIN tb1, and RIGHT OUTER JOIN tb1, try to chck if tb1 is ready to be hashtable.

However, FULL OUTER JOIN never try this, it will use reduce only algorithm. see Joins for details.

- Index

Hive actually create a table, ordered by index fields, and remember the offset to acclerate speed.

in my case the Index Table named by Database__Table_Index__.

CREATE INDEX table02_index ON TABLE table02 (column3) AS 'COMPACT' WITH DEFERRED REBUILD;
ALTER INDEX table02_index ON table2 REBUILD;
SHOW FORMATTED INDEX ON table02;
DROP INDEX table02_index ON table02;

Note: In previous version of Hive, I still see bucket which is a hash map to enhance map-side join, but now it looks deprecated because Index is a more friendly feature to cover.


- Order By, order by is bad performance without 'limt N' clause.

Difference between Sort By and Order By

Hive supports SORT BY which sorts the data per reducer. The difference between "order by" and "sort by" is that the former guarantees total order in the output while the latter only guarantees ordering of the rows within a reducer. If there are more than one reducer, "sort by" may give partially ordered final results.

Note: It may be confusing as to the difference between SORT BY alone of a single column and CLUSTER BY. The difference is that CLUSTER BY partitions by the field and SORT BY if there are multiple reducers partitions randomly in order to distribute data (and load) uniformly across the reducers.

Basically, the data in each reducer will be sorted according to the order that the user specified. 

FROM (FROM (FROM src
            SELECT TRANSFORM(value)
            USING 'mapper'
            AS value, count) mapped
      SELECT cast(value as double) AS value, cast(count as int) AS count
      SORT BY value, count) sorted
SELECT TRANSFORM(value, count)
USING 'reducer'
AS whatever


Syntax of Cluster By and Distribute By

Cluster By and Distribute By are used mainly with the Transform/Map-Reduce Scripts. But, it is sometimes useful in SELECT statements if there is a need to partition and sort the output of a query for subsequent queries.

Cluster By is a short-cut for both Distribute By and Sort By.

Hive uses the columns in Distribute By to distribute the rows among reducers. All rows with the same Distribute By columns will go to the same reducer. However, Distribute By does not guarantee clustering or sorting properties on the distributed keys.

Instead of specifying Cluster By, the user can specify Distribute By and Sort By, so the partition columns and sort columns can be different. The usual case is that the partition columns are a prefix of sort columns, but that is not required.


SELECT col1, col2 FROM t1 DISTRIBUTE BY col1
 
SELECT col1, col2 FROM t1 DISTRIBUTE BY col1 SORT BY col1 ASC, col2 DESC


Below table run Map-Reduce job and have DISTRIBUTE BY and SORT BY in different fields, if it's same, you may only specify 'CLUSTER BY'.

FROM (
  FROM pv_users
  MAP ( pv_users.userid, pv_users.date )
  USING 'map_script.py'
  AS c1, c2, c3
  DISTRIBUTE BY c2
  SORT BY c2, c1) map_output
INSERT OVERWRITE TABLE pv_users_reduced
  REDUCE ( map_output.c1, map_output.c2, map_output.c3 )
  USING 'reduce_script.py'
  AS date, count;





© 著作权归作者所有

woodo
粉丝 5
博文 57
码字总数 32118
作品 0
朝阳
高级程序员
私信 提问
Hive 在多维统计分析中的应用 & 技巧总结

本文原地址:https://my.oschina.net/leejun2005/blog/121945 多维统计一般分两种,我们看看 Hive 中如何解决: 1、同属性的多维组合统计 (1)问题: 有如下数据,字段内容分别为:url, ca...

SimplePoint
2017/04/18
0
0
从 MapReduce 到 Hive —— 一次迁移过程小记

1、背景介绍 早先的工作中,有很多比较复杂的分析工作,当时对hive还不熟悉,但是java比较熟悉,所以在进行处理的时候,优先选择了MR. 但是随着工作的数据内容越来越多,越来越复杂,对应的调整也越...

大数据之路
2014/01/10
4.6K
0
Mapreduce与Hive比较

1、背景介绍 早先的工作中,有很多比较复杂的分析工作,当时对hive还不熟悉,但是java比较熟悉,所以在进行处理的时候,优先选择了MR. 但是随着工作的数据内容越来越多,越来越复杂,对应的调整也越...

恶魔苏醒ing
2017/03/16
0
0
Hive基本操作

hive的基本操作 1. Hive基本操作 1.1 DDL操作 1.1.1 创建表 建表语法 CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name [(col_name data_type [COMMENT col_comment], ...)] [COMMENT ta......

yushiwh
2017/05/12
0
0
hive之自定义函数

create table tabarray(a array<int>,b array<string>) row format delimited fields terminated by 't' colloction items terminated by ','; abc,helloworl,itcats 213123214,432312321 se......

泡海椒
2016/01/17
63
0

没有更多内容

加载失败,请刷新页面

加载更多

ZhaoWei-2020-01-18

Redis官方文档 简介 Redis是开源的(BSD许可)内存数据结构存储,用作数据库,缓存和消息代理。它支持数据结构,例如 字符串,哈希,列表,集合,带范围查询的排序集合,位图,超日志,带有半...

SuSheePark
21分钟前
11
0
替换字符串C#中的换行符

如何在C#中替换字符串中的换行符? #1楼 由于新行可以用\\n , \\r和\\r\\n分隔,因此我们首先将\\r和\\r\\n替换为\\n ,然后才拆分数据字符串。 以下几行应转到parseCSV方法: function p...

javail
23分钟前
6
0
快递物流上门取件api接口对接指南(中通圆通申通韵达百世)

1.常用快递API 支持顺丰、EMS、申通、圆通、韵达、汇通、中通、天天、德邦、全峰等主流快递公司。 上门取件,是电商平台为寄件用户提供的通过一键下单到快递员,并在2小时上门取件的寄件服务...

程序的小猿
27分钟前
6
0
WebFlux系列(十一)WebClient 日志

#Java#Spring#WebClient#WebFlux#log#日志# WebClient 日志 视频讲解 : https://www.bilibili.com/video/av83627944/ WebfluxConsumerApplication.java package com.example.webfluxconsumer......

潘文海
29分钟前
6
0
使用LINQ来获取一个List <>中的项目,而不是另一个List <>中的项目

我会假设有一个简单的LINQ查询可以做到这一点,但我不确定该如何做。 给出这段代码: class Program{ static void Main(string[] args) { List<Person> peopleList1 = new...

技术盛宴
38分钟前
7
0

没有更多内容

加载失败,请刷新页面

加载更多

返回顶部
顶部