文档章节

Hadoop Flume&Sqoop

manonline
 manonline
发布于 2017/07/23 23:01
字数 792
阅读 3
收藏 0
点赞 0
评论 0

Flume

Overview

Using Flume to collect logfiles from a bank of web servers, then moving the log events from those files into new aggregated files in HDFS for processing. Flume is also flexible to write to other systems, like HBase or Solr. Using Flume is mainly a configuration exercise to wire different Agents together.

Flume Agent is a long-lived Java process that runs sources and sinks, connected by channels. A source in Flume produces events and delivers them to the channels, which stores the events until they are forwarded to sinks. source-channel-sink is the basic building blocks of Flume.

Agents on the edge systems collect data and forward it to Agents that is responsible for aggregating and storing the data in the final destination.

  • Running Flume Agent
%flume-ng agent \
 --conf-file agent_config.properties \
 --name agent_name \
 --conf $FLUME_HOME/conf \
 -Dflume.root.logger=INFO, console
  • Agent Configuration
# source, channle and sink declaration
agent_name.sources=source1 source2 ...
agent_name.sinks=sink1 sink2 ...
agent_name.channels=channel1 channel2 ...

# chaining source-channel-sink
agent_name.sources.source1.channel=channel1 channel2
agent_name.sinks.sink1.channel=channel1
agent_name.sinks.sink2.channel=channel2

# config particular source
agent_name.sources.source1.type=spooldir
agent_name.sources.source1.spoolDir=path

# config particular channel
agent_name.channels.channel1.type=memory
# file persist the message and remove it only after it's consumed
agent_name.channels.channel2.type=file

# config particular sink
agent_name.sinks.sink1.type=logger

agent_name.sinks.sink2.type=hdfs
agent_name.sinks.sink2.path=/tmp/flume
agent_name.sinks.sink2.filePreFix=events
agent_name.sinks.sink2.fileSufFix=.avro
agent_name.sinks.sink2.fileType=DataStream
agent_name.sinks.sink2.serializer=avro_event
agent_name.sinks.sink2.serializer.compressionCodec=snappy

  • Event Format: { header: {} body: { ...binary format... ...string format... }}
    • optional header
    • binary format and string format

Transaction and Reliability

Flume uses separate transactions to guarantee delivery from the source to the channel, and from the channel to the sink. If file channel is used, once an event has been written to the channel, it will never be lost, even if the agent restarts. However, using memory channel could lead to message loss in the event of channel restart, but it leads to a much higher throughput.

The overall effect is that every event produced by the source will reach the sink AT LEAST ONCE, that are duplicates is possible. The stronger semantics EXACTLY ONCE requires a two-phase commit protocol, which is expensive. Flume chooses the AT LEAST ONCE approach in order to gain high throughput, and duplicates can anyway be removed by the downstream processing.

HDFS Sink

Chaining

  • Fan Out: delivering events from one source to multiple channels, so they reach multiple sinks.
  • Agent Tiers: aggregating Flume events (from different agents) is achieved by having tiers of Flume agents. The first tier collects events from original sources (say web server) and sends them to a smaller set of 2nd tier agents, which aggregate events from different 1st tier agents before sending to HDFS. Tiers are constructed by using a special SINK that sends events over NETWORK, and a corresponding SOURCE that receives the event.
    • Avro SINK sends events to Avro SOURCE over Avro RPC. (nothing related to Avro file)
    • Thrift SINK sends events to Thrift SOURCE over Thrift RPC.
# 1st Tier Avro SINK : sending events
agent_name.sinks.sink1.type=avro
agent_name.sinks.sink1.hostname=ip_address
agent_name.sinks.sink1.port=10000

# 2nd Tier Avro SOURCE : receiving events
agent_name.sources.source1.type=avro
agent_name.sources.source1.bind=ip_address
agent_name.sources.source1.port=10000
  • Sink Group: allows multiple sinks to be treated as one, for failover or load-balancing purpose.
# declare a group
agent_name.sinkgroups=sinkgroup1

# configure particular group
agent_name.sinkgroups.sinkgroup1.sinks=sink1 sink2
agent_name.sinkgroups.sinkgroup1.processor.type=load_balance
agent_name.sinkgroups.sinkgroup1.processor.backoff=true

Application Integration

An Avro source is an RPC endpoint that accepts Flume events, making it possible to write an RPC client to send events to the endpoint.

  • Flume SDK is a module that provides a Java RpcClient class for sending Event objects to an Avro endpoint.
  • Flume Embedded Agent is cut-down Flume agent that runs in a Java application.

Sqoop

Connectors

Built-in connects that support MySQL, Postgresql, Oracle, DB2, SQLServer and Netezza. There is also generic JDBC connector for connecting to any database that supports JDBC protocol.

There are also various 3rd parties connectors that are available for data stores, ranging from enterprise data warehouse (such as Teradata) to NoSQL stores (such as CouchBase)

Import Commands

  • By default, the imported files are comma-delimited text files;
  • File format, delimitor, compression and other parameters can be configured as well.
    • Sequence files
    • Avro files
    • Parquet files
# -------------------------
# Sqoop Import
%sqoop import
# Connecting to datasource
 --connect jdbc:mysql://host/database \
# Source table
 --table tablename
# MapReduce tasks, default to 4
 --split-by column_name
 -m numberOfMapReduceTasks
# Incremental Reports
 --check-column columnname
 --lastvalue lastValue


# ------------------------
# To view the imported files
%hadoop fs -cat tablename/part-m-0000

Process

  • sqoop examines the target table and retrieves a list of all columns and their SQL types.
  • sqoop code-generator uses this information to generate the table-specific class, which will
    • hold a record extracted from the table during MapReduce processing.
    • JDBC execute query and return the ResultSet
    • DBInputFormat (interface) populate the table-specific class with the data from ResultSet
      • readFiles
      • write

© 著作权归作者所有

共有 人打赏支持
manonline
粉丝 0
博文 30
码字总数 66740
作品 0
hadoop 2.7.2 安装 在zkfc 格式化时报错

hadoop的安装环境为centos6.5 64位 [hadoop@node01 hadoop-2.7.2]$ bin/hdfs zkfc -formatZK16/08/12 15:10:07 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your......

驛路梨花醉美 ⋅ 2016/08/12 ⋅ 1

新手求助:格式化HDFS文件系统 报错 namenode

刚开始学习hadoop,现在我在虚拟机中模拟了两台centos虚拟机,配置到格式化HDFs文件系统时报错。 百度了一翻并不知道怎么解决,求大神指导,谢谢 [hadoop@Master ~]$ hdfs namenode -format 1...

小刺猬2号 ⋅ 2015/11/10 ⋅ 1

Apache Hadoop 2.6.0安装部署

注:本文档参考官方文档编写,原文链接:http://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-common/SingleCluster.html http://hadoop.apache.org/docs/r2.6.0/hadoop-proje......

lee_ypp ⋅ 2015/04/02 ⋅ 0

国内第一篇详细讲解hadoop2的automatic HA+Federation+Yarn配置的教程

前言 hadoop是分布式系统,运行在linux之上,配置起来相对复杂。对于hadoop1,很多同学就因为不能搭建正确的运行环境,导致学习兴趣锐减。不过,我有免费的学习视频下载,请点击这里。 hado...

吴超沉思录 ⋅ 2014/02/12 ⋅ 5

Hadoop集群搭建(27)

Hadoop集群搭建方式: 1.1 确定部署三个节点,分别是hadoop0,hadoop1,hadoop2。 其中hadoop0是主节点(NameNode、JobTracker、SecondaryNamenode),hadoop1、hadoop2是从节 点(DataNode、T...

肖鋭 ⋅ 2014/03/19 ⋅ 0

hadoop分布式部署

1.硬件环境 共有 3 台机器,均使用的 linux 系统,Java 使用的是 jdk1.6.0。 配置如下: hadoop1.example.com:192.168.2.1(NameNode) hadoop2.example.com:192.168.2.2(DataNode) hadoop3.examp......

lee_ypp ⋅ 2014/07/14 ⋅ 11

hadoop集群搭建

最近在搭建hadoop集群,格式化namenode成功之后又突然shutting down 了,不知道是什么问题,搭建流程如下: Hadoop集群搭建步骤 1. 架构图 2. 准备5台机器 centosa: 192.168.42.128 centosb:...

lvzhl ⋅ 2016/12/11 ⋅ 3

高可用,完全分布式Hadoop集群HDFS和MapReduce安装配置指南

(WJW)高可用,完全分布式Hadoop集群HDFS和MapReduce安装配置指南 为了部署HA集群,应该准备以下事情: namenode服务器: 运行namenode的服务器应该有相同的硬件配置. journalnode服务器:运行的j...

白石 ⋅ 2015/01/08 ⋅ 3

Hadoop0.21在CentOS4.X上集群安装配置

以三个系统做集群为例,准备三个Linux系统IP分别为 192.168.0.100 master 192.168.0.101 hadoop1 192.168.0.102 hadoop1 192.168.0.100做为Name Node, 192.168.0.101和192.168.0.101做为Dat......

hock57 ⋅ 2011/06/20 ⋅ 2

大数据之---hadoop伪分布式部署(HDFS)全网终极篇

1、软件环境 RHEL6 jdk-8u45 hadoop-2.8.1.tar.gz ssh xx.xx.xx.xx ip地址 hadoop1 xx.xx.xx.xx ip地址 hadoop2 xx.xx.xx.xx ip地址 hadoop3 xx.xx.xx.xx ip地址 hadoop4 xx.xx.xx.xx ip地址......

ycwyong ⋅ 05/15 ⋅ 0

没有更多内容

加载失败,请刷新页面

加载更多

下一页

zblog2.3版本的asp系统是否可以超越卢松松博客的流量[图]

最近访问zblog官网,发现zlbog-asp2.3版本已经进入测试阶段了,虽然正式版还没有发布,想必也不久了。那么作为aps纵横江湖十多年的今天,blog2.2版本应该已经成熟了,为什么还要发布这个2.3...

原创小博客 ⋅ 今天 ⋅ 0

聊聊spring cloud的HystrixCircuitBreakerConfiguration

序 本文主要研究一下spring cloud的HystrixCircuitBreakerConfiguration HystrixCircuitBreakerConfiguration spring-cloud-netflix-core-2.0.0.RELEASE-sources.jar!/org/springframework/......

go4it ⋅ 今天 ⋅ 0

二分查找

二分查找,也称折半查找、二分搜索,是一种在有序数组中查找某一特定元素的搜索算法。搜素过程从数组的中间元素开始,如果中间元素正好是要查找的元素,则搜素过程结束;如果某一特定元素大于...

人觉非常君 ⋅ 今天 ⋅ 0

VS中使用X64汇编

需要注意的是,在X86项目中,可以使用__asm{}来嵌入汇编代码,但是在X64项目中,再也不能使用__asm{}来编写嵌入式汇编程序了,必须使用专门的.asm汇编文件来编写相应的汇编代码,然后在其它地...

simpower ⋅ 今天 ⋅ 0

ThreadPoolExecutor

ThreadPoolExecutor public ThreadPoolExecutor(int corePoolSize, int maximumPoolSize, long keepAliveTime, ......

4rnold ⋅ 昨天 ⋅ 0

Java正无穷大、负无穷大以及NaN

问题来源:用Java代码写了一个计算公式,包含除法和对数和取反,在页面上出现了-infinity,不知道这是什么问题,网上找答案才明白意思是负的无穷大。 思考:为什么会出现这种情况呢?这是哪里...

young_chen ⋅ 昨天 ⋅ 0

前台对中文编码,后台解码

前台:encodeURI(sbzt) 后台:String param = URLDecoder.decode(sbzt,"UTF-8");

west_coast ⋅ 昨天 ⋅ 0

实验楼—MySQL基础课程-挑战3实验报告

按照文档要求创建数据库 sudo sercice mysql startwget http://labfile.oss.aliyuncs.com/courses/9/createdb2.sqlvim /home/shiyanlou/createdb2.sql#查看下数据库代码 代码创建了grade......

zhangjin7 ⋅ 昨天 ⋅ 0

一起读书《深入浅出nodejs》-node模块机制

node 模块机制 前言 说到node,就不免得提到JavaScript。JavaScript自诞生以来,经历了工具类库、组件库、前端框架、前端应用的变迁。通过无数开发人员的努力,JavaScript不断被类聚和抽象,...

小草先森 ⋅ 昨天 ⋅ 0

Java桌球小游戏

其实算不上一个游戏,就是两张图片,不停的重画,改变ball图片的位置。一个左右直线碰撞的,一个有角度碰撞的。 左右直线碰撞 package com.bjsxt.test;import javax.swing.*;import j...

森林之下 ⋅ 昨天 ⋅ 0

没有更多内容

加载失败,请刷新页面

加载更多

下一页

返回顶部
顶部