文档章节

Hadoop Flume&Sqoop

manonline
 manonline
发布于 2017/07/23 23:01
字数 792
阅读 3
收藏 0

Flume

Overview

Using Flume to collect logfiles from a bank of web servers, then moving the log events from those files into new aggregated files in HDFS for processing. Flume is also flexible to write to other systems, like HBase or Solr. Using Flume is mainly a configuration exercise to wire different Agents together.

Flume Agent is a long-lived Java process that runs sources and sinks, connected by channels. A source in Flume produces events and delivers them to the channels, which stores the events until they are forwarded to sinks. source-channel-sink is the basic building blocks of Flume.

Agents on the edge systems collect data and forward it to Agents that is responsible for aggregating and storing the data in the final destination.

  • Running Flume Agent
%flume-ng agent \
 --conf-file agent_config.properties \
 --name agent_name \
 --conf $FLUME_HOME/conf \
 -Dflume.root.logger=INFO, console
  • Agent Configuration
# source, channle and sink declaration
agent_name.sources=source1 source2 ...
agent_name.sinks=sink1 sink2 ...
agent_name.channels=channel1 channel2 ...

# chaining source-channel-sink
agent_name.sources.source1.channel=channel1 channel2
agent_name.sinks.sink1.channel=channel1
agent_name.sinks.sink2.channel=channel2

# config particular source
agent_name.sources.source1.type=spooldir
agent_name.sources.source1.spoolDir=path

# config particular channel
agent_name.channels.channel1.type=memory
# file persist the message and remove it only after it's consumed
agent_name.channels.channel2.type=file

# config particular sink
agent_name.sinks.sink1.type=logger

agent_name.sinks.sink2.type=hdfs
agent_name.sinks.sink2.path=/tmp/flume
agent_name.sinks.sink2.filePreFix=events
agent_name.sinks.sink2.fileSufFix=.avro
agent_name.sinks.sink2.fileType=DataStream
agent_name.sinks.sink2.serializer=avro_event
agent_name.sinks.sink2.serializer.compressionCodec=snappy

  • Event Format: { header: {} body: { ...binary format... ...string format... }}
    • optional header
    • binary format and string format

Transaction and Reliability

Flume uses separate transactions to guarantee delivery from the source to the channel, and from the channel to the sink. If file channel is used, once an event has been written to the channel, it will never be lost, even if the agent restarts. However, using memory channel could lead to message loss in the event of channel restart, but it leads to a much higher throughput.

The overall effect is that every event produced by the source will reach the sink AT LEAST ONCE, that are duplicates is possible. The stronger semantics EXACTLY ONCE requires a two-phase commit protocol, which is expensive. Flume chooses the AT LEAST ONCE approach in order to gain high throughput, and duplicates can anyway be removed by the downstream processing.

HDFS Sink

Chaining

  • Fan Out: delivering events from one source to multiple channels, so they reach multiple sinks.
  • Agent Tiers: aggregating Flume events (from different agents) is achieved by having tiers of Flume agents. The first tier collects events from original sources (say web server) and sends them to a smaller set of 2nd tier agents, which aggregate events from different 1st tier agents before sending to HDFS. Tiers are constructed by using a special SINK that sends events over NETWORK, and a corresponding SOURCE that receives the event.
    • Avro SINK sends events to Avro SOURCE over Avro RPC. (nothing related to Avro file)
    • Thrift SINK sends events to Thrift SOURCE over Thrift RPC.
# 1st Tier Avro SINK : sending events
agent_name.sinks.sink1.type=avro
agent_name.sinks.sink1.hostname=ip_address
agent_name.sinks.sink1.port=10000

# 2nd Tier Avro SOURCE : receiving events
agent_name.sources.source1.type=avro
agent_name.sources.source1.bind=ip_address
agent_name.sources.source1.port=10000
  • Sink Group: allows multiple sinks to be treated as one, for failover or load-balancing purpose.
# declare a group
agent_name.sinkgroups=sinkgroup1

# configure particular group
agent_name.sinkgroups.sinkgroup1.sinks=sink1 sink2
agent_name.sinkgroups.sinkgroup1.processor.type=load_balance
agent_name.sinkgroups.sinkgroup1.processor.backoff=true

Application Integration

An Avro source is an RPC endpoint that accepts Flume events, making it possible to write an RPC client to send events to the endpoint.

  • Flume SDK is a module that provides a Java RpcClient class for sending Event objects to an Avro endpoint.
  • Flume Embedded Agent is cut-down Flume agent that runs in a Java application.

Sqoop

Connectors

Built-in connects that support MySQL, Postgresql, Oracle, DB2, SQLServer and Netezza. There is also generic JDBC connector for connecting to any database that supports JDBC protocol.

There are also various 3rd parties connectors that are available for data stores, ranging from enterprise data warehouse (such as Teradata) to NoSQL stores (such as CouchBase)

Import Commands

  • By default, the imported files are comma-delimited text files;
  • File format, delimitor, compression and other parameters can be configured as well.
    • Sequence files
    • Avro files
    • Parquet files
# -------------------------
# Sqoop Import
%sqoop import
# Connecting to datasource
 --connect jdbc:mysql://host/database \
# Source table
 --table tablename
# MapReduce tasks, default to 4
 --split-by column_name
 -m numberOfMapReduceTasks
# Incremental Reports
 --check-column columnname
 --lastvalue lastValue


# ------------------------
# To view the imported files
%hadoop fs -cat tablename/part-m-0000

Process

  • sqoop examines the target table and retrieves a list of all columns and their SQL types.
  • sqoop code-generator uses this information to generate the table-specific class, which will
    • hold a record extracted from the table during MapReduce processing.
    • JDBC execute query and return the ResultSet
    • DBInputFormat (interface) populate the table-specific class with the data from ResultSet
      • readFiles
      • write

© 著作权归作者所有

共有 人打赏支持
上一篇: Hadoop HBase
下一篇: Hadoop Hive
manonline
粉丝 0
博文 73
码字总数 66740
作品 0
私信 提问
Apache Hadoop 2.6.0安装部署

注:本文档参考官方文档编写,原文链接:http://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-common/SingleCluster.html http://hadoop.apache.org/docs/r2.6.0/hadoop-proje......

lee_ypp
2015/04/02
0
0
hadoop全分布式高可用方案

集群规划: 主机名 IP 安装的软件 运行的进程 hadoop01 192.168.88.155 jdk、hadoop、zookeeper DataNode、NodeManager、JournalNode、QuorumPeerMain hadoop02(A)192.168.88.164 jdk、hadoo......

泡海椒
2015/12/20
561
0
CentOS 6.5 搭建Hadoop 2.5.2集群

记录在64位CentOS 6.5环境下搭建Hadoop 2.5.2集群的步骤,同时遇到问题的解决办法,这些记录都仅供参考! 1、操作系统环境配置 1.1、操作系统环境 主机名 IP地址 角色 Hadoop用户 hadoop-mas...

巴利奇
01/05
0
0
Hadoop 2.5.0编译到Apache Hadoop Common失败

[INFO] ------------------------------------------------------------------------ [INFO] Reactor Summary: [INFO] [INFO] Apache Hadoop Main ................................. SUCCES......

尧雪
04/19
89
1
hadoop 2.7.2 安装 在zkfc 格式化时报错

hadoop的安装环境为centos6.5 64位 [hadoop@node01 hadoop-2.7.2]$ bin/hdfs zkfc -formatZK 16/08/12 15:10:07 WARN util.NativeCodeLoader: Unable to load native-hadoop library for you......

驛路梨花醉美
2016/08/12
995
1

没有更多内容

加载失败,请刷新页面

加载更多

混合模型---logistic模型的混合

专家混合

中国龙-扬科
9分钟前
1
0
自定义参数校验注解 (实现ConstraintValidator方法)

Hibernate Validator常用注解(图网上找的) 2.自定义校验器 a.注解类 @Target({FIELD, METHOD, PARAMETER, ANNOTATION_TYPE})@Retention(RUNTIME)@Documented@Constraint(validatedB......

INSISTQIAO
13分钟前
1
0
Integer 实现

Integer 封装类型,参数传递传的是值,不是引用 内带缓存,-128 到127 -128 到127 直接数值 IntegerCache 如果不在这个范围,才会new Integer () public static Integer valueOf(int ...

Java搬砖工程师
13分钟前
1
0
数字IT基础-数据采集总线

数字化运营基础 在如今“双十一”不再是线上活动的代名词,而逐步变为一场线上线下同时进行的消费者盛宴。销售、运营、物流、生产商等都在开足马力在各大渠道备战,据统计: 消费者在期间被平...

阿里云官方博客
21分钟前
0
0
三次握手四次挥手

背景 和女朋友异地恋,为了保持感情我提议每天晚上视频聊天一次。 从好上开始,到现在,一年多也算坚持下来了。 问题: 有时候聊天的过程中,我的网络或者她的网络可能会不好,视频就会卡住,...

瑞查德-Jack
24分钟前
2
0

没有更多内容

加载失败,请刷新页面

加载更多

返回顶部
顶部