文档章节

Windows下:Eclipse下通过java开发spark程序【1】

那年的暖风
 那年的暖风
发布于 09/20 15:14
字数 2748
阅读 10
收藏 0

准备:本机环境设置环境 jdk1.8,hadoop2.8.1(与服务器上hadoop环境保持一致)

第一步:

    需要下载windows版本 bin目录下的文件,替换hadoop目录下原来的bin目录下的文件。下载网址是: https://github.com/srccodes/hadoop-common-2.2.0-bin 
另外还需要注意:下载的动态库是64位的,所以必须要在64位windows系统下运行.

把这个文件夹下的bin目录覆盖自己本地hadoop安装目录的bin文件夹

第二步:

    新建maven项目把服务器端下的hadoop目录下etc/hadoop 下面的配置文件 core-site.xml,hdfs-site.xml,mapred-site.xml,yarn-site.xml 放到项目resources目录下

第三步:

maven添加pom依赖:

		<dependency>
			<groupId>org.apache.hadoop</groupId>
			<artifactId>hadoop-client</artifactId>
			<version>2.8.1</version>
			<exclusions>
				<exclusion>
					<groupId>javax.servlet</groupId>
					<artifactId>servlet-api</artifactId>
				</exclusion>
			</exclusions>
		</dependency>
		<dependency>
			<groupId>org.apache.spark</groupId>
			<artifactId>spark-core_2.11</artifactId>
			<version>2.3.1</version>
		</dependency>
		<dependency>
			<groupId>org.apache.spark</groupId>
			<artifactId>spark-streaming_2.11</artifactId>
			<version>2.3.1</version>
		</dependency>

注意:

    1.因为本地调试时会启动spark包内嵌的jetty容器,需要servlet-api3.0以上,hadoop中引入了2.5版本servlet要排除掉.

    2.hadoop以及spark版本,要与自己的版本一致,具体版本对应请自己百度查找,或者去spark官网查看相应提示,

        我服务器使用的是spark2.2.0 官网提示使用2.11版本

第四步:

    创建测试类:


import java.io.IOException;
import java.util.Arrays;
import java.util.List;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;

import scala.Tuple2;

public class SparkTest {
	public static void main(String[] args) throws IOException {
		SparkConf sparkConf = new SparkConf().setAppName("SparkTest1").setMaster("local[2]");
		JavaSparkContext ctx = new JavaSparkContext(sparkConf);
		JavaRDD<String> jpr = ctx.textFile("/README.md");
		JavaRDD<String> words =
				jpr.flatMap(line -> Arrays.asList(line.split(" ")).iterator());
		JavaPairRDD<String, Integer> counts =
			    words.mapToPair(w -> new Tuple2<String, Integer>(w, 1))
			         .reduceByKey((x, y) -> x + y);
		List<Tuple2<String, Integer>> output = counts.collect();
		for (Tuple2<?, ?> tuple : output) {
		    System.out.println(tuple._1() + " : " + tuple._2());
		}
		ctx.close();
	}
}

    

控制台显示:

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
18/09/20 16:45:53 INFO SparkContext: Running Spark version 2.3.1
18/09/20 16:45:53 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/09/20 16:45:53 INFO SparkContext: Submitted application: SparkTest1
18/09/20 16:45:53 INFO SecurityManager: Changing view acls to: yaotang.zhang
18/09/20 16:45:53 INFO SecurityManager: Changing modify acls to: yaotang.zhang
18/09/20 16:45:53 INFO SecurityManager: Changing view acls groups to: 
18/09/20 16:45:53 INFO SecurityManager: Changing modify acls groups to: 
18/09/20 16:45:53 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(yaotang.zhang); groups with view permissions: Set(); users  with modify permissions: Set(yaotang.zhang); groups with modify permissions: Set()
18/09/20 16:45:54 INFO Utils: Successfully started service 'sparkDriver' on port 55765.
18/09/20 16:45:54 INFO SparkEnv: Registering MapOutputTracker
18/09/20 16:45:54 INFO SparkEnv: Registering BlockManagerMaster
18/09/20 16:45:54 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
18/09/20 16:45:54 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
18/09/20 16:45:54 INFO DiskBlockManager: Created local directory at C:\Users\yaotang.zhang\AppData\Local\Temp\blockmgr-af3da18a-9562-4aa7-92ee-e8c46139b185
18/09/20 16:45:54 INFO MemoryStore: MemoryStore started with capacity 898.5 MB
18/09/20 16:45:54 INFO SparkEnv: Registering OutputCommitCoordinator
18/09/20 16:45:54 INFO Utils: Successfully started service 'SparkUI' on port 4040.
18/09/20 16:45:54 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://zhangyaotang:4040
18/09/20 16:45:54 INFO Executor: Starting executor ID driver on host localhost
18/09/20 16:45:54 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 55782.
18/09/20 16:45:54 INFO NettyBlockTransferService: Server created on zhangyaotang:55782
18/09/20 16:45:54 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
18/09/20 16:45:54 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, zhangyaotang, 55782, None)
18/09/20 16:45:54 INFO BlockManagerMasterEndpoint: Registering block manager zhangyaotang:55782 with 898.5 MB RAM, BlockManagerId(driver, zhangyaotang, 55782, None)
18/09/20 16:45:54 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, zhangyaotang, 55782, None)
18/09/20 16:45:54 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, zhangyaotang, 55782, None)
18/09/20 16:45:55 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 317.7 KB, free 898.2 MB)
18/09/20 16:45:55 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 27.5 KB, free 898.2 MB)
18/09/20 16:45:55 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on zhangyaotang:55782 (size: 27.5 KB, free: 898.5 MB)
18/09/20 16:45:55 INFO SparkContext: Created broadcast 0 from textFile at SparkTest.java:18
18/09/20 16:45:56 INFO FileInputFormat: Total input files to process : 1
18/09/20 16:45:56 INFO SparkContext: Starting job: collect at SparkTest.java:24
18/09/20 16:45:56 INFO DAGScheduler: Registering RDD 3 (mapToPair at SparkTest.java:22)
18/09/20 16:45:56 INFO DAGScheduler: Got job 0 (collect at SparkTest.java:24) with 2 output partitions
18/09/20 16:45:56 INFO DAGScheduler: Final stage: ResultStage 1 (collect at SparkTest.java:24)
18/09/20 16:45:56 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)
18/09/20 16:45:56 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 0)
18/09/20 16:45:56 INFO DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[3] at mapToPair at SparkTest.java:22), which has no missing parents
18/09/20 16:45:56 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 5.9 KB, free 898.2 MB)
18/09/20 16:45:56 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 3.2 KB, free 898.2 MB)
18/09/20 16:45:56 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on zhangyaotang:55782 (size: 3.2 KB, free: 898.5 MB)
18/09/20 16:45:56 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1039
18/09/20 16:45:56 INFO DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[3] at mapToPair at SparkTest.java:22) (first 15 tasks are for partitions Vector(0, 1))
18/09/20 16:45:56 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
18/09/20 16:45:56 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, ANY, 7864 bytes)
18/09/20 16:45:56 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, executor driver, partition 1, ANY, 7864 bytes)
18/09/20 16:45:56 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
18/09/20 16:45:56 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
18/09/20 16:45:56 INFO HadoopRDD: Input split: hdfs://master:9000/README.md:0+1904
18/09/20 16:45:56 INFO HadoopRDD: Input split: hdfs://master:9000/README.md:1904+1905
18/09/20 16:45:56 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 1197 bytes result sent to driver
18/09/20 16:45:56 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1197 bytes result sent to driver
18/09/20 16:45:56 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 509 ms on localhost (executor driver) (1/2)
18/09/20 16:45:56 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 528 ms on localhost (executor driver) (2/2)
18/09/20 16:45:56 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
18/09/20 16:45:56 INFO DAGScheduler: ShuffleMapStage 0 (mapToPair at SparkTest.java:22) finished in 0.592 s
18/09/20 16:45:56 INFO DAGScheduler: looking for newly runnable stages
18/09/20 16:45:56 INFO DAGScheduler: running: Set()
18/09/20 16:45:56 INFO DAGScheduler: waiting: Set(ResultStage 1)
18/09/20 16:45:56 INFO DAGScheduler: failed: Set()
18/09/20 16:45:56 INFO DAGScheduler: Submitting ResultStage 1 (ShuffledRDD[4] at reduceByKey at SparkTest.java:23), which has no missing parents
18/09/20 16:45:56 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 3.7 KB, free 898.2 MB)
18/09/20 16:45:56 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 2.1 KB, free 898.1 MB)
18/09/20 16:45:56 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on zhangyaotang:55782 (size: 2.1 KB, free: 898.5 MB)
18/09/20 16:45:56 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1039
18/09/20 16:45:56 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 1 (ShuffledRDD[4] at reduceByKey at SparkTest.java:23) (first 15 tasks are for partitions Vector(0, 1))
18/09/20 16:45:56 INFO TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
18/09/20 16:45:56 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, localhost, executor driver, partition 0, ANY, 7649 bytes)
18/09/20 16:45:56 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 3, localhost, executor driver, partition 1, ANY, 7649 bytes)
18/09/20 16:45:56 INFO Executor: Running task 0.0 in stage 1.0 (TID 2)
18/09/20 16:45:56 INFO Executor: Running task 1.0 in stage 1.0 (TID 3)
18/09/20 16:45:56 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
18/09/20 16:45:56 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks
18/09/20 16:45:57 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 7 ms
18/09/20 16:45:57 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 7 ms
18/09/20 16:45:57 INFO Executor: Finished task 1.0 in stage 1.0 (TID 3). 4579 bytes result sent to driver
18/09/20 16:45:57 INFO Executor: Finished task 0.0 in stage 1.0 (TID 2). 4723 bytes result sent to driver
18/09/20 16:45:57 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 3) in 53 ms on localhost (executor driver) (1/2)
18/09/20 16:45:57 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 53 ms on localhost (executor driver) (2/2)
18/09/20 16:45:57 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 
18/09/20 16:45:57 INFO DAGScheduler: ResultStage 1 (collect at SparkTest.java:24) finished in 0.065 s
18/09/20 16:45:57 INFO DAGScheduler: Job 0 finished: collect at SparkTest.java:24, took 0.718616 s

package : 1
this : 1
Version"](http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version) : 1
Because : 1
Python : 2
page](http://spark.apache.org/documentation.html). : 1
cluster. : 1
its : 1
[run : 1
general : 3
have : 1
pre-built : 1
YARN, : 1
locally : 2
changed : 1
locally. : 1
sc.parallelize(1 : 1
only : 1
several : 1
This : 2
basic : 1
Configuration : 1
learning, : 1
documentation : 3
first : 1
graph : 1
Hive : 2
info : 1
["Specifying : 1
"yarn" : 1
[params]`. : 1
[project : 1
prefer : 1
SparkPi : 2
<http://spark.apache.org/> : 1
engine : 1
version : 1
file : 1
documentation, : 1
MASTER : 1
example : 3
["Parallel : 1
are : 1
params : 1
scala> : 1
DataFrames, : 1
provides : 1
refer : 2
configure : 1
Interactive : 2
R, : 1
can : 7
build : 4
when : 1
easiest : 1
Apache : 1
systems. : 1
thread : 1
how : 3
package. : 1
1000).count() : 1
Note : 1
Data. : 1
>>> : 1
Scala : 2
Alternatively, : 1
tips, : 1
variable : 1
submit : 1
Testing : 1
Streaming : 1
module, : 1
Developer : 1
thread, : 1
rich : 1
them, : 1
detailed : 2
stream : 1
GraphX : 1
distribution : 1
review : 1
Please : 4
return : 2
is : 6
Thriftserver : 1
same : 1
start : 1
built : 1
one : 3
with : 4
Spark](#building-spark). : 1
Spark"](http://spark.apache.org/docs/latest/building-spark.html). : 1
data : 1
Contributing : 1
using : 5
talk : 1
Shell : 2
class : 2
Tools"](http://spark.apache.org/developer-tools.html). : 1
README : 1
computing : 1
Python, : 2
example: : 1
## : 9
from : 1
set : 2
building : 2
N : 1
Hadoop-supported : 1
other : 1
Example : 1
analysis. : 1
runs. : 1
Building : 1
higher-level : 1
need : 1
Big : 1
fast : 1
guide, : 1
Java, : 1
<class> : 1
uses : 1
SQL : 2
will : 1
information : 1
IDE, : 1
requires : 1
get : 1
 : 71
guidance : 2
Documentation : 1
web : 1
cluster : 2
using: : 1
MLlib : 1
contributing : 1
shell: : 2
Scala, : 1
supports : 2
built, : 1
tests](http://spark.apache.org/developer-tools.html#individual-tests). : 1
./dev/run-tests : 1
build/mvn : 1
sample : 1
For : 3
Programs : 1
Spark : 16
particular : 2
The : 1
than : 1
processing. : 1
APIs : 1
computation : 1
Try : 1
[Configuration : 1
./bin/pyspark : 1
A : 1
through : 1
# : 1
library : 1
following : 2
More : 1
which : 2
also : 4
storage : 1
should : 2
To : 2
for : 12
Once : 1
["Useful : 1
setup : 1
mesos:// : 1
Maven](http://maven.apache.org/). : 1
latest : 1
processing, : 1
the : 24
your : 1
not : 1
different : 1
distributions. : 1
given. : 1
About : 1
if : 4
instructions. : 1
be : 2
do : 2
Tests : 1
no : 1
project. : 1
./bin/run-example : 2
programs, : 1
including : 4
`./bin/run-example : 1
Spark. : 1
Versions : 1
started : 1
HDFS : 1
by : 1
individual : 1
spark:// : 1
It : 2
Maven : 1
an : 4
programming : 1
-T : 1
machine : 1
run: : 1
environment : 1
clean : 1
1000: : 2
And : 1
guide](http://spark.apache.org/contributing.html) : 1
developing : 1
run : 7
./bin/spark-shell : 1
URL, : 1
"local" : 1
MASTER=spark://host:7077 : 1
on : 7
You : 4
threads. : 1
against : 1
[Apache : 1
help : 1
print : 1
tests : 2
examples : 2
at : 2
in : 6
-DskipTests : 1
3"](https://cwiki.apache.org/confluence/display/MAVEN/Parallel+builds+in+Maven+3). : 1
development : 1
Maven, : 1
graphs : 1
downloaded : 1
versions : 1
usage : 1
builds : 1
online : 1
Guide](http://spark.apache.org/docs/latest/configuration.html) : 1
abbreviated : 1
comes : 1
directory. : 1
overview : 1
[building : 1
`examples` : 2
optimized : 1
Many : 1
Running : 1
way : 1
use : 3
Online : 1
site, : 1
running : 1
[Contribution : 1
find : 1
sc.parallelize(range(1000)).count() : 1
contains : 1
project : 1
you : 4
Pi : 1
that : 2
protocols : 1
a : 8
or : 3
high-level : 1
name : 1
Hadoop, : 2
to : 17
available : 1
(You : 1
core : 1
more : 1
see : 3
of : 5
tools : 1
"local[N]" : 1
programs : 2
option : 1
package.) : 1
["Building : 1
instance: : 1
must : 1
and : 9
command, : 2
system : 1
Hadoop : 3
18/09/20 16:45:57 INFO SparkUI: Stopped Spark web UI at http://zhangyaotang:4040
18/09/20 16:45:57 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
18/09/20 16:45:57 INFO MemoryStore: MemoryStore cleared
18/09/20 16:45:57 INFO BlockManager: BlockManager stopped
18/09/20 16:45:57 INFO BlockManagerMaster: BlockManagerMaster stopped
18/09/20 16:45:57 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
18/09/20 16:45:57 INFO SparkContext: Successfully stopped SparkContext
18/09/20 16:45:57 INFO ShutdownHookManager: Shutdown hook called
18/09/20 16:45:57 INFO ShutdownHookManager: Deleting directory C:\Users\yaotang.zhang\AppData\Local\Temp\spark-2053c47d-4005-4ee5-9335-b8b618a54a7f

 

README.md文件是事先上传到HDFS服务集群上的,至此简单的java操作spark的demo已经完成.

© 著作权归作者所有

共有 人打赏支持
那年的暖风
粉丝 1
博文 12
码字总数 8094
作品 0
浦东
程序员
3.sparkSQL整合Hive

  spark SQL经常需要访问Hive metastore,Spark SQL可以通过Hive metastore获取Hive表的元数据。从Spark 1.4.0开始,Spark SQL只需简单的配置,就支持各版本Hive metastore的访问。注意,涉...

intsmaze(刘洋)
08/09
0
0
Spark基本工作原理与RDD及wordcount程序实例和原理深度剖析

RDD以及其特点 1、RDD是Spark提供的核心抽象,全称为Resillient Distributed Dataset,即弹性分布式数据集。 2、RDD在抽象上来说是一种元素集合,包含了数据。它是被分区的,分为多个分区,每...

qq1137623160
05/10
0
0
[Spark]Spark RDD 指南一 引入Spark

2.3.0版本:Spark2.3.0 引入Spark 1. Java版 Spark 2.1.1适用于Java 7及更高版本。 如果您使用的是Java 8,则Spark支持使用lambda表达式来简洁地编写函数,否则可以使用org.apache.spark.ap...

sjf0115
2017/06/08
0
0
Openfire+Spark源码开发环境搭建

首先第一步,当然是环境搭建了... Openfire官网:http://www.igniterealtime.org/index.jsp 源码下载地址:http://www.igniterealtime.org/downloads/index.jsp Openfire SVN下载地址:http:......

FallenPanda
2014/01/13
0
0
spark出现GC overhead limit exceeded和java heap space

spark执行任务时出现java.lang.OutOfMemoryError: GC overhead limit exceeded和java.lang.OutOfMemoryError: java heap space 最直接的解决方式就是在spark-env.sh中将下面两个参数调节的尽...

闵开慧
2014/10/14
0
1

没有更多内容

加载失败,请刷新页面

加载更多

【大福利】极客时间专栏返现二维码大汇总

我已经购买了如下专栏,大家通过我的二维码你可以获得一定额度的返现! 然后,再给大家来个福利,只要你通过我的二维码购买,并且关注了【飞鱼说编程】公众号,可以加我微信或者私聊我,我再...

飞鱼说编程
今天
1
0
Spring5对比Spring3.2源码之容器的基本实现

最近看了《Spring源码深度解析》,该书是基于Spring3.2版本的,其中关于第二章容器的基本实现部分,目前spring5的实现方式已有较大改变。 Spring3.2的实现: public void testSimpleLoad(){...

Ilike_Java
今天
1
0
【王阳明心学语录】-001

1.“破山中贼易,破心中贼难。” 2.“夫万事万物之理不外于吾心。” 3.“心即理也。”“心外无理,心外无物,心外无事。” 4.“人心之得其正者即道心;道心之失其正者即人心。” 5.“无...

卯金刀GG
今天
2
0
OSChina 周三乱弹 —— 我们无法成为野兽

Osc乱弹歌单(2018)请戳(这里) 【今日歌曲】 @ _刚刚好: 霸王洗发水这波很骚 手机党少年们想听歌,请使劲儿戳(这里) hahahahahahh @嘻酱:居然忘了喝水。 让你喝可乐的话, 你准忘不了...

小小编辑
今天
10
0
vm GC 日志 配置及查看

-XX:+PrintGCDetails 打印 gc 日志 -XX:+PrintTenuringDistribution 监控晋升分布 -XX:+PrintGCTimeStamps 包含时间戳 -XX:+printGCDateStamps 包含时间 -Xloggc:<filename> 可以将数据保存为......

Canaan_
昨天
0
0

没有更多内容

加载失败,请刷新页面

加载更多

返回顶部
顶部