文档章节

聊聊flink的Parallel Execution

go4it
 go4it
发布于 02/12 10:29
字数 1811
阅读 6
收藏 0

本文主要研究一下flink的Parallel Execution

实例

Operator Level

final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

DataStream<String> text = [...]
DataStream<Tuple2<String, Integer>> wordCounts = text
    .flatMap(new LineSplitter())
    .keyBy(0)
    .timeWindow(Time.seconds(5))
    .sum(1).setParallelism(5);

wordCounts.print();

env.execute("Word Count Example");
  • operators、data sources、data sinks都可以调用setParallelism()方法来设置parallelism

Execution Environment Level

final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(3);

DataStream<String> text = [...]
DataStream<Tuple2<String, Integer>> wordCounts = [...]
wordCounts.print();

env.execute("Word Count Example");
  • 在ExecutionEnvironment里头可以通过setParallelism来给operators、data sources、data sinks设置默认的parallelism;如果operators、data sources、data sinks自己有设置parallelism则会覆盖ExecutionEnvironment设置的parallelism

Client Level

./bin/flink run -p 10 ../examples/*WordCount-java*.jar

或者

try {
    PackagedProgram program = new PackagedProgram(file, args);
    InetSocketAddress jobManagerAddress = RemoteExecutor.getInetFromHostport("localhost:6123");
    Configuration config = new Configuration();

    Client client = new Client(jobManagerAddress, config, program.getUserCodeClassLoader());

    // set the parallelism to 10 here
    client.run(program, 10, true);

} catch (ProgramInvocationException e) {
    e.printStackTrace();
}
  • 使用CLI client,可以在命令行调用是用-p来指定,或者Java/Scala调用时在Client.run的参数中指定parallelism

System Level

# The parallelism used for programs that did not specify and other parallelism.

parallelism.default: 1
  • 可以在flink-conf.yaml中通过parallelism.default配置项给所有execution environments指定系统级的默认parallelism

ExecutionEnvironment

flink-java-1.7.1-sources.jar!/org/apache/flink/api/java/ExecutionEnvironment.java

@Public
public abstract class ExecutionEnvironment {
	//......

	private final ExecutionConfig config = new ExecutionConfig();

	/**
	 * Sets the parallelism for operations executed through this environment.
	 * Setting a parallelism of x here will cause all operators (such as join, map, reduce) to run with
	 * x parallel instances.
	 *
	 * <p>This method overrides the default parallelism for this environment.
	 * The {@link LocalEnvironment} uses by default a value equal to the number of hardware
	 * contexts (CPU cores / threads). When executing the program via the command line client
	 * from a JAR file, the default parallelism is the one configured for that setup.
	 *
	 * @param parallelism The parallelism
	 */
	public void setParallelism(int parallelism) {
		config.setParallelism(parallelism);
	}

	@Internal
	public Plan createProgramPlan(String jobName, boolean clearSinks) {
		if (this.sinks.isEmpty()) {
			if (wasExecuted) {
				throw new RuntimeException("No new data sinks have been defined since the " +
						"last execution. The last execution refers to the latest call to " +
						"'execute()', 'count()', 'collect()', or 'print()'.");
			} else {
				throw new RuntimeException("No data sinks have been created yet. " +
						"A program needs at least one sink that consumes data. " +
						"Examples are writing the data set or printing it.");
			}
		}

		if (jobName == null) {
			jobName = getDefaultName();
		}

		OperatorTranslation translator = new OperatorTranslation();
		Plan plan = translator.translateToPlan(this.sinks, jobName);

		if (getParallelism() > 0) {
			plan.setDefaultParallelism(getParallelism());
		}
		plan.setExecutionConfig(getConfig());

		// Check plan for GenericTypeInfo's and register the types at the serializers.
		if (!config.isAutoTypeRegistrationDisabled()) {
			plan.accept(new Visitor<org.apache.flink.api.common.operators.Operator<?>>() {

				private final Set<Class<?>> registeredTypes = new HashSet<>();
				private final Set<org.apache.flink.api.common.operators.Operator<?>> visitedOperators = new HashSet<>();

				@Override
				public boolean preVisit(org.apache.flink.api.common.operators.Operator<?> visitable) {
					if (!visitedOperators.add(visitable)) {
						return false;
					}
					OperatorInformation<?> opInfo = visitable.getOperatorInfo();
					Serializers.recursivelyRegisterType(opInfo.getOutputType(), config, registeredTypes);
					return true;
				}

				@Override
				public void postVisit(org.apache.flink.api.common.operators.Operator<?> visitable) {}
			});
		}

		try {
			registerCachedFilesWithPlan(plan);
		} catch (Exception e) {
			throw new RuntimeException("Error while registering cached files: " + e.getMessage(), e);
		}

		// clear all the sinks such that the next execution does not redo everything
		if (clearSinks) {
			this.sinks.clear();
			wasExecuted = true;
		}

		// All types are registered now. Print information.
		int registeredTypes = config.getRegisteredKryoTypes().size() +
				config.getRegisteredPojoTypes().size() +
				config.getRegisteredTypesWithKryoSerializerClasses().size() +
				config.getRegisteredTypesWithKryoSerializers().size();
		int defaultKryoSerializers = config.getDefaultKryoSerializers().size() +
				config.getDefaultKryoSerializerClasses().size();
		LOG.info("The job has {} registered types and {} default Kryo serializers", registeredTypes, defaultKryoSerializers);

		if (config.isForceKryoEnabled() && config.isForceAvroEnabled()) {
			LOG.warn("In the ExecutionConfig, both Avro and Kryo are enforced. Using Kryo serializer");
		}
		if (config.isForceKryoEnabled()) {
			LOG.info("Using KryoSerializer for serializing POJOs");
		}
		if (config.isForceAvroEnabled()) {
			LOG.info("Using AvroSerializer for serializing POJOs");
		}

		if (LOG.isDebugEnabled()) {
			LOG.debug("Registered Kryo types: {}", config.getRegisteredKryoTypes().toString());
			LOG.debug("Registered Kryo with Serializers types: {}", config.getRegisteredTypesWithKryoSerializers().entrySet().toString());
			LOG.debug("Registered Kryo with Serializer Classes types: {}", config.getRegisteredTypesWithKryoSerializerClasses().entrySet().toString());
			LOG.debug("Registered Kryo default Serializers: {}", config.getDefaultKryoSerializers().entrySet().toString());
			LOG.debug("Registered Kryo default Serializers Classes {}", config.getDefaultKryoSerializerClasses().entrySet().toString());
			LOG.debug("Registered POJO types: {}", config.getRegisteredPojoTypes().toString());

			// print information about static code analysis
			LOG.debug("Static code analysis mode: {}", config.getCodeAnalysisMode());
		}

		return plan;
	}

	//......
}
  • ExecutionEnvironment提供了setParallelism方法,给ExecutionConfig指定parallelism;最后createProgramPlan方法创建Plan后会读取ExecutionConfig的parallelism,给Plan设置defaultParallelism

LocalEnvironment

flink-java-1.7.1-sources.jar!/org/apache/flink/api/java/LocalEnvironment.java

@Public
public class LocalEnvironment extends ExecutionEnvironment {

	//......

	public JobExecutionResult execute(String jobName) throws Exception {
		if (executor == null) {
			startNewSession();
		}

		Plan p = createProgramPlan(jobName);

		// Session management is disabled, revert this commit to enable
		//p.setJobId(jobID);
		//p.setSessionTimeout(sessionTimeout);

		JobExecutionResult result = executor.executePlan(p);

		this.lastJobExecutionResult = result;
		return result;
	}

	//......
}
  • LocalEnvironment的execute调用的是LocalExecutor的executePlan

LocalExecutor

flink-clients_2.11-1.7.1-sources.jar!/org/apache/flink/client/LocalExecutor.java

public class LocalExecutor extends PlanExecutor {
	
	//......

	@Override
	public JobExecutionResult executePlan(Plan plan) throws Exception {
		if (plan == null) {
			throw new IllegalArgumentException("The plan may not be null.");
		}

		synchronized (this.lock) {

			// check if we start a session dedicated for this execution
			final boolean shutDownAtEnd;

			if (jobExecutorService == null) {
				shutDownAtEnd = true;

				// configure the number of local slots equal to the parallelism of the local plan
				if (this.taskManagerNumSlots == DEFAULT_TASK_MANAGER_NUM_SLOTS) {
					int maxParallelism = plan.getMaximumParallelism();
					if (maxParallelism > 0) {
						this.taskManagerNumSlots = maxParallelism;
					}
				}

				// start the cluster for us
				start();
			}
			else {
				// we use the existing session
				shutDownAtEnd = false;
			}

			try {
				// TODO: Set job's default parallelism to max number of slots
				final int slotsPerTaskManager = jobExecutorServiceConfiguration.getInteger(TaskManagerOptions.NUM_TASK_SLOTS, taskManagerNumSlots);
				final int numTaskManagers = jobExecutorServiceConfiguration.getInteger(ConfigConstants.LOCAL_NUMBER_TASK_MANAGER, 1);
				plan.setDefaultParallelism(slotsPerTaskManager * numTaskManagers);

				Optimizer pc = new Optimizer(new DataStatistics(), jobExecutorServiceConfiguration);
				OptimizedPlan op = pc.compile(plan);

				JobGraphGenerator jgg = new JobGraphGenerator(jobExecutorServiceConfiguration);
				JobGraph jobGraph = jgg.compileJobGraph(op, plan.getJobId());

				return jobExecutorService.executeJobBlocking(jobGraph);
			}
			finally {
				if (shutDownAtEnd) {
					stop();
				}
			}
		}
	}

	//......
}
  • LocalExecutor的executePlan方法还会根据slotsPerTaskManager及numTaskManagers对plan设置defaultParallelism

RemoteEnvironment

flink-java-1.7.1-sources.jar!/org/apache/flink/api/java/RemoteEnvironment.java

@Public
public class RemoteEnvironment extends ExecutionEnvironment {

	//......

	public JobExecutionResult execute(String jobName) throws Exception {
		PlanExecutor executor = getExecutor();

		Plan p = createProgramPlan(jobName);

		// Session management is disabled, revert this commit to enable
		//p.setJobId(jobID);
		//p.setSessionTimeout(sessionTimeout);

		JobExecutionResult result = executor.executePlan(p);

		this.lastJobExecutionResult = result;
		return result;
	}

	//......
}
  • RemoteEnvironment的execute调用的是RemoteExecutor的executePlan

RemoteExecutor

flink-clients_2.11-1.7.1-sources.jar!/org/apache/flink/client/RemoteExecutor.java

public class RemoteExecutor extends PlanExecutor {

	private final Object lock = new Object();

	private final List<URL> jarFiles;

	private final List<URL> globalClasspaths;

	private final Configuration clientConfiguration;

	private ClusterClient<?> client;

	//......

	@Override
	public JobExecutionResult executePlan(Plan plan) throws Exception {
		if (plan == null) {
			throw new IllegalArgumentException("The plan may not be null.");
		}

		JobWithJars p = new JobWithJars(plan, this.jarFiles, this.globalClasspaths);
		return executePlanWithJars(p);
	}

	public JobExecutionResult executePlanWithJars(JobWithJars program) throws Exception {
		if (program == null) {
			throw new IllegalArgumentException("The job may not be null.");
		}

		synchronized (this.lock) {
			// check if we start a session dedicated for this execution
			final boolean shutDownAtEnd;

			if (client == null) {
				shutDownAtEnd = true;
				// start the executor for us
				start();
			}
			else {
				// we use the existing session
				shutDownAtEnd = false;
			}

			try {
				return client.run(program, defaultParallelism).getJobExecutionResult();
			}
			finally {
				if (shutDownAtEnd) {
					stop();
				}
			}
		}
	}

	//......
}
  • RemoteExecutor的executePlan调用了executePlanWithJars方法,而后者则调用了ClusterClient的run,并在参数中指定了defaultParallelism

ClusterClient

flink-clients_2.11-1.7.1-sources.jar!/org/apache/flink/client/program/ClusterClient.java

public abstract class ClusterClient<T> {
	//......

	public JobSubmissionResult run(JobWithJars program, int parallelism) throws ProgramInvocationException {
		return run(program, parallelism, SavepointRestoreSettings.none());
	}

	public JobSubmissionResult run(JobWithJars jobWithJars, int parallelism, SavepointRestoreSettings savepointSettings)
			throws CompilerException, ProgramInvocationException {
		ClassLoader classLoader = jobWithJars.getUserCodeClassLoader();
		if (classLoader == null) {
			throw new IllegalArgumentException("The given JobWithJars does not provide a usercode class loader.");
		}

		OptimizedPlan optPlan = getOptimizedPlan(compiler, jobWithJars, parallelism);
		return run(optPlan, jobWithJars.getJarFiles(), jobWithJars.getClasspaths(), classLoader, savepointSettings);
	}

	private static OptimizedPlan getOptimizedPlan(Optimizer compiler, JobWithJars prog, int parallelism)
			throws CompilerException, ProgramInvocationException {
		return getOptimizedPlan(compiler, prog.getPlan(), parallelism);
	}

	public static OptimizedPlan getOptimizedPlan(Optimizer compiler, Plan p, int parallelism) throws CompilerException {
		Logger log = LoggerFactory.getLogger(ClusterClient.class);

		if (parallelism > 0 && p.getDefaultParallelism() <= 0) {
			log.debug("Changing plan default parallelism from {} to {}", p.getDefaultParallelism(), parallelism);
			p.setDefaultParallelism(parallelism);
		}
		log.debug("Set parallelism {}, plan default parallelism {}", parallelism, p.getDefaultParallelism());

		return compiler.compile(p);
	}

	//......
}
  • ClusterClient的run方法中的parallelism在parallelism > 0以及p.getDefaultParallelism() <= 0的时候会作用到Plan中

DataStreamSource

flink-streaming-java_2.11-1.7.1-sources.jar!/org/apache/flink/streaming/api/datastream/DataStreamSource.java

@Public
public class DataStreamSource<T> extends SingleOutputStreamOperator<T> {

	boolean isParallel;

	public DataStreamSource(StreamExecutionEnvironment environment,
			TypeInformation<T> outTypeInfo, StreamSource<T, ?> operator,
			boolean isParallel, String sourceName) {
		super(environment, new SourceTransformation<>(sourceName, operator, outTypeInfo, environment.getParallelism()));

		this.isParallel = isParallel;
		if (!isParallel) {
			setParallelism(1);
		}
	}

	public DataStreamSource(SingleOutputStreamOperator<T> operator) {
		super(operator.environment, operator.getTransformation());
		this.isParallel = true;
	}

	@Override
	public DataStreamSource<T> setParallelism(int parallelism) {
		if (parallelism != 1 && !isParallel) {
			throw new IllegalArgumentException("Source: " + transformation.getId() + " is not a parallel source");
		} else {
			super.setParallelism(parallelism);
			return this;
		}
	}
}
  • DataStreamSource继承了SingleOutputStreamOperator,它提供了setParallelism方法,最终调用的是父类SingleOutputStreamOperator的setParallelism

SingleOutputStreamOperator

flink-streaming-java_2.11-1.7.1-sources.jar!/org/apache/flink/streaming/api/datastream/SingleOutputStreamOperator.java

@Public
public class SingleOutputStreamOperator<T> extends DataStream<T> {
	//......

	/**
	 * Sets the parallelism for this operator.
	 *
	 * @param parallelism
	 *            The parallelism for this operator.
	 * @return The operator with set parallelism.
	 */
	public SingleOutputStreamOperator<T> setParallelism(int parallelism) {
		Preconditions.checkArgument(canBeParallel() || parallelism == 1,
				"The parallelism of non parallel operator must be 1.");

		transformation.setParallelism(parallelism);

		return this;
	}

	//......
}
  • SingleOutputStreamOperator的setParallelism最后是作用到StreamTransformation

DataStreamSink

flink-streaming-java_2.11-1.7.1-sources.jar!/org/apache/flink/streaming/api/datastream/DataStreamSink.java

@Public
public class DataStreamSink<T> {

	private final SinkTransformation<T> transformation;

	//......

	/**
	 * Sets the parallelism for this sink. The degree must be higher than zero.
	 *
	 * @param parallelism The parallelism for this sink.
	 * @return The sink with set parallelism.
	 */
	public DataStreamSink<T> setParallelism(int parallelism) {
		transformation.setParallelism(parallelism);
		return this;
	}

	//......
}
  • DataStreamSink提供了setParallelism方法,最后是作用于SinkTransformation

小结

  • flink可以设置好几个level的parallelism,其中包括Operator Level、Execution Environment Level、Client Level、System Level
  • 在flink-conf.yaml中通过parallelism.default配置项给所有execution environments指定系统级的默认parallelism;在ExecutionEnvironment里头可以通过setParallelism来给operators、data sources、data sinks设置默认的parallelism;如果operators、data sources、data sinks自己有设置parallelism则会覆盖ExecutionEnvironment设置的parallelism
  • ExecutionEnvironment提供的setParallelism方法用于给ExecutionConfig指定parallelism(如果使用CLI client,可以在命令行调用是用-p来指定,或者Java/Scala调用时在Client.run的参数中指定parallelism;LocalEnvironment及RemoteEnvironment设置的parallelism最后都是设置到Plan中);DataStreamSource继承了SingleOutputStreamOperator,它提供了setParallelism方法,最终调用的是父类SingleOutputStreamOperator的setParallelism;SingleOutputStreamOperator的setParallelism最后是作用到StreamTransformation;DataStreamSink提供了setParallelism方法,最后是作用于SinkTransformation

doc

© 著作权归作者所有

共有 人打赏支持
go4it
粉丝 77
博文 886
码字总数 795879
作品 0
深圳
私信 提问
聊聊flink的RichParallelSourceFunction

序 本文主要研究一下flink的RichParallelSourceFunction RichParallelSourceFunction RichParallelSourceFunction实现了ParallelSourceFunction接口,同时继承了AbstractRichFunction Parall......

go4it
2018/11/28
0
0
聊聊flink的CheckpointedFunction

序 本文主要研究一下flink的CheckpointedFunction 实例 这个BufferingSink实现了CheckpointedFunction接口,它定义了ListState类型的checkpointedState,以及List结构的bufferedElements 在i...

go4it
2018/12/05
0
0
聊聊flink的ParallelIteratorInputFormat

序 本文主要研究一下flink的ParallelIteratorInputFormat 实例 这里使用ExecutionEnvironment的generateSequence方法创建了带NumberSequenceIterator的ParallelIteratorInputFormat Paralle......

go4it
2018/11/30
0
0
聊聊flink的Execution Plan Visualization

序 本文主要研究一下flink的Execution Plan Visualization 实例 代码 json 可视化 打开flink plan visualizer将上面的json,输入到文本框,点击Draw进行可视化如下: StreamExecutionEnviron...

go4it
02/13
0
0
聊聊flink的BoltWrapper

序 本文主要研究一下flink的BoltWrapper BoltWrapper flink-storm_2.11-1.6.2-sources.jar!/org/apache/flink/storm/wrappers/BoltWrapper.java flink用BoltWrapper来包装storm的IRichBolt,......

go4it
2018/11/25
0
0

没有更多内容

加载失败,请刷新页面

加载更多

【结构型】- 享元模式

享元模式 作用:利用共享技术有效地支持大量细粒度对象的复用 享元模式状态 内部状态:在享元对象内部不随外界环境改变而改变的共享部分,存储于享元对象内部 外部状态:随着环境的改变而改变...

ZeroneLove
昨天
1
0
Vue 中使用UEditor富文本编辑器-亲测可用-vue-ueditor-wrap

一、Vue中在使用Vue CLI开发中默认没法使用UEditor 其中UEditor中也存在不少错误,再引用过程中。 但是UEditor相对还是比较好用的一个富文本编辑器。 vue-ueditor-wrap说明 Vue + UEditor + ...

tianma3798
昨天
4
0
php-fpm配置

php-fpm配置 修改bbs.wangzb.cc.conf配置文件,将端口9000改为9001,重新访问网站是失败的 修改配置文件 # vim /etc/nginx/conf.d/bbs.wangzb.cc.conf# nginx -s reloadfastcgi_pass 1...

wzb88
昨天
1
0
配置方案:Redis持久化RDB和AOF

Redis持久化方案 Redis是内存数据库,数据都是存储在内存中,为了避免进程退出导致数据的永久丢失,需要定期将Redis中的数据以某种形式(数据或命令)从内存保存到硬盘。当下次Redis重启时,...

linuxprobe16
昨天
6
0
介绍NoSQL最受欢迎的产品

MongoDB MongoDB是一个基于分布式文件存储的数据库。由C++语言编写。主要解决的是海量数据的访问效率问题,为WEB应用提供可扩展的高性能数据存储解决方案。当数据量达到50GB以上的时候,Mon...

问题终结者
昨天
9
0

没有更多内容

加载失败,请刷新页面

加载更多

返回顶部
顶部