Notebook Workflows: The Easiest Way to Implement Apache Spark Pipelines

发布于 2016/09/04 19:11
字数 1128
阅读 38
收藏 0

Notebook Workflows: The Easiest Way to Implement Apache Spark Pipelines

Dave WangEric LiangMaddie Schults by Dave Wang, Eric Liang and Maddie Schults Posted in Company Blog August 30, 2016




Today we are excited to announce Notebook Workflows in Databricks. Notebook Workflows is a set of APIs that allow users to chain notebooks together using the standard control structures of the source programming language — Python, Scala, or R — to build production pipelines. This functionality makes Databricks the first and only product to support building Apache Spark workflows directly from notebooks, offering data science and engineering teams a new paradigm to build production data pipelines.

Traditionally, teams need to integrate many complicated tools (notebooks, Spark infrastructure, external workflow manager just to name a few) to analyze data, prototype applications, and then deploy them into production. With Databricks, everything can be done in a single environment, making the entire process much easier, faster, and more reliable.

Simplifying Pipelines with Notebooks

Notebooks are very helpful in building a pipeline even with compiled artifacts. Being able to visualize data and interactively experiment with transformations makes it much easier to write code in small, testable chunks. More importantly, the development of most data pipelines begins with exploration, which is the perfect use case for notebooks. As an example, Yesware regularly uses Databricks Notebooks to prototype new features for their ETL pipeline.

On the flip side, teams also run into problems as they use notebooks to take on more complex data processing tasks:

  • Logic within notebooks becomes harder to organize. Exploratory notebooks start off as simple sequences of Spark commands that run in order. However, it is common to make decisions based on the result of prior steps in a production pipeline – which is often at odds with how notebooks are written during the initial exploration.
  • Notebooks are not modular enough. Teams need the ability to retry only a subset of a data pipeline so that a failure does not require re-running the entire pipeline.

These are the common reasons that teams often re-implement notebook code for production. The re-implementation process is time-consuming, tedious, and negates the interactive properties of notebooks.

Databricks Notebook Workflows

We took a fresh look at the problem and decided that a new approach is needed. Our goal is to provide a unified platform that eliminates the friction between data exploration and production applications. We started out by providing a fully managed notebook environment for ad hoc experimentation, as well as a Job Scheduler that allows users to deploy notebooks directly to production via a simple UI. By adding Notebook Workflows on top of these existing functionalities, we are providing users the fastest, easiest way to create complex workflows out of their data processing code.

Databricks Notebook Workflows are a set of APIs to chain together Notebooks and run them in the Job Scheduler. Users create their workflows directly inside notebooks, using the control structures of the source programming language (Python, Scala, or R). For example, you can use if statements to check the status of a workflow step, use loops to repeat work, or even take decisions based on the value returned by a step. This approach is much simpler than external workflow tools such as Apache Airflow, Oozie, Pinball, or Luigi because users can transition from exploration to production in the same environment instead of operating another system.

Notebook Workflows are supervised by the Databricks Jobs Scheduler. This means that every workflow gets the production functionality provided by Jobs, such as fault recovery and timeout mechanisms. It also takes advantage of Databricks’ version control and security features — helping teams manage the evolution of complex workflows through GitHub, and securing access to production infrastructure through role-based access control.

Databricks Notebook Workflows diagram

Figure: Databricks Notebook Workflows is a set of APIs to chain together Databricks Notebooks and run them in the Job Scheduler. Highlighted cells in the diagram show the API calling other notebooks.

How to Use Notebook Workflows

Running a notebook as a workflow with parameters

The most basic action of a Notebook Workflow is to simply run a notebook with the dbutils.notebook.run() command. The command runs the notebook on the cluster the caller notebook is attached to, provided that you have the right permissions (see our ACLs documentation to learn more about notebook and cluster level permissions).

The dbutils.notebook.run() command also allows you to pass in arguments to the notebook, like this:

  1. dbutils.notebook.run(
  2. "../path/to/my/notebook",
  3. timeout_seconds = 60,
  4. arguments = {"x": "value1", "y": "value2", ...})

Example: Running a notebook in Databricks

Example: Running a notebook in Databricks

Getting return values

To create more flexible workflows, the dbutils.notebook.run() command can pass back a return value, like this:

  1. status = dbutils.notebook.run("../path/to/my/notebook", timeout_seconds = 60)

The dbutils.notebook.exit() command in the callee notebook needs to be invoked with a string as the argument, like this:

  1. dbutils.notebook.exit(str(resultValue))

It is also possible to return structured data by referencing data stored in a temporary table or write the results to DBFS (Databricks’ caching layer over Amazon S3) and then return the path of the stored data.

Control flow and exception handling

You can control the execution flow of your workflow and handle exceptions using the standard if/then statements and exception processing statements in either Scala or Python. For example:

  1. try:
  2. nextStep = dbutils.notebook.run(
  3. "DataImportNotebook", 250, {"input_path": importStagingDir})
  4. if nextStep == "Clean":
  5. dbutils.notebook.run("CleaningNotebook", 500)
  6. else:
  7. dbutils.notebook.run("ETLNotebook", 3600)
  8. except Exception as e:
  9. print "Error importing data."
  10. dbutils.notebook.run("ErrorNotebook", 1500)

You can also use workflows to perform retries and pass more complex data between notebooks. See the documentation for more details.

Running concurrent notebook workflows

Using built-in libraries in Python and Scala, you can launch multiple workflows in parallel. Here we show a simple example of running three ETL tasks in parallel from a Python notebook. Since workflows are integrated with the native language, it is possible to express arbitrary concurrency and retry behaviors in the user’s preferred language, in contrast to other workflow engines.

Example of running concurrent Notebook workflows


The run command returns a link to a job, which you can use to deep-dive on performance and debug the workflow. Simply open the caller notebook and click on the callee notebook link as shown below and you can start drilling down with the built-in Spark History UI.

Debugging spark jobs in Databricks

What’s Next

Have questions? Got tips you want to share with others? Visit the Databricks forum and participate in our user community.

We are just getting started with helping Databricks users build workflows. Stay tuned for more functionality in the near future. Try to build workflows by signing up for a trial of Databricks today. You can also find more detailed documentation here, and a demo here.


粉丝 302
博文 1110
码字总数 636346
作品 1
私信 提问
A Vision for Making Deep Learning Simple

A Vision for Making Deep Learning Simple When MapReduce was introduced 15 years ago, it showed the world a glimpse into the future. For the first time, engineers at Silicon Vall......

Spark+AI Summit Europe 2018 PPT下载[共95个]

为期三天的 Spark+AI Summit Europe 于 2018-10-02 ~ 04 在伦敦举行,一如往前,本次会议包含大量 AI 相关的议题,某种意义上也代表着 Spark 未来的发展方向。作为大数据领域的顶级会议,Spa...

Apache Spark 1.5.0 正式发布

Spark 1.5.0 是 1.x 系列的第六个版本,收到 230+ 位贡献者和 80+ 机构的努力,总共 1400+ patches。值得关注的改进如下: APIs:RDD, DataFrame 和 SQL 后端执行:DataFrame 和 SQL 集成:数...

你不能错过的 spark 学习资源

1. 书籍,在线文档 2. 网站 3. Databricks Blog 4. 文章,博客 5. 视频

Apache Spark 1.4 发布,开源集群计算系统

Apache Spark 1.4 发布,该版本将 R API 引入 Spark,同时提升了 Spark 的核心引擎和 MLlib ,以及 Spark Streaming 的可用性。部分重要更新如下: Spark Core Spark core 有多各方面的改进,...






官方 https://jwt.io 英文原版 https://www.ietf.org/rfc/rfc7519.txt 或 https://tools.ietf.org/html/rfc7519 中文翻译 https://www.jianshu.com/p/10f5161dd9df 1. 概述 JSON Web Token(......


AOP 理解AOP编程思想(面向方法、面向切面) spring AOP的概念 方面 -- 功能 目标 -- 原有方法 通知 -- 对原有方法增强的方法 连接点 -- 可以用来连接通知的地方(方法) 切入点 -- 将用来插入...


亮度、明度、光亮度,Luminance和Brightness、lightness其实都是一个意思,只是起名字太难了。 提出一个颜色模型后,由于明度的取值与别人的不同,为了表示区别所以就另想一个词而已。 因此在...


前言 python链表应用源码示例,需要用到python os模块方法、函数和类的应用。 首先,先简单的来了解下什么是链表?链表是一种物理存储单元上非连续、非顺序的存储结构,数据元素的逻辑顺序是...

Source Insight加载源码

Source Insight是一个图形化的源代码查看工具(当然也可以作为编译工具)。如果一个项目的源代码较多,此工具可以很方便地查找到源代码自建的依赖关系。 1.创建工程 下图为Snort源代码的文件...