文档章节

Schema-on-Read VS Schema-on-Read

Zootopia
 Zootopia
发布于 2016/04/27 15:21
字数 602
阅读 51
收藏 0

1.Question:

What is the difference and meaning of these two statements that I encountered during a lecture here:

1. Traditional databases enforce schema during load time.

and

2. Hive enforces schema during read time.

2.ANSWER:

You touch on one of the reasons why Hadoop and other NoSQL strategies have been so successful, so I'm not sure if you were expecting to get a dissertation or not, but here it is! The extra flexibility and agility in data analysis has probably contributed to the explosion of "data science", just because it makes large-scale data analysis easier in general.

A traditional relational database stores the data with schema in mind. It knows that the second column is an integer, it knows that it has 40 columns, etc. Therefore, you need to specify your schema ahead of time and have it well planned out. This is "schema on write" -- that is, the schema is applied when the data is being written to the data store.

Hive (in some cases), Hadoop, and many other NoSQL systems in general are about "schema on read" -- the schema is applied as the data is being read off of the data store. Consider the following line of raw text:

A:B:C~E:F~G:H~~I::J~K~L

There are a couple ways to interpret this. ~ could be the delimiter or maybe : could be the delimiter. Who knows? With schema on read, it doesn't matter. You decide what the schema is when you analyze the data, not when you write the data. This example is a bit ridiculous in that you probably won't ever encounter this case, but it gets the point across hopefully.

With schema on read, you just load your data into the data store and think about how to parse and interpret later. At the core of this explanation, schema on read means write your data first, figure out what it is later. Schema on write means figure out what your data is first, then write it after.


There is a tradeoff here. Some of these are subjective and my own opinion.

Benefits of schema on write:

  • Better type safety and data cleansing done for the data at rest

  • Typically more efficient (storage size and computationally) since the data is already parsed

Downsides of schema on write:

  • You have to plan ahead of time what your schema is before you store the data (i.e., you have to do ETL)

  • Typically you throw away the original data, which could be bad if you have a bug in your ingest process

  • It's harder to have different views of the same data

Benefits of schema on read:

  • Flexibility in defining how your data is interpreted at load time

    • This gives you the ability to evolve your "schema" as time goes on

    • This allows you to have different versions of your "schema"

    • This allows the original source data format to change without having to consolidate to one data format

  • You get to keep your original data

  • You can load your data before you know what to do with it (so you don't drop it on the ground)

  • Gives you flexibility in being able to store unstructured, unclean, and/or unorganized data

Downsides of schema on read:

  • Generally it is less efficient because you have to reparse and reinterpret the data every time (this can be expensive with formats like XML)

  • The data is not self-documenting (i.e., you can't look at a schema to figure out what the data is)

  • More error prone and your analytics have to account for dirty data



本文转载自:http://stackoverflow.com/questions/11764237/hive-enforces-schema-during-read-time

上一篇: NOSQL
Zootopia
粉丝 1
博文 10
码字总数 18242
作品 0
广州
程序员
私信 提问
Java SAX tutorial

Java SAX tutorial shows how to use Java SAX API to read and validate XML documents. SAX SAX (Simple API for XML) is an event-driven algorithm for parsing XML documents. SAX is a......

HelloRookie
2018/09/12
0
0
MAVEN项目使用JDBC连接GP数据库(greenplum)

第一步 下载greenplum.jar 下载地址 http://download.csdn.net/download/enterings/10039723?web=web 第二步 在maven中手动添加本地jar包 在 cmd命令行中 运行 mvn install:install-file -Df......

梦魂清风
2018/03/04
0
0
Multiple annotations found at this line:

Multiple annotations found at this line: - schema_reference.4: Failed to read schema document 'http://www.springframework.org/schema/beans/spring-beans-4.0.xsd', because 1) coul......

天池番薯
2015/09/28
4.7K
2
spring使用Hibernate配置双数据源事务问题怎么解决

我使用spring+hibernate配置了两个数据源,不知道事务怎么配置,是需要配置两套sessionFactory,两套transactionManager吗? 以下是我的配置,可以运行但是程序运行会添加两次事务,请问怎么处...

电脑小童
2014/09/04
1K
2
spring jdbcTemplate多数据源简单实用

第一、config/jdbc.properties HC本地数据库 jdbc.driver=com.mysql.jdbc.Driver jdbc.url=jdbc:mysql://localhost:3306/hcdatabase?useUnicode=true&characterEncoding=UTF-8&allowMultiQu......

梦魂清风
2018/02/08
0
0

没有更多内容

加载失败,请刷新页面

加载更多

在优麒麟中运行英雄联盟LOL

sudo apt install wine-stable sudo add-apt-repository ppa:lutris-team/lutris sudo apt install lutris 在lutris官网的game分类里找到英雄联盟这个游戏,然后进去那个页面, https://lutr......

gugudu
26分钟前
5
0
Mysql主从

一、mysql主从介绍 MySQL主从又叫做Replication、AB复制。简单讲就是A和B两台机器做主从后,在A上写数据,另外一台B也会跟着写数据,两者数据实时同步的,MySQL主从基于binlog,主上须开启b...

wxy丶
30分钟前
4
0
商品SKU规格算法

思想 定义规格属性数据格式 定义生成SKU数据格式 完成点击多选框后生成的数据源 根据数据源生成SKU数据 根据生成的SKU数据做展示 代码示例 <!DOCTYPE html><html><head> <!-- 页面met...

chinahufei
38分钟前
1
0
面试点:Java 中 hashCode() 和 equals() 的关系

Java 中 hashCode() 和 equals() 的关系是面试中的常考点,如果没有深入思考过两者设计的初衷,这个问题将很难回答。除了应付面试,理解二者的关系更有助于我们写出高质量且准确的代码。 一....

爱码仕i
40分钟前
5
0
传智播客JNI第七讲 – JNI中的全局引用/局部引用/弱全局引用、缓存jfieldID和jmethodID的两种方式

讲解JNI中的全局引用/局部引用/弱全局引用、缓存jfieldID和jmethodID的两种方式,并编写两种缓存方式的示例代码。 1.从Java虚拟机创建的对象传到本地C/C++代码时会产生引用,根据Java的垃圾回...

shzwork
51分钟前
3
0

没有更多内容

加载失败,请刷新页面

加载更多

返回顶部
顶部