大数据平台测试数据生产-使用TPC-H生成测试数据

原创
2017/06/19 16:04
阅读数 1.5K

 1 TPC-H介绍

         TPC-H(商业智能计算测试)是TPC的重要测试标准之一,主要用来模拟真实商业的应用环境。 
        TPC-H 用 3NF 实现了一个数据仓库,共包含 8 个基本关系/表,其中表REGION和表NATION的记录数是固定的(分别为5和25),其它6个表的记录数,则随所设定的参数SF而有所不同,其数据量可以设定从 1GB~3TB 不等。有8个级别供用户选择。

2 TPC-H工具下载

http://www.tpc.org/tpc_documents_current_versions/current_specifications.asp

如果链接不可用,请到官网找到TPC-H下载:http://www.tpc.org

下载2.17.1

3 生产测试数据

 在linux系统中解压

unzip tpc-h-tool-2.17.1.zip
tpch_2_17_0/dbgen/

进入dbgen目录后,

复制 makefile.suite文件(此步骤的目的是为了备份原有的makefile.suite)

编辑makefile

cp makefile.suite makefile
vim makefile

 修改以下四处地方

执行

make -f makefile

生产数据:

./dbgen -s 1

-s 1 表示生成1G的数据(如果你之前曾经尝试过生成数据,最好先make clean,再重新make,接着到这步加上-f覆盖掉)

生成之后可以用head命令检查一下tbl们,会看到每一行都有一些用“|”隔开的字段.

./dbgen -h   帮助命令

执行命令后会生成8个数据文件

4 复制数据到hdfs

 hdfs dfs -mkdir /testdata
 hdfs dfs -mkdir /testdata/region
 hdfs dfs -mkdir /testdata/supplier
 hdfs dfs -mkdir /testdata/part
 hdfs dfs -mkdir /testdata/customer
 hdfs dfs -mkdir /testdata/lineitem
 hdfs dfs -mkdir /testdata/nation
 hdfs dfs -mkdir /testdata/orders
 hdfs dfs -mkdir /testdata/partsupp


 hdfs dfs -put /tmp/bigdata/region.tbl /testdata/region
 hdfs dfs -put /tmp/bigdata/supplier.tbl /testdata/supplier
 hdfs dfs -put /tmp/bigdata/part.tbl /testdata/part
 hdfs dfs -put /tmp/bigdata/customer.tbl /testdata/customer
 hdfs dfs -put /tmp/bigdata/lineitem.tbl /testdata/lineitem
 hdfs dfs -put /tmp/bigdata/nation.tbl /testdata/nation
 hdfs dfs -put /tmp/bigdata/orders.tbl /testdata/orders
 hdfs dfs -put /tmp/bigdata/partsupp.tbl /testdata/partsupp

将那8个文件复制到对应的hdfs目录下面

5 创建impala外部关联表

CREATE EXTERNAL TABLE IF NOT EXISTS region
(
	R_REGIONKEY INT,
	R_NAME STRING,
	R_COMMENT STRING
) COMMENT 'The file of TPCH  is region.tbl' 
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'  
STORED AS TEXTFILE
LOCATION '/testdata/region';

CREATE EXTERNAL TABLE IF NOT EXISTS nation
(
	N_NATIONKEY INT,
	N_NAME STRING,
	N_REGIONKEY INT,
	N_COMMENT STRING
) COMMENT 'The file of TPCH  is nation.tbl' 
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'  
STORED AS TEXTFILE
LOCATION '/testdata/nation';

CREATE EXTERNAL TABLE IF NOT EXISTS part
(
  P_PARTKEY INT,
  P_NAME STRING,
  P_MFGR STRING,
  P_BRAND STRING,
  P_TYPE STRING,
  P_SIZE INT,
  P_CONTAINER STRING,
  P_RETAILPRICE DECIMAL(15,2),
  P_COMMENT STRING
) COMMENT 'The file of TPCH  is part.tbl' 
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'  
STORED AS TEXTFILE
LOCATION '/testdata/part';

CREATE EXTERNAL TABLE IF NOT EXISTS lineitem
(
	L_ORDERKEY INT,
  L_PARTKEY INT,
  L_SUPPKEY INT,
  L_LINENUMBER INT,
  L_QUANTITY DECIMAL(15,2),
  L_EXTENDEDPRICE  DECIMAL(15,2),
  L_DISCOUNT DECIMAL(15,2),
  L_TAX DECIMAL(15,2),
  L_RETURNFLAG STRING,
  L_LINESTATUS STRING,
  L_SHIPDATE timestamp,
  L_COMMITDATE timestamp,
  L_RECEIPTDATE timestamp,
  L_SHIPINSTRUCT STRING,
  L_SHIPMODE STRING,
  L_COMMENT STRING
) COMMENT 'The file of TPCH  is lineitem.tbl' 
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'  
STORED AS TEXTFILE
LOCATION '/testdata/lineitem';

CREATE EXTERNAL TABLE IF NOT EXISTS supplier ( 
  S_SUPPKEY  INT,
  S_NAME  STRING,
  S_ADDRESS  STRING,
  S_NATIONKEY INTEGER,
  S_PHONE STRING,
  S_ACCTBAL  DECIMAL(15,2),
  S_COMMENT  STRING
  )COMMENT 'The file of TPCH  is supplier.tbl' 
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'  
STORED AS TEXTFILE
LOCATION '/testdata/supplier';

CREATE EXTERNAL TABLE IF NOT EXISTS partsupp ( 
  PS_PARTKEY  INT,
  PS_SUPPKEY  INT,
  PS_AVAILQTY INT,
  PS_SUPPLYCOST  DECIMAL(15,2),
  PS_COMMENT STRING
  )COMMENT 'The file of TPCH  is partsupp.tbl' 
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'  
STORED AS TEXTFILE
LOCATION '/testdata/partsupp';

CREATE EXTERNAL TABLE IF NOT EXISTS customer ( 
  C_CUSTKEY INT,
  C_NAME  STRING,
  C_ADDRESS  STRING,
  C_NATIONKEY INT,
  C_PHONE STRING,
  C_ACCTBAL  DECIMAL(15,2),
  C_MKTSEGMENT  STRING,
  C_COMMENT  STRING
  )COMMENT 'The file of TPCH  is customer.tbl' 
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'  
STORED AS TEXTFILE
LOCATION '/testdata/customer';

CREATE EXTERNAL TABLE IF NOT EXISTS orders ( 
  O_ORDERKEY INT,
  O_CUSTKEY  INT,
  O_ORDERSTATUS STRING,
  O_TOTALPRICE  DECIMAL(15,2),
  O_ORDERDATE timestamp,
  O_ORDERPRIORITY  STRING,
  O_CLERK STRING,
  O_SHIPPRIORITY INT,
  O_COMMENT STRING
  )COMMENT 'The file of TPCH  is orders.tbl' 
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'  
STORED AS TEXTFILE
LOCATION '/testdata/orders';

现在就可以使用sql查询这些表了。

展开阅读全文
打赏
0
1 收藏
分享

作者的其它热门文章

加载中
更多评论
打赏
0 评论
1 收藏
0
分享
返回顶部
顶部