文档章节

ElasticSearch-Hadoop: Indexing product views count

a
 allantaylor81
发布于 2015/06/17 18:55
字数 1116
阅读 25
收藏 0

This post covers to use ElasticSearch-Hadoop to read data from Hadoop system and index that in ElasticSearch. The functionality it covers is to index product views count and top search query per customer in last n number of days. The analyzed data can further be used on website to display customer recently viewed, product views count and top search query string.

In continuation to the previous posts on

We already have customer search clicks data gathered using Flume and stored in Hadoop HDFS and ElasticSearch, and how to analyze same data using Hive and generate statistical data. Here we will further see how to use the analyzed data to enhance customer experience on website and make it relevant for the end customers.

Recently Viewed Items

We already have covered in first part, how we can use flume ElasticSearch sink to index the recently viewed items directory to ElasticSearch instance and the data can be used to display real time clicked items for the customer.

ElasticSearch-Hadoop

Elasticsearch for Apache Hadoop allows Hadoop jobs to interact with ElasticSearch with small library and easy setup.

Elasticsearch-hadoop-hive, allows to access ElasticSearch using Hive. As shared in previous post, we have product views count and also customer top search query data extracted in Hive tables. We will read and index the same data to ElasticSearch so that it can be used for display purpose on website.

elasticsearch-hadoop-hive

Product views count functionality

Take a scenario to display each product total views by customer in the last n number of days. For better user experience, you can use the same functionality to display to end customer how other customer perceive the same product.

Hive Data for product views

Select sample data from hive table:

1 # search.search_productviews : id, productid, viewcount
2 61, 61, 15
3 48, 48, 8
4 16, 16, 40
5 85, 85, 7

Product Views Count Indexing

Create Hive external table “search_productviews_to_es” to index data to ElasticSearch instance.

1 Use search;
2 DROP TABLE IF EXISTS search_productviews_to_es;
3 CREATE EXTERNAL TABLE search_productviews_to_es (id STRING, productid BIGINT, viewcount INT) STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler' TBLPROPERTIES('es.resource' = 'productviews/productview', 'es.nodes' = 'localhost', 'es.port' = '9210', 'es.input.json' = 'false', 'es.write.operation' = 'index', 'es.mapping.id' = 'id', 'es.index.auto.create' = 'yes');
4 INSERT OVERWRITE TABLE search_productviews_to_es SELECT qcust.id, qcust.productid, qcust.viewcount FROM search_productviews qcust;
  •  External table search_productviews_to_es is created points to ES instance

  •  ElasticSearch instance configration used is localhost:9210

  •  Index “productviews” and document type “productview” will be used to index data

  •  Index and mappins will automatically created if it does not exist

  •  Insert overwrite will override the data if it already exists based on id field.

  •  Data is inserting by selecting data from another hive table “search_productviews” storing analytic/statistical data.

Execute the hive script in java to index product views data, HiveSearchClicksServiceImpl.java

1 Collection<HiveScript> scripts = new ArrayList<>();
2             HiveScript script = new HiveScript(new ClassPathResource("hive/load-search_productviews_to_es.q"));
3             scripts.add(script);
4             hiveRunner.setScripts(scripts);
5             hiveRunner.call();

productviews index sample data

The sample data in ElasticSearch index is stored as below:

1 {id=48, productid=48, viewcount=10}
2 {id=49, productid=49, viewcount=20}
3 {id=5, productid=5, viewcount=18}
4 {id=6, productid=6, viewcount=9}

Customer top search query string functionality

Take a scenario, where you may want to display top search query string by a single customer or all the customers on the website. You can use the same to display top search query cloud on the website.

Hive Data for customer top search queries

Select sample data from hive table:

1 # search.search_customerquery : id, querystring, count, customerid
2 61_queryString59, queryString59, 5, 61
3 298_queryString48, queryString48, 3, 298
4 440_queryString16, queryString16, 1, 440
5 47_queryString85, queryString85, 1, 47

Customer Top search queries Indexing

Create Hive external table “search_customerquery_to_es” to index data to ElasticSearch instance.

1 Use search;
2 DROP TABLE IF EXISTS search_customerquery_to_es;
3 CREATE EXTERNAL TABLE search_customerquery_to_es (id String, customerid BIGINT, querystring String, querycount INT) STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler' TBLPROPERTIES('es.resource' = 'topqueries/custquery', 'es.nodes' = 'localhost', 'es.port' = '9210', 'es.input.json' = 'false', 'es.write.operation' = 'index', 'es.mapping.id' = 'id', 'es.index.auto.create' = 'yes');
4 INSERT OVERWRITE TABLE search_customerquery_to_es SELECT qcust.id, qcust.customerid, qcust.queryString, qcust.querycount FROM search_customerquery qcust;
  •  External table search_customerquery_to_es is created points to ES instance

  •  ElasticSearch instance configration used is localhost:9210

  •  Index “topqueries” and document type “custquery” will be used to index data

  •  Index and mappins will automatically created if it does not exist

  •  Insert overwrite will override the data if it already exists based on id field.

  •  Data is inserting by selecting data from another hive table “search_customerquery” storing analytic/statistical data.

Execute the hive script in java to index data HiveSearchClicksServiceImpl.java

1 Collection<HiveScript> scripts = new ArrayList<>();
2             HiveScript script = new HiveScript(new ClassPathResource("hive/load-search_customerquery_to_es.q"));
3             scripts.add(script);
4             hiveRunner.setScripts(scripts);
5             hiveRunner.call();

topqueries index sample data

The topqueries index data on ElasticSearch instance is as shown below:

1 {id=474_queryString95, querystring=queryString95, querycount=10, customerid=474}
2 {id=482_queryString43, querystring=queryString43, querycount=5, customerid=482}
3 {id=482_queryString64, querystring=queryString64, querycount=7, customerid=482}
4 {id=483_queryString6, querystring=queryString6, querycount=2, customerid=483}
5 {id=487_queryString86, querystring=queryString86, querycount=111, customerid=487}
6 {id=494_queryString67, querystring=queryString67, querycount=1, customerid=494}

The functionality described above is only sample functionality and ofcourse need to be extended to map to specific business scenario. This may cover business scenario of displaying search query cloud to customers on website or for further Business Intelligence analytics.

Spring Data

Spring ElasticSearch for testing purpose has also been included to create ESRepository to count total records and delete All.
Check the service for details, ElasticSearchRepoServiceImpl.java

Total product views:

01 @Document (indexName = "productviews", type = "productview", indexStoreType = "fs", shards = 1, replicas = 0, refreshInterval = "-1")
02 public class ProductView {
03     @Id
04     private String id;
05     @Version
06     private Long version;
07     private Long productId;
08     private int viewCount;
09     ...
10     ...
11     }
12
13 public interface ProductViewElasticsearchRepository extends ElasticsearchCrudRepository<ProductView, String> { }
14
15 long count = productViewElasticsearchRepository.count();

Customer top search queries:

01 @Document (indexName = "topqueries", type = "custquery", indexStoreType = "fs", shards = 1, replicas = 0, refreshInterval = "-1")
02 public class CustomerTopQuery {
03     @Id
04     private String id;
05     @Version
06     private Long version;
07     private Long customerId;
08     private String queryString;
09     private int count;
10     ...
11     ...
12     }
13
14 public interface TopQueryElasticsearchRepository extends ElasticsearchCrudRepository<CustomerTopQuery, String> { }
15
16 long count = topQueryElasticsearchRepository.count();

In later posts we will cover to analyze the data further using scheduled jobs,

  • Using Oozie to schedule coordinated jobs for hive partition and bundle job to index data to ElasticSearch.

  • Using Pig to count total number of unique customers etc.

本文转载自:http://www.javacodegeeks.com/2014/05/elasticsearch-hadoop-indexing-product-views-count-and-custom...

共有 人打赏支持
a
粉丝 2
博文 120
码字总数 2912
作品 0
东城
私信 提问
Elasticsearch存储空间不够导致索引只读的解决方法

问题描述 今天发现当天的索引在ES中并没有创建,logstash中不停的报错: 索引变成了只读: es报错,es报错也是索引只读错误 解决办法 经过分析,发现是因为ES所在服务器磁盘空间太低引起,具...

傲娇字符
01/17
0
0
Elasticsearch 1.3.0/1.2.3 发布,分布式搜索引擎

Elasticsearch 1.3.0/1.2.3 发布,详情如下: Elasticsearch 1.3.0 最新的稳定版本,基于 Lucene 4.9,此版本默认禁用 JSONP,改进了脚本,提升了安全性,稳定性和性能,同时也包括一些 bug ...

oschina
2014/07/28
1K
3
Spark中hive的使用(hive操作es示例)

配置hive-site.xml <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true</value> <description>JDBC connect ......

守望者之父
2018/06/15
0
0
ELK5.0 (Elasticsearch Logstash Kibana) 搭建部署

Elasticsearch+Logstash+Kibana 实时在线日志分析系统 文件下载地址:https://www.elastic.co/downloads/past-releases 环境 centOS 6.8 jdk1.8 一、安装Elasticsearch 1.下载好安装包 并解压...

飓风2000
2018/09/13
0
0
Hibernate Search 5.6.0.Beta4 和 5.7.0.Beta1 发布

Hibernate Search 5.6.0.Beta4 和 5.7.0.Beta1 发布了,Hibernate Search的作用是对数据库中的数据进行检索的。它是hibernate对著名的全文检索系统Lucene的一个集成方案,作用在于对数据表中...

两味真火
2016/11/30
1K
11

没有更多内容

加载失败,请刷新页面

加载更多

【行为型】- 中介者模式

中介者模式: 调停者模式 定义一个中介对象来封装系列对象之间的交互。中介者使各个对象不需要显示地相互引用,从而使其耦合性松散,可独立地改变他们之间的交互。 角色 抽象中介者:定义好同...

ZeroneLove
31分钟前
1
0
Harbor快速部署到Kubernetes集群及登录问题解决

Harbor(https://goharbor.io)是一个功能强大的容器镜像管理和服务系统,用于提供专有容器镜像服务。随着云原生架构的广泛使用,原来由VMWare开发的Harbor也加入了云原生基金会(参考《Har...

openthings
50分钟前
2
0
MQ学习-基本概念区分

消息队列 Kafka 涉及的专有名词和术语进行定义和解释,方便您更好地理解相关概念并使用该产品。 Broker: 消息队列 Kafka 集群包含一个或多个消息处理服务器,该服务器被称为 Broker。 Topi...

os1cheng
今天
3
0
腾讯怒怼:靠红包骗用户下载怎么可以叫产品

近日,社交圈出现了大动荡,三款新推出的社交软件全部被微信封杀,对此,腾讯公关总监在回应外界对于1月15日三款社交新产品撼动微信的消息,他呼吁媒体在批评的同时应当尊重事实,“我们尊重...

linux-tao
今天
3
0
面试必考-数据优化

sql语句优化 性能不理想的系统中除了一部分是因为应用程序的负载确实超过了服务器的实际处理能力外,更多的是因为系统存在大量的SQL语句需要优化。 为了获得稳定的执行性能,SQL语句越简单越好...

瑞查德-Jack
今天
8
0

没有更多内容

加载失败,请刷新页面

加载更多

返回顶部
顶部