文档章节

Web page segmentation by visual clustering

w
 wilesun
发布于 2017/05/05 22:25
字数 737
阅读 2
收藏 0

Our job, at Mapado, is collecting all “things to do”, all around the planet.

In order to get this huge amount of information, we crawl the web, like Google does, searching for content related to concert, show, visits, attractions, …. When we find an interesting page, we try to extract the “good” data from it.

One of our major challenge is to separate content that we are interested in (title, description, photos, dates, …) from all the crap around (advertising, navigation bar, footer content, related content…).

In that challenge, one task is to regroup content that are visually close from each other. Usually, elements composing the main content of a web page are close from each other.

When we began working on this task, we, innocently, thought that we could deal with the HTML DOM. In the DOM, elements are stored as a hierarchy, so elements with the same parent have good chance to be related.

A very intersting paper covering page segmentation can be found at “Page Segmentation by Web Content Clustering“.

Using DOM hierarchy is a good starting point but in many cases things are getting a lot more complicated :

  • CSS stylesheet can move elements : elements which are close on the DOM hierarchy can be moved everywhere, including outside browser windows
  • CSS stylesheet can hide or show elements : many contents can share the same visual position, just being moved (or removed) by CSS and javascript
  • javascript code can display things that are not even in the DOM

So we started considering using webkit as a visual renderer in order to get visual features. There is a bunch of headless webkit packages like phantomjszombie.js or casperjs. Each of them can render a web page and get all computed properties of each element on the page.

One should use some of following useful features in order to cluster visually thing together :

  • position of the element in page (from top and left)
  • width & eight of element

Below is a screenshot of the Quai Branly Museum we want to cluster elements for.

quai-branly

When building the clustering model, we found that one of the main feature is the position of the leftmost and rightmost pixel of each bloc. Indeed, if you look at web pages, very often, different content blocks are separated by a vertical gap.

Adding position of the center of each element bloc and DOM depth improve the efficiency of the clustering.

Below is a first implementation of these concept in Python, using Scikit-Learn to perform the clustering.

Python

 

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

from sklearn.cluster import DBSCAN # Use scikit-learn to perform clustering

 

# ElementList contains a line for each element we want to cluster with his top and left position, width and eight and xpath

 

xpath_dict = set() # Build a dictionnary of XPATH of each element

for item in ElementList:

    path_split_idx = find(item["xpath"],"/")

    for idx in path_split_idx:

        xpath_dict.add(item["xpath"][:idx])

xpath_dict=list(xpath_dict)

 

# Build feature matrix with each element

 

features = [] # Table will store features for each element to cluster

for item in ElementList:

    # Keep only inside browser visual boundary

    if (item["left"] >0 and (item["top"] >0

        and item["features"]["left"]+item["width"] <1200):

        visual_features = (

            [item["left"] ,

            item["left"] + item["width"],

            item["top"],

            item["top"] + item["height"],

            (item["left"] + item["width"] + item["left"]) / 2,

            (item["top"] + item["top"] + item["height2"])/ 2)

        dom_features = [0] * len(xpath_dict) # using DOM parent presence as a feature. Default as 0

        path_split_idx = find(item["xpath"], "/")

 

        for i, idx in enumerate(path_split_idx):

            # give an empirical 70 pixels distance weight to each level of the DOM (far from perfect implementation)

            dom_features[xpath_dict.index(item["xpath"][:idx])] = 800 / (i + 1)

 

        # create feature vector combining visual and DOM features

        feayures.append(visual_features + dom_features)

 

features = np.asarray(features) # Convert to numpy array to make DBSCAN work

 

# DBSCAN is a good general clustering algorithm

eps_value=900 # maximum distance between clusters

db = DBSCAN(eps=eps_value, min_samples=1, metric='cityblock').fit(features)

 

# DBSCAN Algorithm returns a label for each vector of input array

labels = db.labels_

This algorithm is far from perfect but is a good starting point when trying to cluster things visually.

Below is the result of the clustering from one page of Quai Branly Museum, corresponding to above screenshot.

branly-clustering

本文转载自:http://blog.mapado.com/web-page-segmentation-by-visual-clustering/

共有 人打赏支持
w
粉丝 2
博文 193
码字总数 70049
作品 0
成都
架构师
计算机视觉的300多项优质资源

A curated collection of 300+ awesome computer vision resources including books, courses, papers, tutorials, software and more. Due to the size of this list, it can be hard to ke......

langb2014
2016/11/29
0
0
计算机视觉、机器学习相关领域论文和源代码大集合

一、特征提取Feature Extraction: · SIFT [1] [Demo program][SIFT Library] [VLFeat] · PCA-SIFT [2] [Project] · Affine-SIFT [3] [Project] · SURF [4] [OpenSURF] [Matlab Wrapper]......

wangdy
2016/08/02
213
0
计算机视觉、机器学习相关领域论文和源代码大集合

注:下面有project网站的大部分都有paper和相应的code。Code一般是C/C++或者Matlab代码。 最近一次更新:2013-3-17 一、特征提取Feature Extraction: · SIFT [1] [Demo program][SIFT Lib...

moki_oschina
2015/01/15
0
0
Web search engines

Web search engines - the current listing: Abbreviations for abbreviations ABC Search engine - every search starts with ABC About for guidance, not guesswork Academic Search. It'......

jickie阿文
2014/06/16
0
0
Dataset 列表:机器学习研究

Face recognition In computer vision, face images have been used extensively to develop face recognition systems, face detection, and many other projects that use images of faces......

JNingWei
2017/06/28
0
0

没有更多内容

加载失败,请刷新页面

加载更多

用Golang做了一个命令行贪吃蛇游戏

用Golang做了一个命令行贪吃蛇游戏 项目介绍 项目链接:https://gitee.com/lwow2025/snake-go 最近看了一本做几个小项目的书,突然就想用Golang做一个命令行贪吃蛇,也没啥特殊原因。 软件架...

Mediv
17分钟前
0
0
storm的利用并行度提高处理速度的经验

在storm的流计算框架中,在数据量非常大或者计算逻辑比较复杂的情况下,可能会造成处理速度变慢的情况,最后反而不满足了系统的处理要求,因此这里讨论一下。本文的内容是我在storm的使用过程...

飓风2000
27分钟前
0
0
课程推荐|深入浅出区块链博主:全栈区块链开发者的4堂必修课(线上优惠)

Tiny熊从2017年开始更新“深入浅出区块链”博客,在第一篇文章中,关于如何系统学习区块链技术,他这样描述:“从事区块链开发也有很多方向,如:区块链应用开发人员、区块链架构师、底层核心...

HiBlock
38分钟前
0
0
激活win10 亲测有效

1.首先,我们先查看一下Win10正式专业版系统的激活状态: 点击桌面左下角的“Windows”按钮,从打开的扩展面板中依次点击“设置”-“更新和安全”,并切换到“激活”选项卡,在此就可以查看到...

可达鸭眉头一皱
40分钟前
0
0
SpringWind180926

SpringWind SpringWind项目代码学习笔记 /SpringWind/src/main/webapp/WEB-INF/views/login.html 第15行action="#springUrl('/account/login.html')"【为什么是#springUrl】 第4行<a class=......

颖伙虫
53分钟前
1
0

没有更多内容

加载失败,请刷新页面

加载更多

返回顶部
顶部