文档章节

Web page segmentation by visual clustering

w
 wilesun
发布于 2017/05/05 22:25
字数 737
阅读 2
收藏 0
点赞 0
评论 0

Our job, at Mapado, is collecting all “things to do”, all around the planet.

In order to get this huge amount of information, we crawl the web, like Google does, searching for content related to concert, show, visits, attractions, …. When we find an interesting page, we try to extract the “good” data from it.

One of our major challenge is to separate content that we are interested in (title, description, photos, dates, …) from all the crap around (advertising, navigation bar, footer content, related content…).

In that challenge, one task is to regroup content that are visually close from each other. Usually, elements composing the main content of a web page are close from each other.

When we began working on this task, we, innocently, thought that we could deal with the HTML DOM. In the DOM, elements are stored as a hierarchy, so elements with the same parent have good chance to be related.

A very intersting paper covering page segmentation can be found at “Page Segmentation by Web Content Clustering“.

Using DOM hierarchy is a good starting point but in many cases things are getting a lot more complicated :

  • CSS stylesheet can move elements : elements which are close on the DOM hierarchy can be moved everywhere, including outside browser windows
  • CSS stylesheet can hide or show elements : many contents can share the same visual position, just being moved (or removed) by CSS and javascript
  • javascript code can display things that are not even in the DOM

So we started considering using webkit as a visual renderer in order to get visual features. There is a bunch of headless webkit packages like phantomjszombie.js or casperjs. Each of them can render a web page and get all computed properties of each element on the page.

One should use some of following useful features in order to cluster visually thing together :

  • position of the element in page (from top and left)
  • width & eight of element

Below is a screenshot of the Quai Branly Museum we want to cluster elements for.

quai-branly

When building the clustering model, we found that one of the main feature is the position of the leftmost and rightmost pixel of each bloc. Indeed, if you look at web pages, very often, different content blocks are separated by a vertical gap.

Adding position of the center of each element bloc and DOM depth improve the efficiency of the clustering.

Below is a first implementation of these concept in Python, using Scikit-Learn to perform the clustering.

Python

 

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

from sklearn.cluster import DBSCAN # Use scikit-learn to perform clustering

 

# ElementList contains a line for each element we want to cluster with his top and left position, width and eight and xpath

 

xpath_dict = set() # Build a dictionnary of XPATH of each element

for item in ElementList:

    path_split_idx = find(item["xpath"],"/")

    for idx in path_split_idx:

        xpath_dict.add(item["xpath"][:idx])

xpath_dict=list(xpath_dict)

 

# Build feature matrix with each element

 

features = [] # Table will store features for each element to cluster

for item in ElementList:

    # Keep only inside browser visual boundary

    if (item["left"] >0 and (item["top"] >0

        and item["features"]["left"]+item["width"] <1200):

        visual_features = (

            [item["left"] ,

            item["left"] + item["width"],

            item["top"],

            item["top"] + item["height"],

            (item["left"] + item["width"] + item["left"]) / 2,

            (item["top"] + item["top"] + item["height2"])/ 2)

        dom_features = [0] * len(xpath_dict) # using DOM parent presence as a feature. Default as 0

        path_split_idx = find(item["xpath"], "/")

 

        for i, idx in enumerate(path_split_idx):

            # give an empirical 70 pixels distance weight to each level of the DOM (far from perfect implementation)

            dom_features[xpath_dict.index(item["xpath"][:idx])] = 800 / (i + 1)

 

        # create feature vector combining visual and DOM features

        feayures.append(visual_features + dom_features)

 

features = np.asarray(features) # Convert to numpy array to make DBSCAN work

 

# DBSCAN is a good general clustering algorithm

eps_value=900 # maximum distance between clusters

db = DBSCAN(eps=eps_value, min_samples=1, metric='cityblock').fit(features)

 

# DBSCAN Algorithm returns a label for each vector of input array

labels = db.labels_

This algorithm is far from perfect but is a good starting point when trying to cluster things visually.

Below is the result of the clustering from one page of Quai Branly Museum, corresponding to above screenshot.

branly-clustering

© 著作权归作者所有

共有 人打赏支持
w
粉丝 1
博文 32
码字总数 70049
作品 0
成都
架构师
计算机视觉的300多项优质资源

A curated collection of 300+ awesome computer vision resources including books, courses, papers, tutorials, software and more. Due to the size of this list, it can be hard to ke......

langb2014 ⋅ 2016/11/29 ⋅ 0

计算机视觉、机器学习相关领域论文和源代码大集合

一、特征提取Feature Extraction: · SIFT [1] [Demo program][SIFT Library] [VLFeat] · PCA-SIFT [2] [Project] · Affine-SIFT [3] [Project] · SURF [4] [OpenSURF] [Matlab Wrapper]......

wangdy ⋅ 2016/08/02 ⋅ 0

计算机视觉、机器学习相关领域论文和源代码大集合

注:下面有project网站的大部分都有paper和相应的code。Code一般是C/C++或者Matlab代码。 最近一次更新:2013-3-17 一、特征提取Feature Extraction: · SIFT [1] [Demo program][SIFT Lib...

moki_oschina ⋅ 2015/01/15 ⋅ 0

Web search engines

Web search engines - the current listing: Abbreviations for abbreviations ABC Search engine - every search starts with ABC About for guidance, not guesswork Academic Search. It'......

jickie阿文 ⋅ 2014/06/16 ⋅ 0

Dataset 列表:机器学习研究

Face recognition In computer vision, face images have been used extensively to develop face recognition systems, face detection, and many other projects that use images of faces......

JNingWei ⋅ 2017/06/28 ⋅ 0

[12 Jun 2015 ~ 18 Jun 2015] Deep Learning in arxiv

Multi-pathConvolutional Neural Network for Complex Image Classification Suppresshigh frequency components with Bilateral filter in the second path ParseNet:Looking Wider to See ......

sunbaigui ⋅ 2015/06/25 ⋅ 0

Java自然语言处理--LingPipe

LingPipe是一个自然语言处理的Java开源工具包。LingPipe目前已有很丰富的功能,包括主题分类(Top Classification)、命名实体识别(Named Entity Recognition)、词性标注(Part-of Speech ...

匿名 ⋅ 2009/10/17 ⋅ 1

Nginx 0.8.49 发布-下载

Nginx(发音同 engine x)是一款轻量级的Web 服务器/反向代理服务器及电子邮件(IMAP/POP3)代理服务器,并在一个BSD-like 协议下发行。 该版本改进内容: ) Feature: the "imagefilterjpe...

红薯 ⋅ 2010/08/09 ⋅ 0

cbwang505/IITG-Captcha-Solver-OpenCV-TensorFlow

IITG-Captcha-Solver-OpenCV-TensorFlow Solving IITG's webmail captcha using a simple Feed Forward Neural Network ##Script in action ##Dependencies Python 3.5 OpenCV 3 Tensorflow ......

cbwang505 ⋅ 2017/09/18 ⋅ 0

各种跟CV、AR相关的C/C++代码收集

各种跟CV、AR相关的C/C++代码收集 2014年04月23日 Other 暂无评论 阅读 15,471 次 这个页面力图搜集各种跟CV,AR相关的代码,如无特别声明,均是c/c++代码。还是一贯的标准,不求全面,只求质...

阳666 ⋅ 2016/05/30 ⋅ 0

没有更多内容

加载失败,请刷新页面

加载更多

下一页

Centos7重置Mysql 8.0.1 root 密码

问题产生背景: 安装完 最新版的 mysql8.0.1后忘记了密码,向重置root密码;找了网上好多资料都不尽相同,根据自己的问题总结如下: 第一步:修改配置文件免密码登录mysql vim /etc/my.cnf 1...

豆花饭烧土豆 ⋅ 55分钟前 ⋅ 0

熊掌号收录比例对于网站原创数据排名的影响[图]

从去年下半年开始,我在写博客了,因为我觉得业余写写博客也还是很不错的,但是从2017年下半年开始,百度已经推出了原创保护功能和熊掌号平台,为此,我也提交了不少以前的老数据,而这些历史...

原创小博客 ⋅ 今天 ⋅ 0

LVM讲解、磁盘故障小案例

LVM LVM就是动态卷管理,可以将多个硬盘和硬盘分区做成一个逻辑卷,并把这个逻辑卷作为一个整体来统一管理,动态对分区进行扩缩空间大小,安全快捷方便管理。 1.新建分区,更改类型为8e 即L...

蛋黄Yolks ⋅ 今天 ⋅ 0

Hadoop Yarn调度器的选择和使用

一、引言 Yarn在Hadoop的生态系统中担任了资源管理和任务调度的角色。在讨论其构造器之前先简单了解一下Yarn的架构。 上图是Yarn的基本架构,其中ResourceManager是整个架构的核心组件,它负...

p柯西 ⋅ 今天 ⋅ 0

uWSGI + Django @ Ubuntu

创建 Django App Project 创建后, 可以看到路径下有一个wsgi.py的问题 uWSGI运行 直接命令行运行 利用如下命令, 可直接访问 uwsgi --http :8080 --wsgi-file dj/wsgi.py 配置文件 & 运行 [u...

袁祾 ⋅ 今天 ⋅ 0

JVM堆的理解

在JVM中,我们经常提到的就是堆了,堆确实很重要,其实,除了堆之外,还有几个重要的模块,看下图: 大 多数情况下,我们并不需要关心JVM的底层,但是如果了解它的话,对于我们系统调优是非常...

不羁之后 ⋅ 昨天 ⋅ 0

推荐:并发情况下:Java HashMap 形成死循环的原因

在淘宝内网里看到同事发了贴说了一个CPU被100%的线上故障,并且这个事发生了很多次,原因是在Java语言在并发情况下使用HashMap造成Race Condition,从而导致死循环。这个事情我4、5年前也经历...

码代码的小司机 ⋅ 昨天 ⋅ 2

聊聊spring cloud gateway的RetryGatewayFilter

序 本文主要研究一下spring cloud gateway的RetryGatewayFilter GatewayAutoConfiguration spring-cloud-gateway-core-2.0.0.RC2-sources.jar!/org/springframework/cloud/gateway/config/G......

go4it ⋅ 昨天 ⋅ 0

创建新用户和授予MySQL中的权限教程

导读 MySQL是一个开源数据库管理软件,可帮助用户存储,组织和以后检索数据。 它有多种选项来授予特定用户在表和数据库中的细微的权限 - 本教程将简要介绍一些选项。 如何创建新用户 在MySQL...

问题终结者 ⋅ 昨天 ⋅ 0

android -------- 颜色的半透明效果配置

最近有朋友问我 Android 背景颜色的半透明效果配置,我网上看资料,总结了一下, 开发中也是常常遇到的,所以来写篇博客 常用的颜色值格式有: RGB ARGB RRGGBB AARRGGBB 这4种 透明度 透明度...

切切歆语 ⋅ 昨天 ⋅ 0

没有更多内容

加载失败,请刷新页面

加载更多

下一页

返回顶部
顶部