pyspider爬虫学习-文档翻译-About-Projects.md
博客专区 > sijinge 的博客 > 博客详情
pyspider爬虫学习-文档翻译-About-Projects.md
sijinge 发表于3个月前
pyspider爬虫学习-文档翻译-About-Projects.md
  • 发表于 3个月前
  • 阅读 31
  • 收藏 0
  • 点赞 0
  • 评论 0

腾讯云 新注册用户 域名抢购1元起>>>   

摘要: 首次尝试读取源码的同时翻译文档,大家多多指正,勿喷
About Projects 关于项目
==============
在大多数情况下,一个项目就是为一个网站编写的脚本
In most cases, a project is one script you write for one website.

项目是独立的,但是您可以将另一个项目用`from projects import other_project`的方式导入为模块
* Projects are independent, but you can import another project as a module with `from projects import other_project`
一个项目有五种状态:`TODO`, `STOP`, `CHECKING`, `DEBUG` and `RUNNING`
* A project has 5 status: `TODO`, `STOP`, `CHECKING`, `DEBUG` and `RUNNING`
    TODO  刚刚创建一个可编辑的脚本
    - `TODO` - a script is just created to be written
    STOP  如果您希望项目停止,您可以将项目标记为'STOP'
    - `STOP` - you can mark a project as `STOP` if you want it to STOP (= =).
    CHECKING 当正在运行的项目被修改时,为了防止不完整的修改,项目状态将被自动设置为“CHECKING”。
    - `CHECKING` - when a running project is modified, to prevent incomplete modification, project status will be set as `CHECKING` automatically.
    DEBUG`/`RUNNING 这两种状态对spider没有区别。但是当它第一次运行时,将它标记为‘DEBUG’,然后在检查之后将其更改为‘RUNNING’。
    - `DEBUG`/`RUNNING` - these two status have no difference to spider. But it's good to mark it as `DEBUG` when it's running the first time then change it to `RUNNING` after being checked.
爬行速度控制是通过`rate`和`burst`与令牌桶]与[token-bucket](http://en.wikipedia.org/wiki/token_bucket)算法
* The crawl rate is controlled by `rate` and `burst` with [token-bucket](http://en.wikipedia.org/wiki/Token_bucket) algorithm.
     rate 表示一秒钟有多少请求
    - `rate` - how many requests in one second
     burst 考虑到这种情况,`rate/burst = 0.1/3`,意思是每10s爬取1个网页。当所有任务都完成后,项目每分钟都在检查最后的更新列表。假设有3个新列表,pyspider将“burst”爬3个任务无需等待3×10秒,然而,第四任务需要等待10秒
    - `burst` - consider this situation, `rate/burst = 0.1/3`, it means that the spider scrawls 1 page every 10 seconds. All tasks are finished, project is checking last updated items every minute. Assume that 3 new items are found, pyspider will "burst" and crawl 3 tasks without waiting 3*10 seconds. However, the fourth task needs wait 10 seconds.
若要删除一个项目,请将“group”设置为“delete”并将状态设置为“STOP”,并等待24小时。
* To delete a project, set `group` to `delete` and status to `STOP`, wait 24 hours.

`on_finished` callback
--------------------
你可以在项目中重写“on_finished”方法,当task_queue变成0时该方法将被触发
You can override `on_finished` method in the project, the method would be triggered when the task_queue goes to 0.
例子1:当你开始一个项目去爬取一个网站的100个页面,当100个页面爬取成功或者重试失败的时候,"on_finished"回调方法将被执行
Example 1: When you start a project to crawl a website with 100 pages, the `on_finished` callback will be fired when 100 pages are successfully crawled or failed after retries.
例子2:一个项目在“auto_recrawl”任务时,“on_finished”回调将不会触发,因为当auto_recrawl任务存在时,时间队列不可能变为0.
Example 2: A project with `auto_recrawl` tasks will **NEVER** trigger the `on_finished` callback, because time queue will never become 0 when there are auto_recrawl tasks in it.
例子3:一个项目在在每次新任务提交的时候通过“@every”装饰方法将触发“on_finished”回调。
Example 3: A project with `@every` decorated method will trigger the `on_finished` callback every time when the newly submitted tasks are finished.
标签: pyspider 爬虫
共有 人打赏支持
粉丝 0
博文 39
码字总数 36811
×
sijinge
如果觉得我的文章对您有用,请随意打赏。您的支持将鼓励我继续创作!
* 金额(元)
¥1 ¥5 ¥10 ¥20 其他金额
打赏人
留言
* 支付类型
微信扫码支付
打赏金额:
已支付成功
打赏金额: