pyspider爬虫学习-API-self.crawl.md
博客专区 > sijinge 的博客 > 博客详情
pyspider爬虫学习-API-self.crawl.md
sijinge 发表于3个月前
pyspider爬虫学习-API-self.crawl.md
  • 发表于 3个月前
  • 阅读 8
  • 收藏 0
  • 点赞 0
  • 评论 0
摘要: pyspider爬虫学习-API-self.crawl
self.crawl
===========

self.crawl(url, **kwargs)
-------------------------
'self.crawl'是告诉pyspider哪些页面需要爬取的主要接口
`self.crawl` is the main interface to tell pyspider which url(s) should be crawled.
#参数
### Parameters:
#url
##### url
the url or url list to be crawled.#这个url或者url列表将被爬取
#回调函数
##### callback
#解析响应的方法 _default: `__call__`
the method to parse the response. _default: `__call__` _

```python
def on_start(self):
    self.crawl('http://scrapy.org/', callback=self.index_page)
```
#以下参数是可选的
the following parameters are optional
#年龄
##### age
#任务的有效期限,在此期间,该页将被视为没有修改。_default:-1(不重新抓取)_
the period of validity of the task. The page would be regarded as not modified during the period. _default: -1(never recrawl)_ 

```python
@config(age=10 * 24 * 60 * 60)
def index_page(self, response):
    ...
```
#回调'index_page'解析的每个页面在10天内都不会改变。如果你在10天内提交任务,那么它将被丢弃。
> Every pages parsed by the callback `index_page` would be regarded not changed within 10 days. If you submit the task within 10 days since last crawled it would be discarded.
#优先级
##### priority
#任务要根据优先级调度,越高越好。 _default:0 _
the priority of task to be scheduled, higher the better. _default: 0_ 

```python
def index_page(self):
    self.crawl('http://www.example.org/page2.html', callback=self.index_page)
    self.crawl('http://www.example.org/233.html', callback=self.detail_page,
               priority=1)
```
  #'233.html'将在'page2.html'之前被爬取,使用此参数可以做(BFS)(http://en.wikipedia.org/wiki/Breadth-first_search)和减少任务队列的数量(可能花费更多内存资源)
> The page `233.html` would be crawled before `page2.html`. Use this parameter can do a [BFS](http://en.wikipedia.org/wiki/Breadth-first_search) and reduce the number of tasks in queue(which may cost more memory resources).
#执行时间
##### exetime
#在unix时间戳中执行任务的时间。_default:0(立即)_
the executed time of task in unix timestamp. _default: 0(immediately)_ 

```python
import time
def on_start(self):
    self.crawl('http://www.example.org/', callback=self.callback,
               exetime=time.time()+30*60)
```
#这个页面将在30分钟后被抓取
> The page would be crawled 30 minutes later.
#重试
##### retries
#失败而重试的次数。 _default: 3_
retry times while failed. _default: 3_ 

#itag
##### itag
#一个来自前端页面的标记,以显示该任务的潜在修改。它将与它的最后一个值进行比较,当它改变时,recrawl。_default:None_
a marker from frontier page to reveal the potential modification of the task. It will be compared to its last value, recrawl when it's changed. _default: None_ 

```python
def index_page(self, response):
    for item in response.doc('.item').items():
        self.crawl(item.find('a').attr.url, callback=self.detail_page,
                   itag=item.find('.update-time').text())
```
  #在示例中,'update-time'被当做itag。如果没有更改,请求将被丢弃。
> In the sample, `.update-time` is used as itag. If it's not changed, the request would be discarded.

或者你可以在'Handler.crawl_config'中指定'itag'脚本版本,如果您想要重新启动所有的任务。
Or you can use `itag` with `Handler.crawl_config` to specify the script version if you want to restart all of the tasks.

```python
class Handler(BaseHandler):
    crawl_config = {
        'itag': 'v223'
    }
```
#修改脚本并再次单击run按钮后,更改itag的值。如果没有设置,就无所谓了
> Change the value of itag after you modified the script and click run button again. It doesn't matter if not set before. 
#动态重新抓取
##### auto_recrawl
#当启用时,任务将在每一个'age'的时间内重新启动。_default:False_
when enabled, task would be recrawled every `age` time. _default: False_ 

```python
def on_start(self):
    self.crawl('http://www.example.org/', callback=self.callback,
               age=5*60*60, auto_recrawl=True)
```
#该页面将在每隔5小时重新启动。
> The page would be restarted every `age` 5 hours.

##### method
#使用HTTP方法。_default:GET_
HTTP method to use. _default: GET_ 

##### params
#URL参数的字典,附加到URL后面
dictionary of URL parameters to append to the URL. 

```python
def on_start(self):
    self.crawl('http://httpbin.org/get', callback=self.callback,
               params={'a': 123, 'b': 'c'})
    self.crawl('http://httpbin.org/get?a=123&b=c', callback=self.callback)
```
> The two requests are the same. #这两种请求方式效果相同

##### data
#附在请求上的主体。如果提供了字典,就会发生表单编码。
the body to attach to the request. If a dictionary is provided, form-encoding will take place. 

```python
def on_start(self):
    self.crawl('http://httpbin.org/post', callback=self.callback,
               method='POST', data={'a': 123, 'b': 'c'})
```

##### files
#字典格式'{字段: {文件名: '内容'}}'多文件上传
dictionary of `{field: {filename: 'content'}}` files to multipart upload.` 

##### user_agent
#请求的用户代理
the User-Agent of the request

##### headers
#要发送的头文件字典。
dictionary of headers to send. 

##### cookies
将cookies字典附在请求上
dictionary of cookies to attach to this request. 

##### connect_timeout
#首次连接超时秒。_default:20_
timeout for initial connection in seconds. _default: 20_

##### timeout
#获取页面的最大时间秒数。_default:120_
maximum time in seconds to fetch the page. _default: 120_ 

##### allow_redirects
#遵循'30x'重定向._default:True_
follow `30x` redirect _default: True_ 

##### validate_cert
#对于HTTPS请求,验证服务器的证书?_default:True_
For HTTPS requests, validate the server’s certificate? _default: True_ 

##### proxy
#使用'用户名:密码@主机名:端口'格式代理服务器,目前只支持http代理。
proxy server of `username:password@hostname:port` to use, only http proxy is supported currently. 

```python
class Handler(BaseHandler):
    crawl_config = {
        'proxy': 'localhost:8080'
    }
```
  #'Handler.crawl_config'可以用'proxy'来为整个项目设置代理
> `Handler.crawl_config` can be used with `proxy` to set a proxy for whole project.

##### etag 
#如果页面的内容没有改变,使用HTTP Etag机制传递进程。_default:True_
use HTTP Etag mechanism to pass the process if the content of the page is not changed. _default: True_ 

###### last_modified
#如果页面的内容没有改变,使用HTTP last-modified头机制传递进程。_default:True_
use HTTP Last-Modified header mechanism to pass the process if the content of the page is not changed. _default: True_ 

##### fetch_type
#设置为'js'以启用JavaScript fetcher。_default:None_
set to `js` to enable JavaScript fetcher. _default: None_ 

##### js_script
#在加载页面之前或之后运行的JavaScript,应该包装成一个函数就像'function() { document.write("binux"); }'
JavaScript run before or after page loaded, should been wrapped by a function like `function() { document.write("binux"); }`. 


```python
def on_start(self):
    self.crawl('http://www.example.org/', callback=self.callback,
               fetch_type='js', js_script='''
               function() {
                   window.scrollTo(0,document.body.scrollHeight);
                   return 123;
               }
               ''')
```
  #脚本会将页面滚动到底部。函数返回的值可以通过'Response.js_script_result'捕获。
> The script would scroll the page to bottom. The value returned in function could be captured via `Response.js_script_result`.

##### js_run_at
#在'document-start'或'document-end'中运行由'js_script'指定的JavaScript。_default:document-end _
run JavaScript specified via `js_script` at `document-start` or `document-end`. _default: `document-end`_ 

##### js_viewport_width/js_viewport_height
#为布局过程的JavaScript fetcher设置viewport的大小。
set the size of the viewport for the JavaScript fetcher of the layout process. 

##### load_images
#当JavaScript fetcher启用时加载图像。_default:False_
load images when JavaScript fetcher enabled. _default: False_ 

##### save
#一个对象传递给回调方法,可以通过“response.save”来访问
a object pass to the callback method, can be visit via `response.save`. 


```python
def on_start(self):
    self.crawl('http://www.example.org/', callback=self.callback,
               save={'a': 123})

def callback(self, response):
    return response.save['a']
```
   'callback'将返回'123'
> `123` would be returned in `callback`

##### taskid
#识别任务的惟一id,默认是URL生成的MD5代码,可以被方法'def get_taskid(self,task)'覆盖
unique id to identify the task, default is the MD5 check code of the URL, can be overridden by method `def get_taskid(self, task)` 

```python
import json
from pyspider.libs.utils import md5string
def get_taskid(self, task):
    return md5string(task['url']+json.dumps(task['fetch'].get('data', '')))
```
  #在默认情况下,只有url是md5 -ed,上面的代码添加了POST请求的'data'作为taskid的一部分。
> Only url is md5 -ed as taskid by default, the code above add `data` of POST request as part of taskid.

##### force_update
#即使任务处于'ACTIVE'状态,也可以强制更新任务。
force update task params even if the task is in `ACTIVE` status. 

##### cancel
#取消任务,应该用'force_update'来取消一个活动任务。要取消'auto_recrawl'任务,您应该设置'auto_recrawl = False'。
cancel a task, should be used with `force_update` to cancel a active task. To cancel an `auto_recrawl` task, you should set `auto_recrawl=False` as well.

cURL command
------------

`self.crawl(curl_command)`
#cURL是一个用于创建HTTP请求的命令行工具。它可以很容易地获得Chrome Devtools 到网络面板,右击请求并“复制为cURL”。
cURL is a command line tool to make a HTTP request. It can easily get form Chrome Devtools > Network panel,  right click the request and "Copy as cURL".
#您可以使用cURL命令作为'self.crawl'的第一个参数。它将解析该命令,并使HTTP请求类似curl。
You can use cURL command as the first argument of `self.crawl`. It will parse the command and make the HTTP request just like curl do.

@config(**kwargs)
-----------------
#'self.crawl'的默认参数,当使用装饰方法作为回调时。例如:
default parameters of `self.crawl` when use the decorated method as callback. For example:

```python
@config(age=15*60)
def index_page(self, response):
    self.crawl('http://www.example.org/list-1.html', callback=self.index_page)
    self.crawl('http://www.example.org/product-233', callback=self.detail_page)
    
@config(age=10*24*60*60)
def detail_page(self, response):
    return {...}
```
#'list1.html'的'age'是15分钟,而'product-233.html'的'age'是10天。因为'product-233.html'的回调是'detail_page',这意味着它是一个'detail_page',所以它共享'detail_page'的配置。
`age` of `list-1.html` is 15min while the `age` of `product-233.html` is 10days. Because the callback of `product-233.html` is `detail_page`, means it's a `detail_page` so it shares the config of `detail_page`.

Handler.crawl_config = {}
-------------------------
#整个项目'self.crawl'的默认参数。调度器的'crawl_config'中的参数(priority, retries, exetime, age, itag, force_update, auto_recrawl, cancel)将在任务创建时加入,fetcher和processor的参数将在执行时加入。您可以使用此机制来更改之后的fetch配置(例如cookie)。
default parameters of `self.crawl` for the whole project. The parameters in `crawl_config` for scheduler (priority, retries, exetime, age, itag, force_update, auto_recrawl, cancel) will be joined when the task created, the parameters for fetcher and processor will be joined when executed. You can use this mechanism to change the fetch config (e.g. cookies) afterwards.

```python
class Handler(BaseHandler):
    crawl_config = {
        'headers': {
            'User-Agent': 'GoogleBot',
        }
    }
    
    ...
```
  #crawl_config设置一个项目级别的用户代理
> crawl_config set a project level user-agent.
标签: Python PySpider 爬虫
共有 人打赏支持
粉丝 0
博文 39
码字总数 36811
×
sijinge
如果觉得我的文章对您有用,请随意打赏。您的支持将鼓励我继续创作!
* 金额(元)
¥1 ¥5 ¥10 ¥20 其他金额
打赏人
留言
* 支付类型
微信扫码支付
打赏金额:
已支付成功
打赏金额: