pyspider爬虫学习-教程2-AJAX-and-more-HTTP.md
博客专区 > sijinge 的博客 > 博客详情
pyspider爬虫学习-教程2-AJAX-and-more-HTTP.md
sijinge 发表于1个月前
pyspider爬虫学习-教程2-AJAX-and-more-HTTP.md
  • 发表于 1个月前
  • 阅读 12
  • 收藏 0
  • 点赞 0
  • 评论 0

腾讯云 十分钟定制你的第一个小程序>>>   

摘要: pyspider爬虫学习-教程2 AJAX and More HTTP
Level 2: AJAX and More HTTP
===========================
在上篇文章中,我们讨论了如何从HTML中提取链接和信息,但是,web类容在用了一些像ajax之类的技术后变得更加复杂,你可能发现这个页面和浏览器中看到的不同,你没有办法从这个页面的HTML中提出到你想要的信息。
In the last article, we discussed how to extract links and information from HTML documents. However, web contents are becoming more complicated using some technology like AJAX. You may find that page looks different with it in browser, the information you want to extract is not in the HTML of the page.
在这篇文章中,我们将不会编写完整的scrape脚本,但是,一些web页面案例使用了AJAX技术,或者需要一些HTTP参数,而不是URL。
In this article, we will not write complete scrape scripts, but some snippets of web page cases using the technology like AJAX or needs some HTTP parameters besides URL.

AJAX
----
#[AJAX]是异步JavaScript + XML的缩写。AJAX使用现有的标准来更新web页面的部分内容,而无需加载整个页面。AJAX的一个常见用法是加载[JSON]数据并在客户端呈现HTML
[AJAX] is short for asynchronous JavaScript + XML. AJAX is using existing standards to update parts of a web page without loading the whole page. A common usage of AJAX is loading [JSON] data and render to HTML on the client side.
#你可能会发现pyspider抓取HTML时元素丢失或(wget)(https://www.gnu.org/software/wget/)。当你在浏览器中打开它时,一些元素会出现在页面加载(可能不是)一个“加载”动画或单词。例如,我们要得到有通道的Dota2[http://www.twitch.tv/directory/game/Dota%202](http://www.twitch.tv/directory/game/Dota%202)
You may find elements missing in HTML fetched by pyspider or [wget](https://www.gnu.org/software/wget/). When you open it in browser some elements appear after page loaded with(maybe not) a 'loading' animation or words. For example, we want to scrape all channels of Dota 2 from [http://www.twitch.tv/directory/game/Dota%202](http://www.twitch.tv/directory/game/Dota%202)

![twitch](../imgs/twitch.png)
但是在这个页面中你没有发现任何信息
But you may find nothing in the page. 

### Finding the request #发现请求
#(AJAX)数据通过HTTP传输,在(Chrome Developer Tools)的帮助下我们能找到真正的请求 (https://developer.chrome.com/devtools)。
As [AJAX] data is transferred in [HTTP], we can find the real request with the help of [Chrome Developer Tools](https://developer.chrome.com/devtools).

0. Open a new tab.#打开一个新的标签页
1. Use `Ctrl`+`Shift`+`I` (or `Cmd`+`Opt`+`I` on Mac) to open the DevTools. #使用Ctrl`+`Shift`+`I` (or `Cmd`+`Opt`+`I` on Mac) 打开开发工具
2. Switch to Network panel.#切换到网络面板
3. Open the URL [http://www.twitch.tv/directory/game/Dota%202](http://www.twitch.tv/directory/game/Dota%202) in this tab.#在这个标签页中打开URL地址[http://www.twitch.tv/directory/game/Dota%202](http://www.twitch.tv/directory/game/Dota%202)
#在加载资源时,您可能会找到请求的资源表。
While resources are been loaded, you may find a table of requested resources.

![developer tools network](../imgs/developer-tools-network.png)
#AJAX是使用XMLHttpRequest(https://developer.mozilla.org/en-US/docs/Web/API/XMLHttpRequest)对象发送和重获数据,一般是简短“XHR”。使用过滤器(漏斗图标)来过滤XHR请求。浏览每个请求使用预览:
AJAX is using [XMLHttpRequest](https://developer.mozilla.org/en-US/docs/Web/API/XMLHttpRequest) object to send and retrieve data which is generally shorted as "XHR". Use Filter (funnel icon) to filter out the XHR requests. Glance over each requests using preview:

![find request](../imgs/search-for-request.png)
#要确定哪个是关键请求,可以使用过滤器来减少请求的数量,猜测该路径和参数对请求的使用情况,然后查看响应内容以获得确认。这里我们发现请求:[http://api.twitch.tv/kraken/streams?limit=20&offset=0&game=Dota+2&broadcaster_language=&on_site=1](http://api.twitch.tv/kraken/streams?limit=20&offset=0&game=Dota+2&broadcaster_language=&on_site=1)
To determine which one is the key request, you can use a filter to reduce the number of requests, guess the usage of the request by this path and parameters, then view the response contents for confirmation. Here we found the request: [http://api.twitch.tv/kraken/streams?limit=20&offset=0&game=Dota+2&broadcaster_language=&on_site=1](http://api.twitch.tv/kraken/streams?limit=20&offset=0&game=Dota+2&broadcaster_language=&on_site=1)
#现在,在新选项卡中打开URL,您将看到包含通道列表的[JSON]数据。您可以使用一个扩展(JSONView)(https://chrome.google.com/webstore/detail/jsonview/chklaanhfefbnpoihckbnefhakgolnmc)((Firfox)(http://jsonview.com/)有一个漂亮的印花的JSON。示例代码尝试提取每个通道的名称、当前标题和查看器。
Now, open the URL in a new tab, you would see a [JSON] data containing channel list. You can use a extension [JSONView](https://chrome.google.com/webstore/detail/jsonview/chklaanhfefbnpoihckbnefhakgolnmc) ([for Firfox](http://jsonview.com/)) to have a pretty printed view of JSON. A sample code is trying extract the name, current title and viewers of each channel.

```
class Handler(BaseHandler):
    @every(minutes=10)
    def on_start(self):
        self.crawl('http://api.twitch.tv/kraken/streams?limit=20&offset=0&game=Dota+2&broadcaster_language=&on_site=1', callback=self.index_page)

    @config(age=10*60)
    def index_page(self, response):
        return [{
                "name": x['channel']['display_name'],
                "viewers": x['viewers'],
                "status": x['channel'].get('status'),
             } for x in response.json['streams']]
```
   #你可以使用'response.json'将内容转换成python的'dict'对象
> * You can use `response.json` to convert content to a python `dict` object.
   #随着通道列表的频繁变化,我们每10分钟更新一次,并使用[`@config(age=10*60)`](/apis/self.crawl/#configkwargs) 来设置年龄。否则,它将被忽略,因为调度器认为它足够新,并且拒绝更新内容。
> * As channel list is changing frequently, we update it every 10 minutes and use [`@config(age=10*60)`](/apis/self.crawl/#configkwargs) to set the age. Otherwise, it will be ignored as scheduler thinks it's new enough and refuse to update the content.
#这里是twitch的在线演示,关于使用[PhantomJS]的方式将在下一个LEVEL讨论:[http://demo.pyspider.org/debug/tutorial_twitch](http://demo.pyspider.org/debug/tutorial_twitch)
Here is an online demo for twitch as well as a measure using [PhantomJS] which will be discussed in the next level: [http://demo.pyspider.org/debug/tutorial_twitch](http://demo.pyspider.org/debug/tutorial_twitch)

HTTP
----
#[HTTP]是交换或传输超文本的协议。我们在上一篇文章中使用过,我们使用'self.crawl'和URL来获取由[HTTP]传输的HTML内容。
[HTTP] is the protocol to exchange or transfer hypertext. We had used it in last article, we used `self.crawl` and a URL to fetch HTML content which is transferred by [HTTP].
#当你得到“403禁止”或需要登录时。您需要正确的HTTP请求参数。
When you got `403 Forbidden` or needed login. You need right parameters of HTTP request.
#一个典型的HTTP请求消息到[http://example.com/](http://example.com/)看起来像下面这样:
A typical HTTP request message to [http://example.com/](http://example.com/) looks like:

```
GET / HTTP/1.1
Host: example.com
Connection: keep-alive
Cache-Control: max-age=0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.45 Safari/537.36
Referer: http://en.wikipedia.org/wiki/Example.com
Accept-Encoding: gzip, deflate, sdch
Accept-Language: zh-CN,zh;q=0.8
If-None-Match: "359670651"
If-Modified-Since: Fri, 09 Aug 2013 23:54:35 GMT
```
    #第一行包含[HTTP method](http://www.w3schools.com/tags/ref_httpmethods.asp),路径和HTTP版本信息
> * the first line contains [HTTP method](http://www.w3schools.com/tags/ref_httpmethods.asp), path and HTTP version
    #多行“key: value”格式的请求头字段
> * several lines of request header fields in `key: value` format.
    #如果有消息体(比如POST请求),则会将空行和消息体追加到请求消息的末尾。
> * if has message body(say POST request), an empty line and message body would be appended to end of request message.
#你可以得到这个(Chrome开发工具)(https://developer.chrome.com/devtools)——网络面板上面部分中我们使用:
You can get this with [Chrome Developer Tools](https://developer.chrome.com/devtools) - Network panel we used in above section:

![request header](../imgs/request-headers.png)
#在大多数情况下,您最不想要的就是从网络面板中复制右URL +方法+ header + body
In most case, the last thing you need is to copy right URL + method + headers + body from Network panel.

cURL command
------------
#'self.crawl'支持'cURL'命令作为参数来发出HTTP请求。它将解析命令中的参数,并将其作为获取参数使用。
`self.crawl` supports `cURL` command as argument to make the HTTP request. It will parse the arguments in the command and use it as fetch parameters.
#使用“Copy as cURL”的请求,您可以得到一个“cURL”命令,并粘贴到“self.crawl(命令)”让爬取变得简单
With `Copy as cURL` of a request, you can get a `cURL` command and paste to `self.crawl(command)` to make crawling easy.

HTTP Method
-----------
#[HTTP]定义了一些方法,以指明在已确定的资源上执行所需的操作。常用的两种方法是:GET和POST。GET是在打开URL时,请求指定资源的内容。POST用于向服务器提交数据。
[HTTP] defines methods to indicate the desired action to be performed on the identified resource. Two commonly used methods are: GET and POST. GET is when you open a URL, requests the content of a specified resource. POST is used to submit data to server.

TODO: need example here.

HTTP Headers
------------
#[HTTP Headers](http://en.wikipedia.org/wiki/List_of_HTTP_header_fields)是一个参数请求的列表,有些标题需要在抓取时注意
[HTTP Headers](http://en.wikipedia.org/wiki/List_of_HTTP_header_fields) is a list of parameters of a request. Some headers you need to attention while scraping:

### User-Agent
#(用户代理字符串)(http://en.wikipedia.org/wiki/User_agent_string)告诉服务器应用程序类型、操作系统或软件修改以及谁发送的HTTP请求。
A [user agent string](http://en.wikipedia.org/wiki/User_agent_string) tell server the application type, operating system or software revision who send the HTTP request.

pyspider's default user agent string is: `pyspider/VERSION (+http://pyspider.org/)` #pyspider的默认用户代理字符串是:“pyspider/VERSION(+ http://pyspider.org/)”

### Referer
#(推荐人)(http://en.wikipedia.org/wiki/HTTP_referer)的地址是当前请求页面的前一个网页链接,一些网站在图片资源中使用这个,以防止深度链接。
[Referer](http://en.wikipedia.org/wiki/HTTP_referer) is the address of the previous webpage from which a link to the currently requested page was followed. Some website uses this in image resources to prevent deep linking.

TODO: need example here.

HTTP Cookie
-----------
#[HTTP Cookie]在HTTP头(http://en.wikipedia.org/wiki/HTTP_cookie)是一个字段用于跟踪用户的请求。通常用于用户登录和防止未经授权的请求。
[HTTP Cookie](http://en.wikipedia.org/wiki/HTTP_cookie) is a field in HTTP headers used for tracking which user is making the request. Generally used for user login and prevent unauthorized requests.
#您可以使用[`self.crawl(cookies={"key": value})`](/apis/self.crawl/#fetch) 通过类似于API的字典来设置cookie。
You can use [`self.crawl(cookies={"key": value})`](/apis/self.crawl/#fetch) to set cookie via a dict like API.

TODO: need example here.

[PhantomJS]:     http://phantomjs.org/
[AJAX]:          http://en.wikipedia.org/wiki/Ajax_%28programming%29
[JSON]:          http://en.wikipedia.org/wiki/JSON
[HTTP]:          http://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol
标签: pyt pysp ajax
共有 人打赏支持
粉丝 0
博文 35
码字总数 32746
×
sijinge
如果觉得我的文章对您有用,请随意打赏。您的支持将鼓励我继续创作!
* 金额(元)
¥1 ¥5 ¥10 ¥20 其他金额
打赏人
留言
* 支付类型
微信扫码支付
打赏金额:
已支付成功
打赏金额: