pyspider爬虫学习-教程1-HTML-and-CSS-Selector.md
博客专区 > sijinge 的博客 > 博客详情
pyspider爬虫学习-教程1-HTML-and-CSS-Selector.md
sijinge 发表于3个月前
pyspider爬虫学习-教程1-HTML-and-CSS-Selector.md
  • 发表于 3个月前
  • 阅读 23
  • 收藏 0
  • 点赞 0
  • 评论 0

腾讯云 技术升级10大核心产品年终让利>>>   

摘要: pyspider教程1-HTML-and-CSS-Selector.md
Level 1: HTML and CSS Selector
==============================
#在本教程中,我们将从[IMDb]中获取影片片段和视频的信息。
In this tutorial, we will scrape information of movies and TV from [IMDb].
#一个在线的例子已经完成编码放在:[http://demo.pyspider.org/debug/tutorial_imdb](http://demo.pyspider.org/debug/tutorial_imdb) .
An online demo with completed code is: [http://demo.pyspider.org/debug/tutorial_imdb](http://demo.pyspider.org/debug/tutorial_imdb) .


Before Start #开始之前
------------
#你必须已经安装pyspider,你可以参考QuickStart文档或者在demo.pyspider.org上测试你的代码
You should have pyspider installed. You can refer to the documentation [QuickStart](Quickstart). Or test your code on [demo.pyspider.org](http://demo.pyspider.org).

#一些你之前应该知道的基本知识:
Some basic knowledges you should know before scraping:

* [Web][WWW] is a system of interlinked hypertext pages.#[Web][WWW]是一个链接超文本页面的系统。
* Pages is identified on the Web via uniform resource locator ([URL]).#通过统一资源定位符([URL])在Web上标识页面
* Pages transferred via the Hypertext Transfer Protocol ([HTTP]).#通过超文本传输协议([HTTP])传输页面
* Web Pages structured using HyperText Markup Language ([HTML]).#使用超文本标记语言构建的Web页面([HTML])

#从网上获取信息是
To scrape information from a web is

1. Finding URLs of the pages contain the information we want.#查找包含我们想要的信息页面url
2. Fetching the pages via HTTP.#通过HTTP获取页面
3. Extracting the information from HTML.#从HTML中提取信息
4. Finding more URL contains what we want, go back to 2.#找到更多的包含我们想要的信息的URL,重复上面两个步骤


Pick a start URL #选择一个开始的URL
----------------
如果想要获取[IMDb]上的所有电影,首先需要找到一个列表,一个好的列表页面可以是这样的:
As we want to get all of the movies on [IMDb], the first thing is finding a list.  A good list page may:

* containing links to the [movies](http://www.imdb.com/title/tt0167260/) as many as possible.#包含尽可能多的(电影)链接(http://www.imdb.com/title/tt0167260/)。
* by following next page, you can traverse all of the movies. #通过下一页,你可以遍历所有的电影。
* list sorted by last updated time would be a great help to get latest movies.#根据最后更新的时间排序的列表将为获得最新电影的产生巨大帮助。
#通过查看[IMDb]的主页,我找到了这个:
By looking around at the index page of [IMDb], I found this:

![IMDb front page](../imgs/tutorial_imdb_front.png)

[http://www.imdb.com/search/title?count=100&title_type=feature,tv_series,tv_movie&ref_=nv_ch_mm_1](http://www.imdb.com/search/title?count=100&title_type=feature,tv_series,tv_movie&ref_=nv_ch_mm_1)

### Creating a project #创建一个项目
#你可以在面板的右下角发现一个“create”按钮,单击并命名一个项目
You can find "Create" on the bottom right of baseboard. Click and name a project.#
#创建一个项目
![Creating a project](../imgs/creating_a_project.png)
#更改“on_start”回调中的抓取URL:
Changing the crawl URL in `on_start` callback:

```
    @every(minutes=24 * 60)
    def on_start(self):
        self.crawl('http://www.imdb.com/search/title?count=100&title_type=feature,tv_series,tv_movie&ref_=nv_ch_mm_1', callback=self.index_page)
```

> * `self.crawl` would fetch the page and call the `callback` method to parse the response.  #'self.crawl' 将获取页面并调用“回调”方法来解析响应。
> * The [`@every` decorator](http://docs.pyspider.org/en/latest/apis/@every/) represents `on_start` would execute every day, to make sure not missing any new movies.#(“@every”decorator)(http://docs.pyspider.org/en/latest/apis/@every)代表on_start每天执行,确保不遗漏任何新的电影。

#点击绿色“运行”按钮,你应该找到上面的红色1,切换到下面板,点击绿色播放按钮
Click the green `run` button, you should find a red 1 above follows, switch to follows panel, click the green play button:
#运行一个步骤
![Run one step](../imgs/run_one_step.png)

Index Page #主页
----------
#从主页我们必须提取两件事情:
From [index page](http://www.imdb.com/search/title?count=100&title_type=feature,tv_series,tv_movie&ref_=nv_ch_mm_1), we need extract two things:

* links of the movies like `http://www.imdb.com/title/tt0167260/`#电影的连接,就像'http://www.imdb.com/title/tt0167260/'
* links of [Next](http://www.imdb.com/search/title?count=100&ref_=nv_ch_mm_1&start=101&title_type=feature,tv_series,tv_movie) page #下一个页面的连接

### Find Movies #发现电影
#如您所见,示例处理程序已经从页面中提取了1900 +链接。一种提取电影页面的方法是过滤与正则表达式的链接:
As you can see, the sample handler had already extracted 1900+ links from the page. A measure of extracting movie pages is filtering links with regular expression:

```
import re
...

    def index_page(self, response):
        for each in response.doc('a[href^="http"]').items():
            if re.match("http://www.imdb.com/title/tt\d+/$", each.attr.href):
                self.crawl(each.attr.href, callback=self.detail_page)
```
#'回调'是'self.detail_page'在这里使用另一个回调方法来解析。
> * `callback` is `self.detail_page` here to use another callback method to parse.
#请记住,您可以总是使用python或您所熟悉的任何东西来提取信息。但是,建议使用CSS选择器之类的工具。
Remember you can always use the power of python or anything you are familiar with to extract information. But using tools like CSS selector is recommended.

### Next page 下一个页面

#### CSS Selectors #css选择器
#CSS选择器是[CSS]用来选择想要样式的HTML元素的模式。由于包含信息的元素在文档中可能有不同的样式,所以使用CSS选择器来选择我们想要的元素是合适的。有关CSS选择器的更多信息可以在上面的链接中找到:
CSS selectors are patterns used by [CSS] to select HTML elements which are wanted to style. As elements containing information may have different style in document, It's appropriate to use CSS Selector to select elements we want. More information about CSS selectors could be found in above links:

* [CSS Selectors](http://www.w3schools.com/css/css_selectors.asp)
* [CSS Selector Reference](http://www.w3schools.com/cssref/css_selectors.asp)
#您可以使用内置'response.doc'对象的CSS选择器,由[PyQuery]提供,您可以在那里找到完整的引用。
You can use CSS Selector with built-in `response.doc` object, which is provided by [PyQuery], you may find the full reference there.

#### CSS Selector Helper #CSS选择器辅助
#pyspider提供了一个名为'CSS选择器助手'的工具,它可以更容易地为您单击的元素生成选择器模式。通过单击按钮启用CSS选择器助手,然后切换到'web'面板。
pyspider provide a tool called `CSS selector helper` to make it easier to generate a selector pattern to element you clicked. Enable CSS selector helper by click the button and switch to `web` panel.

![CSS Selector helper](../imgs/css_selector_helper.png)
#当鼠标结束时,元素将以黄色高亮显示。当你点击它的时候,一个预先选择的CSS选择器模式显示在上面的栏上。您可以编辑这些特性来定位元素并将其添加到源代码中。
The element will be highlighted in yellow while mouse over. When you click it, a pre-selected CSS Selector pattern is shown on the bar above. You can edit the features to locate the element and add it to your source code.
#在页面上单击“Next»”,在代码中添加选择器模式:
click "Next »" in the page and add selector pattern to your code:

```
    def index_page(self, response):
        for each in response.doc('a[href^="http"]').items():
            if re.match("http://www.imdb.com/title/tt\d+/$", each.attr.href):
                self.crawl(each.attr.href, callback=self.detail_page)
        self.crawl(response.doc('#right a').attr.href, callback=self.index_page)
```
#再次点击“run”,转到下一页,我们发现“«Prev”与“next»”有相同的选择模式。当使用上面的代码时,你会发现pyspider选择了“«Prev”的链接,而不是“Next»”。一个解决方案是选择它们:
Click `run` again and move to the next page, we found that "« Prev" has the same selector pattern as "Next »". When using above code you may find pyspider selected the link of "« Prev", not "Next »". A solution for this is select both of them:

```
        self.crawl([x.attr.href for x in response.doc('#right a').items()], callback=self.index_page)
```

Extracting Information #提取信息
----------------------
#再次点击“run”,并遵循详细页面。
Click `run` again and follow to detail page.
#添加键值,你需要使用“CSS选择器助手”重复地完成结果字典和收集值:
Add keys you need to result dict and collect value using `CSS selector helper` repeatedly:

```
    def detail_page(self, response):
        return {
            "url": response.url,
            "title": response.doc('.header > [itemprop="name"]').text(),
            "rating": response.doc('.star-box-giga-star').text(),
            "director": [x.text() for x in response.doc('[itemprop="director"] span').items()],
        }
```
#注意,“CSS选择器助手”可能并不总是有效。您可以手动编写选择器等模式工具就像(Chrome开发工具)(https://developer.chrome.com/devtools):
Note that, `CSS Selector helper` may not always work. You could write selector pattern manually with tools like [Chrome Dev Tools](https://developer.chrome.com/devtools):
#审查元素
![inspect element](../imgs/inspect_element.png)
#您不需要在选择器模式中编写每个祖先元素,只要能够区分不需要元素就足够了。但是,它需要在抓取或Web开发方面的经验,以了解哪个属性是重要的,可以作为定位器。您还可以在JavaScript控制台中使用“$ $”(“$ $”[itemprop = "]span ')来测试CSS选择器。
You doesn't need to write every ancestral element in selector pattern, only the elements which can differentiate with not needed elements, is enough. However, it needs experience on scraping or Web developing to know which attribute is important, can be used as locator. You can also test CSS Selector in the JavaScript Console by using `$$` like `$$('[itemprop="director"] span')`

Running #运行
-------

1. After tested you code, don't forget to save it.#在测试了您的代码之后,不要忘记保存它
2. Back to dashboard find your project.#回到dashboard找到您的项目
3. Changing the `status` to `DEBUG` or `RUNNING`.#将“状态”更改为“DEBUG”或“RUNNING”
4. Press the `run` button. #按下“运行”按钮

![index demo](../imgs/index_page.png)

Notes #注释
-----
#这仅仅是个很简单的脚本,在抓取IMDb时可能会发现更多问题:
The script is just a simple, you may found more issues when scraping IMDb:
 #在列表页面url中的ref是用来跟踪用户的,最好删除它。
* ref in list page url is for tracing user, it's better remove it.
 #IMDb不服务超过100000结果的任何查询,你需要找到更多与较小的结果列表,像[这个](http://www.imdb.com/search/title?genres=action&title_type=feature&sort=moviemeter,asc)
* IMDb does not serve more than 100000 results for any query, you need find more lists with lesser results, like [this](http://www.imdb.com/search/title?genres=action&title_type=feature&sort=moviemeter,asc)
 #您可能需要一个按最后更新的时间排序的列表,并以较短的间隔更新它。
* You may need a list sorted by last updated time and update it with a shorter interval.
 #有些属性很难提取,您可能需要手工编写选择器模式,或者使用XPATH(http://www.w3schools.com/xpath/xpathsyntax.asp)和/或一些python代码来提取信息。
* Some attribute is hard to extract, you may need write selector pattern on hand or using [XPATH](http://www.w3schools.com/xpath/xpath_syntax.asp) and/or some python code to extract information.

[IMDb]:          http://www.imdb.com/
[WWW]:           http://en.wikipedia.org/wiki/World_Wide_Web
[HTTP]:          http://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol
[HTML]:          http://en.wikipedia.org/wiki/HTML
[URL]:           http://en.wikipedia.org/wiki/Uniform_resource_locator
[CSS]:           https://developer.mozilla.org/en-US/docs/Web/Guide/CSS/Getting_Started/What_is_CSS
[PyQuery]:       https://pythonhosted.org/pyquery/
标签: python pysp 教程
共有 人打赏支持
粉丝 0
博文 39
码字总数 36811
×
sijinge
如果觉得我的文章对您有用,请随意打赏。您的支持将鼓励我继续创作!
* 金额(元)
¥1 ¥5 ¥10 ¥20 其他金额
打赏人
留言
* 支付类型
微信扫码支付
打赏金额:
已支付成功
打赏金额: