博客专区 > sijinge 的博客 > 博客详情
sijinge 发表于7个月前
  • 发表于 7个月前
  • 阅读 29
  • 收藏 0
  • 点赞 0
  • 评论 0


摘要: pyspider教程
Level 1: HTML and CSS Selector
In this tutorial, we will scrape information of movies and TV from [IMDb].
#一个在线的例子已经完成编码放在:[]( .
An online demo with completed code is: []( .

Before Start #开始之前
You should have pyspider installed. You can refer to the documentation [QuickStart](Quickstart). Or test your code on [](

Some basic knowledges you should know before scraping:

* [Web][WWW] is a system of interlinked hypertext pages.#[Web][WWW]是一个链接超文本页面的系统。
* Pages is identified on the Web via uniform resource locator ([URL]).#通过统一资源定位符([URL])在Web上标识页面
* Pages transferred via the Hypertext Transfer Protocol ([HTTP]).#通过超文本传输协议([HTTP])传输页面
* Web Pages structured using HyperText Markup Language ([HTML]).#使用超文本标记语言构建的Web页面([HTML])

To scrape information from a web is

1. Finding URLs of the pages contain the information we want.#查找包含我们想要的信息页面url
2. Fetching the pages via HTTP.#通过HTTP获取页面
3. Extracting the information from HTML.#从HTML中提取信息
4. Finding more URL contains what we want, go back to 2.#找到更多的包含我们想要的信息的URL,重复上面两个步骤

Pick a start URL #选择一个开始的URL
As we want to get all of the movies on [IMDb], the first thing is finding a list.  A good list page may:

* containing links to the [movies]( as many as possible.#包含尽可能多的(电影)链接(。
* by following next page, you can traverse all of the movies. #通过下一页,你可以遍历所有的电影。
* list sorted by last updated time would be a great help to get latest movies.#根据最后更新的时间排序的列表将为获得最新电影的产生巨大帮助。
By looking around at the index page of [IMDb], I found this:

![IMDb front page](../imgs/tutorial_imdb_front.png)


### Creating a project #创建一个项目
You can find "Create" on the bottom right of baseboard. Click and name a project.#
![Creating a project](../imgs/creating_a_project.png)
Changing the crawl URL in `on_start` callback:

    @every(minutes=24 * 60)
    def on_start(self):
        self.crawl(',tv_series,tv_movie&ref_=nv_ch_mm_1', callback=self.index_page)

> * `self.crawl` would fetch the page and call the `callback` method to parse the response.  #'self.crawl' 将获取页面并调用“回调”方法来解析响应。
> * The [`@every` decorator]( represents `on_start` would execute every day, to make sure not missing any new movies.#(“@every”decorator)(代表on_start每天执行,确保不遗漏任何新的电影。

Click the green `run` button, you should find a red 1 above follows, switch to follows panel, click the green play button:
![Run one step](../imgs/run_one_step.png)

Index Page #主页
From [index page](,tv_series,tv_movie&ref_=nv_ch_mm_1), we need extract two things:

* links of the movies like ``#电影的连接,就像''
* links of [Next](,tv_series,tv_movie) page #下一个页面的连接

### Find Movies #发现电影
#如您所见,示例处理程序已经从页面中提取了1900 +链接。一种提取电影页面的方法是过滤与正则表达式的链接:
As you can see, the sample handler had already extracted 1900+ links from the page. A measure of extracting movie pages is filtering links with regular expression:

import re

    def index_page(self, response):
        for each in response.doc('a[href^="http"]').items():
            if re.match("\d+/$", each.attr.href):
                self.crawl(each.attr.href, callback=self.detail_page)
> * `callback` is `self.detail_page` here to use another callback method to parse.
Remember you can always use the power of python or anything you are familiar with to extract information. But using tools like CSS selector is recommended.

### Next page 下一个页面

#### CSS Selectors #css选择器
CSS selectors are patterns used by [CSS] to select HTML elements which are wanted to style. As elements containing information may have different style in document, It's appropriate to use CSS Selector to select elements we want. More information about CSS selectors could be found in above links:

* [CSS Selectors](
* [CSS Selector Reference](
You can use CSS Selector with built-in `response.doc` object, which is provided by [PyQuery], you may find the full reference there.

#### CSS Selector Helper #CSS选择器辅助
pyspider provide a tool called `CSS selector helper` to make it easier to generate a selector pattern to element you clicked. Enable CSS selector helper by click the button and switch to `web` panel.

![CSS Selector helper](../imgs/css_selector_helper.png)
The element will be highlighted in yellow while mouse over. When you click it, a pre-selected CSS Selector pattern is shown on the bar above. You can edit the features to locate the element and add it to your source code.
click "Next »" in the page and add selector pattern to your code:

    def index_page(self, response):
        for each in response.doc('a[href^="http"]').items():
            if re.match("\d+/$", each.attr.href):
                self.crawl(each.attr.href, callback=self.detail_page)
        self.crawl(response.doc('#right a').attr.href, callback=self.index_page)
Click `run` again and move to the next page, we found that "« Prev" has the same selector pattern as "Next »". When using above code you may find pyspider selected the link of "« Prev", not "Next »". A solution for this is select both of them:

        self.crawl([x.attr.href for x in response.doc('#right a').items()], callback=self.index_page)

Extracting Information #提取信息
Click `run` again and follow to detail page.
Add keys you need to result dict and collect value using `CSS selector helper` repeatedly:

    def detail_page(self, response):
        return {
            "url": response.url,
            "title": response.doc('.header > [itemprop="name"]').text(),
            "rating": response.doc('.star-box-giga-star').text(),
            "director": [x.text() for x in response.doc('[itemprop="director"] span').items()],
Note that, `CSS Selector helper` may not always work. You could write selector pattern manually with tools like [Chrome Dev Tools](
![inspect element](../imgs/inspect_element.png)
#您不需要在选择器模式中编写每个祖先元素,只要能够区分不需要元素就足够了。但是,它需要在抓取或Web开发方面的经验,以了解哪个属性是重要的,可以作为定位器。您还可以在JavaScript控制台中使用“$ $”(“$ $”[itemprop = "]span ')来测试CSS选择器。
You doesn't need to write every ancestral element in selector pattern, only the elements which can differentiate with not needed elements, is enough. However, it needs experience on scraping or Web developing to know which attribute is important, can be used as locator. You can also test CSS Selector in the JavaScript Console by using `$$` like `$$('[itemprop="director"] span')`

Running #运行

1. After tested you code, don't forget to save it.#在测试了您的代码之后,不要忘记保存它
2. Back to dashboard find your project.#回到dashboard找到您的项目
3. Changing the `status` to `DEBUG` or `RUNNING`.#将“状态”更改为“DEBUG”或“RUNNING”
4. Press the `run` button. #按下“运行”按钮

![index demo](../imgs/index_page.png)

Notes #注释
The script is just a simple, you may found more issues when scraping IMDb:
* ref in list page url is for tracing user, it's better remove it.
* IMDb does not serve more than 100000 results for any query, you need find more lists with lesser results, like [this](,asc)
* You may need a list sorted by last updated time and update it with a shorter interval.
* Some attribute is hard to extract, you may need write selector pattern on hand or using [XPATH]( and/or some python code to extract information.

标签: python pysp 教程
  • 打赏
  • 点赞
  • 收藏
  • 分享
共有 人打赏支持
粉丝 0
博文 40
码字总数 36811
* 金额(元)
¥1 ¥5 ¥10 ¥20 其他金额
* 支付类型