博客专区 > sijinge 的博客 > 博客详情
sijinge 发表于5个月前
  • 发表于 5个月前
  • 阅读 17
  • 收藏 0
  • 点赞 0
  • 评论 0


摘要: pyspider爬虫学习-教程3 Render with PhantomJS
Level 3: Render with PhantomJS
Sometimes web page is too complex to find out the API request. It's time to meet the power of [PhantomJS].
To use PhantomJS, you should have PhantomJS [installed]( If you are running pyspider with `all` mode, PhantomJS is enabled if excutable in the `PATH`.
Make sure phantomjs is working by running
$ pyspider phantomjs
Continue with the rest of the tutorial if the output is
Web server running on port 25555
Use PhantomJS
#当pyspider与PhantomJS连接在一起时,您可以通过向'self.crawl'添加一个参数“fetch_type='js'”来启用这个特性。我们使用PhantomJS获取频道列表[](中使用AJAX加载我们讨论[Level 2](tutorial/AJAX-and-more-HTTP#ajax):
When pyspider with PhantomJS connected, you can enable this feature by adding a parameter `fetch_type='js'` to `self.crawl`. We use PhantomJS to scrape channel list of  []( which is loaded with AJAX we discussed in [Level 2](tutorial/AJAX-and-more-HTTP#ajax):

class Handler(BaseHandler):
    def on_start(self):
                   fetch_type='js', callback=self.index_page)
    def index_page(self, response):
        return {
            "url": response.url,
            "channels": [{
                "title": x('.title').text(),
                "viewers": x('.info').contents()[2],
                "name": x('.info a').text(),
            } for x in response.doc('.stream.item').items()]
> I used some API to handle the list of streams. You can find complete API reference from [PyQuery complete API](

Running JavaScript on Page
We will try to scrape images from []( in this section. Only 25 images is shown at the beginning, more images would be loaded when you scroll to the bottom of the page.
#我们可以使用[`js_script` parameter](/apis/self.crawl/#enable-javascript-fetcher-need-support-by-fetcher)来对图像进行尽可能多的抓取,设置一些JavaScript代码包装的函数来模拟滚动操作
To scrape images as many as posible we can use a [`js_script` parameter](/apis/self.crawl/#enable-javascript-fetcher-need-support-by-fetcher) to set some function wrapped JavaScript codes to simulate the scroll action: 

class Handler(BaseHandler):
    def on_start(self):
                   fetch_type='js', js_script="""
                   function() {
                   """, callback=self.index_page)

    def index_page(self, response):
        return {
            "url": response.url,
            "images": [{
                "title": x('.richPinGridTitle').text(),
                "img": x('.pinImg').attr('src'),
                "author": x('.creditName').text(),
            } for x in response.doc('.item').items() if x('.pinImg')]
# 脚本在加载页面后执行(可以通过[`js_run_at` parameter](/apis/self.crawl/#enable-javascript-fetcher-need-support-by-fetcher)进行更改)。
> * Script would been executed after page loaded(can been changed via [`js_run_at` parameter](/apis/self.crawl/#enable-javascript-fetcher-need-support-by-fetcher))
> * We scroll once after page loaded, you can scroll multiple times using [`setTimeout`]( PhantomJS will fetch as many items as possible before timeout arrived.
Online demo: [](

标签: PySpider Python ajax
  • 打赏
  • 点赞
  • 收藏
  • 分享
共有 人打赏支持
粉丝 0
博文 40
码字总数 36811
* 金额(元)
¥1 ¥5 ¥10 ¥20 其他金额
* 支付类型