博客专区 > sijinge 的博客 > 博客详情
sijinge 发表于4个月前
  • 发表于 4个月前
  • 阅读 31
  • 收藏 0
  • 点赞 0
  • 评论 0

标题:腾讯云 新注册用户域名抢购1元起>>>   

摘要: pyspider 结果处理 介绍
Working with Results 结果处理
Downloading and viewing your data from WebUI is convenient, but may not suitable for computer.

Working with ResultDB 结果数据处理
Although resultdb is only designed for result preview, not suitable for large scale storage. But if you want to grab data from resultdb, there are some simple snippets using database API that can help you to connect and select the data.

from pyspider.database import connect_database
resultdb = connect_database("<your resutldb connection url>")
for project in resultdb.projects:
    for result in
        assert result['taskid']
        assert result['url']
        assert result['result']
The `result['result']` is the object submitted by `return` statement from your script.

Working with ResultWorker 使用 ResultWorker
In product environment, you may want to connect pyspider to your system / post-processing pipeline, rather than store it into resultdb. It's highly recommended to override ResultWorker.

from pyspider.result import ResultWorker

class MyResultWorker(ResultWorker):
    def on_result(self, task, result):
        assert task['taskid']
        assert task['project']
        assert task['url']
        assert result
        # your processing code goes here
`result` is the object submitted by `return` statement from your script.
You can put this script (e.g., ``) at the folder where you launch pyspider. Add argument for `result_worker` subcommand:
`pyspider result_worker --result-cls=my_result_worker.MyResultWorker`

  "result_worker": {
    "result_cls": "my_result_worker.MyResultWorker"
if you are using config file. [Please refer to Deployment](/Deployment)
Design Your Own Database Schema
The results stored in database is encoded as JSON for compatibility. It's highly recommended to design your own database, and override the ResultWorker described above.

TIPS about Results #关于结果的建议
#### Want to return more than one result in callback?#想要在回调中返回多个单一结果吗?
As resultdb de-duplicate results by taskid(url), the latest will overwrite previous results.
One workaround is using `send_message` API to make a `fake` taskid for each result.

def detail_page(self, response):
    for li in response.doc('li').items():
        self.send_message(self.project_name, {
        }, url=response.url+"#"+li('a.product-sku').text())

def on_message(self, project, msg):
    return msg

See Also: [apis/self.send_message](/apis/self.send_message)
标签: python PySpider
共有 人打赏支持
粉丝 0
博文 39
码字总数 36811
* 金额(元)
¥1 ¥5 ¥10 ¥20 其他金额
* 支付类型