pyspider爬虫学习-文档翻译-Frequently-Asked-Questions.md
博客专区 > sijinge 的博客 > 博客详情
pyspider爬虫学习-文档翻译-Frequently-Asked-Questions.md
sijinge 发表于3个月前
pyspider爬虫学习-文档翻译-Frequently-Asked-Questions.md
  • 发表于 3个月前
  • 阅读 17
  • 收藏 0
  • 点赞 0
  • 评论 0

腾讯云 新注册用户 域名抢购1元起>>>   

摘要: pyspider爬虫常见问题(首次尝试读取源码的同时翻译文档,大家多多指正,勿喷)
Frequently Asked Questions #常见问题
==========================

Does pyspider Work with Windows? #pyspider是否与Windows兼容?
--------------------------------
是的,他应该可以,一些用户已经在Windows使用了。但是我没有windows开发环境,没有办法测试,只能给一些提示给在windows上使用pyspider的朋友:
Yes, it should, some users have made it work on Windows. But as I don't have windows development environment, I cannot test. Only some tips for users who want to use pyspider on Windows:
  #有些包需要二进制libs(例如pycurl,lxml),可能你无法通过pip安装它,Windowns二进制包可以在[http://www.lfd.uci.edu/ ~ gohlke / pythonlibs /]中找到。
- Some package needs binary libs (e.g. pycurl, lxml), that maybe you cannot install it from pip, Windowns binaries packages could be found in [http://www.lfd.uci.edu/~gohlke/pythonlibs/](http://www.lfd.uci.edu/~gohlke/pythonlibs/).
  #准备一个干净的环境与 [virtualenv](https://virtualenv.readthedocs.org/en/latest/)
- Make a clean environment with [virtualenv](https://virtualenv.readthedocs.org/en/latest/)
  #在面临崩溃的时候,试试使用32位版本的Python
- Try 32bit version of Python, especially your are facing crash issue.
  #不要使用Python 3.4.1版本
- Avoid using Python 3.4.1 ([#194](https://github.com/binux/pyspider/issues/194), [#217](https://github.com/binux/pyspider/issues/217))

Unreadable Code (乱码) Returned from Phantomjs #Phantomjs返回的结果乱码
---------------------------------------------
#Phantomjs不支持gzip,不要用“gzip”来设置“Accept-Encoding”的头文件。
Phantomjs doesn't support gzip, don't set `Accept-Encoding` header with `gzip`.


How to Delete a Project? #怎么样删除一个项目?
------------------------
设置'group'为'delete'且'status'为'STOP'并等待24小时,你可以在一个项目被删除之前维护'scheduler.DELETE_TIME'改变删除时间。
set `group` to `delete` and `status` to `STOP` then wait 24 hours. You can change the time before a project deleted via `scheduler.DELETE_TIME`.

How to Restart a Project?#怎么样重启一个项目?
-------------------------
#### Why 为什么重启?
它发生在你修改脚本和你想要用新的策略重新抓取所有内容的时候,但因为urls的[age]没有过期。调度器会放弃所有的新请求。
It happens after you modified a script, and wants to crawl everything again with new strategy. But as the [age](/apis/self.crawl/#age) of urls are not expired. Scheduler will discard all of the new requests.

#### Solution 解决方案
1. Create a new project. #创建一个新的项目
2. Using a [itag](/apis/self.crawl/#itag) within `Handler.crawl_config` to specify the version of your script.#在`Handler.crawl_config`中使用一个标签指定脚本的版本。

How to Use WebDAV Mode? #怎么样使用WebDAV模式?
-----------------------
增加`http://hostname/dav/`到你的文件系统,用你喜欢的编辑器编辑或创建脚本。
Mount `http://hostname/dav/` to your filesystem, edit or create scripts with your favourite editor.

> OSX: `mount_webdav http://hostname/dav/ /Volumes/dav`  
> Linux: Install davfs2, `mount.davfs http://hostname/dav/ /mnt/dav`  
> VIM: `vim http://hostname/dav/script_name.py`

#当您没有WebUI编辑脚本时,您需要在调试时将其更改为“WebDAV模式”。在编辑器中保存脚本后,WebUI可以加载并使用最新的脚本来调试代码。
When you are editing script without WebUI, you need to change it to `WebDAV Mode` while debugging. After you saved script in editor, WebUI can load and use latest script to debug your code.

What does the progress bar mean on the dashboard? #仪表板上进度条是什么意思?
-------------------------------------------------
#当鼠标移动到进度条上时,您可以看到注释。
When mouse move onto the progress bar, you can see the explaintions.
#对于5m,1h,1d,数字是在5m,1h,1d中触发的事件。对于所有进度条,它们是对应状态的总任务数。
For 5m, 1h, 1d the number are the events triggered in 5m, 1h, 1d. For all progress bar, they are the number of total tasks in correspond status.

只有任务在DEBUG/RUNNING状态时才显示进度条
Only the tasks in DEBUG/RUNNING status will show the progress.

我需要多少个scheduler/fetcher/processor/result_worker?或者pyspider停止工作
How many scheduler/fetcher/processor/result_worker do I need? or pyspider stop working
--------------------------------------------------------------------------------------
#您只能有一个调度器,有多少个fetcher /processor/ result_worker依赖于系统瓶颈。您可以使用仪表板上的队列状态来查看系统的瓶颈
You can have only have one scheduler, and multiple fetcher/processor/result_worker depends on the bottleneck. You can use the queue status on dashboard to view the bottleneck of the system:

![run one step](imgs/queue_status.png)
#例如,scheduler和fetcher之间的数字表示队列大小,当它达到100(默认最大队列大小)时,fetcher可能会崩溃,或者你应该考虑添加更多的fetcher。
For example, the number between scheduler and fetcher indicate the queue size of scheduler to fetchers, when it's hitting 100 (default maximum queue size), fetcher might crashed, or you should considered adding more fetchers.
#在fetcher下的数字'0+0'表示processors和schduler之间的新任务和状态包的队列大小。你可以把你的鼠标放在数字上看提示。
The number `0+0` below fetcher indicate the queue size of new tasks and status packs between processors and schduler. You can put your mouse over the numbers to see the tips.
标签: Python PySpider 爬虫
共有 人打赏支持
粉丝 0
博文 39
码字总数 36811
×
sijinge
如果觉得我的文章对您有用,请随意打赏。您的支持将鼓励我继续创作!
* 金额(元)
¥1 ¥5 ¥10 ¥20 其他金额
打赏人
留言
* 支付类型
微信扫码支付
打赏金额:
已支付成功
打赏金额: