文档章节

解决ValueError('Missing scheme in request url: %s' % self._url)

sjfgod
 sjfgod
发布于 2017/09/01 14:38
字数 1125
阅读 27
收藏 0

版权声明:原创文章,欢迎一起学习交流!

使用scrapy的ImagesPipeline爬取图片的时候,运行报错

Traceback (most recent call last):
  File "/home/lcy/.local/lib/python2.7/site-packages/twisted/internet/defer.py", line 653, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/home/lcy/.local/lib/python2.7/site-packages/scrapy/pipelines/media.py", line 62, in process_item
    requests = arg_to_iter(self.get_media_requests(item, info))
  File "/home/lcy/.local/lib/python2.7/site-packages/scrapy/pipelines/images.py", line 147, in get_media_requests
    return [Request(x) for x in item.get(self.images_urls_field, [])]
  File "/home/lcy/.local/lib/python2.7/site-packages/scrapy/http/request/__init__.py", line 25, in __init__
    self._set_url(url)
  File "/home/lcy/.local/lib/python2.7/site-packages/scrapy/http/request/__init__.py", line 57, in _set_url
    raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: h


查找了相关的文档,了解到使用ImagesPipeline传入的url地址必须是一个list,在传入一个list的时候pipeline处理的速度要快得多,而我写的是一个字符串,所以报错,所以我们需要修改一下传入的url格式就行了

 

 

源码附上:

修改前:

# -*- coding: utf-8 -*-
import scrapy
from imgspider.items import QiubaiPicItem
import sys
reload(sys)
sys.setdefaultencoding( "utf-8" )
class QiubaipicSpider(scrapy.Spider):
    name = "qiubaiPic"
    allowed_domains = ["qiushibaike.com"]
    start_urls = ['http://qiushibaike.com/']

    def parse(self, response):
        # page_value=response.xpath('//*[@id="content-left"]/ul/li[8]/a/span/text()').extract()[0]
        # for page in range(1,int(page_value)):
        #     url='http://www.qiushibaike.com/pic/page/'+str(page)
        #     yield scrapy.Request(url,callback=self.parse_detail)

        url='http://www.qiushibaike.com/pic/page/3'
        yield scrapy.Request(url,callback=self.parse_detail)

    def parse_detail(self,response):
        item=[]  
        divs=response.xpath('//*[@id="content-left"]/div[@class="article block untagged mb15"]')
        for div in divs:
            QiubaiPic=QiubaiPicItem()
            src=div.xpath('div[@class="thumb"]/a/img/@src').extract()[0]
            img_path='http://'+src[2:]   
            QiubaiPic['img']=img_path
            item.append(QiubaiPic)
        return item


 

 

 

 

修改后:

# -*- coding: utf-8 -*-
import scrapy
from imgspider.items import QiubaiPicItem
import sys
reload(sys)
sys.setdefaultencoding( "utf-8" )
class QiubaipicSpider(scrapy.Spider):
    name = "qiubaiPic"
    allowed_domains = ["qiushibaike.com"]
    start_urls = ['http://qiushibaike.com/']

    def parse(self, response):
        # page_value=response.xpath('//*[@id="content-left"]/ul/li[8]/a/span/text()').extract()[0]
        # for page in range(1,int(page_value)):
        #     url='http://www.qiushibaike.com/pic/page/'+str(page)
        #     yield scrapy.Request(url,callback=self.parse_detail)

        url='http://www.qiushibaike.com/pic/page/3'
        yield scrapy.Request(url,callback=self.parse_detail)

    def parse_detail(self,response):
        item=[]
        img_paths=[]
        divs=response.xpath('//*[@id="content-left"]/div[@class="article block untagged mb15"]')
        for div in divs:
            QiubaiPic=QiubaiPicItem()
            src=div.xpath('div[@class="thumb"]/a/img/@src').extract()[0]
            img_path='http://'+src[2:]
            img_paths.append(img_path)
        QiubaiPic['img']=img_paths
        item.append(QiubaiPic)
        return item

 

setting.py文件

# -*- coding: utf-8 -*-

import random

BOT_NAME = 'imgspider'

SPIDER_MODULES = ['imgspider.spiders']
NEWSPIDER_MODULE = 'imgspider.spiders'
#浏览器请求头,这个必须要有
USER_AGENT_LIST=[
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
    "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
    "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
    "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
    "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1" \
    "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11", \
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6", \
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6", \
    "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1", \
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5", \
    "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5", \
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \
    "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", \
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", \
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \
    "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3", \
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24", \
    "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"

]
ua= random.choice(USER_AGENT_LIST)
if ua:
    USER_AGENT =ua
    print ua
else:
    USER_AGENT="Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"

#是否遵循robots协定
ROBOTSTXT_OBEY = False
#线程数量
CONCURRENT_REQUESTS = 32
#下载延迟单位秒
DOWNLOAD_DELAY = 3
#cookies开关,建议禁用
COOKIES_ENABLED = False

# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'scrapy.pipelines.images.ImagesPipline':1}
ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1} IMAGES_URLS_FIELD = 'img' IMAGES_STORE = r'/home/lcy/pics' LOG_FILE="scrapy.log"
 

本文转载自:

共有 人打赏支持
sjfgod
粉丝 0
博文 19
码字总数 9137
作品 0
西安
 pip install –upgrade https://storage.googleapis.com/tensorflow 

pip install –upgrade https://storage.googleapis.com/tensorflow Exception: Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/pip-1.5.4-py2.7.egg/pip/base......

sca7
2017/03/10
141
0
ValueError: ('Missing distribution spec', '\xe2\x80\x93upgrade')

cat pip.log ------------------------------------------------------------ /usr/bin/pip run on Fri Mar 10 02:16:02 2017 Exception: Traceback (most recent call last): File "/usr/li......

sca7
2017/03/10
95
0
Deep Link及相关第三方库调研

背景说明 通知相关的页面跳转POCT项目处于后台状态,服务器发推信息到客户端,客户端在通知栏点击消息,进入App并跳转到具体的消息页面。现阶段接收的通知包含:系统消息、个人消息、春雨、七...

sycasl
2017/11/22
0
0
XWiki 7.1.2 发布,Java Wiki 系统

XWiki 7.1.2 发布,主要更新内容如下: XWIKI-12342 Javascript Error : require is not define when XWiki is the root app XWIKI-12290 Use default configuration source in the URL conf......

oschina
2015/07/31
988
1
tornado 源码分析 之 异步io的实现方式

前言 AsyncHTTPClient : fetch fetch_impl _HTTPConnection TCPClient connect _Connector try_connect createstream IOStream connect addio_state 小总结: ioloop IOStream.handleevents ......

国夫君
2015/07/12
0
0

没有更多内容

加载失败,请刷新页面

加载更多

cmd中查询mysql表出现中文乱码

问题:在pycharm中正常的fetchall拉取数据,能够正常显示,而在cmd中直接select却出现中文乱码。 解决思路:右键查看cmd命令窗口属性得到,cmd窗口默认编码是gbk(如下图所示),而设置的mys...

fang_faye
20分钟前
1
0
centOS 安装Python3与python2并存

centOS 安装Python3与python2并存 如果本机安装了python2,尽量不要管他,使用python3运行python脚本就好,因为可能有程序依赖目前的python2环境, 比如yum!!!!! 不要动现有的python2环...

MedivhXu
46分钟前
1
0
Spring JdbcTemplate模板模式与回调结合分析

在看Spring的JdbcTemplate的时候,看到其将模板模式和回调模式结合使用的实现,可以精妙的解决很多的问题。详见Spring中涉及的设计模式总结中的关于模板模式和回调模式结合的具分析,本文利用...

宸明
今天
1
0
docker update:更新一个或多个容器的配置

更新容器的配置 docker update:更新一个或多个容器的配置。 具体内容请访问:https://docs.docker.com/engine/reference/commandline/update/#options 语法:docker update [OPTIONS] CONTA...

lwenhao
今天
3
0
unload事件

unload事件不触发的原因分析 1.代码位置不对,应该优先加载,不能放到回调函数中 2.浏览器不支持 3.最可能的原因,unload事件中触发的函数是一个异步执行的函数,浏览器是不允许在窗口关闭之后在...

狮子狗
今天
2
0

没有更多内容

加载失败,请刷新页面

加载更多

返回顶部
顶部