文档章节

解决ValueError('Missing scheme in request url: %s' % self._url)

sjfgod
 sjfgod
发布于 2017/09/01 14:38
字数 1125
阅读 8
收藏 0
点赞 0
评论 0

版权声明:原创文章,欢迎一起学习交流!

使用scrapy的ImagesPipeline爬取图片的时候,运行报错

Traceback (most recent call last):
  File "/home/lcy/.local/lib/python2.7/site-packages/twisted/internet/defer.py", line 653, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/home/lcy/.local/lib/python2.7/site-packages/scrapy/pipelines/media.py", line 62, in process_item
    requests = arg_to_iter(self.get_media_requests(item, info))
  File "/home/lcy/.local/lib/python2.7/site-packages/scrapy/pipelines/images.py", line 147, in get_media_requests
    return [Request(x) for x in item.get(self.images_urls_field, [])]
  File "/home/lcy/.local/lib/python2.7/site-packages/scrapy/http/request/__init__.py", line 25, in __init__
    self._set_url(url)
  File "/home/lcy/.local/lib/python2.7/site-packages/scrapy/http/request/__init__.py", line 57, in _set_url
    raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: h


查找了相关的文档,了解到使用ImagesPipeline传入的url地址必须是一个list,在传入一个list的时候pipeline处理的速度要快得多,而我写的是一个字符串,所以报错,所以我们需要修改一下传入的url格式就行了

 

 

源码附上:

修改前:

# -*- coding: utf-8 -*-
import scrapy
from imgspider.items import QiubaiPicItem
import sys
reload(sys)
sys.setdefaultencoding( "utf-8" )
class QiubaipicSpider(scrapy.Spider):
    name = "qiubaiPic"
    allowed_domains = ["qiushibaike.com"]
    start_urls = ['http://qiushibaike.com/']

    def parse(self, response):
        # page_value=response.xpath('//*[@id="content-left"]/ul/li[8]/a/span/text()').extract()[0]
        # for page in range(1,int(page_value)):
        #     url='http://www.qiushibaike.com/pic/page/'+str(page)
        #     yield scrapy.Request(url,callback=self.parse_detail)

        url='http://www.qiushibaike.com/pic/page/3'
        yield scrapy.Request(url,callback=self.parse_detail)

    def parse_detail(self,response):
        item=[]  
        divs=response.xpath('//*[@id="content-left"]/div[@class="article block untagged mb15"]')
        for div in divs:
            QiubaiPic=QiubaiPicItem()
            src=div.xpath('div[@class="thumb"]/a/img/@src').extract()[0]
            img_path='http://'+src[2:]   
            QiubaiPic['img']=img_path
            item.append(QiubaiPic)
        return item


 

 

 

 

修改后:

# -*- coding: utf-8 -*-
import scrapy
from imgspider.items import QiubaiPicItem
import sys
reload(sys)
sys.setdefaultencoding( "utf-8" )
class QiubaipicSpider(scrapy.Spider):
    name = "qiubaiPic"
    allowed_domains = ["qiushibaike.com"]
    start_urls = ['http://qiushibaike.com/']

    def parse(self, response):
        # page_value=response.xpath('//*[@id="content-left"]/ul/li[8]/a/span/text()').extract()[0]
        # for page in range(1,int(page_value)):
        #     url='http://www.qiushibaike.com/pic/page/'+str(page)
        #     yield scrapy.Request(url,callback=self.parse_detail)

        url='http://www.qiushibaike.com/pic/page/3'
        yield scrapy.Request(url,callback=self.parse_detail)

    def parse_detail(self,response):
        item=[]
        img_paths=[]
        divs=response.xpath('//*[@id="content-left"]/div[@class="article block untagged mb15"]')
        for div in divs:
            QiubaiPic=QiubaiPicItem()
            src=div.xpath('div[@class="thumb"]/a/img/@src').extract()[0]
            img_path='http://'+src[2:]
            img_paths.append(img_path)
        QiubaiPic['img']=img_paths
        item.append(QiubaiPic)
        return item

 

setting.py文件

# -*- coding: utf-8 -*-

import random

BOT_NAME = 'imgspider'

SPIDER_MODULES = ['imgspider.spiders']
NEWSPIDER_MODULE = 'imgspider.spiders'
#浏览器请求头,这个必须要有
USER_AGENT_LIST=[
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
    "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
    "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
    "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
    "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1" \
    "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11", \
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6", \
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6", \
    "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1", \
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5", \
    "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5", \
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \
    "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3", \
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", \
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3", \
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \
    "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3", \
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3", \
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24", \
    "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"

]
ua= random.choice(USER_AGENT_LIST)
if ua:
    USER_AGENT =ua
    print ua
else:
    USER_AGENT="Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"

#是否遵循robots协定
ROBOTSTXT_OBEY = False
#线程数量
CONCURRENT_REQUESTS = 32
#下载延迟单位秒
DOWNLOAD_DELAY = 3
#cookies开关,建议禁用
COOKIES_ENABLED = False

# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'scrapy.pipelines.images.ImagesPipline':1}
ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1} IMAGES_URLS_FIELD = 'img' IMAGES_STORE = r'/home/lcy/pics' LOG_FILE="scrapy.log"
 

© 著作权归作者所有

共有 人打赏支持
sjfgod
粉丝 0
博文 1
码字总数 9137
作品 0
西安
 pip install –upgrade https://storage.googleapis.com/tensorflow 

pip install –upgrade https://storage.googleapis.com/tensorflow Exception: Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/pip-1.5.4-py2.7.egg/pip/base......

sca7 ⋅ 2017/03/10 ⋅ 0

ValueError: ('Missing distribution spec', '\xe2\x80\x93upgrade')

cat pip.log ------------------------------------------------------------ /usr/bin/pip run on Fri Mar 10 02:16:02 2017 Exception: Traceback (most recent call last): File "/usr/li......

sca7 ⋅ 2017/03/10 ⋅ 0

tornado 源码分析 之 异步io的实现方式

前言 AsyncHTTPClient : fetch fetch_impl _HTTPConnection TCPClient connect _Connector try_connect createstream IOStream connect addio_state 小总结: ioloop IOStream.handleevents ......

国夫君 ⋅ 2015/07/12 ⋅ 0

Deep Link及相关第三方库调研

背景说明 通知相关的页面跳转POCT项目处于后台状态,服务器发推信息到客户端,客户端在通知栏点击消息,进入App并跳转到具体的消息页面。现阶段接收的通知包含:系统消息、个人消息、春雨、七...

sycasl ⋅ 2017/11/22 ⋅ 0

XWiki 7.1.2 发布,Java Wiki 系统

XWiki 7.1.2 发布,主要更新内容如下: XWIKI-12342 Javascript Error : require is not define when XWiki is the root app XWIKI-12290 Use default configuration source in the URL conf......

oschina ⋅ 2015/07/31 ⋅ 1

iOS中UIWebView与其中网页的javascript的交互

首发:个人博客,更新&纠错&回复 1.本地语言调js的方式与android中的方式类似,也是向WebView控件发送要调用的js语句 2. 但js调本地语言,则不是像android那样直接调一个全局变量的方法,而是...

祁达方 ⋅ 2015/12/10 ⋅ 0

Python实现的检测web服务器健康状况的小程序方法详情

python urllib如何获取http状态码 f=urllib.urlopen("xxxxxx") print f.getcode() #这就是获取返回的状态码 404 200等 python 服务器状态探测3种方法 1、关键字分析 import os #https网站加-...

Enweitech Software Works ⋅ 2017/12/18 ⋅ 0

Add Authentication to Your iOS Apps With Centralized Login, Part 2

Welcome back! If you missed Part 1, you can check it out here. Add Authentication to Your iOS App For our example, we will implement two iOS apps using Swift. Our applications w......

Sebastián Peyrott ⋅ 2017/12/23 ⋅ 0

iOS开发过程中专门在调试时运行代码的方法

在开发过程中,我们经常会使用NSLog用于跟踪调试,不过在发布的产品可能并不希望这些调试代码被运行。这里有一个小技巧分享一下。 在编写代码时可以使用如下方式: #ifdef DEBUG // Debug 模...

长平狐 ⋅ 2013/12/26 ⋅ 0

js与native交互

js与native交互 UIWebView Native调用JS,使用来解释执行js脚本。 PS:苹果推荐iOS8之后的app使用来代替 UIWebView,同时也使用方法来替代 stringByEvaluatingJavaScriptFromString,因为会一...

coolwxb ⋅ 2016/12/02 ⋅ 0

没有更多内容

加载失败,请刷新页面

加载更多

下一页

JavaScript零基础入门——(八)JavaScript的数组

JavaScript零基础入门——(八)JavaScript的数组 欢迎大家回到我们的JavaScript零基础入门,上一节课我们讲了有关JavaScript正则表达式的相关知识点,便于大家更好的对字符串进行处理。这一...

JandenMa ⋅ 49分钟前 ⋅ 0

sbt网络问题解决方案

转自:http://dblab.xmu.edu.cn/blog/maven-network-problem/ cd ~/.sbt/launchers/0.13.9unzip -q ./sbt-launch.jar 修改 vi sbt/sbt.boot.properties 增加一个oschina库地址: [reposit......

狐狸老侠 ⋅ 今天 ⋅ 0

大数据,必须掌握的10项顶级安全技术

我们看到越来越多的数据泄漏事故、勒索软件和其他类型的网络攻击,这使得安全成为一个热门话题。 去年,企业IT面临的威胁仍然处于非常高的水平,每天都会看到媒体报道大量数据泄漏事故和攻击...

p柯西 ⋅ 今天 ⋅ 0

Linux下安装配置Hadoop2.7.6

前提 安装jdk 下载 wget http://mirrors.hust.edu.cn/apache/hadoop/common/hadoop-2.7.6/hadoop-2.7.6.tar.gz 解压 配置 vim /etc/profile # 配置java环境变量 export JAVA_HOME=/opt/jdk1......

晨猫 ⋅ 今天 ⋅ 0

crontab工具介绍

crontab crontab 是一个用于设置周期性被执行的任务工具。 周期性执行的任务列表称为Cron Table crontab(选项)(参数) -e:编辑该用户的计时器设置; -l:列出该用户的计时器设置; -r:删除该...

Linux学习笔记 ⋅ 今天 ⋅ 0

深入Java多线程——Java内存模型深入(2)

5. final域的内存语义 5.1 final域的重排序规则 1.对于final域,编译器和处理器要遵守两个重排序规则: (1)在构造函数内对一个final域的写入,与随后把这个被构造对象的引用赋值给一个引用...

江左煤郎 ⋅ 今天 ⋅ 0

面试-正向代理和反向代理

面试-正向代理和反向代理 Nginx 是一个高性能的反向代理服务器,但同时也支持正向代理方式的配置。

秋日芒草 ⋅ 今天 ⋅ 0

Spring 依赖注入(DI)

1、Setter方法注入: 通过设置方法注入依赖。这种方法既简单又常用。 类中定义set()方法: public class HelloWorldOutput{ HelloWorld helloWorld; public void setHelloWorld...

霍淇滨 ⋅ 昨天 ⋅ 0

马氏距离与欧氏距离

马氏距离 马氏距离也可以定义为两个服从同一分布并且其协方差矩阵为Σ的随机变量之间的差异程度。 如果协方差矩阵为单位矩阵,那么马氏距离就简化为欧氏距离,如果协方差矩阵为对角阵,则其也...

漫步当下 ⋅ 昨天 ⋅ 0

聊聊spring cloud的RequestRateLimiterGatewayFilter

序 本文主要研究一下spring cloud的RequestRateLimiterGatewayFilter GatewayAutoConfiguration @Configuration@ConditionalOnProperty(name = "spring.cloud.gateway.enabled", matchIfMi......

go4it ⋅ 昨天 ⋅ 0

没有更多内容

加载失败,请刷新页面

加载更多

下一页

返回顶部
顶部