Python urllib实用方法、属性、流程总结

原创
2019/11/26 19:10
阅读数 412

一、urllib、urllib2、urllib3、requests

urllib2室python2中的,python3合并了urllib和urllib2到urllib目录下,所以python3直接使用urllib。

py2-urllib

py3-urllib

urllib3是一个三方库,它提供了连接池、客户端SSL/TLS验证、文件编码上传、HTTP重定向、gzip和deflate压缩编码、HTTP和SOCKS代理等功能。

requests也是一个三方库,它依赖于urllib3,做了一些封装,所以一般使用requests的比较多。

二、urlopen

from urllib import request,parse

response = request.urlopen(r'http://www.baidu.com', timeout=3000)
# <class 'http.client.HTTPResponse'>
print(type(response))

content = response.read()
# <class 'bytes'>
print(type(content))

print(content.decode('utf-8'))

# 传递参数
param = parse.urlencode({'id': '2'})
data = bytes(param, encoding='utf8')
response = request.urlopen(r'http://www.baidu.com', data=data)

urlopen的timeout可以设置超时时间,data可以设置参数。

urlencode是把参数编码为url参数:

param = parse.urlencode({'id': '2', 'name': '中文'}, encoding='utf-8')
# id=2&name=%E4%B8%AD%E6%96%87
print(param)
# %E4%B8%AD%E6%96%87
print(parse.quote("中文"))
print(parse.unquote("%E4%B8%AD%E6%96%87"))

三、Response

方法或属性 说明
read() 获取网页内容
status HTTP状态码,200表示成功
getcode() HTTP状态码,和status相同
reason 状态信息,成功为ok
msg 成功为ok
getheader('header_name') 获取指定header
getheaders() 获取所有header,元组列表
version 获取版本信息
debuglevel 获取调试等级
closed 获取对象是否关闭布尔值
geturl() 获取请求URL
info() 其他相应信息信息
import urllib.request

response = urllib.request.urlopen('http://www.baidu.com', timeout=3000)

# 获取网页内容
print(response.read().decode('utf-8'))   
# 获取指定header
print(response.getheader('Content-Type'))
# 以元组列表获取头信息
print(response.getheaders()) 
# 获取版本信息
print(response.version)  
# 获取状态码
print(response.status)  
# 获取调试等级
print(response.debuglevel) 
# 获取对象是否关闭布尔值
print(response.closed)  
# 获取URL
print(response.geturl()) 

# 获取HTTP状态码
print(response.getcode()) 
# 获取msg
print(response.msg)  
# 获取状态信息
print(response.reason)

# 获取其他信息
print(response.info())

四、Request

from urllib import request, parse

url = 'http://127.0.0.1:8080/test/user'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0'
}

data = {'id': '1', 'name': 'tim'}
params = parse.urlencode(data)
byte_params = bytes(params, encoding='utf-8')
rst = request.Request(url=url, data=byte_params, headers=headers, method='POST')
rst.add_header('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8')
rst.add_header('Accept-Encoding', 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2')
rst.add_header('Accept-Language', 'gzip, deflate, br')

response = request.urlopen(rst)
print(response.read().decode('utf-8'))

五、异常

URLError在urllib库的error模块,继承了OSError类,由request模块产生的异常都可以通过捕获这个类来处理,URLError包含一个属性reason表示错误原因。

HTTPError是URLError的子类,有3个属性,code表示HTTP状态码,reason表示错误原因,headers是返回头信息。

from urllib import request,error

url = 'http://127.0.0.1:8080/test/user'

try:
    response = request.urlopen(url, timeout=1)
except error.HTTPError as e:
    print(e.reason, e.code, e.headers)
    print("HTTPError:" + str(type(e)))
except error.URLError as e:
    print(e.reason)
    print("URLError:" + str(type(e)))
else:
    print('success')

六、urllib handler处理流程

handler流程

七、cookie

7.1 获取cookie

from http import cookiejar
from urllib import request

url = 'http://127.0.0.1:8080/test/cookie'
cookie = cookiejar.CookieJar()
handler = request.HTTPCookieProcessor(cookie)
opener = request.build_opener(handler)
response = opener.open(url)
print(response.read().decode('utf-8'))

for ck in cookie:
    print(ck.name + ":" + ck.value)

7.2 cookie保存与重用

from http import cookiejar
from urllib import request

url = 'http://127.0.0.1:8080/test/cookie'
fielname = r'F:\tmp\cookies.txt'

# cookie = cookiejar.MozillaCookieJar(filename=fielname)
cookie = cookiejar.LWPCookieJar(filename=fielname)
handler = request.HTTPCookieProcessor(cookie)
opener = request.build_opener(handler)
response = opener.open(url)
print(response.read().decode('utf-8'))
cookie.save(ignore_discard=True, ignore_expires=True)


# cookie = cookiejar.MozillaCookieJar()
cookie = cookiejar.LWPCookieJar()
cookie.load(fielname, ignore_discard=True, ignore_expires=True)

handler = request.HTTPCookieProcessor(cookie)
opener = request.build_opener(handler)
response = opener.open(url)
print(response.read().decode('utf-8'))

7.3 服务端代码

@RequestMapping("/cookie")
    public String cookie(HttpServletRequest request,
                         HttpServletResponse response,
                         @CookieValue(value = "pyck", required = false,defaultValue = "dfck") String pyck
    ){
        Cookie[] cookies =  request.getCookies();
        if(cookies != null){
            for(Cookie cookie : cookies){
                System.out.println(cookie.getName() + " " + cookie.getValue());
            }
        }
        Cookie cookie=new Cookie("pyck","happy");
        response.addCookie(cookie);
        System.out.println("pyck:" + pyck);
        return pyck;
    }

八、代理

from urllib.error import URLError
from urllib.request import ProxyHandler, build_opener

proxy = ProxyHandler({
    'http': 'http://127.0.0.1:7777',
    'https': 'http://127.0.0.1:8888'
})
opener = build_opener(proxy)
try:
    response=opener.open('https://www.baidu.com')
    print(response.read().decode('utf-8'))
except URLError as e:
    print(e.reason)

九、Auth

这里的auth是指HTTPBasicAuth,HTTPBasicAuth一般是服务器实现的,直接配置的用户密码和权限,不是我们常见的登录,因为一般我们都是自己实现登录。

不过我们还是有必要了解一下HTTPBasicAuth,很多的监控组件不会自己实现登录注册,就会简单的使用服务器提供的HTTPBasicAuth,例如Tomcat的监控。

下面就介绍一下python中利用HTTPBasicAuth,先下载Tomcat,然后tomcat根目录下conf目录下的tomcat-users.xml,tomcat-users节点下添加:

<role rolename="admin-gui"/>
<role rolename="manager-gui"/>
<role rolename="manager-jmx"/>
<role rolename="manager-script"/>
<role rolename="manager-status"/>
<user username="tim" password="123456" roles="admin-gui,manager-gui,manager-jmx,manager-script,manager-status"/>

在tomcat的bin目录下执行startup脚本就可以启动

tomcat-auth

from urllib.request import HTTPPasswordMgrWithDefaultRealm
from urllib.request import HTTPBasicAuthHandler
from urllib.request import build_opener
from urllib import request, error

username = 'tim'
password = '123456'
url = 'http://localhost:8080/manager/status'

pwdMg = HTTPPasswordMgrWithDefaultRealm()
pwdMg.add_password(None, url, username, password)
auth_handler = HTTPBasicAuthHandler(pwdMg)

opener = build_opener(auth_handler)

try:
    response = opener.open(url)
    html = response.read().decode('utf8')
    print(html)
except error.URLError as e:
    print(e.reason)

# 没有auth,401
try:
    response = request.urlopen(url)
except error.HTTPError as e:
    print(e.reason, e.code, e.headers)
else:
    print('success')

十、总结

urllib总结

展开阅读全文
打赏
0
0 收藏
分享
加载中
更多评论
打赏
0 评论
0 收藏
0
分享
返回顶部
顶部