python的urllib模块和http模块

1.python的urllib库用于操作网页，并对网页内容进行处理

urllib包有如下模块：

urllib.request：打开和读取URL

urllib.error：包含urllib.request抛出的异常

urllib.parse：解析URL

urllib.robotparser：解析robots.txt文件

urllib的request模块

urllib.request定义了一些打开URL的函数和类，包含授权验证、重定向、浏览器cookies等

urllib.request可以模拟浏览器的一个请求发起过程

我们可以使用urllib.request的urlopen方法来打开一个URL，语法格式如下

urllib.request.urlopen(url, data=None, [timeout,]*, cafile=None, capath=None, cadefault=False, context=None)

url：url地址

data：发送到服务器的其他数据对象，默认为None

timeout：设置访问超时时间

cafile和capath：cafile为CA证书，capath为CA证书的路径，使用HTTPS需要用到

cadefault：已经被启用

context：ssl.SSLContext类型，用来指定SSL设置

read来读取全部内容

import urllib.request
myurl=urllib.request.urlopen('http://www.baidu.com')
print(myurl.read()) #read用来读取网站的所有内容

可以指定长度，指定长度为300个字符

import urllib.request
myurl=urllib.request.urlopen('http://www.baidu.com')
print(myurl.read(300))

通过readline读取一行内容

import urllib.request
myurl=urllib.request.urlopen('http://www.baidu.com')
print(myurl.readline())

readlines来读取文件的所有内容，读取到的内容赋值给一个列表变量

from urllib.request
myurl=urllib.request.urlopen('http://www.baidu.com')
print(myurl.readlines())

判断网页是否可以正常访问

import urllib.request
myurl=urllib.request.open('http://www.baidu.com')
print(myurl.getcode())  #返回问的状态码是多少
try:myurl2=urllib.request.urlopen('http://www.baidu.com/no.html')
except urllib.error.HTTPError as e:if e.code==404print(404)

抓取网页保存到本地

from urllib.request
myurl=urllib.request.urlopen('http://www.baidu.com')
f=open('1.html','wb')
context=myurl.read()
f.write(context)   
f.close()

url编码和解码可以使用quote和unquote方法

import urllib.request
encode=urllib.request.quote('http://www.baidu.com')  #对字符进行编码
print(encode)
decode=urllib.request.unquote(encode)  #对变量encode进行解码
print(decode)结果为：
https%3A//www.runoob.com/
https://www.runoob.com/

模拟头部信息：

我们抓取网页一般需要对header(网页头部信息)进行模拟，需要用到urllib.request.Request类

class urllib.request.Request(url,data=None,headers={},origin_req_host=None,unverifiable=False,method=None)url：url地址
data：发送到服务器其他数据对象，默认为None
headers：http请求的头部信息，字典格式
origin_req_host：请求的主机地址，ip或者域名
unverifiable：少用参数，用于设置网页是否需要验证，默认为False
method：请求方法，GET,POST,DELETE,PUT等

实例1-1

import urllib.request
import urllib.parse
url='https://www.runoob.com/s=' #这个是菜鸟的搜索网址
keyword='java教程'  #这个是搜索的内容
keycode=urllib.request.quote(keyword)  #对请求进行编码
urlall=url+keycode  
header={'User-Agent':'Mozilla/5.0 (X11; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}
#这里reqeust对象通过Request类来构建http亲贵
request=urllib.request.Request(urlall,headers=header) 
#这里请求了request这个对象，然后结果赋值到了response里
response=urllib.reqeust.urlopen(request)
print(response.read())

执行以上代码会打印出来菜鸟教程中搜索java教程后的页面代码

实例1-2 使用POST传递数据

1.先定义一个POST的html页面

<!DOCTYPE html>
<html lang="en">
<head><meta charset="UTF-8"><meta name="viewport" content="width=device-width, initial-scale=1.0"><title>Document</title>
</head>
<body><form action="" type="text" name="myform">Nmae:<input type="text" name="name"><br>Pass:<input type="text" name="pass"><br><input type="submit" value="提交"></form><hr><?phpif(isset($_POST['name']) &&$_POST['pass']){echo 'hello word!';}?>
</body>
</html>

使用urllib来提交数据，看回显源码

import urllib.request
import urllib.parse
url='https://www.runoob.com/try/py3/py3_urllib_test.php'  #提交到表单页面
data={'name':'RUNOOB','tag':'菜鸟教程'}   #提交数据
header={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36 Edg/131.0.0.0'
}
data=urllib.parse.urlencode(data).encode() #对参数进行编码，解码使用urllib.parse.urldecode()
request=urllib.request.Request(url,data,header)
response=urllib.request.urlopen(request).read()
print(response.decode())

源码如下，通过1.html来打开就可以看到网站回显了

<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>菜鸟教程(runoob.com) urllib POST  测试</title>
</head>
<body>
<form action="" method="post" name="myForm">Name: <input type="text" name="name"><br>Tag: <input type="text" name="tag"><br><input type="submit" value="提交">
</form>
<hr>
RUNOOB, 菜鸟教程</body>
</html>

urllib的error模块

1.urllib.error模块为urllib.request所引发的异常定义了异常类，基础异常类是URLError

urlib.error包含了两个方法，URLError和HTTPError

URLError是OSError的一个子类，用于处理程序在遇到问题会引发此异常，包含的属性reason为引发异常的原因，

HTTPError是URLError的一个子类，用于处理特殊HTTP错误，例如作为认证请求的时候，包含的属性code为HTTP的状态码，reason为引发异常的原因，headers为导致HTTPError的特定http请求的http响应头

实例1-1：对不存在的页面抓取并处理异常

import urllib.request
import urllib.errormyURL1 = urllib.request.urlopen("https://www.runoob.com/")
print(myURL1.getcode())   # 200try:myURL2 = urllib.request.urlopen("https://www.runoob.com/no.html")
except urllib.error.HTTPError as e:if e.code == 404:print(404)   # 404

urllib的parse模块

urllib.parse模块用于解析URL，格式如下

urllib.parse.urlparse(urlstring,scheme='',allow_fragments=True)urlstring    字符串的URL地址，scheme为协议类型
allow_fragments   参数为false，则无法识别片段标识符，他们被解析为路径，参数或者查询组件的一部分，并fragments在返回值中设置为空字符串

实例1-1

from urllib.parse import urlparse
o=urlparse("https://www.runoob.com/?s=python+%E6%95%99%E7%A8%8B")
print(o)返回结果：
ParseResult(scheme='https', netloc='www.runoob.com', path='/', params='', query='s=python+%E6%95%99%E7%A8%8B', fragment='')

从以上可以看出，内容是一个元组，包含6个字符串：协议，位置，路径，参数，查询，判断

我们可以直接读取协议：

from urllib.parse import urlparse
o=urlparse('https://www.runoob.com/?s=python+%E6%95%99%E7%A8%8B')
print(o.scheme)  #schema是协议返回结果为：
https   
说明用的是https协议

urlparse模块的解释

属性        索引        值                        值（如果不存在）
scheme       0         URL协议                    scheme参数
netloc       1         网络位置                   空字符串
path         2         分层路劲                   空字符串
params       3         最后路径元素的参数          空字符串
query        4         查询组件                   空字符串
fragment     5         片段识别                   空字符串
username               用户名                     None
password               密码                       None
hostname               主机名（小写）              None
port                  端口号为整数(如果存在)       None

http包简介：

http包提供了使用HTTP协议的一些功能，其主要模块如下：

http.client    底层的http协议客户端，可以为urllib.request模块所用
http.server    提供了基于http协议客户端，可以为urllib.request模块所用
http.cookies   coolies的管理工具
http.cookiejar  提供了cookies的持久化支持在http.client模块中用于客户端的类如下所示，
HTTPConnection    基于HTTP协议的访问客户端
HTTPSConnection   基于HTTPS协议的访问客户端
HTTPResponse      基于HTTP协议的服务端响应HTTPConnection构造方法原型如下：
HTTPConnection(host,port=None,[timeout,]source_address=None)参数意义如下：
host    服务器的地址
port    用来指定访问的服务器端口，不提供则从host中提取，否则使用80端口
timeout    指定超时秒数HTTPConnection对象的主要方法如下
request(method,url,body,headers)
method    发送的操作，一般为GET或POST
url       进行操作的URL
body      发送的数据
headers   发送的HTTP头当服务器发送请求后，可以使用HTTPConnection对象的getresponse()方法返回一个HTTPResponse对象，使用HTTPConnection对象的close()方法可以关闭服务器的连接，除了使用 request方法以外，还可以使用以下方法向服务器发送请求 putrequest(request,selector,skip_host,skip_accept_encoding)
putheader(header,argument,...)
endheaders()
send(data)putrequest方法的参数如下
request    所发送的操作，如POST,GET,PUT
selector   进行操作的URL
skip_host  可选参数，若为真，禁止自动发送'HOST”
skip_accept_encoding    可选参数，若为真，禁止自动发送Accept-Encoding:headersputheader方法的参数含义如下
header    发送的HTTP头
argument    发送的参数send方法的含义
data    发送的数据

实例1-1 使用http.client.HTTPConnection对象访问网站，

from http.client import HTTPConnection
mc=HTTPConnection('www.baidu.com')  #定义基于http的访问客户端对象
mc.request('GET','/')  #请求方法为GET，url为根目录
res=mc.getresponse()  #获取服务器响应
print(res.status,res.reason)  #status返回状态码，reason返回对应状态码的短语OK
print(res.read().decode())  #读取通过第4行响应的所有内容，内容进程字节串解码

代码说明：进本的访问示例，实例化http.client.HTTPConnection对象，指定请求方法为GET，最后使用getresponse()方法取得访问的网页，打印出响应的状态与网页