Python 网络爬虫快速入门

网络爬虫是一种自动化的程序，用于从互联网上抓取数据。Python 由于其简洁的语法和丰富的库支持，成为编写网络爬虫的理想选择。本文将带你快速入门 Python 网络爬虫，从安装必要的库到编写一个简单的爬虫，再到处理更复杂的情况。

1. 环境准备

1.1 安装 Python

确保你已经安装了 Python。你可以从 Python 官方网站下载并安装最新版本的 Python。

1.2 安装必要的库

我们将使用以下几个库来编写网络爬虫：

requests：用于发送 HTTP 请求。
BeautifulSoup：用于解析 HTML。
lxml：用于提高解析速度。

使用 pip 安装这些库：

pip install requests beautifulsoup4 lxml

2. 编写第一个爬虫

2.1 发送 HTTP 请求

使用 requests 库发送 HTTP 请求并获取网页内容。

import requestsurl = 'https://www.example.com'
response = requests.get(url)if response.status_code == 200:print(response.text)
else:print(f"Failed to retrieve the page. Status code: {response.status_code}")

2.2 解析 HTML

使用 BeautifulSoup 库解析 HTML 并提取所需的数据。

from bs4 import BeautifulSoupurl = 'https://www.example.com'
response = requests.get(url)if response.status_code == 200:soup = BeautifulSoup(response.text, 'lxml')title = soup.find('title').textprint(f"Title: {title}")# 提取所有的链接links = soup.find_all('a')for link in links:print(link.get('href'))
else:print(f"Failed to retrieve the page. Status code: {response.status_code}")

3. 处理请求和响应

3.1 设置请求头

有些网站会检查请求头，以防止爬虫访问。你可以设置请求头来模拟浏览器行为。

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}response = requests.get(url, headers=headers)

3.2 处理 cookies

某些网站需要 cookies 才能正常访问。你可以使用 requests 库来处理 cookies。

cookies = {'session_id': 'abc123'
}response = requests.get(url, headers=headers, cookies=cookies)

3.3 处理重定向

默认情况下，requests 会自动处理重定向。你可以禁用自动重定向并手动处理。

response = requests.get(url, headers=headers, allow_redirects=False)if response.status_code == 302:redirect_url = response.headers['location']print(f"Redirected to: {redirect_url}")

4. 解析 HTML

4.1 提取文本

使用 BeautifulSoup 提取文本内容。

soup = BeautifulSoup(response.text, 'lxml')
title = soup.find('title').text
print(f"Title: {title}")

4.2 提取属性

提取 HTML 元素的属性值。

links = soup.find_all('a')
for link in links:href = link.get('href')print(href)

4.3 提取多个元素

提取多个相同类型的元素。

paragraphs = soup.find_all('p')
for p in paragraphs:print(p.text)

5. 高级主题

5.1 异步爬虫

使用 aiohttp 和 asyncio 库编写异步爬虫，提高爬取效率。

import aiohttp
import asyncio
from bs4 import BeautifulSoupasync def fetch(session, url):async with session.get(url) as response:return await response.text()async def main():urls = ['https://www.example.com','https://www.example2.com','https://www.example3.com']async with aiohttp.ClientSession() as session:tasks = [fetch(session, url) for url in urls]responses = await asyncio.gather(*tasks)for response in responses:soup = BeautifulSoup(response, 'lxml')title = soup.find('title').textprint(f"Title: {title}")if __name__ == '__main__':asyncio.run(main())

5.2 反爬虫策略

有些网站会采取反爬虫措施，如验证码、IP 封禁等。你可以采取以下措施应对：

使用代理 IP：使用不同的 IP 地址发送请求。
设置合理的请求间隔：避免短时间内发送大量请求。
处理验证码：使用 OCR 技术或第三方服务识别验证码。

5.3 存储数据

将爬取的数据存储到文件或数据库中。

存储到文件

with open('data.txt', 'w') as f:f.write(response.text)

存储到数据库

使用 sqlite3 库将数据存储到 SQLite 数据库中。

import sqlite3conn = sqlite3.connect('data.db')
cursor = conn.cursor()cursor.execute('''CREATE TABLE IF NOT EXISTS pages (url TEXT, content TEXT)''')url = 'https://www.example.com'
content = response.textcursor.execute('INSERT INTO pages (url, content) VALUES (?, ?)', (url, content))
conn.commit()
conn.close()