Python爬虫实战：爬取豆瓣电影

引言

1. 爬虫基础

1.1 什么是爬虫？

1.2 Python爬虫常用库

2. 实战：抓取豆瓣电影Top250

2.1 安装依赖库

2.2 发送HTTP请求

编辑

2.3 解析HTML

编辑

2.4 存储数据

2.5 完整代码

3. 进阶：处理分页和动态内容

3.1 抓取多页数据

3.2 处理动态内容

4. 反爬虫策略与应对

4.1 常见的反爬虫策略

4.2 应对策略

5. 总结

引言

在当今大数据时代，网络爬虫（Web Crawler）成为了获取互联网数据的重要工具。无论是数据分析、机器学习还是市场调研，爬虫技术都能帮助我们快速获取所需的数据。本文将带你从零开始，使用Python编写一个简单的网络爬虫，并逐步扩展到更复杂的应用场景。

1. 爬虫基础

1.1 什么是爬虫？

网络爬虫是一种自动化程序，能够从互联网上抓取数据。它通过模拟浏览器请求，访问网页并提取所需的信息。爬虫的核心任务包括：

发送HTTP请求：向目标网站发送请求，获取网页内容。
解析HTML：从网页中提取有用的数据。
存储数据：将提取的数据保存到本地或数据库中。

1.2 Python爬虫常用库

Python拥有丰富的库来支持爬虫开发，以下是常用的几个库：

Requests：用于发送HTTP请求，获取网页内容。
BeautifulSoup：用于解析HTML，提取数据。
Scrapy：一个强大的爬虫框架，适合大规模数据抓取。
Selenium：用于处理动态网页，模拟浏览器操作。

2. 实战：抓取豆瓣电影Top250

将以抓取豆瓣电影Top250为例，演示如何使用Python编写一个简单的爬虫。

2.1 安装依赖库

首先，确保你已经安装了requests和BeautifulSoup库。如果没有安装，可以使用以下命令进行安装：

pip install requests beautifulsoup4

2.2 发送HTTP请求

我们使用requests库向豆瓣电影Top250页面发送请求，获取网页内容。

import requestsurl = "https://movie.douban.com/top250"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}response = requests.get(url, headers=headers)
if response.status_code == 200:print("请求成功！")html_content = response.text
else:print("请求失败，状态码：", response.status_code)

2.3 解析HTML

使用BeautifulSoup解析HTML，提取电影名称、评分等信息。

from bs4 import BeautifulSoupsoup = BeautifulSoup(html_content, "html.parser")movies = soup.find_all("div", class_="info")for movie in movies:title = movie.find("span", class_="title").textrating = movie.find("span", class_="rating_num").textprint(f"电影名称：{title}，评分：{rating}")

2.4 存储数据

将提取的数据保存到CSV文件中。

import csvwith open("douban_top250.csv", mode="w", newline="", encoding="utf-8") as file:writer = csv.writer(file)writer.writerow(["电影名称", "评分"])for movie in movies:title = movie.find("span", class_="title").textrating = movie.find("span", class_="rating_num").textwriter.writerow([title, rating])

2.5 完整代码

import requests
from bs4 import BeautifulSoup
import csvurl = "https://movie.douban.com/top250"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}response = requests.get(url, headers=headers)
if response.status_code == 200:html_content = response.text
else:print("请求失败，状态码：", response.status_code)exit()soup = BeautifulSoup(html_content, "html.parser")
movies = soup.find_all("div", class_="info")with open("douban_top250.csv", mode="w", newline="", encoding="utf-8") as file:writer = csv.writer(file)writer.writerow(["电影名称", "评分"])for movie in movies:title = movie.find("span", class_="title").textrating = movie.find("span", class_="rating_num").textwriter.writerow([title, rating])print("数据已保存到douban_top250.csv")

3. 进阶：处理分页和动态内容

3.1 抓取多页数据

豆瓣电影Top250有10页数据，我们需要遍历所有页面进行抓取。

base_url = "https://movie.douban.com/top250"
all_movies = []for page in range(0, 250, 25):url = f"{base_url}?start={page}"response = requests.get(url, headers=headers)if response.status_code == 200:soup = BeautifulSoup(response.text, "html.parser")movies = soup.find_all("div", class_="info")for movie in movies:title = movie.find("span", class_="title").textrating = movie.find("span", class_="rating_num").textall_movies.append([title, rating])else:print(f"第{page//25 + 1}页请求失败，状态码：", response.status_code)with open("douban_top250_all.csv", mode="w", newline="", encoding="utf-8") as file:writer = csv.writer(file)writer.writerow(["电影名称", "评分"])writer.writerows(all_movies)print("所有数据已保存到douban_top250_all.csv")

3.2 处理动态内容

如果网页内容是通过JavaScript动态加载的，可以使用Selenium模拟浏览器操作。

from selenium import webdriver
from selenium.webdriver.common.by import By
import timedriver = webdriver.Chrome()
driver.get("https://movie.douban.com/top250")movies = driver.find_elements(By.CLASS_NAME, "info")
for movie in movies:title = movie.find_element(By.CLASS_NAME, "title").textrating = movie.find_element(By.CLASS_NAME, "rating_num").textprint(f"电影名称：{title}，评分：{rating}")driver.quit()