Python爬虫 - 豆瓣图书数据爬取、处理与存储

文章目录

前言
一、使用版本
二、需求分析
- 1. 分析要爬取的内容
- - 1.1 分析要爬取的单个图书信息
  - 1.2 爬取步骤
  - - 1.2.1 爬取豆瓣图书标签分类页面
    - 1.2.2 爬取分类页面
    - 1.2.3 爬取单个图书页面
  - 1.3 内容所在的标签定位
- 2. 数据用途
- - 2.1 基础分析
  - 2.2 高级分析
- 3. 应对反爬机制的策略
- - 3.1 使用 `User-Agent` 模拟真实浏览器请求
  - 3.2 实施随机延时策略
  - 3.3 构建和使用代理池
  - 3.4 其他
三、编写爬虫代码
- 1. 爬取标签分类html
- 2. 爬取单个分类的所有页面
- 3. 爬取单个图书的html
四、数据处理与存储
- 1. 解析html并把数据保存到csv文件
- - 1.1 字段说明
  - 1.2 代码实现
- 2. 数据清洗与存储
- - 2.1 数据清洗
  - 2.2 数据存储
  - - 2.2.1 表设计
    - 2.2.2 表实现
  - 2.3 代码实现

前言

在数字化时代，网络爬虫技术为我们提供了强大的数据获取能力，使得从各类网站提取信息变得更加高效和便捷。豆瓣读书作为一个广受欢迎的图书评价和推荐平台，汇聚了大量的书籍信息，包括书名、作者、出版社、评分等。这些信息不仅对读者选择图书有帮助，也为出版商和研究人员提供了宝贵的数据资源。

本项目旨在通过 Python 爬虫技术，系统性地抓取豆瓣读书网站上的图书信息，并将其存储为结构化的数据格式，以便后续分析和研究。我们将使用 requests 和 BeautifulSoup 库进行网页请求和数据解析，利用 pandas 进行数据处理，最后将清洗后的数据存储到 MySQL 数据库中。

一、使用版本

	python	requests	bs4	beautifulsoup4	soupsieve	lxml	pandas	sqlalchemy	mysql-connector-python	selenium
版本	3.8.5	2.31.0	0.0.2	4.12.3	2.6	4.9.3	2.0.3	2.0.36	9.0.0	4.15.2

二、需求分析

1. 分析要爬取的内容

1.1 分析要爬取的单个图书信息

点击进入豆瓣读书官网：https://book.douban.com/

随便点开一本图书

在这里插入图片描述

如下图，在图书首页可以看到标题、作者、出版社、出版日期、页数、价格和评分等信息。那我们的目的就是要把这些信息爬取下来保存到csv文件中作为原始数据。

在这里插入图片描述

鼠标右击，选择检查，找到相关信息的网页源码。

在这里插入图片描述

当鼠标悬浮在如下图红色箭头所指的标签上之后，我们发现左侧我们想要爬取的信息范围被显示出来，说明我们要爬取的单个图书信息内容就在该标签中。

在这里插入图片描述

复制了该标签的内容如下图所示，从该标签中可以看到需要爬取的信息都有。

我们的目的就是把单个图书的hmtl文件爬取下来，然后使用BeautifulSoup解析后把数据保存到csv文件中。

<div class="subjectwrap clearfix">
<div class="subject clearfix">
<div id="mainpic" class=""><a class="nbg" href="https://img1.doubanio.com/view/subject/l/public/s34971089.jpg" title="再造乡土"><img src="https://img1.doubanio.com/view/subject/s/public/s34971089.jpg" title="点击看大图" alt="再造乡土" rel="v:photo" style="max-width: 135px;max-height: 200px;"></a>
</div>
<div id="info" class=""><span><span class="pl"> 作者</span>:<a class="" href="/author/4639586">（美）萨拉·法默</a></span><br><span class="pl">出版社:</span><a href="https://book.douban.com/press/2476">广西师范大学出版社</a><br><span class="pl">出品方:</span><a href="https://book.douban.com/producers/795">望mountain</a><br><span class="pl">副标题:</span> 1945年后法国农村社会的衰落与重生<br><span class="pl">原作名:</span> Rural Inventions: The French Countryside after 1945<br><span><span class="pl"> 译者</span>:<a class="" href="/search/%E5%8F%B6%E8%97%8F">叶藏</a></span><br><span class="pl">出版年:</span> 2024-11<br><span class="pl">页数:</span> 288<br><span class="pl">定价:</span> 79.20元<br><span class="pl">装帧:</span> 精装<br><span class="pl">ISBN:</span> 9787559874597<br>
</div>
</div>
<div id="interest_sectl" class=""><div class="rating_wrap clearbox" rel="v:rating"><div class="rating_logo">豆瓣评分</div><div class="rating_self clearfix" typeof="v:Rating"><strong class="ll rating_num " property="v:average"> 8.5 </strong><span property="v:best" content="10.0"></span><div class="rating_right "><div class="ll bigstar45"></div><div class="rating_sum"><span class=""><a href="comments" class="rating_people"><span property="v:votes">55</span>人评价</a></span></div></div></div>
<span class="stars5 starstop" title="力荐">5星
</span>
<div class="power" style="width:37px"></div><span class="rating_per">29.1%</span><br>
<span class="stars4 starstop" title="推荐">4星
</span>
<div class="power" style="width:64px"></div><span class="rating_per">49.1%</span><br>
<span class="stars3 starstop" title="还行">3星
</span>
<div class="power" style="width:26px"></div><span class="rating_per">20.0%</span><br>
<span class="stars2 starstop" title="较差">2星
</span>
<div class="power" style="width:2px"></div><span class="rating_per">1.8%</span><br>
<span class="stars1 starstop" title="很差">1星
</span>
<div class="power" style="width:0px"></div><span class="rating_per">0.0%</span><br></div>
</div>
</div>

1.2 爬取步骤

1.2.1 爬取豆瓣图书标签分类页面

豆瓣图书标签分类地址：https://book.douban.com/tag/?view=type&icn=index-sorttags-all

爬取图书标签分类页面保存为../douban/douban_book/douban_book_tag/douban_book_all_tag.html文件。然后使用BeautifulSoup解析../douban/douban_book/douban_book_tag/douban_book_all_tag.html文件，获取每个分类标签的名称和链接。

在这里插入图片描述

1.2.2 爬取分类页面

例如，点进小说标签后的页面如下：
可以看到访问的网址是https://book.douban.com/tag/小说，由此可以推断不同分类标签第一页的网址是https://book.douban.com/tag/标签名称。

在这里插入图片描述

在上面的两个页面中可以看到每一页显示了多个小说的大概信息（这些信息并不能满足我的爬取要求），那我就需要获取每个分页的链接，然后根据每个分页的链接保存每一页的html文件。

如下图所示，检查后发现每一页是20条数据，而且带有两个参数（start、type；start表示每页开始位置，每页20条数据），由此可以推断每一页的链接为：https://book.douban.com/tag/<标签名称>?start=<20的倍数>&type=T。然后从每一页中解析出每个图书的链接。

在这里插入图片描述

1.2.3 爬取单个图书页面

获得每个图书的链接后，就可以根据链接保存每个图书的html文件。然后就可以使用BeautifulSoup从该页面中解析出图书的信息。

单个图书的页面如下图所示：

在这里插入图片描述

1.3 内容所在的标签定位

可以使用CSS选择器定位需要爬取的内容所在的标签位置。
示例：标题标签定位
鼠标右击标题部分，选择检查，显示出标题部分的源码之后；右击有标题的源码，点击复制，选择复制selector。

在这里插入图片描述

复制后的selector如下：

#wrapper > h1 > span

2. 数据用途

2.1 基础分析

描述性统计：
- 计算书籍价格、页数等数值型字段的平均值、中位数、最大值、最小值以及标准差。
- 统计不同装帧类型（binding）或出版社（publisher）的书籍数量。
频率分布：
- 制作出版年份（publication_year）的频率分布图，观察每年的出版趋势。
- 分析各星级评分（stars5_starstop至stars1_starstop）所占的比例，了解整体评分分布情况。
简单关系探索：
- 探索书籍价格与评分之间的简单相关性。
- 研究书籍页数与评分的关系，看是否有明显的关联。
分类汇总：
- 按作者（author）、出版社（publisher）或者丛书系列（series）对书籍进行分组，并计算每组的平均评分、总销量等指标。

2.2 高级分析

预测建模：
- 使用机器学习算法预测一本书的可能评分，基于如作者、出版社、价格、出版年份等因素。
- 构建模型预测书籍销售量，帮助出版社或书店优化库存管理。
聚类分析：
- 对书籍进行聚类，找出具有相似特征的书籍群体，例如相似的主题、读者群体或市场表现。
- 根据用户评论链接中的文本信息进行主题建模，以识别常见的读者关注点或反馈类型。
因果分析：
- 通过控制其他变量，研究特定因素（如封面设计、翻译质量等）对书籍评分或销量的影响。
- 使用实验设计或准实验方法评估营销活动对书籍销量的影响。
时间序列分析：
- 如果有连续多年的数据，可以对出版年份和销量等进行时间序列分析，预测未来的趋势。
- 研究特定事件（如作者获得奖项）对书籍销量的时间影响。
网络分析：
- 构建作者合作网络或书籍引用网络，探索学术或文学领域内的合作模式和影响力传播。
情感分析：
- 对用户评论链接指向的内容进行情感分析，理解读者对书籍的情感倾向。
多变量回归分析：
- 研究多个变量（如价格、页数、出版年份等）如何共同影响一本书的评分或销量。

3. 应对反爬机制的策略

3.1 使用 `User-Agent` 模拟真实浏览器请求

许多网站通过检查HTTP请求头中的 User-Agent 字段来判断请求是否来自真实的浏览器。默认情况下，Python库发送的请求可能带有明显的标识，容易被识别为自动化工具。因此，修改 User-Agent 来模拟不同的浏览器和操作系统可以有效地绕过这一检测。

3.2 实施随机延时策略

频繁且规律性的请求频率是典型的爬虫行为特征之一。通过在每次请求之间加入随机延迟，不仅可以模仿人类用户的访问模式，还能减少服务器负载，降低被封禁的风险。

3.3 构建和使用代理池

直接从同一个IP地址发起大量请求容易引起封禁。通过构建并使用代理池，您可以轮换不同的IP地址来进行请求，从而分散风险。这不仅增加了爬虫的隐蔽性，也减轻了单个IP地址的压力。

3.4 其他

验证码处理：某些网站可能还会使用验证码来验证用户身份。针对这种情况，可以考虑使用第三方OCR服务或专门的验证码识别API。
JavaScript渲染页面：部分现代网站依赖JavaScript动态加载内容，普通的HTML解析可能无法获取完整数据。这时可以使用像Selenium这样的工具，它能启动一个真实的浏览器实例执行JavaScript。

三、编写爬虫代码

1. 爬取标签分类html

页面如下图所示：

在这里插入图片描述

代码实现：

import random
import time
from pathlib import Pathimport requestsdef get_request(url, **kwargs):time.sleep(random.uniform(0.1, 2))print(f'===============================地址：{url} ===============================')# 定义一组User-Agent字符串user_agents = [# Chrome'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36','Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36','Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36',# Firefox'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/117.0','Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) Gecko/20100101 Firefox/117.0','Mozilla/5.0 (X11; Linux i686; rv:109.0) Gecko/20100101 Firefox/117.0',# Edge'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36 Edg/117.0.2040.0',# Safari'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.5 Safari/605.1.15',]# 请求头headers = {'User-Agent': random.choice(user_agents)}# 用户名密码认证(私密代理/独享代理)username = ""password = ""proxies = {"http": "http://%(user)s:%(pwd)s@%(proxy)s/" % {"user": username, "pwd": password,"proxy": '36.25.243.5:11768'},"https": "http://%(user)s:%(pwd)s@%(proxy)s/" % {"user": username, "pwd": password,"proxy": '36.25.243.5:11768'}}max_retries = 3for attempt in range(max_retries):try:response = requests.get(url=url, timeout=10, headers=headers, **kwargs)# response = requests.get(url=url, timeout=10, headers=headers, proxies=proxies, **kwargs)if response.status_code == 200:return responseelse:print(f"请求失败，状态码: {response.status_code}，正在重新发送请求 (尝试 {attempt + 1}/{max_retries})")except requests.exceptions.RequestException as e:print(f"请求过程中发生异常: {e}，正在重新发送请求 (尝试 {attempt + 1}/{max_retries})")# 如果不是最后一次尝试，则等待一段时间再重试if attempt < max_retries - 1:time.sleep(random.uniform(1, 2))print('================多次请求失败，请查看异常情况================')return None  # 或者返回最后一次的响应，取决于你的需求def save_book_html_file(save_dir, file_name, content):dir_path = Path(save_dir)# 确保保存目录存在，如果不存在则创建所有必要的父级目录dir_path.mkdir(parents=True, exist_ok=True)# 使用 'with' 语句打开文件以确保正确关闭文件流with open(save_dir + file_name, 'w', encoding='utf-8') as fp:print(f"==============================={save_dir + file_name} 文件已保存===============================")fp.write(str(content))def download_book_tag():save_dir = '../douban/douban_book/douban_book_tag/'file_name = 'douban_book_all_tag.html'book_tag_url = 'https://book.douban.com/tag/?view=type&icn=index-sorttags-all'tag_file_path = Path(save_dir + file_name)if tag_file_path.exists() and tag_file_path.is_file():print(f'\n===============================文件 {tag_file_path} 已存在===============================')else:print(f'===============================文件 {tag_file_path} 不存在，正在下载...===============================')save_book_html_file(save_dir=save_dir, file_name=file_name, content=get_request(book_tag_url).text)if __name__ == '__main__':download_book_tag()

运行结果如下图所示：

在这里插入图片描述

该代码可以重复执行，重复执行会自动检查文件是否已下载，如下图所示：

在这里插入图片描述

保存后的文件如下图：

在这里插入图片描述

2. 爬取单个分类的所有页面

基于上面的爬取标签分类继续实现的代码，使用BeautifulSoup解析标签分类html后，根据获取的标签分类名称和链接循环获取每个分类下的所有html页面。

import random
import time
from pathlib import Pathimport requests
from bs4 import BeautifulSoup# 快代理试用：https://www.kuaidaili.com/freetest/def get_request(url, **kwargs):time.sleep(random.uniform(0.1, 2))print(f'===============================地址：{url} ===============================')# 定义一组User-Agent字符串user_agents = [# Chrome'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36','Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36','Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36',# Firefox'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/117.0','Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) Gecko/20100101 Firefox/117.0','Mozilla/5.0 (X11; Linux i686; rv:109.0) Gecko/20100101 Firefox/117.0',# Edge'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36 Edg/117.0.2040.0',# Safari'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.5 Safari/605.1.15',]# 请求头headers = {'User-Agent': random.choice(user_agents)}# 用户名密码认证(私密代理/独享代理)username = "17687015657"password = "qvbgms8w"proxies = {"http": "http://%(user)s:%(pwd)s@%(proxy)s/" % {"user": username, "pwd": password,"proxy": '36.25.243.5:11768'},"https": "http://%(user)s:%(pwd)s@%(proxy)s/" % {"user": username, "pwd": password,"proxy": '36.25.243.5:11768'}}max_retries = 3for attempt in range(max_retries):try:response = requests.get(url=url, timeout=10, headers=headers, **kwargs)# response = requests.get(url=url, timeout=10, headers=headers, proxies=proxies, **kwargs)if response.status_code == 200:return responseelse:print(f"请求失败，状态码: {response.status_code}，正在重新发送请求 (尝试 {attempt + 1}/{max_retries})")except requests.exceptions.RequestException as e:print(f"请求过程中发生异常: {e}，正在重新发送请求 (尝试 {attempt + 1}/{max_retries})")# 如果不是最后一次尝试，则等待一段时间再重试if attempt < max_retries - 1:time.sleep(random.uniform(1, 2))print('================多次请求失败，请查看异常情况================')return None  # 或者返回最后一次的响应，取决于你的需求def save_book_html_file(save_dir, file_name, content):dir_path = Path(save_dir)# 确保保存目录存在，如果不存在则创建所有必要的父级目录dir_path.mkdir(parents=True, exist_ok=True)# 使用 'with' 语句打开文件以确保正确关闭文件流with open(save_dir + file_name, 'w', encoding='utf-8') as fp:print(f"==============================={save_dir + file_name} 文件已保存===============================")fp.write(str(content))def download_book_tag():save_dir = '../douban/douban_book/douban_book_tag/'file_name = 'douban_book_all_tag.html'book_tag_url = 'https://book.douban.com/tag/?view=type&icn=index-sorttags-all'tag_file_path = Path(save_dir + file_name)if tag_file_path.exists() and tag_file_path.is_file():print(f'\n===============================文件 {tag_file_path} 已存在===============================')else:print(f'===============================文件 {tag_file_path} 不存在，正在下载...===============================')save_book_html_file(save_dir=save_dir, file_name=file_name, content=get_request(book_tag_url).text)def get_soup(markup):return BeautifulSoup(markup=markup, features='lxml')def get_book_type_and_href():# 定义HTML文件路径file = '../douban/douban_book/douban_book_tag/douban_book_all_tag.html'# 初始化一个空字典用于存储标签名称和对应的链接name_href_result = {}# 定义豆瓣书籍的基础URL，用于拼接完整的链接book_base_url = 'https://book.douban.com'# 打开并读取HTML文件内容with open(file=file, mode='r', encoding='utf-8') as fp:# 使用BeautifulSoup解析HTML内容soup = get_soup(fp)# 选择包含所有标签链接的主要容器tag = soup.select_one('#content > div > div.article > div:nth-child(2)')# 选择所有包含标签链接的表格行（每个类别下的标签表）tables = tag.select('div > a.tag-title-wrapper + table.tagCol')# 遍历每个表格for table in tables:# 选择表格中的所有行（tr标签）tr_tags = table.select('tr')# 遍历每一行for tr_tag in tr_tags:# 选择行中的所有单元格（td标签）td_tags = tr_tag.select('td')# 遍历每个单元格for td_tag in td_tags:# 选择单元格中的第一个a标签（如果存在）a_tag = td_tag.select_one('a')# 如果找到了a标签，则提取文本和href属性if a_tag:# 提取a标签的文本内容，并去除两端空白字符tag_text = a_tag.string# 获取a标签的href属性，并与基础URL拼接成完整链接tag_href = book_base_url + a_tag.attrs.get('href')# 将提取到的标签文本和链接添加到结果字典中name_href_result[tag_text] = tag_href# 返回包含所有标签名称和对应链接的字典return name_href_resultdef get_book_data_dagai(name, start):book_tag_base_url = 'https://book.douban.com/tag/' + namepayload = {'start': start,'type': 'T'}response = get_request(book_tag_base_url, params=payload)if response is None:return Nonereturn response.textdef download_book_data_dagai(name, start):save_dir = '../douban/douban_book/douban_book_data_dagai/'file_name = f'douban_book_data_dagai_{name}_{start}.html'dagai_file_path = Path(save_dir + file_name)if dagai_file_path.exists() and dagai_file_path.is_file():print(f'===============================文件 {dagai_file_path} 已存在===============================')else:print(f'===============================文件 {dagai_file_path} 不存在，正在下载...===============================')content = get_book_data_dagai(name, start)if content is None:return None# 判断是否是最后一页soup = get_soup(content)p_tag = soup.select_one('#subject_list > p')if p_tag is not None:print(f"===============================分类 {name} 的网页爬取完成===============================")return Truesave_book_html_file(save_dir=save_dir, file_name=file_name, content=content)if __name__ == '__main__':download_book_tag()book_type = get_book_type_and_href()book_type_name = book_type.keys()print(book_type_name)for type_name in book_type_name:print(f'===============================图书分类标签：{type_name}===============================')start_ = 0while True:flag = download_book_data_dagai(type_name, start_)start_ = start_ + 20if flag is None:continueif flag:print(f'======================================图书分类标签 {type_name} 的大概html下载完成======================================')break

执行过程中打印的部分信息如下图所示：

在这里插入图片描述

爬取后保存的部分html文件如下图所示：

在这里插入图片描述

3. 爬取单个图书的html

基于上面的爬取单个分类的所有页面继续实现的代码，使用BeautifulSoup解析每一页的html后，根据获取的单个图书链接获取html页面。

import random
import time
from pathlib import Pathimport requests
from bs4 import BeautifulSoup# 快代理试用：https://www.kuaidaili.com/freetest/def get_request(url, **kwargs):time.sleep(random.uniform(0.1, 2))print(f'===============================地址：{url} ===============================')# 定义一组User-Agent字符串user_agents = [# Chrome'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36','Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36','Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36',# Firefox'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/117.0','Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) Gecko/20100101 Firefox/117.0','Mozilla/5.0 (X11; Linux i686; rv:109.0) Gecko/20100101 Firefox/117.0',# Edge'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36 Edg/117.0.2040.0',# Safari'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.5 Safari/605.1.15',]# 请求头headers = {'User-Agent': random.choice(user_agents)}# 用户名密码认证(私密代理/独享代理)username = ""password = ""proxies = {"http": "http://%(user)s:%(pwd)s@%(proxy)s/" % {"user": username, "pwd": password,"proxy": '36.25.243.5:11768'},"https": "http://%(user)s:%(pwd)s@%(proxy)s/" % {"user": username, "pwd": password,"proxy": '36.25.243.5:11768'}}max_retries = 3for attempt in range(max_retries):try:response = requests.get(url=url, timeout=10, headers=headers, **kwargs)# response = requests.get(url=url, timeout=10, headers=headers, proxies=proxies, **kwargs)if response.status_code == 200:return responseelse:print(f"请求失败，状态码: {response.status_code}，正在重新发送请求 (尝试 {attempt + 1}/{max_retries})")except requests.exceptions.RequestException as e:print(f"请求过程中发生异常: {e}，正在重新发送请求 (尝试 {attempt + 1}/{max_retries})")# 如果不是最后一次尝试，则等待一段时间再重试if attempt < max_retries - 1:time.sleep(random.uniform(1, 2))print('================多次请求失败，请查看异常情况================')return None  # 或者返回最后一次的响应，取决于你的需求def save_book_html_file(save_dir, file_name, content):dir_path = Path(save_dir)# 确保保存目录存在，如果不存在则创建所有必要的父级目录dir_path.mkdir(parents=True, exist_ok=True)# 使用 'with' 语句打开文件以确保正确关闭文件流with open(save_dir + file_name, 'w', encoding='utf-8') as fp:print(f"==============================={save_dir + file_name} 文件已保存===============================")fp.write(str(content))def download_book_tag():save_dir = '../douban/douban_book/douban_book_tag/'file_name = 'douban_book_all_tag.html'book_tag_url = 'https://book.douban.com/tag/?view=type&icn=index-sorttags-all'tag_file_path = Path(save_dir + file_name)if tag_file_path.exists() and tag_file_path.is_file():print(f'\n===============================文件 {tag_file_path} 已存在===============================')else:print(f'===============================文件 {tag_file_path} 不存在，正在下载...===============================')save_book_html_file(save_dir=save_dir, file_name=file_name, content=get_request(book_tag_url).text)def get_soup(markup):return BeautifulSoup(markup=markup, features='lxml')def get_book_type_and_href():# 定义HTML文件路径file = '../douban/douban_book/douban_book_tag/douban_book_all_tag.html'# 初始化一个空字典用于存储标签名称和对应的链接name_href_result = {}# 定义豆瓣书籍的基础URL，用于拼接完整的链接book_base_url = 'https://book.douban.com'# 打开并读取HTML文件内容with open(file=file, mode='r', encoding='utf-8') as fp:# 使用BeautifulSoup解析HTML内容soup = get_soup(fp)# 选择包含所有标签链接的主要容器tag = soup.select_one('#content > div > div.article > div:nth-child(2)')# 选择所有包含标签链接的表格行（每个类别下的标签表）tables = tag.select('div > a.tag-title-wrapper + table.tagCol')# 遍历每个表格for table in tables:# 选择表格中的所有行（tr标签）tr_tags = table.select('tr')# 遍历每一行for tr_tag in tr_tags:# 选择行中的所有单元格（td标签）td_tags = tr_tag.select('td')# 遍历每个单元格for td_tag in td_tags:# 选择单元格中的第一个a标签（如果存在）a_tag = td_tag.select_one('a')# 如果找到了a标签，则提取文本和href属性if a_tag:# 提取a标签的文本内容，并去除两端空白字符tag_text = a_tag.string# 获取a标签的href属性，并与基础URL拼接成完整链接tag_href = book_base_url + a_tag.attrs.get('href')# 将提取到的标签文本和链接添加到结果字典中name_href_result[tag_text] = tag_href# 返回包含所有标签名称和对应链接的字典return name_href_resultdef get_book_data_dagai(name, start):book_tag_base_url = 'https://book.douban.com/tag/' + namepayload = {'start': start,'type': 'T'}response = get_request(book_tag_base_url, params=payload)if response is None:return Nonereturn response.textdef download_book_data_dagai(name, start):save_dir = '../douban/douban_book/douban_book_data_dagai/'file_name = f'douban_book_data_dagai_{name}_{start}.html'dagai_file_path = Path(save_dir + file_name)if dagai_file_path.exists() and dagai_file_path.is_file():print(f'===============================文件 {dagai_file_path} 已存在===============================')else:print(f'===============================文件 {dagai_file_path} 不存在，正在下载...===============================')content = get_book_data_dagai(name, start)if content is None:return None# 判断是否是最后一页soup = get_soup(content)p_tag = soup.select_one('#subject_list > p')if p_tag is not None:print(f"===============================分类 {name} 的网页爬取完成===============================")return Truesave_book_html_file(save_dir=save_dir, file_name=file_name, content=content)def download_book_data_detail():save_dir = '../douban/douban_book/douban_book_data_detail/'dagai_dir = Path('../douban/douban_book/douban_book_data_dagai/')dagai_file_list = dagai_dir.rglob('*.html')for dagai_file in dagai_file_list:soup = get_soup(markup=open(file=dagai_file, mode='r', encoding='utf-8'))a_tag_list = soup.select('#subject_list > ul > li  h2 > a')for a_tag in a_tag_list:href = a_tag.attrs.get('href')book_id = href.split('/')[-2]file_name = f'douban_book_data_detail_{book_id}.html'detail_file_path = Path(save_dir + file_name)if detail_file_path.exists() and detail_file_path.is_file():print(f'===============================文件 {detail_file_path} 已存在===============================')else:print(f'===============================文件 {detail_file_path} 不存在，正在下载...===============================')response = get_request(href)if response is None:continuesave_book_html_file(save_dir, file_name, response.text)def print_in_rows(items, items_per_row=20):for index, name in enumerate(items, start=1):print(f'{name}', end=' ')if index % items_per_row == 0:print()if __name__ == '__main__':download_book_tag()book_type = get_book_type_and_href()book_type_name = book_type.keys()print(book_type_name)for type_name in book_type_name:print(f'===============================图书分类标签：{type_name}===============================')start_ = 0while True:flag = download_book_data_dagai(type_name, start_)start_ = start_ + 20if flag is None:continueif flag:print(f'======================================图书分类标签 {type_name} 的大概html下载完成======================================')breakdownload_book_data_detail()

执行过程中打印的部分信息如下图所示：

在这里插入图片描述

爬取后保存的部分html文件如下图所示：

在这里插入图片描述

四、数据处理与存储

1. 解析html并把数据保存到csv文件

使用BeautifulSoup从html文档中解析出单个图书的信息，循环解析出多个图书数据后，把数据保存到csv文件。

1.1 字段说明

字段名称	说明
book_id	书籍的唯一标识符。
title	书名。
img_src	封面图片的网络地址。
author	作者姓名。
publisher	出版社名称。
producer	制作人或出品方（如果有的话）。
original_title	原版书名（如果是翻译作品，则为原语言书名）。
translator	翻译者姓名（如果有）。
publication_year	出版年份。
page_count	页数。
price	定价。
binding	装帧类型（如平装、精装等）。
series	丛书系列名称（如果有的话）。
isbn	国际标准书号。
rating	平均评分。
rating_sum	参与评分的人数。
comment_link	用户评论链接。
stars5_starstop	五星评价所占的比例。
stars4_starstop	四星评价所占的比例。
stars3_starstop	三星评价所占的比例。
stars2_starstop	二星评价所占的比例。
stars1_starstop	一星评价所占的比例。

1.2 代码实现

每解析出100条数据，就把解析出的数据保存到csv文件中。

from pathlib import Pathimport pandas as pd
from bs4 import BeautifulSoupdef get_soup(markup):return BeautifulSoup(markup=markup, features='lxml')def parse_detail_html_to_csv():# 定义CSV文件路径csv_file_dir = '../douban/douban_book/data_csv/'csv_file_name = 'douban_books.csv'csv_file_path = Path(csv_file_dir + csv_file_name)csv_file_dir_path = Path(csv_file_dir)csv_file_dir_path.mkdir(parents=True, exist_ok=True)detail_dir = Path('../douban/douban_book/douban_book_data_detail/')detail_file_list = detail_dir.rglob('*.html')book_data = []count = 0for detail_file in detail_file_list:book_id = str(detail_file).split('_')[-1].split('.')[0]soup = get_soup(open(file=detail_file, mode='r', encoding='utf-8'))title = soup.select_one('#wrapper > h1 > span').stringtag_subjectwrap = soup.select_one('#content > div > div.article > div.indent > div.subjectwrap.clearfix')img_src = tag_subjectwrap.select_one('#mainpic > a > img').attrs.get('src')tag_info = tag_subjectwrap.select_one('div.subject.clearfix > #info')tag_author = tag_info.find(name='span', attrs={'class': 'pl'}, string=' 作者')if tag_author is None:author = ''else:author = tag_author.next_sibling.next_sibling.text.strip()tag_publisher = tag_info.find(name='span', attrs={'class': 'pl'}, string='出版社:')if tag_publisher is None:publisher = ''else:publisher = tag_publisher.next_sibling.next_sibling.text.strip()tag_producer = tag_info.find(name='span', attrs={'class': 'pl'}, string='出品方:')if tag_producer is None:producer = ''else:producer = tag_producer.next_sibling.next_sibling.text.strip()tag_original_title = tag_info.find(name='span', attrs={'class': 'pl'}, string='原作名:')if tag_original_title is None:original_title = ''else:original_title = tag_original_title.next_sibling.strip()tag_translator = tag_info.find(name='span', attrs={'class': 'pl'}, string=' 译者')if tag_translator is None:translator = ''else:translator = tag_translator.next_sibling.next_sibling.text.strip()tag_publication_year = tag_info.find(name='span', attrs={'class': 'pl'}, string='出版年:')if tag_publication_year is None:publication_year = ''else:publication_year = tag_publication_year.next_sibling.strip()tag_page_count = tag_info.find(name='span', attrs={'class': 'pl'}, string='页数:')if tag_page_count is None:page_count = ''else:page_count = tag_page_count.next_sibling.strip()tag_price = tag_info.find(name='span', attrs={'class': 'pl'}, string='定价:')if tag_price is None:price = ''else:price = tag_price.next_sibling.strip()tag_binding = tag_info.find(name='span', attrs={'class': 'pl'}, string='装帧:')if tag_binding is None:binding = ''else:binding = tag_binding.next_sibling.strip()tag_series = tag_info.find(name='span', attrs={'class': 'pl'}, string='丛书:')if tag_series is None:series = ''else:series = tag_series.next_sibling.next_sibling.text.strip()tag_isbn = tag_info.find(name='span', attrs={'class': 'pl'}, string='ISBN:')if tag_isbn is None:isbn = ''else:isbn = tag_isbn.next_sibling.strip()# 评分信息tag_rating_wrap_clearbox = tag_subjectwrap.select_one('#interest_sectl > div')# 评分tag_rating = (tag_rating_wrap_clearbox.select_one('#interest_sectl > div > div.rating_self.clearfix > strong'))if tag_rating is None:rating = ''else:rating = tag_rating.string.strip()# 评论人数tag_rating_sum = tag_rating_wrap_clearbox.select_one('#interest_sectl > div > div.rating_self.clearfix > div > div.rating_sum > span > a > span')if tag_rating_sum is None:rating_sum = ''else:rating_sum = tag_rating_sum.string.strip()# 评论链接comment_link = f'https://book.douban.com/subject/{book_id}/comments/'# 五星比例tag_stars5_starstop = tag_rating_wrap_clearbox.select_one('#interest_sectl > div > span.stars5.starstop')if tag_stars5_starstop is None:stars5_starstop = ''else:stars5_starstop = tag_stars5_starstop.next_sibling.next_sibling.next_sibling.next_sibling.text.strip()# 四星比例tag_stars4_starstop = tag_rating_wrap_clearbox.select_one('#interest_sectl > div > span.stars4.starstop')if tag_stars4_starstop is None:stars4_starstop = ''else:stars4_starstop = tag_stars4_starstop.next_sibling.next_sibling.next_sibling.next_sibling.text.strip()# 三星比例tag_stars3_starstop = tag_rating_wrap_clearbox.select_one('#interest_sectl > div > span.stars3.starstop')if tag_stars3_starstop is None:stars3_starstop = ''else:stars3_starstop = tag_stars3_starstop.next_sibling.next_sibling.next_sibling.next_sibling.text.strip()# 二星比例tag_stars2_starstop = tag_rating_wrap_clearbox.select_one('#interest_sectl > div > span.stars2.starstop')if tag_stars2_starstop is None:stars2_starstop = ''else:stars2_starstop = tag_stars2_starstop.next_sibling.next_sibling.next_sibling.next_sibling.text.strip()# 一星比例tag_stars1_starstop = tag_rating_wrap_clearbox.select_one('#interest_sectl > div > span.stars1.starstop')if tag_stars1_starstop is None:stars1_starstop = ''else:stars1_starstop = tag_stars1_starstop.next_sibling.next_sibling.next_sibling.next_sibling.text.strip()data_dict = {'book_id': book_id,'title': title,'img_src': img_src,'author': author,'publisher': publisher,'producer': producer,'original_title': original_title,'translator': translator,'publication_year': publication_year,'page_count': page_count,'price': price,'binding': binding,'series': series,'isbn': isbn,'rating': rating,'rating_sum': rating_sum,'comment_link': comment_link,'stars5_starstop': stars5_starstop,'stars4_starstop': stars4_starstop,'stars3_starstop': stars3_starstop,'stars2_starstop': stars2_starstop,'stars1_starstop': stars1_starstop}print(f'===========================文件路径：{detail_file}，解析后的数据如下：===========================')print(data_dict)print('===========================================================')# 把数据保存到列表中book_data.append(data_dict)count = count + 1if count == 100:df = pd.DataFrame(book_data)if not csv_file_path.exists():df.to_csv(csv_file_dir + csv_file_name, index=False, encoding='utf-8-sig')else:df.to_csv(csv_file_dir + csv_file_name, index=False, encoding='utf-8-sig', mode='a', header=False)book_data = []count = 0if __name__ == '__main__':parse_detail_html_to_csv()

执行过程中打印的部分信息如下图所示：

在这里插入图片描述

csv文件位置及内容如下图所示：

在这里插入图片描述

2. 数据清洗与存储

2.1 数据清洗

使用pandas进行数据清洗。
空值：除下列说明外，对于空值统一使用未知来填充。
日期：空值使用1970-01-01来填充，缺失月或日用01填充。
页数：空值使用0来填充。
定价：空值使用0来填充。
评分：空值使用0来填充。
评分人数：空值使用0来填充。
星级评价：空值使用0来填充。

2.2 数据存储

把清洗后的数据保存到MySQL中。

2.2.1 表设计

根据图片中的字段，以下是设计的MySQL表结构。我将使用标准的SQL语法来定义这个表，并以表格形式展示。

字段名称	数据类型	说明
book_id	INT	书籍的唯一标识符。
title	VARCHAR(255)	书名。
img_src	VARCHAR(255)	封面图片的网络地址。
author	VARCHAR(255)	作者姓名。
publisher	VARCHAR(255)	出版社名称。
producer	VARCHAR(255)	制作人或出品方（如果有的话）。
original_title	VARCHAR(255)	原版书名（如果是翻译作品，则为原语言书名）。
translator	VARCHAR(255)	翻译者姓名（如果有）。
publication_year	DATE	出版年份。
page_count	INT	页数。
price	DECIMAL(10, 2)	定价。
binding	VARCHAR(255)	装帧类型（如平装、精装等）。
series	VARCHAR(255)	丛书系列名称（如果有的话）。
isbn	VARCHAR(20)	国际标准书号。
rating	DECIMAL(3, 1)	平均评分。
rating_sum	INT	参与评分的人数。
comment_link	VARCHAR(255)	用户评论链接。
stars5_starstop	DECIMAL(5, 2)	五星评价所占的比例。
stars4_starstop	DECIMAL(5, 2)	四星评价所占的比例。
stars3_starstop	DECIMAL(5, 2)	三星评价所占的比例。
stars2_starstop	DECIMAL(5, 2)	二星评价所占的比例。
stars1_starstop	DECIMAL(5, 2)	一星评价所占的比例。

2.2.2 表实现

创建数据库douban。

create database douban;

切换到数据库douban。

use douban;

创建数据表cleaned_douban_books，用于存储清洗后的数据。

CREATE TABLE cleaned_douban_books (book_id INT PRIMARY KEY,title VARCHAR(255),img_src VARCHAR(255),author VARCHAR(255),publisher VARCHAR(255),producer VARCHAR(255),original_title VARCHAR(255),translator VARCHAR(255),publication_year DATE,page_count INT,price DECIMAL(10, 2),binding VARCHAR(255),series VARCHAR(255),isbn VARCHAR(20),rating DECIMAL(3, 1),rating_sum INT,comment_link VARCHAR(255),stars5_starstop DECIMAL(5, 2),stars4_starstop DECIMAL(5, 2),stars3_starstop DECIMAL(5, 2),stars2_starstop DECIMAL(5, 2),stars1_starstop DECIMAL(5, 2)
);

2.3 代码实现

import reimport pandas as pd
from sqlalchemy import create_enginedef read_csv_to_df(file_path):# 加载CSV文件到DataFramedf = pd.read_csv(file_path, encoding='utf-8')return dfdef unify_date_format(date_str):# 检查是否为 NaN 或 Noneif pd.isna(date_str) or date_str is None:return None# 定义一个函数来处理特殊格式的日期def preprocess_date(date_str):# 如果是字符串并且包含中文格式的日期，则进行替换if isinstance(date_str, str) and '年' in date_str and '月' in date_str:return date_str.replace('年', '-').replace('月', '-').replace('日', '')return date_str# 预处理日期字符串processed_date = preprocess_date(date_str)try:# 使用pd.to_datetime尝试转换日期格式date_obj = pd.to_datetime(processed_date, errors='coerce')# 如果只有年份，则添加默认的月份和日子为01if isinstance(date_obj, pd.Timestamp) and len(str(processed_date).split('-')) == 1:date_obj = date_obj.replace(month=1, day=1)# 返回标准化的日期字符串return date_obj.strftime('%Y-%m-%d') if not pd.isna(date_obj) else Noneexcept Exception as e:print(f"Error parsing date '{date_str}': {e}")return '1970-01-01'def clean_price(price_str):if pd.isna(price_str) or not isinstance(price_str, str):return 0# 移除所有非数字字符，保留数字和小数点cleaned = re.sub(r'[^\d./]+', '', price_str)# 处理包含多个价格的情况，这里选择平均值作为代表prices = []for part in cleaned.split('/'):# 进一步清理每个部分，移除非数字和非小数点字符sub_parts = re.findall(r'\d+\.\d+|\d+', part)if sub_parts:try:# 取每个部分的第一个匹配的价格price = float(sub_parts[0])prices.append(price)except ValueError:continueif not prices:return 0# 根据需要选择不同的策略，这里选择平均值avg_price = sum(prices) / len(prices)# 确保保留两位小数return round(avg_price, 2)def clean_percentage(percentage_str):if pd.isna(percentage_str) or not isinstance(percentage_str, str):return 0# 移除百分比符号并转换为浮点数cleaned = re.sub(r'[^\d.]+', '', percentage_str)return round(float(cleaned), 2)def clean_page_count(page_str):if not isinstance(page_str, str) or not page_str.strip():return 0# 移除非数字字符，保留数字和分号cleaned = re.sub(r'[^\d;；]+', '', page_str)# 分离多个页数pages = [int(p) for p in cleaned.split('；') if p]if not pages:return 0# 根据需要选择不同的策略，这里选择最大值max_page = max(pages)return max_page# 定义函数：清理和转换数据格式
def clean_and_transform(df):# 删除book_id相同的数据df.drop_duplicates(subset=['book_id'])df['author'].fillna('未知', inplace=True)df['publisher'].fillna('未知', inplace=True)df['producer'].fillna('未知', inplace=True)df['original_title'].fillna('未知', inplace=True)df['translator'].fillna('未知', inplace=True)# 日期：空值使用1970-01-01来填充，缺失月或日用01填充df['publication_year'] = df['publication_year'].apply(unify_date_format)df['page_count'].fillna(0, inplace=True)df['page_count'] = df['page_count'].apply(clean_page_count)df['page_count'] = df['page_count'].astype(int)df['price'] = df['price'].apply(clean_price)df['binding'].fillna('未知', inplace=True)df['series'].fillna('未知', inplace=True)df['isbn'].fillna('未知', inplace=True)df['rating'].fillna(0, inplace=True)df['rating_sum'].fillna(0, inplace=True)df['rating_sum'] = df['rating_sum'].astype(int)df['stars5_starstop'] = df['stars5_starstop'].apply(lambda x: clean_percentage(x))df['stars4_starstop'] = df['stars4_starstop'].apply(lambda x: clean_percentage(x))df['stars3_starstop'] = df['stars3_starstop'].apply(lambda x: clean_percentage(x))df['stars2_starstop'] = df['stars2_starstop'].apply(lambda x: clean_percentage(x))df['stars1_starstop'] = df['stars1_starstop'].apply(lambda x: clean_percentage(x))return dfdef save_df_to_db(df):# 设置数据库连接信息db_user = 'root'db_password = 'zxcvbq'db_host = '127.0.0.1'  # 或者你的数据库主机地址db_port = '3306'  # MySQL默认端口是3306db_name = 'douban'# 创建数据库引擎engine = create_engine(f'mysql+mysqlconnector://{db_user}:{db_password}@{db_host}:{db_port}/{db_name}')# 将df写入MySQL表df.to_sql(name='cleaned_douban_books', con=engine, if_exists='append', index=False)print("所有csv文件的数据已成功清洗并写入MySQL数据库")if __name__ == '__main__':csv_file = r'..\douban\douban_book\data_csv\douban_books.csv'df = read_csv_to_df(csv_file)df = clean_and_transform(df)save_df_to_db(df)

查看cleaned_douban_books表中的图书数据：

select * from cleaned_douban_books limit 10;

在这里插入图片描述