前言

🌸当你喜欢哪个诗人，想获取他的全部诗词数据的时候，可以通过爬虫来解决这个问题，用爬虫把诗词全部爬下来，然后存到txt文档中，打印出来背诵，岂不美哉。 🐟

提示：以下是本篇文章正文内容，下面案例可供参考

一、基本目标

我们要爬取张若虚这个诗人的全部诗词和他的个人简介

二、使用步骤

1.进行分析

🐽先在该页面中获取诗人信息，但是该页面难以获取全部诗词内容，那么在该页面中先获取到诗词详细的url，根据诗词详情页的url再继续深一层爬取详情页信息，进而获取诗词内容

2.整体代码

代码如下（示例）：

import requests
from lxml import etree
import re
import time# 设置要爬取的url
base_url = "https://www.shicimingju.com/chaxun/zuozhe/04.html"
# 反反爬
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36","Accept-Encoding": "gzip, deflate, br","Accept-Language": "zh-CN,zh;q=0.9","Referer":"https://www.shicimingju.com"
}
# requests爬取源码
resp = requests.get(url=base_url,headers=headers)
# XPATH解析
html = etree.HTML(resp.text)
# xpath定位，拿到作者名字
author_name = html.xpath('//*[@id="main_right"]/div[1]/div[2]/div[1]/h4/a/text()')[0]
# 解析数据
# 设置re正则表达式获取作者简介的页面元素
obj_introduction = re.compile(r'<div class="des">(?P<introduction>.*?)</div>', re.S)
# 开始匹配正则
result_introduction = obj_introduction.finditer(resp.text)
# 设置作者简介
author_introduction = ""
# 对作者简介页面元素进行正则剔除多余的html标签，并把作者简介进行赋值获取文字信息
for it in result_introduction:author_introduction = it.group("introduction")pattern = re.compile(r'<[^>]+>', re.S)author_introduction = pattern.sub('', author_introduction).strip()
# xpath定位，拿到每篇的url链接，为了进行下一层访问
poet_list = html.xpath('//*[@id="main_left"]/div[1]/div')
poet_list = poet_list[1::2]
for poet in poet_list:url = poet.xpath('./div[2]/h3/a/@href')[0]url = "https://www.shicimingju.com" + url# 爬取具体的诗词信息resp_poet = requests.get(url=url)resp_poet.encoding = 'utf-8'# XPATH解析html_child = etree.HTML(resp_poet.text)# xpath定位，拿到作者名字poet_name = html_child.xpath('//*[@id="zs_title"]/text()')[0]# 解析数据，设置获取诗词内容的正则obj_content = re.compile(r'<div class="item_content" id="zs_content">(?P<poetry_content>.*?)</div>', re.S)# 对正则进行过滤获取到正则后的内容result_content = obj_content.finditer(resp_poet.text)poetry_content = ""# 对正则后的内容进行过滤html标签，连接到poetry_content诗词内容字符串上for it in result_content:poetry_content = it.group("poetry_content")pattern = re.compile(r'<[^>]+>', re.S)poetry_content = pattern.sub('', poetry_content).strip()with open('poet.txt', 'a', encoding='utf-8') as file:file.write("作者姓名:" + author_name + "\n作者简介:" + author_introduction + "\n诗词题目:" + poet_name+"\n诗词内容:"+poetry_content+"\n")print("作者姓名:" + author_name + "\n作者简介:" + author_introduction + "\n诗词题目:" + poet_name+"\n诗词内容:"+poetry_content+"\n")time.sleep(1)
print("结束！")