处理动态分页：自动翻页与增量数据抓取策略-数据议事厅

爬虫代理

一、案例场景

Lily（挥舞着数据报表）：“用户反馈我们的股票舆情分析总是缺失最新跟帖！这些动态分页像狡猾的狐狸，每次抓取都漏掉关键数据！”

小王（调试着爬虫代码）：“传统分页参数已经失效了。看！（指向屏幕）这个「加载更多」按钮会变异——每次点击都会生成新的加密参数！”

动态分页化身黑衣刺客，手持带有时间戳的毒镖：「想要新数据？先破解我的身份令牌！」UserAgent检测如同城门守卫，将没有伪装的爬虫拒之门外。

import requests
from bs4 import BeautifulSoup
import time
import jsonclass GubaCrawler:def __init__(self):# 亿牛云代理配置（www.16yun.cn）self.proxy = {"http": "http://16YUN:16IP@yn-proxy.16yun.cn:3111", "https": "http://16YUN:16IP@yn-proxy.16yun.cn:3111"}self.headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36","Cookie": "em_hq_fls=js; sid=6d5b20..."  # 需要定期更新的动态cookie}self.visited_ids = set()  # 增量抓取存储器def parse_page(self, url):try:# 爬虫代理IP与浏览器指纹双保险response = requests.get(url, proxies=self.proxy, headers=self.headers, timeout=10)soup = BeautifulSoup(response.text, 'html.parser')# 东方财富股吧帖子解析posts = []for item in soup.select('.articleh'):post_id = item.get('data-postid')  # 唯一标识符if post_id in self.visited_ids:continuetitle = item.select_one('.l3 a').text.strip()time = item.select_one('.l5').text# 更多字段解析...posts.append({"id":post_id, "title":title, "time":time})self.visited_ids.add(post_id)return postsexcept Exception as e:print(f"抓取异常：{str(e)}")return []def auto_pagination(self):base_url = "https://guba.eastmoney.com/list,002291_{}.html"page = 1while True:current_url = base_url.format(page)print(f"智能翻页中：{current_url}")data = self.parse_page(current_url)if not data:  # 终止条件判断print("到达最后一页！")break# 数据存储逻辑with open('guba_data.json', 'a', encoding='utf-8') as f:json.dump(data, f, ensure_ascii=False)page += 1time.sleep(3)  # 控制频率if __name__ == '__main__':crawler = GubaCrawler()crawler.auto_pagination()