Python爬虫之小白入门保姆级教程，带7个爬虫小案例（附源码）!

以下是一份 Python 爬虫入门保姆级教程：

一、准备工作

安装 Python
- 前往 Python 官方网站（https://www.python.org/）下载适合你操作系统的 Python 版本并安装。安装过程中可以勾选“Add Python to PATH”以便在命令行中方便地调用 Python。
选择开发环境
- 如果你是初学者，可以使用集成开发环境（IDE）如 PyCharm。它提供了代码自动补全、调试等功能，方便开发。也可以使用简单的文本编辑器如 Visual Studio Code，通过安装 Python 插件来进行开发。
了解基本概念
- HTML：网页的结构语言，了解其基本标签和结构有助于理解网页内容。
- HTTP/HTTPS：网络通信协议，爬虫主要通过这些协议与服务器进行交互。
- 爬虫的合法性：在进行爬虫时，要遵守法律法规和网站的使用条款，避免非法爬取数据。

二、安装必要的库

Requests
- 用于发送 HTTP 请求，获取网页内容。可以在命令行中使用“pip install requests”进行安装。
BeautifulSoup
- 用于解析 HTML 和 XML 文档，提取所需的数据。安装命令为“pip install beautifulsoup4”。

三、开始爬取

发送请求

使用 Requests 库发送 GET 请求获取网页内容。例如：

import requestsurl = 'https://example.com'
response = requests.get(url)

解析网页

使用 BeautifulSoup 解析网页内容。例如：

from bs4 import BeautifulSoupsoup = BeautifulSoup(response.text, 'html.parser')

提取数据
- 根据网页结构，使用 BeautifulSoup 的方法提取所需的数据。例如，如果要提取所有的链接，可以使用以下代码：
```
links = [a['href'] for a in soup.find_all('a', href=True)]
```

存储数据

可以将提取的数据存储到文件中，如 CSV、JSON 等格式，或者存储到数据库中。例如，将数据存储到 CSV 文件中：

import csvwith open('data.csv', 'w', newline='') as csvfile:writer = csv.writer(csvfile)writer.writerow(['标题', '链接'])for link in links:title = soup.find('a', href=link).textwriter.writerow([title, link])

四、进阶技巧

处理动态页面
- 有些网页是通过 JavaScript 动态生成的，此时可以使用工具如 Selenium 来模拟浏览器操作，获取页面内容。

设置请求头

为了避免被网站识别为爬虫，可以设置请求头，模拟浏览器的请求。例如：

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
response = requests.get(url, headers=headers)

处理异常
- 在爬取过程中可能会遇到各种异常，如网络连接错误、页面解析错误等。要使用 try-except 语句来处理这些异常，保证程序的稳定性。

五、注意事项

遵守法律法规和网站的使用条款，不要爬取敏感信息或侵犯他人隐私。
控制爬取速度，避免对目标网站造成过大的负担。
注意数据的版权问题，不要未经授权使用爬取的数据。

通过以上步骤，你可以初步掌握 Python 爬虫的基本方法。随着学习的深入，你还可以探索更多高级的爬虫技术，如分布式爬虫、反爬虫策略等。

以下是七个 Python 爬虫小案例及源码：

案例一：爬取豆瓣电影Top250列表

import requests
from bs4 import BeautifulSoupdef douban_movie_top250():url = "https://movie.douban.com/top250"headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36"}response = requests.get(url, headers=headers)soup = BeautifulSoup(response.text, "html.parser")movies = soup.find_all("div", class_="item")for movie in movies:title = movie.find("span", class_="title").textrating = movie.find("span", class_="rating_num").textprint(f"电影名称：{title}，评分：{rating}")douban_movie_top250()

案例二：爬取知乎热榜问题

import requests
from bs4 import BeautifulSoupdef zhihu_hot():url = "https://www.zhihu.com/hot"headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36"}response = requests.get(url, headers=headers)soup = BeautifulSoup(response.text, "html.parser")hot_questions = soup.find_all("div", class_="HotItem-content")for question in hot_questions:title = question.find("a").textprint(f"知乎热榜问题：{title}")zhihu_hot()

案例三：爬取天气预报

import requestsdef weather_report(city):url = f"http://wthrcdn.etouch.cn/weather_mini?city={city}"response = requests.get(url)data = response.json()if data["status"] == 1000:weather_info = data["data"]city_name = weather_info["city"]forecast = weather_info["forecast"][0]date = forecast["date"]high_temp = forecast["high"]low_temp = forecast["low"]weather_type = forecast["type"]print(f"{city_name}的天气预报：{date}，天气{weather_type}，高温{high_temp}，低温{low_temp}")else:print("无法获取该城市的天气预报。")weather_report("北京")

案例四：爬取百度新闻标题

import requests
from bs4 import BeautifulSoupdef baidu_news():url = "https://news.baidu.com/"headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36"}response = requests.get(url, headers=headers)soup = BeautifulSoup(response.text, "html.parser")news_titles = soup.find_all("a", class_="news-title")for title in news_titles:print(title.text)baidu_news()

案例五：爬取京东商品信息

import requests
from bs4 import BeautifulSoupdef jd_product_info(keyword):url = f"https://search.jd.com/Search?keyword={keyword}"headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36"}response = requests.get(url, headers=headers)soup = BeautifulSoup(response.text, "html.parser")products = soup.find_all("div", class_="gl-i-wrap")for product in products:title = product.find("div", class_="p-name").a.em.textprice = product.find("div", class_="p-price").strong.i.textprint(f"商品名称：{title}，价格：{price}")jd_product_info("手机")

案例六：爬取微博热搜榜

import requestsdef weibo_hot():url = "https://s.weibo.com/top/summary?cate=realtimehot"headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36"}response = requests.get(url, headers=headers)data = response.textstart_index = data.find('"hotData": [') + len('"hotData": [')end_index = data.find(']', start_index)hot_data = data[start_index:end_index]hot_items = hot_data.split('},{')for item in hot_items:title = item.split('"word":"')[1].split('"')[0]print(f"微博热搜：{title}")weibo_hot()

案例七：爬取古诗词网的诗词

import requests
from bs4 import BeautifulSoupdef ancient_poetry():url = "https://www.gushiwen.cn/"headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36"}response = requests.get(url, headers=headers)soup = BeautifulSoup(response.text, "html.parser")poems = soup.find_all("div", class_="left")for poem in poems:title = poem.find("h1").textauthor = poem.find("p", class_="source").a.textcontent = poem.find("div", class_="contson").textprint(f"诗词名称：{title}，作者：{author}，内容：{content}")ancient_poetry()