Python如何从HTML提取img标签下的src属性

前提准备

在处理网页数据时，我们经常需要从HTML中提取特定的信息，比如图片的URL。
这通常通过获取img标签的src属性来实现。

在开始之前，你需要确保已经安装了BeautifulSoup

pip install beautifulsoup4

步骤

1. 解析HTML内容

from bs4 import BeautifulSoup # 导入BeautifulSoup库
html_content = """
<html>
<head><title>Test Page</title></head>
<body>
<img src="image1.jpg" alt="Image 1">
<img src="image2.png" alt="Image 2">
</body>
</html>
"""
soup = BeautifulSoup(html_content, 'html.parser') # 使用BeautifulSoup解析HTML内容

2. 查找所有的`img`标签

使用find_all方法查找所有的img标签。

img_tags = soup.find_all('img')

3. 提取`src`属性

遍历所有的img标签，并提取它们的src属性。

src_urls = [img['src'] for img in img_tags if img.has_attr('src')]

使用列表推导式来创建一个包含所有src属性值的列表。
img.has_attr('src')确保我们只处理那些实际包含src属性的img标签。

完整代码

from bs4 import BeautifulSouphtml_content = """
<html>
<head><title>Test Page</title></head>
<body>
<img src="image1.jpg" alt="Image 1">
<img src="image2.png" alt="Image 2">
</body>
</html>
"""# 解析HTML内容
soup = BeautifulSoup(html_content, 'html.parser')# 查找所有的img标签
img_tags = soup.find_all('img')# 提取src属性
src_urls = [img['src'] for img in img_tags if img.has_attr('src')]# 打印结果
print(src_urls)