HtmlAgilityPack 操作详解

1.安装 HtmlAgilityPack

2. 示例 HTML

3. 使用 HtmlAgilityPack 进行 HTML 解析与操作

4. 代码详解

1.加载html文档

2.选择元素

3. 提取属性

4.修改属性

5.常用的几种获取元素的 XPath 写法

HtmlAgilityPack：

轻量且高效，适合进行常规的 HTML 解析。
由于其轻量化设计，在只需简单提取或修改元素内容时，HtmlAgilityPack 会显得更快。
对于层级较深或大规模的 HTML 文档，HtmlAgilityPack 也会处理得较为流畅。
文件大小较小，功能单一，适用于解析 HTML 和使用 XPath 查询。
没有内置对 CSS 选择器的支持，需要通过额外库扩展（如 Fizzler）。

1.安装 HtmlAgilityPack

通过 NuGet 包管理器安装 HtmlAgilityPack：

2. 示例 HTML

假设我们有以下 HTML 内容，需要解析和操作：

 <!DOCTYPE html><html><head><title>HtmlAgilityPack Example</title><style>.highlight { color: yellow; }#main { background-color: #f0f0f0; }</style></head><body><h1 id='main-heading' class='highlight'>Welcome to HtmlAgilityPack</h1><p>This is a <span class='highlight'>simple</span> example.</p><a href='https://example.com' target='_blank'>Visit Example.com</a><ul id='items'><li class='item'>Item 1</li><li class='item'>Item 2</li><li class='item'>Item 3</li></ul><input type='text' id='username' value='JohnDoe' /><input type='password' id='password' /></body></html>

3. 使用 HtmlAgilityPack 进行 HTML 解析与操作

以下是一个详细的 C# 示例，展示如何使用 HtmlAgilityPack 进行各种操作：

using HtmlAgilityPack;
using System;
using System.Linq;class Program
{static void Main(string[] args){// 示例 HTML 内容string html = @"<!DOCTYPE html><html><head><title>HtmlAgilityPack Example</title><style>.highlight { color: yellow; }#main { background-color: #f0f0f0; }</style></head><body><h1 id='main-heading' class='highlight'>Welcome to HtmlAgilityPack</h1><p>This is a <span class='highlight'>simple</span> example.</p><a href='https://example.com' target='_blank'>Visit Example.com</a><ul id='items'><li class='item'>Item 1</li><li class='item'>Item 2</li><li class='item'>Item 3</li></ul><input type='text' id='username' value='JohnDoe' /><input type='password' id='password' /></body></html>";// 1. **加载 HTML 文档**HtmlDocument document = new HtmlDocument();document.LoadHtml(html);// 2. **选择元素**// 使用 XPath 选择所有具有 class 'highlight' 的元素var highlights = document.DocumentNode.SelectNodes("//*[@class='highlight']");Console.WriteLine("Elements with class 'highlight':");foreach (var elem in highlights){Console.WriteLine($"- <{elem.Name}>: {elem.InnerText}");}// 使用 ID 选择器选择特定元素var mainHeading = document.GetElementbyId("main-heading");if (mainHeading != null){Console.WriteLine($"\nElement with ID 'main-heading': {mainHeading.InnerText}");}// 选择所有 <a> 标签var links = document.DocumentNode.SelectNodes("//a");Console.WriteLine("\nAll <a> elements:");foreach (var link in links){Console.WriteLine($"- Text: {link.InnerText}, Href: {link.GetAttributeValue("href", "")}, Target: {link.GetAttributeValue("target", "")}");}// 选择所有具有 class 'item' 的 <li> 元素var items = document.DocumentNode.SelectNodes("//li[@class='item']");Console.WriteLine("\nList items with class 'item':");foreach (var item in items){Console.WriteLine($"- {item.InnerText}");}// 选择特定类型的输入元素var textInput = document.DocumentNode.SelectSingleNode("//input[@type='text']");var passwordInput = document.DocumentNode.SelectSingleNode("//input[@type='password']");Console.WriteLine($"\nText Input Value: {textInput.GetAttributeValue("value", "")}");Console.WriteLine($"Password Input Value: {passwordInput.GetAttributeValue("value", "")}");// 3. **提取和修改属性**// 获取第一个链接的 href 属性string firstLinkHref = links.First().GetAttributeValue("href", "");Console.WriteLine($"\nFirst link href: {firstLinkHref}");// 修改第一个链接的 href 属性links.First().SetAttributeValue("href", "https://newexample.com");Console.WriteLine($"Modified first link href: {links.First().GetAttributeValue("href", "")}");// 4. **提取和修改文本内容**// 获取第一个段落的文本内容var firstParagraph = document.DocumentNode.SelectSingleNode("//p");Console.WriteLine($"\nFirst paragraph text: {firstParagraph.InnerText}");// 修改第一个段落的文本内容firstParagraph.InnerHtml = "This is an <strong>updated</strong> example.";Console.WriteLine($"Modified first paragraph HTML: {firstParagraph.InnerHtml}");// 5. **操作样式**// 获取元素的 class 属性string h1Classes = mainHeading.GetAttributeValue("class", "");Console.WriteLine($"\nMain heading classes: {h1Classes}");// 添加一个新的 classmainHeading.SetAttributeValue("class", h1Classes + " new-class");Console.WriteLine($"Main heading classes after adding 'new-class': {mainHeading.GetAttributeValue("class", "")}");// 移除一个 class (手动实现，HtmlAgilityPack 不支持内置的 class 操作)h1Classes = mainHeading.GetAttributeValue("class", "").Replace("highlight", "").Trim();mainHeading.SetAttributeValue("class", h1Classes);Console.WriteLine($"Main heading classes after removing 'highlight': {mainHeading.GetAttributeValue("class", "")}");// 6. **遍历和查询 DOM**// 遍历所有子节点的标签名Console.WriteLine("\nChild elements of <body>:");var bodyChildren = document.DocumentNode.SelectSingleNode("//body").ChildNodes;foreach (var child in bodyChildren){if (child.NodeType == HtmlNodeType.Element){Console.WriteLine($"- <{child.Name}>");}}// 查找包含特定文本的元素var elementsWithText = document.DocumentNode.SelectNodes("//*[contains(text(), 'simple')]");Console.WriteLine("\nElements containing the text 'simple':");foreach (var elem in elementsWithText){Console.WriteLine($"- <{elem.Name}>: {elem.InnerText}");}// 7. **生成和输出修改后的 HTML**string modifiedHtml = document.DocumentNode.OuterHtml;Console.WriteLine("\nModified HTML:");Console.WriteLine(modifiedHtml);}
}

4. 代码详解

1.加载html文档

HtmlDocument document = new HtmlDocument();
document.LoadHtml(html);

2.选择元素

使用 XPath 选择所有具有相同特征的元素集合 .SelectNodes("XPath");
```
var elements = document.DocumentNode.SelectNodes("//*[@class='class']");
```

通过 XPath 选择具有独立性的单一元素 .SelectSingleNode("XPath");

var div = document.DocumentNode.SelectSingleNode("//div[@id='title-content']");

使用 ID 选择器选择特定元素 .GetElementbyId("id");
```
var element = document.GetElementbyId("id");
```
获取子节点（注意这里是直接子节点集合，即第一级的子节点。不包括更深层次的子孙节点。）.ChildNodes;
```
var bodyChildren = document.DocumentNode.SelectSingleNode("//body").ChildNodes;
```
获取元素的第一个子节点 .First();
```
var firstChildNode = element.First();
```

3. 提取属性

假设我们要对下面这个 element 进行操作

var element = document.GetElementbyId("id");

提取元素内部 html
```
string innerHtml = element.InnerHtml;
```
提取含元素自身的 html
```
string outerHtml = element.OuterHtml;
```
提取文本
```
string text= element.InnerText;
```

提取属性

string _value = element.GetAttributeValue("value", "");

提取 href

string href = element.GetAttributeValue("href", "");

4.修改属性

修改 href

element.SetAttributeValue("href", "https://newexample.com");

添加 class

 element.SetAttributeValue("class", oldClasses + " new-class");

修改 class

// 移除一个 class (手动实现，HtmlAgilityPack 不支持内置的 class 操作)
newClasses = element.GetAttributeValue("class", "").Replace("highlight", "").Trim();
element.SetAttributeValue("class", newClasses);

5.常用的几种获取元素的 XPath 写法

通过 id 获取

var element = document.DocumentNode.SelectSingleNode("//*[@id='id']");

通过 class 获取

var element = document.DocumentNode.SelectNodes("//*[@class='class']");

通过匹配文本获取

var elementsWithText = document.DocumentNode.SelectNodes("//*[contains(text(), 'simple')]");

通过 class 和匹配文本相结合获取

var elements = doc.DocumentNode.SelectNodes("//span[@class='title-content-title' and contains(text(), '包含的文本')]");

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.rhkb.cn/news/462287.html

如若内容造成侵权/违法违规/事实不符，请联系长河编程网进行投诉反馈email:809451989@qq.com，一经查实，立即删除！