github项目--crawl4ai

- 输出html
- 输出markdown格式
- 输出结构化数据
- 与BeautifulSoup的对比

crawl4ai github上这个项目，没记错的话，昨天涨了3000多的star，今天又新增2000star。一款抓取和解析工具，简单写个demo感受下

这里我们使用crawl4ai抓取github每日趋势，每天通过邮件发到自己邮箱

输出html

async def github_trend_html():async with AsyncWebCrawler(verbose=True) as crawler:result = await crawler.arun(url="https://github.com/trending",)assert result.success, "github 数据抓取失败"return result.cleaned_html

输出的还是html，但对原始页面做了处理，比如移除不相关元素，动态元素，简化html结构。

在这里插入图片描述

输出markdown格式

async def github_trend_md():async with AsyncWebCrawler(verbose=True) as crawler:result = await crawler.arun(url="https://github.com/trending",)assert result.success, "github 数据抓取失败"return result.markdown

用md软件打开看一下效果：

在这里插入图片描述

输出结构化数据

async def github_trend_json():schema = {"name": "Github trending","baseSelector": ".Box-row","fields": [{"name": "repository","selector": ".lh-condensed a[href]","type": "text",},{"name": "description","selector": "p","type": "text",},{"name": "lang","type": "text","selector": "span[itemprop='programmingLanguage']",},{"name": "stars","type": "text","selector": "a[href*='/stargazers']"},{"name": "today_star","type": "text","selector": "span.float-sm-right",},],}extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)async with AsyncWebCrawler(verbose=True) as crawler:result = await crawler.arun(url="https://github.com/trending",extraction_strategy=extraction_strategy,bypass_cache=True,)assert result.success, "github 数据抓取失败"github_trending_json = json.loads(result.extracted_content)for ele in github_trending_json:ele['repository'] = 'https://github.com/' + ''.join(ele['repository'].split())return github_trending_json

与前两种不同的是，结构化输出需要通过自定义schema来定义解析的数据结构。控制台按照我们定义的schema输出了标准了JSON数据。将数据放入html模版，通过邮件每日发送。看一下邮件显示：

在这里插入图片描述

与BeautifulSoup的对比

记得第一次用soup的时候，对于只用过Java sax解析xml的我来说，soup真的太方便了。今天简单测试了下crawl4ai，和soup相比

crawl4ai数据采集分析更方便
soup需要配合使用request进行网页抓取，BeautifulSoup负责html解析
html解析有点类似，都是通过CSS选择器，但crawl4ai通过定义schema，解析更方便
数据解析方面，crawl4ai除了提供了markdown和简化版的html，还提供了通过集成OpenAI提取结构化数据的能力(尚未体验)

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.rhkb.cn/news/436948.html

如若内容造成侵权/违法违规/事实不符，请联系长河编程网进行投诉反馈email:809451989@qq.com，一经查实，立即删除！