Python抓取电商平台数据 / 采集商品评论 / 可视化展示词云图...

前言

大家早好、午好、晚好吖 ❤ ~

我给大家准备了一些资料，包括:

2022最新Python视频教程、Python电子书10个G

（涵盖基础、爬虫、数据分析、web开发、机器学习、人工智能、面试题）、Python学习路线图等等

直接在文末名片自取即可！

本次亮点

selenium工具的使用
结构化的数据解析
csv数据保存

环境介绍：

python 3.8
pycharm
谷歌驱动谷歌浏览器

selenium 操控谷歌驱动然后操控浏览器模拟人的行为去操作浏览器

模块使用:

selenium

pip install selenium==3.141.0 (指定版本安装模块)

安装模块时候速度比较慢可以切换一下镜像源

(模拟人的行为去操作浏览器)
csv

内置模块不需要安装把数据保存到csv表格里面
time

内置模块不需要安装时间模块延时操作延时等待

安装python第三方模块:

win + R 输入 cmd 点击确定, 输入安装命令 pip install 模块名 (pip install requests) 回车
在pycharm中点击Terminal(终端) 输入安装命令

selenium 模拟人的行为去操作浏览器

打开浏览器
输入网址
输入想要商品名字
点击搜索查看商品数据
获取我们想要数据内容
保存数据

代码展示

“”"

爬取商品数据

🎯 文章素材、解答、源码、教程领取处：点击

“”"

导入模块

import pprint
from selenium import webdriver  # 从selenium里面导入webdriver的方法
# 导入时间模块
import time
import csv

word = input('请输入你想要获取商品: ')

创建一个文件保存如果utf-8保存csv文件乱码改成 utf-8-sig

f = open(f'{word}.csv', mode='a', encoding='utf-8', newline='')csv_writer = csv.DictWriter(f, fieldnames=['title','price','comment','shop_name','href',
])

写入表头

csv_writer.writeheader()

如果把浏览器驱动放到和python安装目录下面, 可以不用指定驱动路径

executable_path=r’C:\01-Software-installation\Miniconda3\chromedriver.exe’

1. 打开浏览器

driver = webdriver.Chrome()

实例化浏览器对象, 打开一个浏览器原本是需要一个谷歌驱动 selenium 对象

2. 输入网址

3. 输入想要商品名字

driver.find_element_by_css_selector('#key').send_keys(word)

4. 点击搜索查看商品数据

driver.find_element_by_css_selector('#search > div > div.form > button > i').click()  # 点击动作

5. 下滑网页, 让商品数据全部加载出来

"""执行页面滚动的操作"""  # javascript

def drop_down():for x in range(1, 12, 2):  # 1 3 5 7 9 11 在你不断的下拉过程中, 页面高度也会变的time.sleep(1)  # 延时操作 死等j = x / 9  # 1/9  3/9  5/9  9/9# document.documentElement.scrollTop  指定滚动条的位置# document.documentElement.scrollHeight 获取浏览器页面的最大高度js = 'document.documentElement.scrollTop = document.documentElement.scrollHeight * %f' % jdriver.execute_script(js)def get_shop_info():driver.implicitly_wait(10)  # 隐式等待, 等待网页数据加载 只要数据加载完了 就运行下面的程序drop_down()

6. 获取所有商品li标签

css语法 class 可以用小圆点代替, 加上类名字可以直接定位到标签

    lis = driver.find_elements_by_css_selector('.gl-item')

elements 提取多个标签 element 提取一个标签

一个一个提取列表里面元素, 用for循环遍历

    for li in lis:try:title = li.find_element_by_css_selector('a em').text.replace('\n', '')  # 标题price = li.find_element_by_css_selector('.p-price strong i').text  # 价格comment = li.find_element_by_css_selector('.p-commit strong a').text  # 评论数shop_name = li.find_element_by_css_selector('.p-shop span a').text  # 店铺名字href = li.find_element_by_css_selector('.p-name a').get_attribute('href')  # 详情页

7. 保存数据

            dit = {'title': title,'price': price,'comment': comment,'shop_name': shop_name,'href': href,}csv_writer.writerow(dit)print(title, price, comment, shop_name, href)# pprint.pprint(title)  格式化输出模块except:passdriver.find_element_by_css_selector('.pn-next').click()  # 点击下一页for page in range(1, 11):print(f'===========================正在采集第{page}页的数据内容===========================')get_shop_info()driver.quit()  # 采集完数据之后 自动关闭浏览器

“”"

爬取商品评论数据

🎯 文章素材、解答、源码、教程领取处：点击

“”"

import requests
import time

for page in range(10):time.sleep(2)

    response = requests.get(url=url, headers=headers)comments = '\n'.join([index['content'] for index in response.json()['comments']])

comments = [] 创建空列表

for index in response.json()['comments']: for循环遍历提取列表元素

a = index['content'] 根据键值对取值提取评论数据

comments.append(a) 把评论数据添加到列表里面

comments = '\n'.join(comments) 通过join的方法把comments 列表里面的元素用\n合并成为一个字符串

    print(comments)with open('评论.txt', mode='a', encoding='utf-8') as f:f.write(comments)f.write('\n')

“”"

评论制作词云图

“”"

导入模块

结巴分词

import jieba

词云图模块

import wordcloud

读取文件返回对象

f = open('评论.txt', encoding='utf-8')

读取文本内容返回字符串

text = f.read()

通过jieba分词对文本进行词语分割返回的列表

text_list = jieba.lcut(text)   
print(text_list)

通过join方法把文本词语列表合并成一个字符串

string = ' '.join(text_list)

词云图配置

wc = wordcloud.WordCloud(width=800,height=800,background_color='white',scale=15,font_path='msyh.ttc'
)

写入词语内容

wc.generate(string)

输出词云图

wc.to_file('1.png')

尾语 💝

有更多建议或问题可以评论区或私信我哦！一起加油努力叭(ง •_•)ง

喜欢就关注一下博主，或点赞收藏评论一下我的文章叭！！！

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.rhkb.cn/news/63755.html

如若内容造成侵权/违法违规/事实不符，请联系长河编程网进行投诉反馈email:809451989@qq.com，一经查实，立即删除！