项目介绍:使用python爬取京东电商拿到价格、店铺、链接、销量并做可视化
...........................................................................................................................................................
项目介绍 |
效果展示 |
全部代码 |
...........................................................................................................................................................
效果展示:
...........................................................................................................................................................
..........................................................................................................................................................
价格与店铺可视化:
...........................................................................................................................................................
..........................................................................................................................................................
销量与店铺可视化:
..........................................................................................................................................................
..........................................................................................................................................................
爬取主函数:
..........................................................................................................................................................
import selenium.webdriver as driver
from selenium.webdriver.common.by import By
import time
from lxml import etree
import pandasclass GetData:"""一手数据获取:前端代码"""def __init__(self):# 目标网站:京东iPhone4s搜索页面self.url = 'https://search.jd.com/Search?keyword=%E5%94%90%E5%8D%A1%E5%90%8A%E5%9D%A0&enc=utf-8&wq=%E5%94%90%E5%8D%A1%E5%90%8A%E5%9D%A0&pvid=31f3e974663949f39b95db6bb05ad3f8'# 创建浏览器self.edge = driver.Edge()# 访问指定页面self.edge.get(self.url)def take(self):'对页面进行操作'button = self.edge.find_element(By.CLASS_NAME,'weixin-icon')button.click()# 等待登录time.sleep(10)# 最终数据:目标页面代码self.over_data = self.edge.page_sourceclass Sift:"筛选信息"def __init__(self):# 创建GetData类获取前端代码geter = GetData() # 创建geter.take() # 操作# 最终的前端页面数据self.over_data = geter.over_datadef take(self):# 创建xpath解析器html = etree.HTML(self.over_data)# 获取数据self.prices = html.xpath('//*[@id="J_goodsList"]/ul/li[*]/div/div[*]/strong/i/text()')self.shop = html.xpath('//*[@id="J_goodsList"]/ul/li[*]/div/div[*]/span/a/text()')self.shopping = html.xpath('//*[@id="J_goodsList"]/ul/li[*]/div/div[*]/span/a/@href')self.ping = html.xpath('/html/body/div[*]/div[*]/div[*]/div[*]/div/div[*]/ul/li[*]/div/div[*]/strong/a/text()')# 将网站链接手动加上https:for i in range(len(self.shopping)):data = 'https:'+self.shopping[i]self.shopping[i] = dataprint('数据获取成功')def sava(self):'保存'print('保存中...')# 创建数据集data = {'价格':self.prices,'店铺':self.shop,'店铺链接':self.shopping,'评论数/销量':self.ping}pd = pandas.DataFrame(data)# 写入文件pd.to_excel('JD data.xlsx',index = False)time.sleep(2)print('保存成功')
..........................................................................................................................................................
flowchart LR
A[开始] --> B[创建GetData类]
B --> C[访问京东iPhone4s搜索页面]
C --> D[点击微信登录]
D --> E[等待登录10秒]
E --> F[获取页面源代码]
F --> G[创建Sift类]
G --> H[解析前端页面数据]
H --> I[获取价格信息]
I --> J[获取店铺信息]
J --> K[获取店铺链接]
K --> L[获取评论数/销量]
L --> M[保存数据为Excel]
M --> N[结束]
..........................................................................................................................................................
可视化主函数:
..........................................................................................................................................................
import re
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from collections import Counter# 模拟的TXT文件内容
with open("唐卡吊坠2.txt","r") as f:txt_data =f.read()# 清洗数据,去除特殊字符,并分词
words = re.findall(r'[\u4e00-\u9fa5]+', txt_data) # 仅保留汉字# 统计词频
word_counts = Counter(words)# 绘制词云图
font_path = '方正仿宋简体.ttf' # 字体路径,需要根据实际情况修改
wordcloud = WordCloud(font_path=font_path, width=800, height=400, background_color='white').generate_from_frequencies(word_counts)# 显示词云图
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()# 绘制柱形图
common_words = word_counts.most_common(10)
labels, values = zip(*common_words)plt.figure(figsize=(10, 5))
plt.bar(labels, values)# 设置中文字体
plt.rcParams['font.sans-serif'] = ['SimHei'] # 指定默认字体
plt.rcParams['axes.unicode_minus'] = False # 解决保存图像时负号'-'显示为方块的问题plt.xlabel('Words')
plt.ylabel('Count')
plt.title('Top 10 Most Common Words')
plt.xticks(rotation=45) # 旋转x轴标签,以便更好地显示
plt.show()
..........................................................................................................................................................
flowchart LR
A[开始] --> B[读取TXT文件内容]
B --> C[清洗数据,去除特殊字符,并分词]
C --> D[统计词频]
D --> E[绘制词云图]
E --> F[显示词云图]
F --> G[获取最常见的10个词]
G --> H[绘制柱形图]
H --> I[显示柱形图]
I --> J[结束]
..........................................................................................................................................................
运行函数:
..........................................................................................................................................................
import get
import lookclass Main:def __init__(self):# 获取目标数据geter = get.Sift() # 创建get.py文件中的Sift类geter.take()geter.sava()# 进行可视化layout = look.MakePlot()layout.make()if __name__ == '__main__':Main()
..........................................................................................................................................................
flowchart LR
A[开始] --> B[创建Main类]
B --> C[创建get.py中的Sift类]
C --> D[调用take()方法获取数据]
D --> E[调用sava()方法保存数据]
E --> F[创建look.py中的MakePlot类]
F --> G[调用make()方法进行可视化]
G --> H[结束]
...........................................................................................................................................................
总流程:
..........................................................................................................................................................
获取数据:
flowchart LR
A[开始] --> B[创建GetData类]
B --> C[访问京东iPhone4s搜索页面]
C --> D[点击微信登录]
D --> E[等待登录10秒]
E --> F[获取页面源代码]
F --> G[创建Sift类]
G --> H[解析前端页面数据]
H --> I[获取价格信息]
I --> J[获取店铺信息]
J --> K[获取店铺链接]
K --> L[获取评论数/销量]
L --> M[保存数据为Excel]
M --> N[结束]
...........................................................................................................................................................
可视化:
...........................................................................................................................................................
flowchart LR
A[开始] --> B[读取TXT文件内容]
B --> C[清洗数据,去除特殊字符,并分词]
C --> D[统计词频]
D --> E[绘制词云图]
E --> F[显示词云图]
F --> G[获取最常见的10个词]
G --> H[绘制柱形图]
H --> I[显示柱形图]
I --> J[结束]
...........................................................................................................................................................
运行:
...........................................................................................................................................................
flowchart LR
A[开始] --> B[创建Main类]
B --> C[创建get.py中的Sift类]
C --> D[调用take()方法获取数据]
D --> E[调用sava()方法保存数据]
E --> F[创建look.py中的MakePlot类]
F --> G[调用make()方法进行可视化]
G --> H[结束]
...........................................................................................................................................................
Guff_hys-CSDN博客
...........................................................................................................................................................