目录
1.项目架构
1.1介绍
1.2系统架构
1.3所需依赖包
2.项目的使用
3.项目效果图
3.1登录界面(login_ui.py):
3.2点击注册,跳转去注册界面(register_ui.py )
3.3 movie_util.py文件主要功能是抓取和读取电影数据
3.4 点击历史电影搜索:
3.5 在映电影搜索
3.6 统计图
4.总结
1.项目架构
1.1介绍
一个基于爬虫技术的海量电影数据分析+登录、注册界面连接Mysql系统
1.2系统架构
dao
|-- __init__.py
|-- movie_dao.py # 电影dao层接口类
|-- login_dao.py # 用户dao层接口类
ui
|-- __init__.py
|-- register_ui.py # 注册界面
|-- login_ui.py # 登录界面
|-- movie_ui.py # 电影主界面
data
|-- __init__.py
|-- movies_tablets.csv # 电影表格数据
|-- moviesBoxOffice.csv # 历史电影数据
|-- recentlyMovies.csv # 在映电影数据
|-- top10_data.csv # 电影前十数据
|-- top_movie.csv # 电影排名数据
utils
|-- __init__.py
|-- db_helper.py # dbhelper帮助类
|-- movie_util.py # 电影排行榜和关键字查询电影接口定义
|-- pyec.py # 电影排行榜和关键字查询电影接口定义
main.py # 运行程序入口
1.3所需依赖包
numpy:Python中基于数组对象的科学计算库。
pandas:Python的一个数据分析包,该工具为解决数据分析任务而创建。
requests:Python中常用的HTTP库,它提供了方便的HTTP请求和响应处理功能。
json:json模块提供了处理JSON数据的强大工具。
sklearn:Scikit-learn(sklearn)是机器学习中常用的第三方模块。
webbrowser:使用默认浏览器。
tkinter:Python 的标准 GUI 库。Python 使用 Tkinter 可以快速的创建 GUI 应用程序。
collections:Python内建的一个集合模块,提供了许多有用的集合类和方法。
pyecharts: 一个基于 ECharts 的 Python 数据可视化库,它允许用户使用 Python 语言生成各种类型的交互式图表和数据可视化。
re:re模块进行正则表达式匹配的方法和语法。
2.项目的使用
运行ui中的login_ui.py即可
3.项目效果图
3.1登录界面(login_ui.py):
调用了login_dao.py中的函数(user_login:登录验证)
def user_login(self, username: str, password: str):# 创建db数据库db = db_helper()# sql语句sql = "select * from t_user where luser = %s and lpwd = %s"# 占位符values = [username, password]# 调用db中的添加函数inserts = db.execute_query(str(sql), values)# 检查结果if inserts:print("登录成功!")return 1else:print("登录失败,密码或者账号错误")return 0
3.2点击注册,跳转去注册界面(register_ui.py )
def do_search_kw(self):# 销毁当前窗口self.root.destroy()# 打开新窗口from ui.register_ui import register_uire = register_ui()re.show()
3.3 movie_util.py文件主要功能是抓取和读取电影数据
(1)recently()
这一函数主要是抓取最近上映票房排名前十名的电影信息。
url = "https://ys.endata.cn/enlib-api/api/movie/getMovie_BoxOffice_Day_Chart.do"header = {
"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.82 Safari/537.36',
"Cookie": 'JSESSIONID=b2685bfa-aa4f-4359-ae96-57befaf8d1ec; route=4e39643a15b7003e568cadd862137cf3; Hm_lvt_82932fc4fc199c08b9a83c4c9d02af11=1649834963,1649852471,1649859039,1649900037; Hm_lpvt_82932fc4fc199c08b9a83c4c9d02af11=1649917933'
}post_BoxOffice_Day_data = {'r': 0.7572955414768414,'datetype': 'Day','date': datetime.now().strftime('%Y-%m-%d'),'sdate': datetime.now().strftime('%Y-%m-%d'),'edate': datetime.now().strftime('%Y-%m-%d'),'bserviceprice': 1
} ```
以上代码块是运行爬虫前的准备工作,包含抓取的网址url、爬虫所需的请求头、请求时需要附带的数据。
python
res = requests.post(url, headers=header, data=post_BoxOffice_Day_data).text
json_data = json.loads(res)
data0 = json_data['data']['table0']
data1 = json_data['data']['table1']
以上代码块是运行爬虫并将其解析为json形式,方便后面对数据进行取出。
movie_rank = []
movie_details_MovieName = []
movie_details_BoxOffice = []
movie_details_ShowCount = []
movie_details_AudienceCount = []
movie_details_Attendance = []movie_percent_BoxOfficePercent = []
movie_percent_ShowCountPercent = []
movie_percent_AudienceCountPercent = []
以上代码是部分定义的所需的数据字段。
for i in range(10):movie_rank.append(data0[i]['Irank'])movie_details_MovieName.append(data0[i]['MovieName'])movie_details_BoxOffice.append(data0[i]['BoxOffice'])movie_details_ShowCount.append(data0[i]['ShowCount'])movie_details_AudienceCount.append(data0[i]['AudienceCount'])movie_details_Attendance.append(data0[i]['Attendance'])
以上是从json数据中取数据的过程。
top10_data = pd.DataFrame({"影片排名": movie_rank,"影片名称": movie_details_MovieName,"影片票房": movie_details_BoxOffice,"影片场次": movie_details_ShowCount,"影片人次": movie_details_AudienceCount,"上座率": movie_details_Attendance,"影片票房占比": movie_percent_BoxOfficePercent,"影片场次占比": movie_percent_ShowCountPercent,"影片人次占比": movie_percent_AudienceCountPercent,"一线城市票房": movie_city1_BoxOffice,"一线城市场次": movie_city1_ShowCount,"一线城市人次": movie_city1_AudienceCount,"二线城市票房": movie_city2_BoxOffice,"二线城市场次": movie_city2_ShowCount,"二线城市人次": movie_city2_AudienceCount,"三线城市票房": movie_city3_BoxOffice,"三线城市场次": movie_city3_ShowCount,"三线城市人次": movie_city3_AudienceCount,"四线城市票房": movie_city4_BoxOffice,"四线城市场次": movie_city4_ShowCount,"四线城市人次": movie_city4_AudienceCount,"其它票房": movie_others_BoxOffice,"其它场次": movie_others_ShowCount,"其它人次": movie_others_AudienceCount
})
print(top10_data)
top10_data.to_csv("data/top10_data.csv", encoding='gbk', index=False)
以上是定义数据表并将数据表填满,打印数据表,保存数据表的过程。
3.4 点击历史电影搜索:
# 历史电影数据def history(self):# showerror(title="失败", message="历史电影数据获取失败")if self.treeview is not None:self.clear_tree(self.treeview) # 清空表格self.create_tree_history()self.btn_top2['text'] = '正在努力搜索'list = history(int(self.movie_num_entry.get()))self.add_tree(list, self.treeview)self.btn_top2['state'] = NORMALself.btn_top2['text'] = '历史电影搜索'
调用movie_util.py中的方法
def history(num:int):data = pd.read_csv("C:/Users/LoveB/PycharmProjects/pythonProject1/python/yy_movie//data/moviesBoxOffice.csv", encoding='gbk')data = np.array(data[:num]).tolist()print(data)return data
3.5 在映电影搜索
# 在映电影搜索def showing(self):# showerror(title="失败", message="系统爬虫失效或超时,请联系系统开发者")if self.treeview is not None:self.clear_tree(self.treeview) # 清空表格self.create_tree_showing()self.btn_top['text'] = '正在努力搜索'showing(int(self.movie_num_entry.get()))list = np.array(pd.read_csv("C:/Users/LoveB/PycharmProjects/pythonProject1/python/yy_movie/data/recentlyMovies.csv", encoding='gbk')).tolist()self.add_tree(list, self.treeview) # 将数据添加到tree中self.btn_top['state'] = NORMALself.btn_top['text'] = '在映电影搜索'
def showing(num: int):header = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.82 Safari/537.36',"Cookie": 'JSESSIONID=edf01a0d-deae-4143-9071-2e7eda2c5055; route=4e39643a15b7003e568cadd862137cf3; Hm_lvt_82932fc4fc199c08b9a83c4c9d02af11=1649859039,1649900037,1649983572,1649988152; Hm_lpvt_82932fc4fc199c08b9a83c4c9d02af11=1650016413'}url_total = "https://ys.endata.cn/enlib-api/api/movie/getMovie_BoxOffice_Day_List.do"total_post_data = {'r': 0.08330054546930543,'datetype': 'Day','date': datetime.now().strftime('%Y-%m-%d'),'sdate': datetime.now().strftime('%Y-%m-%d'),'edate': datetime.now().strftime('%Y-%m-%d'),'bserviceprice': 1,'columnslist': '100,102,103,119,105,107,109,106,112,129,142,143,163,164,165','pageindex': 1,'pagesize': 20,'order': 103,'ordertype': 'desc',}total_res = requests.post(url_total, headers=header, data=total_post_data).texttotal_json_data = json.loads(total_res)pagesize = total_json_data['data']['table2'][0]['TotalCounts']total_post_data = {'r': 0.08330054546930543,'datetype': 'Day','date': datetime.now().strftime('%Y-%m-%d'),'sdate': datetime.now().strftime('%Y-%m-%d'),'edate': datetime.now().strftime('%Y-%m-%d'),'bserviceprice': 1,'columnslist': '100,102,103,119,105,107,109,106,112,129,142,143,163,164,165','pageindex': 1,'pagesize': pagesize,'order': 103,'ordertype': 'desc',}total_res = requests.post(url_total, headers=header, data=total_post_data).texttotal_json_data = json.loads(total_res)['data']['table1']print(total_json_data)print('total=',len(total_json_data))movies_rank = []movies_MovieName = []movies_BoxOffice = []movies_ReleaseDate = []movies_TotalBoxOffice = []movies_ShowCount = []movies_AudienceCount = []movies_BoxOfficePercent = []movies_ReleaseDay = []movies_ShowDay = []movies_HjBoxOffice = []movies_HjShowCount = []movies_HjBoxOfficePercent = []movies_HjShowCountPercent = []movies_HjAudienceCountPercent = []movies_MaoYanWantToSee = []movies_TaoPiaoPiaoWantToSee = []movies_DouBanWantToSee = []for i in range(num):if total_json_data[i]['EntMovieID'] != 0:movies_rank.append(total_json_data[i]['Irank'])movies_MovieName.append(total_json_data[i]['MovieName'])movies_BoxOffice.append(total_json_data[i]['BoxOffice'])movies_ReleaseDate.append(total_json_data[i]['ReleaseDate'])movies_TotalBoxOffice.append(total_json_data[i]['TotalBoxOffice'])movies_ShowCount.append(total_json_data[i]['ShowCount'])movies_AudienceCount.append(total_json_data[i]['AudienceCount'])movies_BoxOfficePercent.append(total_json_data[i]['BoxOfficePercent'])movies_ReleaseDay.append(total_json_data[i]['ReleaseDay'])movies_ShowDay.append(total_json_data[i]['ShowDay'])movies_HjBoxOffice.append(total_json_data[i]['HjBoxOffice'])movies_HjShowCount.append(total_json_data[i]['HjShowCount'])movies_HjBoxOfficePercent.append(total_json_data[i]['HjBoxOfficePercent'])movies_HjShowCountPercent.append(total_json_data[i]['HjShowCountPercent'])movies_HjAudienceCountPercent.append(total_json_data[i]['HjAudienceCountPercent'])post_data = {'r': 0.3270070971758279,'entmovieid': total_json_data[i]['EntMovieID']}res = json.loads(requests.post(url="https://ys.endata.cn/enlib-api/api/movie/getMovie_HeadBoxOfficeByMovieID.do",headers=header, data=post_data).text)print(total_json_data[i]['EntMovieID'])print(res)movies_MaoYanWantToSee.append(res['data']['table0'][0]['MaoYanWantToSee'])print(movies_MaoYanWantToSee)movies_TaoPiaoPiaoWantToSee.append(res['data']['table0'][0]['TaoPiaoPiaoWantToSee'])movies_DouBanWantToSee.append(res['data']['table0'][0]['DouBanWantToSee'])total_data = pd.DataFrame({"排名": movies_rank,"影片名称": movies_MovieName,"当前票房": movies_BoxOffice,"上映日期": movies_ReleaseDate,"累计票房": movies_TotalBoxOffice,"当前场次": movies_ShowCount,"当前人次": movies_AudienceCount,"票房占比": movies_BoxOfficePercent,"累计上映天数": movies_ReleaseDay,"当前统计天数": movies_ShowDay,"淘票票想看数": movies_TaoPiaoPiaoWantToSee,"猫眼想看数": movies_MaoYanWantToSee,"豆瓣想看数": movies_DouBanWantToSee,"黄金场票房": movies_HjBoxOffice,"黄金场场次": movies_HjShowCount,"黄金场票房占比": movies_HjBoxOfficePercent,"黄金场场次占比": movies_HjShowCountPercent,"黄金场人次占比": movies_HjAudienceCountPercent})total_data.to_csv("C:/Users/LoveB/PycharmProjects/pythonProject1/python/yy_movie/data/recentlyMovies.csv", encoding='gbk', index=False)
3.6 统计图
def clicking(self):# showerror(title="失败", message="跳转在映电影数据分析失败")recently()from python.yy_movie.utils.pyec import ShowingShowing()webbrowser.open("C:/Users/LoveB/PycharmProjects/pythonProject1/python/yy_movie/在映电影分析.html")
4.总结
以上就是电影项目大致流程,更详细点击我的主页获取完整代码,如果对你有帮助,点赞关注评论一下呗~~