国科大——数据挖掘(0812课程)——课后作业

前沿: 此文章记录了2024年度秋季学期数据挖掘课程的三次课后作业,答案仅供参考

第一次作业

1

假定数据仓库中包含4个维:date, product, vendor, location;和两个度量:sales_volume和sales_cost。

1)画出该数据仓库的星形模式图 。
在这里插入图片描述
2)由基本方体[date, product, vendor, location]开始,列出每年在Los Angles的每个vendor的sales_volume。

roll up on product from basic(key) to all
roll up on location from basic(key) to city
roll up on date from basic(key) to year
slice for location = ‘Los Angles’

3)对于数据仓库,位图索引是有用的。以该立方体为例,简略讨论使用位图索引结构的优点和问题。
在这里插入图片描述

2

Design a data warehouse for a regional weather bureau. The weather bureau has about 1000 probes, which are scattered throughout various land and ocean locations in the region to collect basic weather data, including air pressure, temperature, and precipitation at each hour. All data are sent to the central station, which has collected such data for over 10 years. Your design should facilitate efficient querying and online analytical processing, and derive general weather patterns in multidimensional space. (note: please present the schema, the fact table(s) and the dimension tables with concept hierarchy)

在这里插入图片描述

3

下面是一个超市商品A连续20个月的销售数据(单位为百元)
A:21, 16, 19, 24, 27, 23, 22, 21, 20, 17, 16, 20, 23, 22, 18, 24, 26, 25, 20, 26。
B:38, 24, 38, 45, 46, 44, 42, 34, 40, 30, 31, 40, 40, 32, 36, 42, 50, 47, 46, 50。

1)Calculate the mean, median, and standard deviation of the sales data.

21.5; 21.5;3.22

2)Draw the boxplot.

Min = 16, Q1 = 19, median = 21.5, Q3 = 24, Max = 27.
在这里插入图片描述
3) Normalize the values based on min-max normalization.
在这里插入图片描述

4)假设商品B连续20个月的销售数据(单位为百元)如下:38, 24, 38, 45, 46, 44, 42, 34, 40, 30, 31, 40, 40, 32, 36, 42, 50, 47, 46, 50。
Calculate the correlation coefficient (Pearson’s product moment coefficient). Are these products positively or negatively correlated?

相关计算相关系数结果为0.831,表明A和B产品为正相关。

5)Draw the scatter plot for the sales data of the two products.

在这里插入图片描述

4

下面是一个超市商品A连续20个月的销售数据(单位为百元)。21,16,19,24,27,23,22,21,20,17,16,20,23,22,18,24,26, 25,20,26。对以上数据进行噪声平滑,使用深度为5的Equal-depth binning方法。

答案:

首先对20个数据进行排序,排序后的结果如下:16, 16, 17, 18, 19, 20, 20, 20, 21, 21, 22, 22, 23, 23, 24, 24, 25, 26, 26, 27,使用深度为5的Equal-depth binning方法,则分箱结果如下:
Bin1: 16, 16, 17, 18, 19;
Bin2: 20, 20, 20, 21, 21;
Bin3: 22, 22, 23, 23, 24;
Bin4: 24, 25, 26, 26, 27;

1)采用bin median方法进行平滑;

Bin1的median为17,则平滑后结果为Bin1: 17, 17, 17, 17, 17;
Bin2的median为20,则平滑后结果为Bin2: 20, 20, 20, 20, 20;
Bin3的median为23,则平滑后结果为Bin3: 23, 23, 23, 23, 23;
Bin4的median为26,则平滑后结果为Bin4: 26, 26, 26, 26, 26;

2) 采用bin boundaries方法进行平滑。

平滑后结果为:
Bin1: 16, 16, 16, 19, 19;
Bin2: 20, 20, 20, 21, 21;
Bin3: 22, 22, 22, 22, 24 或者 22, 22, 24, 24, 24;
Bin4: 24, 24, 27, 27, 27;

第二次作业

1

Given a data set below for attributes {Height, Hair, Eye} and two classes {C1, C2}.
在这里插入图片描述
1)Compute the Information Gain for Height, Hair and Eye.
在这里插入图片描述
在这里插入图片描述
2)Construct a decision tree with Information Gain.

在这里插入图片描述
在这里插入图片描述

2

Classify the unknown sample Z based on the training data set in Q1:

Z = (Height = Short, Hair = blond, Eye = brown). What would a naïve Bayesian classifier classify Z?

在这里插入图片描述

3

注: 此题的答案应该有点问题。
1)Design a multilayer feed-forward neural network (one hidden layer) for the data set in Q1. Label the nodes in the input and output layers.
在这里插入图片描述
2)Using the neural network obtained above, show the weight values after one iteration of the back propagation algorithm, given the training instance “(Tall, Red, Brown)". Indicate your initial weight values and biases and the learning rate used.
在这里插入图片描述
在这里插入图片描述

4

Consider the data set shown in Table 1(min_sup = 60%, min_conf=70%).
在这里插入图片描述

1)Find all frequent itemsets using Apriori by treating each transaction ID as a market basket.
在这里插入图片描述
2)Use the results in part (a) to compute the confidence for the association rules {a, b}->{c} and {c}->{a, b}. Is confidence a symmetric measure?
在这里插入图片描述
3)List all of the strong association rules (with support s and confidence c) matching the following metarule, where X is a variable representing customers, and itemi denotes variables representing items (e.g. “A”, “B”, etc.)
在这里插入图片描述
在这里插入图片描述

5

Assume a supermarket would like to promote pasta. Use the data in “transactions” as training data to build a decision tree (C5.0 algorithm) model to predict whether the customer would buy pasta or not.

Build a decision tree using data set “transactions” that predicts pasta as a function of the other fields. Set the “type” of each field to “Flag”, set the “direction” of “pasta” as “out”, set the “type” of COD as “Typeless”, select “Expert” and set the “pruning severity” to 65, and set the “minimum records per child branch” to be 95. Hand-in: A figure showing your tree.

在这里插入图片描述

6

Use the model (the full tree generated by Clementine in step 1 above) to make a prediction for each of the 20 customers in the “rollout” data to determine whether the customer would buy pasta.
1)Hand-in: your prediction for each of the 20 customers. (10 points)

在这里插入图片描述
2)Hand-in: rules for positive (yes) prediction of pasta purchase identified from the decision tree (up to the fifth level. The root is considered as level 1). (10 points)
在这里插入图片描述

第三次作业

1

Suppose that the data mining task is to cluster the following ten points (with(x, y, z) representing location) into three clusters:

A1(4,2,5), A2(10,5,2), A3(5,8,7), B1(1,1,1), B2(2,3,2), B3(3,6,9), C1(11,9,2),C2(1,4,6), C3(9,1,7), C4(5,6,7)

The distance function is Euclidean distance. Suppose initially we assign A2,B2,C2 as the center of each cluster, respectively. Use the K-Means algorithm to show only.

1)The three cluster’s centers after the first round execution

在这里插入图片描述
2)The final three clusters
在这里插入图片描述

2

Table 2 gives a User-Product rating matrix.
在这里插入图片描述
1)List the top 3 most similar users of user 2 based on Cosine Similarity
在这里插入图片描述
2)Predict User 2’s rating for Product 2
在这里插入图片描述

3

The goal of this assignment is to introduce churn management using decision trees, logistic regression and neural network. You will try different combinations of the parameters to see their impacts on the accuracy of your models for this specific data set. This data set contains summarized data records for each customer for a phone company. Our goal is to build a model so that this company can predict potential churners.

Two data sets are available, churn_training.txt and churn_validation.txt. Each data set has 21 variables. They are:

State:

Account_length: how long this person has been in this plan

Area_code:

Phone_number:

International_plan: this person has international plan=1, otherwise=0

Voice_mail_plan: this person has voice mail plan=1, otherwise=0

Number_vmail_messages: number of voice mails

Total_day_minutes:

Total_day_calls:

Total_day_charge:

Total_eve_minutes:

Total_eve_calls:

Total_eve_charge:

Total_night_minutes:

Total_night_calls:

Total_night_charge:

Total_intl_minutes:

Total_intl_calls:

Total_intl_charge:

Number_customer_service_calls:

Class: churn=1, did not churn=0

Each row in “churn_training” represents the customer record. The training data contains 2000 rows and the validation data contains 1033 records.

1)Perform decision tree classification on training data set. Select all the input variables except state, area_code, and phone_number (since they are only informative for this analysis). Set the “Direction” of class as “out”, “type” as “Flag”. Then, specify the “minimum records per child branch” as 40, “pruning severity” as 70, click “use global pruning”. Hand-in the confusion matrices for validation data.

通过在clementine软件上,使用Decision Tree算法,并按照上述要求所计算的混淆矩阵如下图所示。
在这里插入图片描述
2)Perform neural network on training data set using default settings. Again, select all the input variables except state, area_code, and phone_number. Hand-in the confusion matrix for validation data.

通过在clementine软件上,使用neural network算法,并按照上述要求所计算的混淆矩阵如下图所示。
在这里插入图片描述

3)Perform logistic regression on training data set using default settings. Again, select all the input variables except state, area_code, and phone_number. Hand-in the confusion matrix for validation data.

通过在clementine软件上,使用logistic regression算法,并按照上述要求所计算的混淆矩阵如下图所示。
在这里插入图片描述

4) Hand-in your observations on the model quality for decision tree, neural network and logistic regression using the confusion matrices.
在这里插入图片描述

4

Learn the use of market basket analysis for the purpose of making product purchase recommendations to the customers.

The data set contains transactions from a large supermarket. Each transaction is made by someone holding the loyalty card. We limited the total number of categories in this supermarket data to 20 categories for simplicity. The field value for a certain product in the transaction basket is 1 if the customer has bought it and 0 if he/she has not. The file named “Transactions” has data for 46243 transactions.

The data are available from the class web page.

Your submission should consist only of those deliverables marked indicated by “Hand-in”.

Market basket analysis has the objective to discover individual products, or groups of products that tend to occur together in transactions. The knowledge obtained from a market basket analysis can be employed by a business to recognize products frequently sold together in order to determine recommendations and cross-sell and up-sell opportunities. It can also be used to improve the efficiency of a promotional campaign.

Run Apriori on “transaction” data set. Set the “Type” of “COD” as “Typeless”, set the “direction” of all the other 20 categories as “Both”, set their “Type” as “Flag”. Set “Minimum antecedent support” to be 7%, “Minimum confidence” to be 45%, and “Maximum number of antecedents” to be 4 in the modeling node (Apriori node). In general you should explore by trying different values of these parameters to see what type of rules you get.

· Hand-in: The list of association rules generated by the model.
在这里插入图片描述

Sort the rules by lift, support, and confidence, respectively to see the rules identified. Hand-in: For each case, choose top 5 rules (note: make sure no redundant rules in the 5 rules) and give 2-3 lines comments. Many of the rules will be logically redundant and therefore will have to be eliminated after you think carefully about them.

通过在clementine软件上,分别将lift、support和confidence作为排序字段,所获取的关联规则如上图所示,所选出的top5 rules如下表所示。
在这里插入图片描述
1)lift:结合上图中的图(a),我们先选取出top5的规则,如下所述:

  a) tomato source→pasta:买番茄酱的人会买意大利面,相对比较合理;b) coffee、milk→pasta:买咖啡和牛奶的人会买意大利面,不是很合理,排除此规则;c) biscuits、pasta→milk:买饼干和意大利面的会买牛奶,相对合理;d) pasta、water→milk:买意大利面和水的人会买牛奶,相对合理;e) juices→milk:买果汁的人会买牛奶,相对合理;

由于规则b)不是很合理,因此删除规则b),新增一下一个规则:

	f) yoghurt→milk:买酸奶的人会买牛奶,相对合理;

因此,所选出的top5规则如上表中的第二列所示。

2)support:结合上图中的图(b),我们先选取出top5的规则:pasta→milk,water→milk,biscuits→milk,brioches→milk以及yoghurt→milk,这五条规则都相对比较合理,比如买水或者酸奶等饮品的购物者常常会一起买上牛奶,因此选取此五条规则,如上表中的第三列所示。

3)confidence:结合上图中的图©,我们先选取出top5的规则:biscuits、pasta→milk,water、pasta→milk,juices→milk,tomato source→pasta以及yoghurt→milk,也相对来说比较符合常识,比如买番茄酱的购物者很可能意大利面,因此选取此五条规则,如上表中的第四列所示。

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.rhkb.cn/news/24300.html

如若内容造成侵权/违法违规/事实不符,请联系长河编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

基于SpringBoot的“古城景区管理系统”的设计与实现(源码+数据库+文档+PPT)

基于SpringBoot的“古城景区管理系统”的设计与实现(源码数据库文档PPT) 开发语言:Java 数据库:MySQL 技术:SpringBoot 工具:IDEA/Ecilpse、Navicat、Maven 系统展示 系统整体功能图 系统首页界面 系统注册界面 景…

HarmonyOS Design 介绍

HarmonyOS Design 介绍 文章目录 HarmonyOS Design 介绍一、HarmonyOS Design 是什么?1. 设计系统(Design System)2. UI 框架的支持3. 设计工具和资源4. 开发指南5. 与其他设计系统的对比总结 二、HarmonyOS Design 特点 | 应用场景1. Harmon…

面试题——简述Vue 3的服务器端渲染(SSR)是如何工作的?

面试题——简述Vue3的服务器端渲染(SSR)是如何工作的? 服务器端渲染(SSR)已经成为了一个热门话题。Vue 3,作为一款流行的前端框架,也提供了强大的SSR支持。那么,Vue 3的SSR究竟是如何…

汽车零部件工厂如何通过ESD监控系统闸机提升产品质量

在汽车零部件工厂的生产过程中,静电带来的危害不容小觑。从精密的电子元件到复杂的机械部件,静电都可能成为影响产品质量的 “隐形杀手”。而 ESD 监控系统闸机的出现,为汽车零部件工厂解决静电问题、提升产品质量提供了关键的技术支持。 一、…

AWQ和GPTQ量化的区别

一、前言 本地化部署deepseek时发现,如果是量化版的deepseek,会节约很多的内容,然后一般有两种量化技术,那么这两种量化技术有什么区别呢? 二、量化技术对比 在模型量化领域,AWQ 和 GPTQ 是两种不同的量…

IDEA关闭SpringBoot程序后仍然占用端口的排查与解决

IDEA关闭SpringBoot程序后仍然占用端口的排查与解决 问题描述 在使用 IntelliJ IDEA 开发 Spring Boot 应用时,有时即使关闭了应用,程序仍然占用端口(例如:4001 端口)。这会导致重新启动应用时出现端口被占用的错误&a…

数字IC后端设计实现OCC(On-chip Clock Controller)电路介绍及时钟树综合案例

数字IC后端时钟树综合专题(OCC电路案例分享) 复杂时钟设计时钟树综合(clock tree synthesis)常见20个典型案例 1、什么是OCC? 片上时钟控制器(On-chip Clock Controllers ,OCC),也称为扫描时钟控制器(Scan Clock Con…

IP离线库助力破解网络反诈难题

毫秒级响应识别异常访问 IP离线库集成全球全量IP地址的详细信息,包括地理地址查询、运营商、经纬度、代理识别等多种维度数据。例如: 当用户账号频繁从北京、越南等多地IP登录时,系统将自动触发风险预警; 检测到访问IP为已知机…

ROS2 强化学习:案例与代码实战

一、引言 在机器人技术不断发展的今天,强化学习(RL)作为一种强大的机器学习范式,为机器人的智能决策和自主控制提供了新的途径。ROS2(Robot Operating System 2)作为新一代机器人操作系统,具有…

Redis搭建集群

今天学习了搭建redis集群,以redis6.2.6为例,在windows下搭建 redis6.2.6下载地址: 现在本机搭建一主二从,主写从读,7001端口的redis为master节点,7002、7003为从节点 ① 将redis复制三份,分别…

嵌入式开发:傅里叶变换(4):在 STM32上面实现FFT(基于STM32L071KZT6 HAL库+DSP库)

目录 步骤 1:准备工作 步骤 2:创建 Keil 项目,并配置工程 步骤 3:在MDK工程上添加 CMSIS-DSP 库 步骤 5:编写代码 步骤 6:配置时钟和优化 步骤 7:调试与验证 步骤 8:优化和调…

C++ Primer 初识泛型算法

欢迎阅读我的 【CPrimer】专栏 专栏简介:本专栏主要面向C初学者,解释C的一些基本概念和基础语言特性,涉及C标准库的用法,面向对象特性,泛型特性高级用法。通过使用标准库中定义的抽象设施,使你更加适应高级…

解决后端跨域问题

目录 一、什么是跨域问题? 1、跨域问题的定义 2、举例 3、为什么会有跨域问题的存在? 二、解决跨域问题 1、新建配置类 2、编写代码 三、结语 一、什么是跨域问题? 1、跨域问题的定义 跨域问题(Cross-Origin Resource Sh…

STM32MP157A-FSMP1A单片机移植Linux系统SPI总线驱动

SPI总线驱动整体上与I2C总线驱动类型,差别主要在设备树和数据传输上,由于SPI是由4根线实现主从机的通信,在设备树上配置时需要对SPI进行设置。 原理图可知,数码管使用的SPI4对应了单片机上的PE11-->SPI4-NSS,PE12-->SPI4-S…

springboot博客系统详解与实现(后端实现)

目录 前言: 项目介绍 一、项目的准备工作 1.1 数据准备 1.2 项目创建 1.3 前端页面的准备 1.4 配置配置文件 二、公共模块 2.1 根据需求完成公共层代码的编写 2.1.1 定义业务状态枚举 2.1.2 统一返回结果 2.1.3 定义项目异常 2.1.4 统一异常处理 三、业…

Metal 学习笔记四:顶点函数

到目前为止,您已经完成了 3D 模型和图形管道。现在,是时候看看 Metal 中两个可编程阶段中的第一个阶段,即顶点阶段,更具体地说,是顶点函数。 着色器函数 定义着色器函数时,可以为其指定一个属性。您将在本…

Kafka可视化工具EFAK(Kafka-eagle)安装部署

Kafka Eagle是什么? Kafka Eagle是一款用于监控和管理Apache Kafka的开源系统,它提供了完善的管理页面,例如Broker详情、性能指标趋势、Topic集合、消费者信息等。 源代码地址:https://github.com/smartloli/kafka-eagle 前置条件…

vue3.0将后端返回的word文件流转换为pdf并导出+html2pdf.js将页面导出为pdf

实现思路 1.将Word文档转换为HTML:mammoth.js,它可以将.docx文件转换为HTML 2.将HTML转换为PDF:使用html2pdf.js将HTML转换为PDF 如果想要相同的效果,也可以把前端页面直接导出转换为pdf: 运用的插件:html2pdf.js 后端…

lowagie(itext)老版本手绘PDF,包含页码、水印、图片、复选框、复杂行列合并等。

入口类:exportPdf ​ package xcsy.qms.webapi.service;import com.alibaba.fastjson.JSONArray; import com.alibaba.fastjson.JSONObject; import com.alibaba.nacos.common.utils.StringUtils; import com.ibm.icu.text.RuleBasedNumberFormat; import com.lowa…

基于JAVA+SpringBoot+Vue的前后端分离的简历系统

基于JAVASpringBootVue的前后端分离的简历系统 前言 ✌全网粉丝20W,csdn特邀作者、博客专家、CSDN[新星计划]导师、java领域优质创作者,博客之星、掘金/华为云/阿里云/InfoQ等平台优质作者、专注于Java技术领域和毕业项目实战✌ 🍅文末附源码下载链接&#x1f345…