1.前期准备
(1)首先要把hadoop集群,hive和spark等配置好
hadoop集群,hive的配置可以看看这个博主写的博客
大数据_蓝净云的博客-CSDN博客
或者看看黑马程序员的视频
黑马程序员大数据入门到实战教程,大数据开发必会的Hadoop、Hive,云平台实战项目全套一网打尽_哔哩哔哩_bilibili
对于博主本人,有关hadoop集群和hive的配置可以直接看这篇文章
黑马程序员hadoop三件套(hdfs,Mapreduce,yarn)的安装配置以及hive的安装配置-CSDN博客
spark配置参考文章:
spark的安装配置_spark基本配置-CSDN博客
(2)最好把Finalshell也下载好,具体下载教程详见如下文章:
保姆级教程下载finalshell以及连接云服务器基础的使用教程_finalshell下载安装-CSDN博客
2.配置spark-sql
(1)首先在node1登录root用户,接着进入hive安装目录conf目录,修改hive-site.xml
cd /export/server/apache-hive-3.1.3-bin/conf/
vi hive-site.xml
添加如下内容:
<property>
<name>hive.spark.client.jar</name>
<value>${SPARK_HOME}/lib/spark-assembly-*.jar</value>
</property>
<property><name>hive.spark.client.jar</name><value>${SPARK_HOME}/lib/spark-assembly-*.jar</value>
</property>
(2)拷贝hive-site.xml到/export/server/spark-3.4.4-bin-hadoop3/conf,同时分发到node2,node3节点
cp /export/server/apache-hive-3.1.3-bin/conf/hive-site.xml /export/server/spark-3.4.4-bin-hadoop3/conf/
scp /export/server/apache-hive-3.1.3-bin/conf/hive-site.xml node2:/export/server/spark-3.4.4-bin-hadoop3/conf/
scp /export/server/apache-hive-3.1.3-bin/conf/hive-site.xml node3:/export/server/spark-3.4.4-bin-hadoop3/conf/
(3)拷贝MYSQL驱动到/export/server/spark-3.4.4-bin-hadoop3/jars/,同时分发到node2,node3节点
cp /export/server/apache-hive-3.1.3-bin/lib/mysql-connector-java-5.1.34.jar /export/server/spark-3.4.4-bin-hadoop3/jars/
scp /export/server/spark-3.4.4-bin-hadoop3/jars/mysql-connector-java-5.1.34.jar node2:/export/server/spark-3.4.4-bin-hadoop3/jars/
scp /export/server/spark-3.4.4-bin-hadoop3/jars/mysql-connector-java-5.1.34.jar node3:/export/server/spark-3.4.4-bin-hadoop3/jars/
(4)在node1的/export/server/spark-3.4.4-bin-hadoop3/conf/spark-env.sh 文件中配置 MySQL 驱动,同时分发到node2,node3节点
vi /export/server/spark-3.4.4-bin-hadoop3/conf/spark-env.sh
添加如下内容:
export SPARK_CLASSPATH=/export/server/spark-3.4.4-bin-hadoop3/jars/mysql-connector-java-5.1.34.jar
export SPARK_CLASSPATH=/export/server/spark-3.4.4-bin-hadoop3/jars/mysql-connector-java-5.1.34.jar
分发
scp /export/server/spark-3.4.4-bin-hadoop3/conf/spark-env.sh node2:/export/server/spark-3.4.4-bin-hadoop3/conf/
scp /export/server/spark-3.4.4-bin-hadoop3/conf/spark-env.sh node3:/export/server/spark-3.4.4-bin-hadoop3/conf/
scp /export/server/spark-3.4.4-bin-hadoop3/conf/spark-env.sh node2:/export/server/spark-3.4.4-bin-hadoop3/conf/
scp /export/server/spark-3.4.4-bin-hadoop3/conf/spark-env.sh node3:/export/server/spark-3.4.4-bin-hadoop3/conf/
(5)在node1修改日志级别,同时分发到node2,node3节点
cp /export/server/spark-3.4.4-bin-hadoop3/conf/log4j2.properties.template /export/server/spark-3.4.4-bin-hadoop3/conf/log4j2.properties
vi /export/server/spark-3.4.4-bin-hadoop3/conf/log4j2.properties
把以下这部分注释
rootLogger.level = info
rootLogger.appenderRef.stdout.ref = console
注释后效果如下
# rootLogger.level = info
# rootLogger.appenderRef.stdout.ref = console
再添加以下内容:
rootLogger.level = warn
rootLogger.appenderRef.console.ref = console
rootLogger.level = warn
rootLogger.appenderRef.console.ref = console
再分发
scp /export/server/spark-3.4.4-bin-hadoop3/conf/log4j2.properties node2:/export/server/spark-3.4.4-bin-hadoop3/conf/
scp /export/server/spark-3.4.4-bin-hadoop3/conf/log4j2.properties node3:/export/server/spark-3.4.4-bin-hadoop3/conf/
scp /export/server/spark-3.4.4-bin-hadoop3/conf/log4j2.properties node2:/export/server/spark-3.4.4-bin-hadoop3/conf/
scp /export/server/spark-3.4.4-bin-hadoop3/conf/log4j2.properties node3:/export/server/spark-3.4.4-bin-hadoop3/conf/
3.体验spark-sql
(1)首先启动该启动的,在node1(此时是root用户)直接复制以下命令到命令行运行即可
su - hadoop
start-dfs.sh
start-yarn.sh
nohup /export/server/hive/bin/hive --service metastore >> /export/server/hive/logs/metastore.log 2>&1 &
cd /export/server/spark-3.4.4-bin-hadoop3/sbin
./start-all.sh
jps
spark-sql
效果如下
[root@node1 ~]# su - hadoop
Last login: Wed Dec 4 20:52:45 CST 2024 on pts/0
[hadoop@node1 ~]$ start-dfs.sh
Starting namenodes on [node1]
Starting datanodes
Starting secondary namenodes [node1]
[hadoop@node1 ~]$ start-yarn.sh
Starting resourcemanager
Starting nodemanagers
[hadoop@node1 ~]$ nohup /export/server/hive/bin/hive --service metastore >> /export/server/hive/logs/metastore.log 2>&1 &
[1] 39039
[hadoop@node1 ~]$ cd /export/server/spark-3.4.4-bin-hadoop3/sbin
[hadoop@node1 sbin]$ ./start-all.sh
starting org.apache.spark.deploy.master.Master, logging to /export/server/spark-3.4.4-bin-hadoop3/logs/spark-hadoop-org.apache.spark.deploy.master.Master-1-node1.out
node2: starting org.apache.spark.deploy.worker.Worker, logging to /export/server/spark-3.4.4-bin-hadoop3/logs/spark-hadoop-org.apache.spark.deploy.worker.Worker-1-node2.out
node3: starting org.apache.spark.deploy.worker.Worker, logging to /export/server/spark-3.4.4-bin-hadoop3/logs/spark-hadoop-org.apache.spark.deploy.worker.Worker-1-node3.out
[hadoop@node1 sbin]$ jps
39952 Jps
37553 SecondaryNameNode
38978 WebAppProxyServer
36902 NameNode
39127 Master
39143 VersionInfo
38537 NodeManager
37118 DataNode
38335 ResourceManager
[hadoop@node1 sbin]$ spark-sql
24/12/04 22:20:18 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/12/04 22:20:20 WARN HiveConf: HiveConf of name hive.metastore.event.db.notification.api.auth does not exist
24/12/04 22:20:20 WARN HiveConf: HiveConf of name hive.spark.client.jar does not exist
Spark master: spark://node1:7077, Application Id: app-20241204222029-0000
spark-sql (default)> use demo1;
(2)在spark-sql中尝试一下写代码
随便选择一个数据库使用吧
use demo1;
-- 创建一个新表
CREATE TABLE employees (id INT,name STRING,salary DOUBLE
);
-- 插入单条记录
INSERT INTO employees VALUES (4, 'Alice', 1300);
-- 插入多条记录
INSERT INTO employees VALUES
(5, 'Bob', 1400),
(6, 'Charlie', 1100);
-- 查询表中的所有数据
SELECT * FROM employees;
-- 删除表
DROP TABLE IF EXISTS employees;
效果如下
spark-sql (default)> use demo1;
Time taken: 9.682 seconds
spark-sql (demo1)> -- 创建一个新表
spark-sql (demo1)> CREATE TABLE employees (> id INT,> name STRING,> salary DOUBLE> );
24/12/04 22:22:50 WARN ResolveSessionCatalog: A Hive serde table will be created as there is no table provider specified. You can set spark.sql.legacy.createHiveTableByDefault to false so that native data source table will be created instead.
[TABLE_OR_VIEW_ALREADY_EXISTS] Cannot create table or view `demo1`.`employees` because it already exists.
Choose a different name, drop or replace the existing object, or add the IF NOT EXISTS clause to tolerate pre-existing objects.
spark-sql (demo1)> -- 删除表
spark-sql (demo1)> DROP TABLE IF EXISTS employees;
Time taken: 4.462 seconds
spark-sql (demo1)> -- 创建一个新表
spark-sql (demo1)> CREATE TABLE employees (> id INT,> name STRING,> salary DOUBLE> );
24/12/04 22:23:29 WARN ResolveSessionCatalog: A Hive serde table will be created as there is no table provider specified. You can set spark.sql.legacy.createHiveTableByDefault to false so that native data source table will be created instead.
24/12/04 22:23:29 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
Time taken: 1.541 seconds
spark-sql (demo1)> -- 插入单条记录
spark-sql (demo1)> INSERT INTO employees VALUES (4, 'Alice', 1300);
Time taken: 15.817 seconds
spark-sql (demo1)> -- 插入多条记录
spark-sql (demo1)> INSERT INTO employees VALUES> (5, 'Bob', 1400),> (6, 'Charlie', 1100);
Time taken: 10.018 seconds
spark-sql (demo1)> -- 查询表中的所有数据
spark-sql (demo1)> SELECT * FROM employees;
4 Alice 1300.0
5 Bob 1400.0
6 Charlie 1100.0
Time taken: 5.835 seconds, Fetched 3 row(s)
spark-sql (demo1)> -- 删除表
spark-sql (demo1)> DROP TABLE IF EXISTS employees;
Time taken: 0.784 seconds
spark-sql (demo1)>
到这里,基本上就已经成功了!
(3)关闭所有进程代码
先ctrl+C退出spark-sql
cd /export/server/spark-3.4.4-bin-hadoop3/sbin
./stop-all.sh
cd
stop-yarn.sh
stop-dfs.sh
jps
再通过kill -9 命令把RunJar进程给关闭掉
cd /export/server/spark-3.4.4-bin-hadoop3/sbin
./stop-all.sh
cd
stop-yarn.sh
stop-dfs.sh
jps
效果如下
Time taken: 0.784 seconds
spark-sql (demo1)> [hadoop@node1 sbin]$
[hadoop@node1 sbin]$ cd /export/server/spark-3.4.4-bin-hadoop3/sbin
[hadoop@node1 sbin]$ ./stop-all.sh
node2: stopping org.apache.spark.deploy.worker.Worker
node3: stopping org.apache.spark.deploy.worker.Worker
stopping org.apache.spark.deploy.master.Master
[hadoop@node1 sbin]$ cd
[hadoop@node1 ~]$ stop-yarn.sh
Stopping nodemanagers
Stopping resourcemanager
Stopping proxy server [node1]
[hadoop@node1 ~]$ stop-dfs.sh
Stopping namenodes on [node1]
Stopping datanodes
Stopping secondary namenodes [node1]
[hadoop@node1 ~]$ jps
64996 Jps
39039 RunJar
[hadoop@node1 ~]$ kill -9 39039
[hadoop@node1 ~]$ jps
66159 Jps
[1]+ Killed nohup /export/server/hive/bin/hive --service metastore >> /export/server/hive/logs/metastore.log 2>&1
[hadoop@node1 ~]$ jps
66219 Jps
[hadoop@node1 ~]$
4.参考文章
(1)黑马大数据学习笔记4-Hive部署和基本操作_黑马大数据 hive笔记-CSDN博客
(2)spark的安装配置_spark基本配置-CSDN博客
(3)Scala配置教程_统信 scala-CSDN博客
(4)大数据_蓝净云的博客-CSDN博客
5.补充笔记
(1)Hive 3.x版本通常会自动处理与Spark的集成
vi /export/server/apache-hive-3.1.3-bin/bin/hive 为什么我的hive文件里面没有这一行sparkAssemblyPath=`ls ${SPARK_HOME}/lib/spark-assembly-*.jar`
在 Hive 3.x 版本中,`sparkAssemblyPath` 这一行并不是标准的配置项。因此,你在 `hive` 启动脚本中没有看到这行配置是完全正常的。
### 为什么没有这行配置?
1. **Hive 版本更新**:从 Hive 2.x 到 Hive 3.x 的版本更新中,Hive 与 Spark 的集成方式发生了变化。特别是对于 Spark 2.x 及之后的版本,Spark 与 Hive 的集成不再依赖于传统的 `spark-assembly` JAR 文件。`spark-assembly` 是 Spark 1.x 版本的一个重要文件,在 Spark 2.x 及以后版本中,Spark 被重新架构,`spark-assembly` 文件已被拆分为多个 JAR 文件,因此不再需要在 Hive 启动脚本中通过 `sparkAssemblyPath` 来加载它。
2. **Hive 集成 Spark 的方式**:在 Hive 3.x 版本中,Spark 通过 `hive-site.xml` 和其他配置文件进行集成,而不再需要在 Hive 启动脚本中显式地指定 Spark 的 JAR 文件路径。Spark 会根据环境变量或配置文件自动加载所需的 JAR 文件。
### 解决方案
如果你希望在 Hive 中集成 Spark,并且需要 Spark 的相关配置,可以通过以下方法进行配置:
1. **确保 Hive 配置文件正确**:
- 在 `hive-site.xml` 中设置与 Spark 相关的参数,尤其是 Spark SQL 引擎和 Metastore 配置。
2. **设置 `SPARK_HOME` 环境变量**:
- 确保你的 Hive 环境能够找到正确的 Spark 配置,可以通过设置环境变量 `SPARK_HOME` 来指定 Spark 的安装路径。
3. **添加 Spark JAR 到 `SPARK_CLASSPATH`**:
- 如果你确实需要手动指定 Spark 的 JAR 文件(例如,如果 Spark 配置文件没有被自动加载),可以在 `spark-env.sh` 中设置 `SPARK_CLASSPATH` 来指定 Spark JAR 文件的路径:
```bash
export SPARK_CLASSPATH=/path/to/spark/jars/*.jar
```
4. **启动 Hive 与 Spark 结合的服务**:
- 启动 Spark 时可以通过 `spark-submit` 或 `spark-sql` 来运行 Spark SQL 查询,并确保 Hive Metastore 已经启动。
### 如何检查 Hive 与 Spark 集成是否正常
1. **检查 Hive Metastore 是否正常运行**:
你可以通过命令查看 Hive Metastore 服务是否启动成功:
```bash
hive --service metastore
```
2. **测试 Hive 和 Spark 的连接**:
在 `spark-sql` 中执行以下命令,检查 Spark 能否正常读取 Hive 数据库:
```bash
spark-sql --conf spark.sql.warehouse.dir=/path/to/hive/warehouse
```
3. **查看 Spark 的 JAR 文件路径**:
确保 Spark 的所有必要 JAR 文件都在 Spark 的 `jars` 目录下,可以使用以下命令列出 Spark 中的所有 JAR 文件:
```bash
ls $SPARK_HOME/jars
```
### 总结
在 Hive 3.x 版本中,`sparkAssemblyPath` 的配置项已经不再使用,而是通过配置文件和环境变量来管理 Spark 与 Hive 的集成。如果你需要手动配置 Spark 的 JAR 文件路径,可以通过 `SPARK_CLASSPATH` 环境变量或者在 `spark-env.sh` 中指定。
(2)之前遇到的报错
[hadoop@node1 sbin]$ ./start-all.sh starting org.apache.spark.deploy.master.Master, logging to /export/server/spark-3.4.4-bin-hadoop3/logs/spark-hadoop-org.apache.spark.deploy.master.Master-1-node1.out node3: starting org.apache.spark.deploy.worker.Worker, logging to /export/server/spark-3.4.4-bin-hadoop3/logs/spark-hadoop-org.apache.spark.deploy.worker.Worker-1-node3.out node2: starting org.apache.spark.deploy.worker.Worker, logging to /export/server/spark-3.4.4-bin-hadoop3/logs/spark-hadoop-org.apache.spark.deploy.worker.Worker-1-node2.out [hadoop@node1 sbin]$
您已经成功执行了 `./start-all.sh` 脚本来启动 Spark 集群。根据输出信息:
```
starting org.apache.spark.deploy.master.Master, logging to /export/server/spark-3.4.4-bin-hadoop3/logs/spark-hadoop-org.apache.spark.deploy.master.Master-1-node1.out
node3: starting org.apache.spark.deploy.worker.Worker, logging to /export/server/spark-3.4.4-bin-hadoop3/logs/spark-hadoop-org.apache.spark.deploy.worker.Worker-1-node3.out
node2: starting org.apache.spark.deploy.worker.Worker, logging to /export/server/spark-3.4.4-bin-hadoop3/logs/spark-hadoop-org.apache.spark.deploy.worker.Worker-1-node2.out
```
说明:
- **Master** 节点已经启动,并且日志文件 `spark-hadoop-org.apache.spark.deploy.master.Master-1-node1.out` 已经开始记录 Master 进程的日志。
- **Worker** 节点也在 `node2` 和 `node3` 上启动,并且相应的日志文件已经生成。
### 接下来您可以做以下几步来验证集群是否正常启动:
#### 1. **检查 Master 节点状态**
您可以通过访问 Spark Master UI 来检查 Master 节点是否正常启动。默认情况下,Spark Master 的 Web UI 运行在 `7077` 端口。您可以访问以下 URL:
```
http://node1:8080
```
这是 Spark Master 的 Web UI,您应该能看到集群状态、Worker 节点以及其他信息。如果页面能够正常加载且显示了 Worker 节点的状态(如 `node2` 和 `node3`),说明 Master 节点和 Worker 节点正常工作。
#### 2. **检查 Worker 节点状态**
您还可以检查各个 Worker 节点的 Web UI 来查看其状态。默认情况下,Spark Worker 的 Web UI 运行在 `8081` 端口。您可以访问以下 URL:
```
http://node2:8081
http://node3:8081
```
如果 Worker 节点也显示正常并且没有错误信息,那么说明 Worker 节点已成功加入集群。
#### 3. **查看日志文件**
查看日志文件以确保没有错误:
- **Master 节点日志**:
`/export/server/spark-3.4.4-bin-hadoop3/logs/spark-hadoop-org.apache.spark.deploy.master.Master-1-node1.out`
- **Worker 节点日志**:
`/export/server/spark-3.4.4-bin-hadoop3/logs/spark-hadoop-org.apache.spark.deploy.worker.Worker-1-node2.out`
`/export/server/spark-3.4.4-bin-hadoop3/logs/spark-hadoop-org.apache.spark.deploy.worker.Worker-1-node3.out`
如果日志中有错误信息,可能需要根据日志内容进一步诊断问题。
#### 4. **提交 Spark 作业**
您还可以提交一个简单的 Spark 作业来测试集群是否正常运行。例如,您可以使用以下命令提交一个简单的 Spark 程序:
```bash
spark-submit --class org.apache.spark.examples.SparkPi --master spark://node1:7077 $SPARK_HOME/examples/jars/spark-examples_2.12-3.4.4.jar 10
```
这将提交一个计算 Pi 值的示例作业。如果集群正常,作业应该能够运行,并且您可以看到结果。
### 总结:
- 确认通过 Web UI (`http://node1:8080` 和 `http://node2:8081`, `http://node3:8081`) 检查 Spark 集群状态。
- 查看日志文件确保没有错误。
- 通过提交简单的作业来测试集群的可用性。
如果您在过程中遇到任何问题或有其他问题,随时可以向我询问!
(3)配置的txt参考
1.进入hive安装目录conf目录,修改hive-site.xml
cd /export/server/apache-hive-3.1.3-bin/conf/
vi hive-site.xml
添加如下内容:
<property>
<name>hive.spark.client.jar</name>
<value>${SPARK_HOME}/lib/spark-assembly-*.jar</value>
</property>
2.拷贝hive-site.xml到/export/server/spark-3.4.4-bin-hadoop3/conf
cp /export/server/apache-hive-3.1.3-bin/conf/hive-site.xml /export/server/spark-3.4.4-bin-hadoop3/conf/
scp /export/server/apache-hive-3.1.3-bin/conf/hive-site.xml node2:/export/server/spark-3.4.4-bin-hadoop3/conf/
scp /export/server/apache-hive-3.1.3-bin/conf/hive-site.xml node3:/export/server/spark-3.4.4-bin-hadoop3/conf/
3.拷贝MYSQL驱动到/export/server/spark-3.4.4-bin-hadoop3/jars
cd /export/server/apache-hive-3.1.3-bin/lib/
cp /export/server/apache-hive-3.1.3-bin/lib/mysql-connector-java-5.1.34.jar /export/server/spark-3.4.4-bin-hadoop3/jars/
scp /export/server/spark-3.4.4-bin-hadoop3/jars/mysql-connector-java-5.1.34.jar node2:/export/server/spark-3.4.4-bin-hadoop3/jars/
scp /export/server/spark-3.4.4-bin-hadoop3/jars/mysql-connector-java-5.1.34.jar node3:/export/server/spark-3.4.4-bin-hadoop3/jars/
4.在所有节点/export/server/spark-3.4.4-bin-hadoop3/conf/spark-env.sh 文件中配置 MySQL 驱动
vi /export/server/spark-3.4.4-bin-hadoop3/conf/spark-env.sh
添加如下内容:
export SPARK_CLASSPATH=/export/server/spark-3.4.4-bin-hadoop3/jars/mysql-connector-java-5.1.34.jar
分发
scp /export/server/spark-3.4.4-bin-hadoop3/conf/spark-env.sh node2:/export/server/spark-3.4.4-bin-hadoop3/conf/
scp /export/server/spark-3.4.4-bin-hadoop3/conf/spark-env.sh node3:/export/server/spark-3.4.4-bin-hadoop3/conf/
5.修改日志级别,在各节点:
cp /export/server/spark-3.4.4-bin-hadoop3/conf/log4j2.properties.template /export/server/spark-3.4.4-bin-hadoop3/conf/log4j2.properties
vi /export/server/spark-3.4.4-bin-hadoop3/conf/log4j2.properties
在文件中找到log4j2.rootCategory的设置,并将其修改为:
原来的
rootLogger.level = info
rootLogger.appenderRef.stdout.ref = console
把原来的那个注释
再添加以下内容:
rootLogger.level = warn
rootLogger.appenderRef.console.ref = console
再分发
scp /export/server/spark-3.4.4-bin-hadoop3/conf/log4j2.properties node2:/export/server/spark-3.4.4-bin-hadoop3/conf/
scp /export/server/spark-3.4.4-bin-hadoop3/conf/log4j2.properties node3:/export/server/spark-3.4.4-bin-hadoop3/conf/
6.启动该启动的,访问spark-sql
su - hadoop
start-dfs.sh
start-yarn.sh
nohup /export/server/hive/bin/hive --service metastore >> /export/server/hive/logs/metastore.log 2>&1 &
cd /export/server/spark-3.4.4-bin-hadoop3/sbin
./start-all.sh
jps
spark-sql
7.在spark-sql中尝试一下写代码
use demo1;
-- 创建一个新表
CREATE TABLE employees (
id INT,
name STRING,
salary DOUBLE
);
-- 插入单条记录
INSERT INTO employees VALUES (4, 'Alice', 1300);
-- 插入多条记录
INSERT INTO employees VALUES
(5, 'Bob', 1400),
(6, 'Charlie', 1100);
-- 查询表中的所有数据
SELECT * FROM employees;
-- 删除表
DROP TABLE IF EXISTS employees;