首页标签分类
2. spark on hive 搭建
2025-08-03 · 更新 2026-03-03约 2 分钟 · 342 字
spark基础
000

目录

spark on hive 搭建
原理
搭建
启停脚本
验证
注意事项

spark on hive 搭建

前提:

  1. 拥有hadoop集群
  2. 搭建了hive(使用mysql作为元数据存储)

原理

hive:仅仅提供元数据的情况

spark:负责,sql解析,优化,任务划分和执行

搭建

bash
自动换行:关
放大阅读
展开代码
## 下载需要的spark包并解压 [atguigu@hadoop102 ~]$ wget -P /opt/software / https://archive.apache.org/dist/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3.tgz [atguigu@hadoop102 ~]$ tar -zxvf /opt/software/spark-3.3.1-bin-hadoop3.tgz -C /opt/module/ [atguigu@hadoop102 ~]$ mv /opt/module/spark-3.3.1-bin-hadoop3 /opt/module/spark-yarn ## 设置spark的环境变量 [atguigu@hadoop102 ~]$ SPARK_HOME=/opt/module/spark-yarn ## 复制完整的hive-site.xml配置到spark的配置目录下 [atguigu@hadoop102 spark-yarn]$ cp /opt/module/hive/conf/hive-site.xml /opt/module/spark-yarn/conf/ [atguigu@hadoop102 spark-yarn]$ ll /opt/module/spark-yarn/conf/ total 40 -rw-r--r--. 1 atguigu atguigu 1105 Oct 15 2022 fairscheduler.xml.template -rw-rw-r--. 1 atguigu atguigu 2756 Aug 3 14:15 hive-site.xml -rw-r--r--. 1 atguigu atguigu 3350 Oct 15 2022 log4j2.properties.template -rw-r--r--. 1 atguigu atguigu 9141 Oct 15 2022 metrics.properties.template -rw-r--r--. 1 atguigu atguigu 1292 Oct 15 2022 spark-defaults.conf -rwxr-xr-x. 1 atguigu atguigu 4559 Aug 3 14:14 spark-env.sh -rw-r--r--. 1 atguigu atguigu 865 Oct 15 2022 workers.template ## 复制hive的lib库下的mysql驱动到spark的jars库下 [atguigu@hadoop102 spark-yarn]$ cp /opt/module/hive/lib/mysql-connector-java-5.1.37.jar /opt/module/spark-yarn/jars/ [atguigu@hadoop102 spark-yarn]$ ll /opt/module/spark-yarn/jars/ | grep mysql -rw-rw-r--. 1 atguigu atguigu 985600 Aug 3 15:24 mysql-connector-java-5.1.37.jar ## 创建spark的日志存放路径 [atguigu@hadoop102 spark-yarn]$ hadoop fs -mkdir -p /tmp/spark ## 编辑sparkd的配置,设置资源调度使用yarn,开启日志记录,数据序列化方式,driver的内存大小 [atguigu@hadoop102 spark-yarn]$ vim conf/spark-defaults.conf # 使用yarn模式 spark.master yarn spark.eventLog.enabled true spark.eventLog.dir hdfs://hadoop102:8082/tmp/spark spark.serializer org.apache.spark.serializer.KryoSerializer spark.driver.memory 512m spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three" ## 启动hive metastore服务 [atguigu@hadoop102 spark-yarn]$ nohup hive --service metastore & [atguigu@hadoop102 spark-yarn]$ jps 2498 NameNode 31922 Jps 2708 DataNode 3064 NodeManager 31790 RunJar ## 使用外部的idea工具链接spark thrift serve所需的信息:jdbc:hive2://hadoop102:10000 ## 需要注意的是 : ## 1.spark的thrift 服务是以HiveServer2改进而来,完全兼容HiveServer2的接口和协议 ## 2.使用外部idea工具链接thrift服务时,即使没有开启授权访问,比如没有设置用户和密码,但依旧需要在填写链接用户名写为具有访问hdfs上存放spark日志目录的hadoop用户,密码可留空不写 <!-- 指定hiveserver2连接的host --> <property> <name>hive.server2.thrift.bind.host</name> <value>hadoop102</value> </property> <!-- 指定hiveserver2连接的端口号 --> <property> <name>hive.server2.thrift.port</name> <value>10000</value> </property> ## 启动 spark的thriftserver服务 [atguigu@hadoop102 spark-yarn]$ sbin/start-thriftserver.sh starting org.apache.spark.sql.hive.thriftserver.HiveThriftServer2, logging to /opt/module/spark-yarn/logs/spark-atguigu-org.apache.spark.sql.hive.thriftserver.HiveThriftServer2-1-hadoop102.out ## 通过端口查看和查看日志文件来验证启动结果 [atguigu@hadoop102 spark-yarn]$ ss -tanlp | grep 10000 LISTEN 0 50 [::ffff:192.168.61.102]:10000 [::]:* users:(("java",pid=32335,fd=348))

spark thrift启动成功时的日志:

image-20250803154104574

启停脚本

为了方便启动spark on hive的模式,在启动hadoop后,只需要执行如下脚本就能够实现服务的启动和停止。

spark_thrift_services.sh

bash
自动换行:关
放大阅读
展开代码
#!/usr/bin/env bash #=============================================================================================== # 脚本名称: spark_thrift_services.sh # 脚本功能: 启动或停止 Hive Metastore 和 Spark Thrift Server 服务 # 使用说明: # chmod +x spark_thrift_services.sh # ./spark_thrift_services.sh start # 启动服务 # ./spark_thrift_services.sh stop # 停止服务 #=============================================================================================== #--- 环境配置 --- export SPARK_HOME=/opt/module/spark-yarn export HIVE_HOME=/opt/module/hive LOG_DIR="$SPARK_HOME/logs" METASTORE_PORT=9083 THRIFT_SERVER_PORT=10000 #--- 脚本颜色配置 --- COLOR_GREEN='\033[0;32m' COLOR_RED='\033[0;31m' COLOR_YELLOW='\033[0;33m' COLOR_NC='\033[0m' #--- 辅助函数 --- log_info() { echo -e "${COLOR_GREEN}[INFO] $1${COLOR_NC}"; } log_warn() { echo -e "${COLOR_YELLOW}[WARN] $1${COLOR_NC}"; } log_error() { echo -e "${COLOR_RED}[ERROR] $1${COLOR_NC}"; } # 端口检测 is_port_in_use() { ss -tanlp | grep -q ":${1}\b" } # 轮询等待端口开放 wait_for_port() { local port=$1 local timeout=$2 echo -n "等待端口 $port 开放... " for ((i=0; i<timeout; i+=2)); do if is_port_in_use "$port"; then echo -e "${COLOR_GREEN}成功!${COLOR_NC}" return 0 fi echo -n "." sleep 2 done echo -e "${COLOR_RED}失败! (超时: ${timeout}s)${COLOR_NC}" return 1 } # 轮询等待端口关闭 wait_for_port_to_close() { local port=$1 local timeout=$2 echo -n "等待端口 $port 关闭... " for ((i=0; i<timeout; i+=2)); do if ! is_port_in_use "$port"; then echo -e "${COLOR_GREEN}成功!${COLOR_NC}" return 0 fi echo -n "." sleep 2 done echo -e "${COLOR_RED}失败! (超时: ${timeout}s)${COLOR_NC}" return 1 } # 获取Metastore的PID get_metastore_pid() { # pgrep -f 比 jps | grep 更稳定,直接查找包含特定字符串的完整命令行 pgrep -f "org.apache.hadoop.hive.metastore.HiveMetaStore" } # --- 启动服务逻辑 --- start_services() { log_info "========== 开始启动服务 ==========" # 1. 启动 Hive Metastore log_info "--- 步骤 1: 启动 Hive Metastore ---" if is_port_in_use $METASTORE_PORT; then log_warn "Hive Metastore (端口: $METASTORE_PORT) 已经启动。" else log_info "正在启动 Hive Metastore..." METASTORE_LOG_FILE="$LOG_DIR/hive-metastore-$(date +%Y%m%d).log" nohup $HIVE_HOME/bin/hive --service metastore > "$METASTORE_LOG_FILE" 2>&1 & if wait_for_port $METASTORE_PORT 30; then log_info "Hive Metastore 启动成功! PID: $(get_metastore_pid)" else log_error "Hive Metastore 启动失败! 请检查日志: $METASTORE_LOG_FILE" exit 1 fi fi # 2. 启动 Spark Thrift Server log_info "--- 步骤 2: 启动 Spark Thrift Server ---" if is_port_in_use $THRIFT_SERVER_PORT; then log_warn "Spark Thrift Server (端口: $THRIFT_SERVER_PORT) 已经启动。" else log_info "正在启动 Spark Thrift Server..." $SPARK_HOME/sbin/start-thriftserver.sh if wait_for_port $THRIFT_SERVER_PORT 60; then log_info "Spark Thrift Server 启动成功!" else log_error "Spark Thrift Server 启动失败! 请检查 Spark 日志: $LOG_DIR" exit 1 fi fi log_info "========== 所有服务均已成功启动 ==========" } # --- 停止服务逻辑 --- stop_services() { log_info "========== 开始停止服务 ==========" # 1. 停止 Spark Thrift Server log_info "--- 步骤 1: 停止 Spark Thrift Server ---" if is_port_in_use $THRIFT_SERVER_PORT; then log_info "正在停止 Spark Thrift Server..." $SPARK_HOME/sbin/stop-thriftserver.sh if wait_for_port_to_close $THRIFT_SERVER_PORT 30; then log_info "Spark Thrift Server 已成功停止。" else log_warn "Spark Thrift Server 未能确认关闭,请手动检查。" fi else log_warn "Spark Thrift Server (端口: $THRIFT_SERVER_PORT) 未在运行。" fi # 2. 停止 Hive Metastore log_info "--- 步骤 2: 停止 Hive Metastore ---" METASTORE_PID=$(get_metastore_pid) if [ -n "$METASTORE_PID" ]; then log_info "找到 Hive Metastore 进程 (PID: $METASTORE_PID),正在发送优雅关闭信号 (SIGTERM)..." kill "$METASTORE_PID" if wait_for_port_to_close $METASTORE_PORT 30; then log_info "Hive Metastore 已成功停止。" else log_warn "Hive Metastore 未能优雅关闭,将强制杀死 (SIGKILL)..." kill -9 "$METASTORE_PID" sleep 2 if get_metastore_pid > /dev/null; then log_error "强制杀死 Hive Metastore 失败!请手动处理 PID: $METASTORE_PID" else log_info "Hive Metastore 已被强制停止。" fi fi else log_warn "Hive Metastore 未在运行。" fi log_info "========== 所有服务均已停止 ==========" } # --- 主逻辑 --- case "$1" in start) start_services ;; stop) stop_services ;; *) echo "用法: $0 {start|stop}" exit 1 ;; esac exit 0

验证

综上,部署了spark thrift,并且spark任务依赖yarn进行资源调度,部署模式为client,通过查看spark的web界面查看环境熟悉,默认地址为 http://thrift节点:4040

image-20250803162726369

idea中提交任务后查看hadoop提供的任务界面:默认界面为http://resourcemanage节点:8080

image-20250803163048542

image-20250803163118173

注意事项

  1. 由于没有对hdfs的读写权限导致,当涉及数据的写入hdfs场景的sql执行失败
sql
自动换行:关
放大阅读
展开代码
create table if not exists senior_candidates as WITH -- 1. 模拟 Candidates 表 Candidates AS ( SELECT 1 AS employee_id, 'Junior' AS experience, 10000 AS salary UNION ALL SELECT 9, 'Junior', 10000 UNION ALL SELECT 2, 'Senior', 80000 UNION ALL SELECT 11, 'Senior', 80000 UNION ALL SELECT 13, 'Senior', 80000 UNION ALL SELECT 4, 'Junior', 40000 ) select * from Candidates; select * from senior_candidates;

image-20250803161141865

image-20250803155525106

  1. spark on hive 部署模式为client

Spark Thrift Server 这种需要提供稳定、固定网络端点的长连接服务,那么启动Driver的节点必然是固定的。

  1. 启动 spark thrift的前提

    • hadoop 集群运行正常
    • hive的metastore服务正常
  2. spark的thrift服务默认集成在具有hadoop依赖的包(spark-x.x.x-bin-hadoopx.tgz )

    image-20250803164203537 不要选择没有hadoop依赖的包,否则启动thrift服务时报错:

    bash
    自动换行:关
    放大阅读
    展开代码
    Error: Failed to load class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2. Failed to load main class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2. You need to build Spark with -Phive and -Phive-thriftserver. 2025-08-03 16:15:29,167 INFO util.ShutdownHookManager: Shutdown hook called

本文作者:hedeoer

本文链接:

版权声明:本博客所有文章除特别声明外,均采用 BY-NC-SA 许可协议。转载请注明出处!