본문 바로가기
데이터 엔지니어( 실습 정리 )

하둡 클러스터 운영( hive 컨테이너 ) 실습[hive]

by 세용용용용 2024. 6. 13.

1. hive (컨테이너)

데이터 웨어하우스 기능을 제공하는 오픈 소스 데이터베이스 관리 시스템입니다. 
주로 대용량의 데이터를 다루는 데 사용되며, Hadoop의 일부로 개발

 

1-1. hive 동작

하이브는 Hadoop의 맵리듀스(MapReduce) 프레임워크를 기반으로 동작합니다. 
쿼리가 실행될 때 Hive는 HiveQL을 맵리듀스 작업으로 변환하여 클러스터에서 분산 처리를 수행

 

1-2. hdfs mkdir( 웨어하우스 디렉토리 생성 )

hdfs dfs -mkdir -p /user/hive/warehouse

 

2. dockfile[hadoop 도커 파일에서 추가]

# 베이스 이미지
FROM openjdk:8

# bash 및 필요 패키지 설치
RUN apt-get update && apt-get install -y bash wget tar

# 환경 변수 설정
ENV JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 \
    HADOOP_HOME=/data/sy0218/hadoop-3.3.5 \
    HADOOP_COMMON_HOME=/data/sy0218/hadoop-3.3.5 \
    HADOOP_MAPRED_HOME=/data/sy0218/hadoop-3.3.5 \
    HADOOP_HDFS_HOME=/data/sy0218/hadoop-3.3.5 \
    HADOOP_YARN_HOME=/data/sy0218/hadoop-3.3.5 \
    HADOOP_CONF_DIR=/data/sy0218/hadoop-3.3.5/etc/hadoop \
    HADOOP_LOG_DIR=/logs/hadoop \
    HADOOP_PID_DIR=/var/run/hadoop/hdfs \
    HADOOP_COMMON_LIB_NATIVE_DIR=/data/sy0218/hadoop-3.3.5/lib/native \
    HADOOP_OPTS="-Djava.library.path=/data/sy0218/hadoop-3.3.5/lib/native" \
    HIVE_HOME=/data/sy0218/apache-hive-3.1.3-bin \
    HIVE_AUX_JARS_PATH=/data/sy0218/apache-hive-3.1.3-bin/aux

ENV PATH=$JAVA_HOME/bin:$HADOOP_HOME/sbin:$HADOOP_HOME/bin:$HIVE_HOME/bin:$HIVE_AUX_JARS_PATH/bin:$PATH

# /usr/lib/jvm 디렉토리 생성 및 JDK 복사
RUN mkdir -p /usr/lib/jvm && cp -r /usr/local/openjdk-8 /usr/lib/jvm/java-8-openjdk-amd64

# 하둡 설치 및 설정 파일 복사
RUN mkdir -p /data/sy0218
RUN mkdir -p /data/download_tar
RUN mkdir -p /hadoop/data1
RUN mkdir -p /hadoop/data2
RUN mkdir -p /hadoop/hdfs
RUN mkdir -p /hadoop/hdfs_work
RUN mkdir -p /hadoop/jn
RUN mkdir -p /hadoop/data

# 하둡 tar파일 download_tar 디렉토리 복사
COPY hadoop-3.3.5.tar.gz /data/download_tar/hadoop-3.3.5.tar.gz
COPY apache-hive-3.1.3-bin.tar.gz /data/download_tar/apache-hive-3.1.3-bin.tar.gz

# 하둡 tar파일 원하는 경로에 풀기
RUN tar xzvf /data/download_tar/hadoop-3.3.5.tar.gz -C /data/sy0218/

# 하둡 설정 파일
COPY core-site.xml /data/sy0218/hadoop-3.3.5/etc/hadoop/
COPY hdfs-site.xml /data/sy0218/hadoop-3.3.5/etc/hadoop/
COPY mapred-site.xml /data/sy0218/hadoop-3.3.5/etc/hadoop/
COPY yarn-site.xml /data/sy0218/hadoop-3.3.5/etc/hadoop/
COPY workers /data/sy0218/hadoop-3.3.5/etc/hadoop/
COPY hadoop-env.sh /data/sy0218/hadoop-3.3.5/etc/hadoop/
COPY hadoop-config.sh /data/sy0218/hadoop-3.3.5/libexec/

# Hive tar파일 원하는 경로에 풀기
RUN tar xzvf /data/download_tar/apache-hive-3.1.3-bin.tar.gz -C /data/sy0218/

# 설정파일 및 JAR파일 원하는경로 COPY
COPY postgresql-42.2.11.jar /data/sy0218/apache-hive-3.1.3-bin/lib/
COPY hive-site.xml /data/sy0218/apache-hive-3.1.3-bin/conf/

# 컨테이너 내에서 실행할 기본 명령어 설정
CMD tail -f /dev/null

 

2-1. hive-site.xml

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
        <name>javax.jdo.option.ConnectionURL</name>
        <value>jdbc:postgresql://192.168.56.10:5432/sy0218</value>
        <description>metadata is stored in a MySQL server</description>
    </property>
    <property>
        <name>hive.metastore.db.type</name>
        <value>postgres</value>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>org.postgresql.Driver</value>
        <description>MySQL JDBC driver class</description>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionUserName</name>
        <value>hive</value>
        <description>Username to use against metastore database</description>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionPassword</name>
        <value>!hive0218</value>
        <description>password to use against metastore database</description>
    </property>
    <property>
        <name>hive.metastore.warehouse.dir</name>
        <value>/user/hive/warehouse</value>
        <description>location warehouse</description>
    </property>
    <property>
        <name>hive.execution.engine</name>
        <value>mr</value>
    </property>
    <property>
        <name>hive.server2.transport.mode</name>
        <value>binary</value>
    </property>
    <property>
        <name>hive.server2.webui.port</name>
        <value>0</value>
    </property>
    <property>
        <name>hive.server2.enable.doAs</name>
        <value>false</value>
    </property>
    <property>
        <name>hive.vectorized.execution.enabled</name>
        <value>true</value>
    </property>
    <property>
        <name>mapreduce.input.fileinputformat.split.maxsize</name>
        <value>64000000</value>
    </property>
    <property>
        <name>hive.exec.max.dynamic.partitions</name>
        <value>3000</value>
    </property>
    <property>
        <name>hive.exec.max.dynamic.partitions.pernode</name>
        <value>100</value>
    </property>
</configuration>

 
 

2-2. 그 외 필요 파일 및 tar

/data/hadoop_docker >> hadoop_hive 도커 디렉토리
apache-hive-3.1.3-bin.tar.gz
core-site.xml
dockerfile
fair-scheduler.xml
hadoop-3.3.5.tar.gz
hadoop-config.sh
hadoop-env.sh
hdfs-site.xml
hive-site.xml
mapred-site.xml
masters
postgresql-42.2.11.jar
workers
yarn-site.xml

 
 
 

2-3. 컨테이너 재기동 스크립트( 필요 변수 : 이미지 명, 이미지 테크, 도커 디렉토리 )

- 재 기동전 /data/sy0218/hadoop-3.3.5/sbin/stop-all.sh 하둡 클러스터 종료 필수!!

 

/data/work/hadoop_auto_fun.sh

#!/usr/bin/bash

# 인자 개수 확인
if [ "$#" -ne 3 ]; then
    echo "Usage: $0 [hadoop_hive_image_name] [image_tag] [docker_image_dir]"
    exit 1
fi


check_sc_home="/data/check"
hadoop_hive_image_name=$1
docker_image_tag=$2
docker_image_dir=$3

echo "[`date`] Time_Stamp : hadoop auto run  Start...."; echo "";
#####################################################################################################
echo "[`date`] Time_Stamp : hadoop 필요 디렉토리 생성  Start...."
${check_sc_home}/all_command.sh "rm -rf /hadoop"

for dir_name in /hadoop/data1 /hadoop/data2 /hadoop/hdfs /hadoop/hdfs_work /hadoop/jn /hadoop/data
do
        ${check_sc_home}/all_command.sh "mkdir -p ${dir_name}"
done
echo "[`date`] Time_Stamp : hadoop 필요 디렉토리 생성  End...."; echo "";
#####################################################################################################


#####################################################################################################
echo "[`date`] Time_Stamp : hadoop_logs_files rm  Start...."
${check_sc_home}/all_command.sh "rm -rf /logs/hadoop/*"
${check_sc_home}/all_command.sh "rm -rf /var/run/hadoop/hdfs"
echo "[`date`] Time_Stamp : hadoop_logs_files rm  End...."; echo "";
#####################################################################################################


#####################################################################################################
echo "[`date`] Time_Stamp : hadoop docker_container_stop  Start...."
${check_sc_home}/all_command.sh "docker stop ${hadoop_hive_image_name}"
echo "[`date`] Time_Stamp : hadoop docker_container_stop  End...."; echo "";
#####################################################################################################


#####################################################################################################
echo "[`date`] Time_Stamp : hadoop docker_container_rm  Start...."
${check_sc_home}/all_command.sh "docker rm ${hadoop_hive_image_name}"
echo "[`date`] Time_Stamp : hadoop docker_container_rm  End...."; echo "";
#####################################################################################################


#####################################################################################################
echo "[`date`] Time_Stamp : hadoop docker_container_image_rm  Start...."
${check_sc_home}/all_command.sh "docker rmi ${hadoop_hive_image_name}:${docker_image_tag}"
echo "[`date`] Time_Stamp : hadoop docker_container_image_rm  End...."; echo "";
#####################################################################################################


#####################################################################################################
echo "[`date`] Time_Stamp : hadoop docker_build  Start...."
${check_sc_home}/all_command.sh "docker build -t ${hadoop_hive_image_name}:${docker_image_tag} ${docker_image_dir}"
echo "[`date`] Time_Stamp : hadoop docker_build  End...."; echo "";
#####################################################################################################


#####################################################################################################
echo "[`date`] Time_Stamp : hadoop docker_run  Start...."
${check_sc_home}/all_command.sh "docker run -d --name ${hadoop_hive_image_name} --network host -v /root/.ssh:/root/.ssh -v /hadoop:/hadoop -v /var/run/hadoop/hdfs:/var/run/hadoop/hdfs ${hadoop_hive_image_name}:${docker_image_tag}"
echo "[`date`] Time_Stamp : hadoop docker_run  End...."; echo "";
#####################################################################################################


#####################################################################################################
echo "[`date`] Time_Stamp : hadoop bin_file cp  Start...."
${check_sc_home}/all_command.sh "docker cp ${hadoop_hive_image_name}:/data/sy0218/ /data/"
echo "[`date`] Time_Stamp : hadoop bin_file cp  End...."; echo "";
#####################################################################################################
echo "[`date`] Time_Stamp : hadoop auto run  End...."

 

2-4. 하둡 재기동

1) hdfs zkfc -formatZK(master1)
2) start-dfs.sh(master1)
3) hdfs namenode -format(master1)
4) stop-dfs.sh(master1)
5) start-all.sh(master1)
7)
/data/sy0218/hadoop-3.3.5/sbin/stop-all.sh
/data/sy0218/hadoop-3.3.5/sbin/start-all.sh

hdfs haadmin -getServiceState namenode1
hdfs haadmin -getServiceState namenode2
---------------------------------------------------------
6) hdfs namenode -bootstrapStandby(master2)

 

jps : 현재 실행 중인 모든 Java 프로세스의 이름과 PID 확인

/data/check/jps_check.sh

#!/usr/bin/bash

for host_name in kube-control1 kube-node1 kube-data1 kube-data2 kube-data3
do
        echo "------------jps on ${host_name}------------"
        ssh ${host_name} "jps"
        echo "-------------------------------------------"; echo"";
done


********************************출력 결과********************************

------------jps on kube-control1------------
9873 ResourceManager
9093 NameNode
10663 Jps
9534 DFSZKFailoverController
9342 JournalNode
-------------------------------------------

------------jps on kube-node1------------
5980 JournalNode
6605 Jps
5870 NameNode
6110 DFSZKFailoverController
6239 ResourceManager
-------------------------------------------

------------jps on kube-data1------------
5250 JournalNode
5380 NodeManager
5125 DataNode
5658 Jps
-------------------------------------------

------------jps on kube-data2------------
4616 NodeManager
4874 Jps
4478 DataNode
-------------------------------------------

------------jps on kube-data3------------
4629 NodeManager
4935 Jps
4491 DataNode
-------------------------------------------

 
 

/data/check/all_command.sh(지정 커맨드 실행)

#!/usr/bin/bash


# 인자 개수 확인
if [ "$#" -ne 1 ]; then
    echo "Usage: $0 <command>"
    exit 1
fi

command=$1

for host_name in kube-control1 kube-node1 kube-data1 kube-data2 kube-data3
do
        echo "------command ${host_name}------------"
        ssh ${host_name} "${command}"
        echo "--------------------------------------"; echo"";
done

 

 


docker_container_restart.sh(docker 컨테이너 재시작)

#!/usr/bin/bash

echo "[`date`] Time_Stamp : restart docker container Start...."; echo "";

echo "[`date`] Time_Stamp : restart zookeeper Start...."
for host_name in kube-control1 kube-node1 kube-data1 kube-data2 kube-data3
do
        containers_id=$(ssh ${host_name} "docker ps -a | grep zookeeper | awk '{print \$1}'")
        if [ -n "${containers_id}" ]; then
                echo "------------Containers restart ${host_name}------------"
                ssh ${host_name} "docker start ${containers_id}"
                echo "-------------------------------------------------------"; echo"";
        else
                echo "------------Containers restart ${host_name}------------"
                echo "no zeekeeper server"
                echo "-------------------------------------------------------"; echo"";
        fi
done
echo "[`date`] Time_Stamp : restart zookeeper End...."; echo "";


echo "[`date`] Time_Stamp : restart postgresql Start...."
for host_name in kube-control1 kube-node1 kube-data1 kube-data2 kube-data3
do
        containers_id=$(ssh ${host_name} "docker ps -a | grep postsql | awk '{print \$1}'")
        if [ -n "${containers_id}" ]; then
                echo "------------Containers restart ${host_name}------------"
                ssh ${host_name} "docker start ${containers_id}"
                echo "-------------------------------------------------------"; echo"";
        else
                echo "------------Containers restart ${host_name}------------"
                echo "no postsql server"
                echo "-------------------------------------------------------"; echo"";
        fi
done
echo "[`date`] Time_Stamp : restart postgresql End...."; echo "";


echo "[`date`] Time_Stamp : restart hadoop_hive Start...."
for host_name in kube-control1 kube-node1 kube-data1 kube-data2 kube-data3
do
        containers_id=$(ssh ${host_name} "docker ps -a | grep hadoop_hive | awk '{print \$1}'")
        if [ -n "${containers_id}" ]; then
                echo "------------Containers restart ${host_name}------------"
                ssh ${host_name} "docker start ${containers_id}"
                echo "-------------------------------------------------------"; echo"";
        else
                echo "------------Containers restart ${host_name}------------"
                echo "no hadoop_hive server"
                echo "-------------------------------------------------------"; echo"";
        fi
done
echo "[`date`] Time_Stamp : restart hadoop_hive End...."; echo "";

echo "[`date`] Time_Stamp : restart docker container End...."

 

 

docker_container.sh(실행중인 docker 컨테이너 확인)

#!/usr/bin/bash

for host_name in kube-control1 kube-node1 kube-data1 kube-data2 kube-data3
do
        echo "------Containers on ${host_name}------------"
        ssh  ${host_name} 'docker ps --format "{{.Names}}"'
        echo "--------------------------------------------"; echo"";
done

 

 

zookeeper_check.sh(주키퍼 상태 확인)

#!/usr/bin/bash

for host_name in kube-control1 kube-node1 kube-data1
do
        echo "------------zookeeper type ${host_name}------------"
        ssh -t ${host_name} "docker exec -it zookeeper /data/sy0218/apache-zookeeper-3.7.2-bin/bin/zkServer.sh status"
        echo "---------------------------------------------------"; echo"";
done
[kube-control1:/data/check] cat ^C
[kube-control1:/data/check] cat zookeeper_check.sh
#!/usr/bin/bash

for host_name in kube-control1 kube-node1 kube-data1
do
        echo "------------zookeeper type ${host_name}------------"
        ssh -t ${host_name} "docker exec -it zookeeper /data/sy0218/apache-zookeeper-3.7.2-bin/bin/zkServer.sh status"
        echo "---------------------------------------------------"; echo"";
done

 

 

3. hive 기동 

1. Apache Hive의 메타스토어 데이터베이스를 초기화
/data/sy0218/apache-hive-3.1.3-bin/bin/schematool -initSchema -dbType postgres


2. HiveServer2를 백그라운드에서 실행하며 로그를 지정된 파일에 기록
mkdir -p /hive/log
nohup hive --service hiveserver2 >> /hive/log/hiveserver2.log 2>&1 &

 

3-1. hive_test

hive -e "show databases;"