데이터 엔지니어( 실습 정리 )

ansible - 하둡 동적 설치

세용용용용 2024. 7. 23. 17:10

도커 컨테이너 실습[하둡hive] 동적 실행 (tistory.com)

 

도커 컨테이너 실습[하둡hive] 동적 실행

업데이트(Update) 2024-07-03- 도커 레지스트리 활용 컨테이너 실행으로 스크립트 변경- proc.sh 스크립트 수정(절차지향 >>> 객체지향) 재 사용성을 위해 수정하였습니다.- component_proc.py(system_download.txt

sy02229.tistory.com

 

기존 컨테이너 방식은 오버헤드 발생 단점... ㅠㅠ

local 환경에 ansible을 활용한 동적 설치를 통해 단점을 개선!!! 가쥬앗

 

 

0. 분산 애플리케이션 설정 공유자원( /data/work/system_download.txt )

[server_ip]|192.168.56.10|192.168.56.11|192.168.56.12

[zookeeper_ip]|192.168.56.10|192.168.56.11|192.168.56.12
-----------------------[zoo.cfg-start]-----------------------
[tickTime=]|2000
[initLimit=]|11
[syncLimit=]|5
[dataDir=]|/data/sy0218/apache-zookeeper-3.7.2-bin/data
[clientPort=]|2181
-----------------------[zoo.cfg-end]-----------------------


[postgresql_data_directory]|/pgdb/pg_data
-----------------------[postgresql.conf-start]-----------------------
[data_directory]|'/pgdb/pg_data'
[listen_addresses]|'*'
[port]|5432
-----------------------[postgresql.conf-end]-----------------------


[hadoop_ip]|192.168.56.10|192.168.56.11|192.168.56.12
[need_dir]|/hadoop/hdfs_work|/hadoop/hdfs|/hadoop/data1|/hadoop/data2|/hadoop/jn|/hadoop/data
-----------------------[core-site.xml-start]-----------------------
[fs.default.name]|hdfs://192.168.56.10:9000
[fs.defaultFS]|hdfs://my-hadoop-cluster
[hadoop.tmp.dir]|file:///hadoop/hdfs_work/hadoop-root
[ha.zookeeper.quorum]|192.168.56.10:2181,192.168.56.11:2181,192.168.56.12:2181
-----------------------[core-site.xml-end]-----------------------

-----------------------[hdfs-site.xml-start]-----------------------
[dfs.namenode.name.dir]|file:///hadoop/hdfs/nn
[dfs.datanode.data.dir]|file:///hadoop/data1,file:///hadoop/data2
[dfs.journalnode.edits.dir]|/hadoop/jn
[dfs.namenode.rpc-address.my-hadoop-cluster.namenode1]|192.168.56.10:8020
[dfs.namenode.rpc-address.my-hadoop-cluster.namenode2]|192.168.56.11:8020
[dfs.namenode.http-address.my-hadoop-cluster.namenode1]|192.168.56.10:50070
[dfs.namenode.http-address.my-hadoop-cluster.namenode2]|192.168.56.11:50070
[dfs.namenode.shared.edits.dir]|qjournal://192.168.56.10:8485;192.168.56.11:8485;192.168.56.12:8485/my-hadoop-cluster
[dfs.name.dir]|/hadoop/data/name
[dfs.data.dir]|/hadoop/data/data
-----------------------[hdfs-site.xml-end]-----------------------

-----------------------[mapred-site.xml-start]-----------------------
[mapreduce.framework.name]|yarn
-----------------------[mapred-site.xml-end]-----------------------

-----------------------[yarn-site.xml-start]-----------------------
[yarn.resourcemanager.hostname.rm1]|192.168.56.10
[yarn.resourcemanager.hostname.rm2]|192.168.56.11
[yarn.resourcemanager.webapp.address.rm1]|192.168.56.10:8088
[yarn.resourcemanager.webapp.address.rm2]|192.168.56.11:8088
[yarn.resourcemanager.zk-address]|192.168.56.10:2181,192.168.56.11:2181,192.168.56.12:2181
[yarn.nodemanager.resource.memory-mb]|8192
-----------------------[yarn-site.xml-end]-----------------------

-----------------------[hadoop-env.sh-start]-----------------------
[export JAVA_HOME=]|/usr/lib/jvm/java-8-openjdk-amd64
[export HADOOP_HOME=]|/data/sy0218/hadoop-3.3.5
-----------------------[hadoop-env.sh-end]-----------------------

-----------------------[workers-start]-----------------------
192.168.56.10
192.168.56.11
192.168.56.12
-----------------------[workers-end]-----------------------

 

 

 

1. 인벤토리 ( /data/work/hadoop_3.3.5_auto_ansible/hosts.ini )

인벤토리: 관리 대상 시스템의 리스트입니다.

[servers]
192.168.56.10
192.168.56.11
192.168.56.12

 

 

 

2. 플레이북 변수 파일 ( /data/work/hadoop_3.3.5_auto_ansible/main.yml )

hadoop_tar_path: "/data/download_tar"
hadoop_tar_filename: "hadoop-3.3.5.tar.gz"
work_dir: "/data/sy0218"
play_book_dir: "/data/work/hadoop_3.3.5_auto_ansible"

 

 

 

3. 하둡 동적 설정을 위한 entrypoint.sh (/data/work/hadoop_3.3.5_auto_ansible/entrypoint.sh)

 

(0) 하둡 tar wget

wget https://archive.apache.org/dist/hadoop/core/hadoop-3.3.5/hadoop-3.3.5.tar.gz

 

(1) hadoop 설정 파일 모음

하둡 클러스터 운영( 도커 컨테이너 ) 실습[하둡] (tistory.com)

 

하둡 클러스터 운영( 도커 컨테이너 ) 실습[하둡]

1. 하둡 (컨테이너)하둡은 대규모 데이터 처리를 위해 설계된 오픈 소스 소프트웨어 프레임워크입니다. 분산 저장 및 분산 처리를 지원하며, 대량의 데이터를 효율적으로 처리할 수 있는 기능을

sy02229.tistory.com

 

(2) entrypoint.sh

#!/bin/bash

system_file="/data/work/system_download.txt"
conf_dir=$1
work_dir=$2
ip_array=($(grep hadoop_ip ${system_file} | awk -F '|' '{for(i=2; i<=NF; i++) print $i}'))
len_ip_array=${#ip_array[@]}
hadoop_need_dir=($(grep need_dir ${system_file} | awk -F '|' '{for(i=2; i<=NF; i++) print $i}'))
len_need_dir=${#hadoop_need_dir[@]}

for ((i=0; i<len_ip_array; i++)); do
        current_ip=${ip_array[$i]}

        for ((j=0; j<len_need_dir; j++)); do
                current_dir=${hadoop_need_dir[$j]}
                echo "${current_ip} 서버 필요 디렉토리 삭제 후 생성 $current_dir"
                ssh ${current_ip} "rm -rf ${current_dir}"
                ssh ${current_ip} "mkdir -p ${current_dir}"
                done
done

for core_config_low in $(awk '/\[core-site.xml-start\]/{flag=1; next} /\[core-site.xml-end\]/{flag=0} flag' ${system_file});
do
        file_name=$(find ${conf_dir} -type f -name *core-site.xml*)
        core_site_name=$(echo ${core_config_low} | awk -F '|' '{print $1}' | sed 's/[][]//g')
        core_site_value=$(echo ${core_config_low} | awk -F '|' '{print $2}')
        sed -i "/<name>${core_site_name}<\/name>/!b;n;c<value>${core_site_value}</value>" ${file_name}
done


for hdfs_site_config_low in $(awk '/\[hdfs-site.xml-start\]/{flag=1; next} /\[hdfs-site.xml-end\]/{flag=0} flag' ${system_file});
do
        file_name=$(find ${conf_dir} -type f -name *hdfs-site.xml*)
        hdfs_site_name=$(echo ${hdfs_site_config_low} | awk -F '|' '{print $1}' | sed 's/[][]//g')
        hdfs_site_value=$(echo ${hdfs_site_config_low} | awk -F '|' '{print $2}')
        sed -i "/<name>${hdfs_site_name}<\/name>/!b;n;c<value>${hdfs_site_value}</value>" ${file_name}
done


for mapred_site_config_low in $(awk '/\[mapred-site.xml-start\]/{flag=1; next} /\[mapred-site.xml-end\]/{flag=0} flag' ${system_file});
do
        file_name=$(find ${conf_dir} -type f -name *mapred-site.xml*)
        mapred_site_name=$(echo ${mapred_site_config_low} | awk -F '|' '{print $1}' | sed 's/[][]//g')
        mapred_site_value=$(echo ${mapred_site_config_low} | awk -F '|' '{print $2}')
        sed -i "/<name>${mapred_site_name}<\/name>/!b;n;c<value>${mapred_site_value}</value>" ${file_name}
done


for yarn_site_config_low in $(awk '/\[yarn-site.xml-start\]/{flag=1; next} /\[yarn-site.xml-end\]/{flag=0} flag' ${system_file});
do
        file_name=$(find ${conf_dir} -type f -name *yarn-site.xml*)
        yarn_site_name=$(echo ${yarn_site_config_low} | awk -F '|' '{print $1}' | sed 's/[][]//g')
        yarn_site_value=$(echo ${yarn_site_config_low} | awk -F '|' '{print $2}')
        sed -i "/<name>${yarn_site_name}<\/name>/!b;n;c<value>${yarn_site_value}</value>" ${file_name}
done


hadoop_env_config=$(awk '/\[hadoop-env.sh-start\]/{flag=1; next} /\[hadoop-env.sh-end\]/{flag=0} flag' ${system_file})
while IFS= read -r hadoop_env_config_low;
do
        file_name=$(find ${conf_dir} -type f -name *hadoop-env.sh*)
        hadoop_env_name=$(echo $hadoop_env_config_low | awk -F '|' '{print $1}' | sed 's/[][]//g')
        hadoop_env_value=$(echo $hadoop_env_config_low | awk -F '|' '{print $2}')
        sed -i "s|^${hadoop_env_name}.*$|${hadoop_env_name}${hadoop_env_value}|" ${file_name}
done <<< $hadoop_env_config


work_file_name=$(find ${conf_dir} -type f -name *workers*)
truncate -s 0 $work_file_name
for workers_low in $(awk '/\[workers-start\]/{flag=1; next} /\[workers-end\]/{flag=0} flag' ${system_file});
do
        echo $workers_low >> $work_file_name
done

# 하둡 서버 필요 디렉토리 생성
for ((i=0; i<len_ip_array; i++)); do
        current_ip=${ip_array[$i]}

        for ((j=0; j<len_need_dir; j++)); do
                current_dir=${hadoop_need_dir[$j]}
                echo "${current_ip} 서버 필요 디렉토리 삭제 후 생성 $current_dir"
                ssh ${current_ip} "rm -rf ${current_dir}"
                ssh ${current_ip} "mkdir -p ${current_dir}"
                done
done


# 동적 setup된 하둡 설정 관련 파일 하둡 클러스터에 scp
for cp_file in $(ls ${conf_dir});
do
        if [[ "$cp_file" != "entrypoint.sh" && "$cp_file" != "hadoop-3.3.5.tar.gz" && "$cp_file" != *.yml && "$cp_file" != *.ini ]]; then
                if [[ "$cp_file" == "fair-scheduler.xml" ]]; then
                        local_path=$(find ${work_dir}/*hadoop*/etc/hadoop -name hadoop -type d)
                else
                        local_path=$(find ${work_dir}/ -name ${cp_file} -type f ! -path "*/sample-conf/*")
                fi
        for ((i=0; i<len_ip_array; i++));
        do
                current_ip=${ip_array[$i]}
                scp ${conf_dir}/${cp_file} root@${current_ip}:${local_path}
        done
        fi
done

 

 

 

4. 플레이북 yml 파일 ( /data/work/hadoop_3.3.5_auto_ansible/hadoop_deploy.yml )

---
- name: Create hadoop_tar directory
  hosts: servers
  become: yes
  vars_files:
    - /data/work/hadoop_3.3.5_auto_ansible/main.yml
  tasks:
    - name: Create hadoop_tar directory
      file:
        path: "{{ hadoop_tar_path }}"
        state: directory

    - name: Create work directory
      file:
        path: "{{ work_dir }}"
        state: directory


- name: Copy hadoop_tar to servers
  hosts: localhost
  become: yes
  vars_files:
    - /data/work/hadoop_3.3.5_auto_ansible/main.yml
  tasks:
    - name: Copy hadoop_tar to servers
      copy:
        src: "{{ play_book_dir }}/{{ hadoop_tar_filename }}"
        dest: "{{ hadoop_tar_path }}/{{ hadoop_tar_filename }}"
        mode: "0644"
      delegate_to: "{{ item }}"
      with_items:
        - '192.168.56.10'
        - '192.168.56.11'
        - '192.168.56.12'


- name: Extract hadoop_tar
  hosts: servers
  become: yes
  vars_files:
    - /data/work/hadoop_3.3.5_auto_ansible/main.yml
  tasks:
    - name: Extract the hadoop tarball
      unarchive:
        src: "{{ hadoop_tar_path }}/{{ hadoop_tar_filename }}"
        dest: "{{ work_dir }}"
        remote_src: yes


- name: entrypoint_sh start
  hosts: localhost
  become: yes
  vars_files:
    - /data/work/hadoop_3.3.5_auto_ansible/main.yml
  tasks:
    - name: entry_point_sh start
      shell: "{{ play_book_dir }}/entrypoint.sh {{ play_book_dir }} {{ work_dir }}"

 

실행 명령어

ansible-playbook -i /data/work/hadoop_3.3.5_auto_ansible/hosts.ini /data/work/hadoop_3.3.5_auto_ansible/hadoop_deploy.yml

 

 

 

5. 하둡 실행

hdfs zkfc -formatZK(master1) : HDFS(ZooKeeper Failover Controller)를 포맷하는 데 사용
- ZKFC의 상태를 초기화하고 주키퍼에 새로운 ZKFC 데이터를 포맷

start-dfs.sh(master1) : 하둡 분산 파일 시스템(HDFS)의 구성 요소를 시작하는 스크립트
/data/sy0218/hadoop-3.3.5/sbin/start-dfs.sh

hdfs namenode -format(master1)  : HDFS 네임노드를 포맷

stop-dfs.sh(master1)  : 하둡 분산 파일 시스템(HDFS)의 구성 요소를 종료하는 스크립트
/data/sy0218/hadoop-3.3.5/sbin/stop-dfs.sh
/data/sy0218/hadoop-3.3.5/sbin/stop-all.sh

start-all.sh(master1)  : 하둡(Hadoop) 클러스터의 여러 구성 요소를 시작
/data/sy0218/hadoop-3.3.5/sbin/start-all.sh

hdfs namenode -bootstrapStandby(master2) : Hadoop 클러스터에서 하나의 네임노드를 부트스트래핑하여 스탠바이 네임노드로 설정
/data/sy0218/hadoop-3.3.5/sbin/stop-all.sh(master1)
/data/sy0218/hadoop-3.3.5/sbin/start-all.sh(master1)

hdfs haadmin -getServiceState namenode1(master1)
hdfs haadmin -getServiceState namenode2(master1)
hdfs dfsadmin -report(master1)

 

최종 셋팅 확인

Configured Capacity: 198771019776 (185.12 GB)
Present Capacity: 158007382016 (147.16 GB)
DFS Remaining: 158007283712 (147.16 GB)
DFS Used: 98304 (96 KB)
DFS Used%: 0.00%
Replicated Blocks:
        Under replicated blocks: 0
        Blocks with corrupt replicas: 0
        Missing blocks: 0
        Missing blocks (with replication factor 1): 0
        Low redundancy blocks with highest priority to recover: 0
        Pending deletion blocks: 0
Erasure Coded Block Groups:
        Low redundancy block groups: 0
        Block groups with corrupt internal blocks: 0
        Missing block groups: 0
        Low redundancy blocks with highest priority to recover: 0
        Pending deletion blocks: 0

-------------------------------------------------
Live datanodes (3):

Name: 192.168.56.10:9866 (kube-control1)
Hostname: kube-control1
Decommission Status : Normal
Configured Capacity: 66257006592 (61.71 GB)
DFS Used: 32768 (32 KB)
Non DFS Used: 10655657984 (9.92 GB)
DFS Remaining: 52202463232 (48.62 GB)
DFS Used%: 0.00%
DFS Remaining%: 78.79%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 0
Last contact: Tue Jul 23 17:09:03 KST 2024
Last Block Report: Tue Jul 23 16:57:26 KST 2024
Num of Blocks: 0


Name: 192.168.56.11:9866 (kube-node1)
Hostname: kube-node1
Decommission Status : Normal
Configured Capacity: 66257006592 (61.71 GB)
DFS Used: 32768 (32 KB)
Non DFS Used: 9957601280 (9.27 GB)
DFS Remaining: 52900519936 (49.27 GB)
DFS Used%: 0.00%
DFS Remaining%: 79.84%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 0
Last contact: Tue Jul 23 17:09:03 KST 2024
Last Block Report: Tue Jul 23 16:55:05 KST 2024
Num of Blocks: 0


Name: 192.168.56.12:9866 (kube-data1)
Hostname: kube-data1
Decommission Status : Normal
Configured Capacity: 66257006592 (61.71 GB)
DFS Used: 32768 (32 KB)
Non DFS Used: 9953820672 (9.27 GB)
DFS Remaining: 52904300544 (49.27 GB)
DFS Used%: 0.00%
DFS Remaining%: 79.85%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Xceivers: 0
Last contact: Tue Jul 23 17:09:02 KST 2024
Last Block Report: Tue Jul 23 16:54:59 KST 2024
Num of Blocks: 0