0%

Hadoop-Ecosystem

阅读更多

1 Overview

Below are the Hadoop components, that together form a Hadoop ecosystem:

  • HDFS -> Hadoop Distributed File System
  • YARN -> Yet Another Resource Negotiator
  • MapReduce -> Data processing using programming
  • Spark -> In-memory Data Processing
  • PIG -> Data Processing Services using Query (SQL-like)
  • HBase -> NoSQL Database
  • Mahout & Spark MLlib -> Machine Learning
  • Drill -> SQL on Hadoop
  • Zookeeper -> Managing Cluster
  • Oozie -> Job Scheduling
  • Flume -> Data Ingesting Services
  • Solr & Lucene -> Searching & Indexing
  • Ambari -> Provision, Monitor and Maintain cluster

HADOOP-ECOSYSTEM-Edureka

HadoopEcosystem-min

2 HDFS

HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.

hdfsarchitecture

2.1 Deployment via Docker

Import paths of docker:

  • /var/log/hadoop
    • /var/log/hadoop/hadoop-hadoop-namenode-$(hostname).out
    • /var/log/hadoop/hadoop-hadoop-datanode-$(hostname).out
    • /var/log/hadoop/hadoop-hadoop-resourcemanager-$(hostname).out
    • /var/log/hadoop/hadoop-hadoop-nodemanager-$(hostname).out
    • userlogs: Logs for applications that are submitted to yarn.

Important ports:

  • 8020: The NameNode’s default RPC port is 8020.
  • 9866: The datanode server address and port for data transfer.
  • 8042: NodeManager (Web UI).
  • 8088: ResourceManager (Web UI).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
SHARED_NS=hadoop-ns
HADOOP_CONTAINER_NAME=hadoop

docker run -dit --name ${HADOOP_CONTAINER_NAME} --hostname ${HADOOP_CONTAINER_NAME} --network ${SHARED_NS} --privileged -p 8020:8020 -p 9866:9866 -p 8042:8042 -p 8088:8088 apache/hadoop:3.3.6 bash
docker exec ${HADOOP_CONTAINER_NAME} bash -c "cat > /opt/hadoop/etc/hadoop/core-site.xml << EOF
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://${HADOOP_CONTAINER_NAME}:8020</value>
</property>
</configuration>
EOF"

docker exec ${HADOOP_CONTAINER_NAME} bash -c "cat > /opt/hadoop/etc/hadoop/hdfs-site.xml << EOF
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.permissions.enabled</name>
<value>false</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/opt/hadoop/data</value>
</property>
<property>
<name>dfs.datanode.address</name>
<value>${HADOOP_CONTAINER_NAME}:9866</value>
</property>
<property>
<name>dfs.datanode.http.address</name>
<value>${HADOOP_CONTAINER_NAME}:9864</value>
</property>
<property>
<name>dfs.datanode.ipc.address</name>
<value>${HADOOP_CONTAINER_NAME}:9867</value>
</property>
<property>
<name>dfs.datanode.hostname</name>
<value>${HADOOP_CONTAINER_NAME}</value>
</property>
</configuration>
EOF"

docker exec ${HADOOP_CONTAINER_NAME} bash -c "cat > /opt/hadoop/etc/hadoop/yarn-site.xml << EOF
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>${HADOOP_CONTAINER_NAME}</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>8192</value>
</property>
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>4</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>1024</value>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>8192</value>
</property>
</configuration>
EOF"

docker exec ${HADOOP_CONTAINER_NAME} bash -c "cat > /opt/hadoop/etc/hadoop/mapred-site.xml << EOF
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=/opt/hadoop</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=/opt/hadoop</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=/opt/hadoop</value>
</property>
<property>
<name>mapreduce.application.classpath</name>
<value>/opt/hadoop/share/hadoop/mapreduce/*,/opt/hadoop/share/hadoop/mapreduce/lib/*</value>
</property>
</configuration>
EOF"

# Format
docker exec ${HADOOP_CONTAINER_NAME} bash -c 'hdfs namenode -format'
docker exec ${HADOOP_CONTAINER_NAME} bash -c 'mkdir -p /opt/hadoop/data'

# Retart all daemons
docker exec ${HADOOP_CONTAINER_NAME} bash -c 'hdfs --daemon stop namenode; hdfs --daemon start namenode'
docker exec ${HADOOP_CONTAINER_NAME} bash -c 'hdfs --daemon stop datanode; hdfs --daemon start datanode'
docker exec ${HADOOP_CONTAINER_NAME} bash -c 'yarn --daemon stop resourcemanager; yarn --daemon start resourcemanager'
docker exec ${HADOOP_CONTAINER_NAME} bash -c 'yarn --daemon stop nodemanager; yarn --daemon start nodemanager'
docker exec ${HADOOP_CONTAINER_NAME} bash -c 'mapred --daemon stop historyserver; mapred --daemon start historyserver'

# Report status
docker exec ${HADOOP_CONTAINER_NAME} bash -c 'hdfs dfsadmin -report'

Test:

1
docker exec ${HADOOP_CONTAINER_NAME} bash -c 'hadoop jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar pi 10 100'

2.2 Deployment via Docker and Kerberos

2.2.1 Kerberos Basics

Concepts:

  • Key Distribution Center, KDC
    • The central server responsible for managing authentication.
    • Comprises two sub-components:
      • Authentication Server (AS): Authenticates the client and issues the Ticket Granting Ticket (TGT).
      • Ticket Granting Server (TGS): Issues service-specific tickets upon request.
  • Principal: A unique identity (user, service, or host) in the Kerberos system, e.g., user@REALM or service/hostname@REALM.
  • Realm: A logical network served by a single KDC, identified by an uppercase string, e.g., EXAMPLE.COM.
  • Keytab: A file storing pre-shared credentials for a user or service, used for automated authentication.
  • Ticket: A temporary set of credentials that allows a principal to authenticate to services.
    • Ticket Granting Ticket (TGT): Issued by the AS, used to request service tickets.
    • Service Ticket: Allows access to a specific service.

Kerberos Commands:

  • kinit <principal>[@<kerberos_realm>]: Login with password.
  • kinit -kt <keytabpath> <principal>[@<kerberos_realm>]: Login with keytab.
  • kinit -c <cache_path> <principal>[@<kerberos_realm>]: Specific cache path.
  • klist
  • klist -c <cache_path>
  • kdestroy
  • kdestroy -c <cache_path>
  • Environments:
    • export KRB5_TRACE=/dev/stdout
    • export KRB5_CONFIG=<path/to/krb5.conf>
    • export KRB5CCNAME=FILE:/tmp/krb5cc_testuser: Use local file as cache.
    • export KRB5CCNAME=MEMORY:: Use meory as cache.

kadmin.local Commands:

  • ?: help doc.

Tips:

  • Make sure target user has permission to read related files, including config, TGT, keyTab etc.

2.2.2 Kerberos Container

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
SHARED_NS=hadoop-ns
HADOOP_CONTAINER_NAME=hadoop-with-kerberos
KERBEROS_CONTAINER_NAME=kerberos
REAL_DOMAIN=liuyehcf.org
HADOOP_HOSTNAME=${HADOOP_CONTAINER_NAME}.${REAL_DOMAIN}
KERBEROS_HOSTNAME=${KERBEROS_CONTAINER_NAME}.${REAL_DOMAIN}
KERBEROS_LOGIC_DOMAIN=example.com
KERBEROS_LOGIC_DOMAIN_UPPER=$(echo ${KERBEROS_LOGIC_DOMAIN} | tr "[:lower:]" "[:upper:]")

docker run -dit --name ${KERBEROS_CONTAINER_NAME} --hostname ${KERBEROS_HOSTNAME} --network ${SHARED_NS} --privileged -p 88:88 -p 464:464 -p 749:749 ubuntu:xenial

# Install kerberos
docker exec ${KERBEROS_CONTAINER_NAME} bash -c 'apt update'
docker exec ${KERBEROS_CONTAINER_NAME} bash -c 'DEBIAN_FRONTEND=noninteractive apt install -y ntp python-dev python-pip python-wheel python-setuptools python-pkg-resources krb5-admin-server krb5-kdc'
docker exec ${KERBEROS_CONTAINER_NAME} bash -c 'apt install -y vim iputils-ping iproute2'

# Setup kerberos config
docker exec ${KERBEROS_CONTAINER_NAME} bash -c "tee /etc/krb5.conf > /dev/null << EOF
[libdefaults]
default_realm = ${KERBEROS_LOGIC_DOMAIN_UPPER}
dns_lookup_realm = false
dns_lookup_kdc = false
ticket_lifetime = 24h
renew_lifetime = 7d
forwardable = true

[realms]
${KERBEROS_LOGIC_DOMAIN_UPPER} = {
kdc = ${KERBEROS_HOSTNAME}
admin_server = ${KERBEROS_HOSTNAME}
}

[domain_realm]
${REAL_DOMAIN} = ${KERBEROS_LOGIC_DOMAIN_UPPER}
.${REAL_DOMAIN} = ${KERBEROS_LOGIC_DOMAIN_UPPER}
EOF"

docker exec ${KERBEROS_CONTAINER_NAME} bash -c "cat > /etc/krb5kdc/kdc.conf << EOF
[kdcdefaults]
kdc_ports = 88
kdc_tcp_ports = 88

[realms]
${KERBEROS_LOGIC_DOMAIN_UPPER} = {
database_name = /var/lib/krb5kdc/principal
admin_keytab = /etc/krb5kdc/kadm5.keytab
acl_file = /etc/krb5kdc/kadm5.acl
key_stash_file = /etc/krb5kdc/stash
log_file = /var/log/krb5kdc.log
kdc_ports = 88
max_life = 10h 0m 0s
max_renewable_life = 7d 0h 0m 0s
}
EOF"

docker exec ${KERBEROS_CONTAINER_NAME} bash -c "cat > /etc/krb5kdc/kadm5.acl << EOF
*/admin@${KERBEROS_LOGIC_DOMAIN_UPPER} *
EOF"

docker exec ${KERBEROS_CONTAINER_NAME} bash -c 'kdb5_util create -s <<EOF
!Abcd1234
!Abcd1234
EOF'
docker exec ${KERBEROS_CONTAINER_NAME} bash -c '/usr/sbin/krb5kdc'
docker exec ${KERBEROS_CONTAINER_NAME} bash -c '/usr/sbin/kadmind'

# Create principal for hadoop
docker exec ${KERBEROS_CONTAINER_NAME} bash -c 'mkdir -p /etc/security/keytabs'
docker exec ${KERBEROS_CONTAINER_NAME} bash -c "kadmin.local <<EOF
addprinc -randkey nn/${HADOOP_HOSTNAME}@${KERBEROS_LOGIC_DOMAIN_UPPER}
addprinc -randkey dn/${HADOOP_HOSTNAME}@${KERBEROS_LOGIC_DOMAIN_UPPER}
listprincs

ktadd -k /etc/security/keytabs/nn.service.keytab nn/${HADOOP_HOSTNAME}@${KERBEROS_LOGIC_DOMAIN_UPPER}
ktadd -k /etc/security/keytabs/dn.service.keytab dn/${HADOOP_HOSTNAME}@${KERBEROS_LOGIC_DOMAIN_UPPER}

quit
EOF"

# Create principal for user
docker exec ${KERBEROS_CONTAINER_NAME} bash -c "kadmin.local <<EOF
addprinc -pw 123456 testuser
listprincs

quit
EOF"

Test:

1
2
3
# password: 123456
kinit testuser
klist

2.2.3 Hadoop Container

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
SHARED_NS=hadoop-ns
HADOOP_CONTAINER_NAME=hadoop-with-kerberos
KERBEROS_CONTAINER_NAME=kerberos
REAL_DOMAIN=liuyehcf.org
HADOOP_HOSTNAME=${HADOOP_CONTAINER_NAME}.${REAL_DOMAIN}
KERBEROS_HOSTNAME=${KERBEROS_CONTAINER_NAME}.${REAL_DOMAIN}
KERBEROS_LOGIC_DOMAIN=example.com
KERBEROS_LOGIC_DOMAIN_UPPER=$(echo ${KERBEROS_LOGIC_DOMAIN} | tr "[:lower:]" "[:upper:]")

docker run -dit --name ${HADOOP_CONTAINER_NAME} --hostname ${HADOOP_HOSTNAME} --network ${SHARED_NS} --privileged -p 8020:8020 -p 9866:9866 apache/hadoop:3.3.6 bash
docker exec ${HADOOP_CONTAINER_NAME} bash -c "cat > /opt/hadoop/etc/hadoop/core-site.xml << EOF
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://${HADOOP_HOSTNAME}:8020</value>
</property>
<property>
<name>hadoop.security.authentication</name>
<value>kerberos</value>
</property>
<property>
<name>hadoop.security.authorization</name>
<value>true</value>
</property>
</configuration>
EOF"

docker exec ${HADOOP_CONTAINER_NAME} bash -c "cat > /opt/hadoop/etc/hadoop/hdfs-site.xml << EOF
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.permissions.enabled</name>
<value>false</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/opt/hadoop/data</value>
</property>
<property>
<name>dfs.datanode.address</name>
<value>${HADOOP_HOSTNAME}:9866</value>
</property>
<property>
<name>dfs.datanode.http.address</name>
<value>${HADOOP_HOSTNAME}:9864</value>
</property>
<property>
<name>dfs.datanode.ipc.address</name>
<value>${HADOOP_HOSTNAME}:9867</value>
</property>
<property>
<name>dfs.datanode.hostname</name>
<value>${HADOOP_HOSTNAME}</value>
</property>

<property>
<name>dfs.namenode.kerberos.principal</name>
<value>nn/_HOST@${KERBEROS_LOGIC_DOMAIN_UPPER}</value>
</property>
<property>
<name>dfs.namenode.keytab.file</name>
<value>/etc/security/keytabs/nn.service.keytab</value>
</property>
<property>
<name>dfs.datanode.kerberos.principal</name>
<value>dn/_HOST@${KERBEROS_LOGIC_DOMAIN_UPPER}</value>
</property>
<property>
<name>dfs.datanode.keytab.file</name>
<value>/etc/security/keytabs/dn.service.keytab</value>
</property>

<property>
<name>dfs.block.access.token.enable</name>
<value>true</value>
</property>
<property>
<name>dfs.block.access.token.master.key.num</name>
<value>2</value>
</property>
<property>
<name>dfs.block.access.token.lifetime</name>
<value>600</value>
</property>
<property>
<name>ignore.secure.ports.for.testing</name>
<value>true</value>
</property>
</configuration>
EOF"

# Setup kerberos config
docker exec ${HADOOP_CONTAINER_NAME} bash -c "sudo tee /etc/krb5.conf > /dev/null << EOF
[libdefaults]
default_realm = ${KERBEROS_LOGIC_DOMAIN_UPPER}
dns_lookup_realm = false
dns_lookup_kdc = false
ticket_lifetime = 24h
renew_lifetime = 7d
forwardable = true

[realms]
${KERBEROS_LOGIC_DOMAIN_UPPER} = {
kdc = ${KERBEROS_HOSTNAME}
admin_server = ${KERBEROS_HOSTNAME}
}

[domain_realm]
${REAL_DOMAIN} = ${KERBEROS_LOGIC_DOMAIN_UPPER}
.${REAL_DOMAIN} = ${KERBEROS_LOGIC_DOMAIN_UPPER}
EOF"

# Copy keytab
docker cp ${KERBEROS_CONTAINER_NAME}:/etc/security/keytabs/nn.service.keytab /tmp/nn.service.keytab
docker cp ${KERBEROS_CONTAINER_NAME}:/etc/security/keytabs/dn.service.keytab /tmp/dn.service.keytab
docker cp /tmp/nn.service.keytab ${HADOOP_CONTAINER_NAME}:/etc/security/keytabs/nn.service.keytab
docker cp /tmp/dn.service.keytab ${HADOOP_CONTAINER_NAME}:/etc/security/keytabs/dn.service.keytab
docker exec ${HADOOP_CONTAINER_NAME} bash -c 'sudo chmod 644 /etc/security/keytabs/nn.service.keytab'
docker exec ${HADOOP_CONTAINER_NAME} bash -c 'sudo chmod 644 /etc/security/keytabs/dn.service.keytab'

# Install jsvc
docker exec ${HADOOP_CONTAINER_NAME} bash -c 'sudo mv /etc/yum.repos.d/CentOS-Base.repo /etc/yum.repos.d/CentOS-Base.repo_bak'
docker exec ${HADOOP_CONTAINER_NAME} bash -c 'sudo wget -O /etc/yum.repos.d/CentOS-Base.repo http://mirrors.aliyun.com/repo/Centos-7.repo'
docker exec ${HADOOP_CONTAINER_NAME} bash -c 'sudo yum install -y jsvc'

# Format
docker exec ${HADOOP_CONTAINER_NAME} bash -c 'hdfs namenode -format'
docker exec ${HADOOP_CONTAINER_NAME} bash -c 'sudo mkdir -p /opt/hadoop/data'

# Retart all daemons
docker exec ${HADOOP_CONTAINER_NAME} bash -c 'hdfs --daemon stop namenode; hdfs --daemon start namenode'
docker exec ${HADOOP_CONTAINER_NAME} bash -c 'HDFS_DATANODE_SECURE_USER=root; \
JSVC_HOME=/bin; \
JAVA_HOME=/usr/lib/jvm/jre; \
sudo -E /opt/hadoop/bin/hdfs --daemon stop datanode; \
sudo -E /opt/hadoop/bin/hdfs --daemon start datanode'

# Report status
docker exec ${HADOOP_CONTAINER_NAME} bash -c "kinit testuser <<EOF
123456
EOF"
docker exec ${HADOOP_CONTAINER_NAME} bash -c 'hdfs dfsadmin -report'

Key points:

  • Directory /opt/hadoop/data must can be accessed by the user started datanode.
  • Credential file /etc/security/keytabs/dn.service.keytab must can be accessed by the user started datanode.
  • If ignore.secure.ports.for.testing is set to false, then the http port and tcp port must be smaller than 1023, otherwise it cannot pass the check.

2.3 Configuration

  1. core-site.xml
    • Path: $HADOOP_HOME/etc/hadoop/core-site.xml
    • Description: Contains configuration settings for Hadoop’s core system, including the default filesystem URI.
    • core-default.xml
  2. hdfs-site.xml
    • Path: $HADOOP_HOME/etc/hadoop/hdfs-site.xml
    • Description: Contains configuration settings specific to HDFS.
    • hdfs-default.xml
  3. yarn-site.xml
    • Path: $HADOOP_HOME/etc/hadoop/yarn-site.xml
    • Description: Contains configuration settings for YARN (Yet Another Resource Negotiator).
    • yarn-default.xml
  4. mapred-site.xml
    • Path: $HADOOP_HOME/etc/hadoop/mapred-site.xml
    • Description: Contains configuration settings specific to MapReduce.
    • mapred-default.xml
  5. hadoop-env.sh
    • Path: $HADOOP_HOME/etc/hadoop/hadoop-env.sh
    • Description: Sets environment variables for Hadoop processes, such as JAVA_HOME.
  6. yarn-env.sh
    • Path: $HADOOP_HOME/etc/hadoop/yarn-env.sh
    • Description: Sets environment variables for YARN.
  7. log4j.properties
    • Path: $HADOOP_HOME/etc/hadoop/log4j.properties
    • Description: Configures logging for Hadoop.

2.3.1 How to config dfs.nameservices

core-site.xml

1
2
3
4
5
6
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://mycluster</value>
</property>
</configuration>

hdfs-site.xml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
<configuration>
<property>
<name>dfs.nameservices</name>
<value>mycluster</value>
</property>

<property>
<name>dfs.ha.namenodes.mycluster</name>
<value>p0,p1,p2</value>
</property>
<property>
<name>dfs.namenode.rpc-address.mycluster.p0</name>
<value>192.168.0.1:12000</value>
</property>
<property>
<name>dfs.namenode.rpc-address.mycluster.p1</name>
<value>192.168.0.2:12000</value>
</property>
<property>
<name>dfs.namenode.rpc-address.mycluster.p2</name>
<value>192.168.0.3:12000</value>
</property>

<property>
<name>dfs.client.failover.proxy.provider.mycluster</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
<property>
<name>dfs.ha.automatic-failover.enabled.mycluster</name>
<value>true</value>
</property>
</configuration>

2.3.2 Support config hot loading

FileSystem::CACHE will cache filesystem object, cache key is built from uri and Configuration, if you only modify hdfs-site.xml file itself, the object Configuration will be exactly the same, so comes with the cache hit.

How to disable it?, set property fs.<protocol>.impl.disable.cache to true in hdfs-site.xml

  • For hdfs, the property name is: fs.hdfs.impl.disable.cache

2.3.3 Hedge read

  • dfs.client.hedged.read.threadpool.size
  • dfs.client.hedged.read.threshold.millis

2.4 Command

2.4.1 daemon

1
2
3
4
5
6
7
8
9
10
11
hdfs --daemon stop namenode
hdfs --daemon stop datanode
yarn --daemon stop resourcemanager
yarn --daemon stop nodemanager
mapred --daemon stop historyserver

hdfs --daemon start namenode
hdfs --daemon start datanode
yarn --daemon start resourcemanager
yarn --daemon start nodemanager
mapred --daemon start historyserver

2.4.2 hdfs

2.4.2.1 File Path

1
hdfs dfs -ls -R <path>

2.4.2.2 File Status

1
2
hdfs fsck <path>
hdfs fsck <path> -files -blocks -replication

2.4.2.3 Show Content

1
2
3
4
5
# For text file
hdfs dfs -cat <path>

# For avro file
hdfs dfs -text <path.avro>

2.4.2.4 Grant Permission

1
2
3
hdfs dfs -setfacl -R -m user:hive:rwx /
hdfs dfs -setfacl -R -m default:user:hive:rwx /
hdfs dfs -getfacl /

2.4.3 yarn

2.4.3.1 Node

1
2
yarn node -list
yarn node -list -showDetails

2.4.3.2 Application

1
2
3
4
5
6
7
yarn application -list
yarn application -list -appStates ALL

yarn application -status <appid>
yarn logs -applicationId <appid>

yarn application -kill <appid>

2.5 Tips

2.5.1 How to access a hadoop cluster started via docker

For linux, there are two ways of approaching this:

  1. Access hadoop via container’s ip.

For mac with m chip, the above methods cannot work, because there will be an extra virtualization layer between mac and the container. Here are steps of how we can access hadoop in this situation:

  • We can use hostname to access both namenode and datanode, the hostname must be resolved in both container and Mac, and the container’s name should be the best solution.
  1. Set port mapping for namenode’s port (8020 bydefault) and datanode’s port (9866 by default).
  2. Update /etc/hosts, mapping hadoop’s container’s name to 127.0.0.1.
  3. Config dfs.datanode.hostname at hadoop side (i.e. Container side), set its value to container’s name.
  4. Config dfs.client.use.datanode.hostname to true at the client side (i.e. Mac side), otherwise it will use container’s ip address, which is unconnected between mac and container because of the extra virtualization layer.
  5. Access hadoop via container’s name.

2.5.2 SDK don’t recognize HADOOP_CONF_DIR automatically

You need to implement the parsing of the path yourself.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
public static void main(String[] args) {
Configuration hadoopConf = new Configuration();
String hadoopConfDir = System.getenv(HADOOP_CONF_ENV);
if (StringUtils.isNotBlank(hadoopConfDir)) {
addHadoopConfIfFound(hadoopConf, hadoopConfDir);
}
// ...
}

private static void addHadoopConfIfFound(Configuration configuration,
String possibleHadoopConfPath) {
if (new File(possibleHadoopConfPath).exists()) {
if (new File(possibleHadoopConfPath + "/core-site.xml").exists()) {
configuration.addResource(
new org.apache.hadoop.fs.Path(possibleHadoopConfPath + "/core-site.xml"));
LOGGER.debug("Adding {}/core-site.xml to hadoop configuration",
possibleHadoopConfPath);
}
if (new File(possibleHadoopConfPath + "/hdfs-site.xml").exists()) {
configuration.addResource(
new org.apache.hadoop.fs.Path(possibleHadoopConfPath + "/hdfs-site.xml"));
LOGGER.debug("Adding {}/hdfs-site.xml to hadoop configuration",
possibleHadoopConfPath);
}
}
}

3 Spark

What Is Apache Spark?

spark-cluster-overview

3.1 Deployment

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
#!/bin/bash

ROOT=$(dirname "$0")
ROOT=$(cd "$ROOT"; pwd)

wget https://mirrors.tuna.tsinghua.edu.cn/apache/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3.tgz
tar -zxf spark-3.5.1-bin-hadoop3.tgz

export SPARK_HOME=${ROOT}/spark-3.5.1-bin-hadoop3
cat > ${SPARK_HOME}/conf/spark-env.sh << 'EOF'
SPARK_MASTER_HOST=localhost
EOF

${SPARK_HOME}/sbin/start-master.sh
${SPARK_HOME}/sbin/start-worker.sh spark://127.0.0.1:7077

Stop:

1
2
${SPARK_HOME}/sbin/stop-master.sh
${SPARK_HOME}/sbin/stop-worker.sh

Test:

1
2
3
4
5
6
7
${SPARK_HOME}/bin/spark-shell

scala> val nums = sc.parallelize(1 to 10)
scala> println(nums.count())
scala> :quit

${SPARK_HOME}/bin/spark-submit ${SPARK_HOME}/examples/src/main/python/pi.py 10

3.2 Tips

3.2.1 Spark-Sql

1
2
3
4
5
show catalogs;
set catalog <catalog_name>;

show databases;
use <database_name>;

3.2.2 Spark-Shell

3.2.2.1 Read parquet/orc/avro file

1
2
# avro is not built-in format
spark-shell --packages org.apache.spark:spark-avro_2.12:3.5.2
1
2
spark.read.format("parquet").load("hdfs://192.168.64.2/user/iceberg/demo/db/table/data/00000-0-e424d965-50e8-4f61-abc7-6e6c117876f4-0-00001.parquet").show(truncate=false)
spark.read.format("avro").load("hdfs://192.168.64.2/user/iceberg/demo/demo_namespace/demo_table/metadata/snap-2052751058123365495-1-7a31848a-3e5f-43c7-886a-0a8d5f6c8ed7.avro").show(truncate=false)

4 Hive

What is Apache Hive?

hive_core_architecture

4.1 Components

4.1.1 Hive-Server 2 (HS2)

HS2 supports multi-client concurrency and authentication. It is designed to provide better support for open API clients like JDBC and ODBC.

4.1.2 Hive Metastore Server (HMS)

The Hive Metastore (HMS) is a central repository of metadata for Hive tables and partitions in a relational database, and provides clients (including Hive, Impala and Spark) access to this information using the metastore service API. It has become a building block for data lakes that utilize the diverse world of open-source software, such as Apache Spark and Presto. In fact, a whole ecosystem of tools, open-source and otherwise, are built around the Hive Metastore, some of which this diagram illustrates.

4.2 Deployment via Docker

Here is a summary of the compatible versions of Apache Hive and Hadoop (refer to Apache Hive Download for details):

  • Hive 4.0.0: Works with Hadoop 3.3.6, Tez 0.10.3

Issues:

4.2.1 Use built-in Derby

Apache Hive - Quickstart

Start a hive container joining the shared network.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
SHARED_NS=hadoop-ns
HADOOP_CONTAINER_NAME=hadoop
HIVE_PREFIX=hive-with-derby
HIVE_METASTORE_CONTAINER_NAME=${HIVE_PREFIX}-metastore
HIVE_SERVER_CONTAINER_NAME=${HIVE_PREFIX}-server

# Download tez resources and put to hdfs
if [ ! -e /tmp/apache-tez-0.10.3-bin.tar.gz ]; then
wget -O /tmp/apache-tez-0.10.3-bin.tar.gz https://downloads.apache.org/tez/0.10.3/apache-tez-0.10.3-bin.tar.gz
fi
docker exec ${HADOOP_CONTAINER_NAME} bash -c 'mkdir -p /opt/tez'
docker cp /tmp/apache-tez-0.10.3-bin.tar.gz ${HADOOP_CONTAINER_NAME}:/opt/tez
docker exec ${HADOOP_CONTAINER_NAME} bash -c '
if ! hdfs dfs -ls /opt/tez/tez.tar.gz > /dev/null 2>&1; then
rm -rf /opt/tez/apache-tez-0.10.3-bin
tar -zxf /opt/tez/apache-tez-0.10.3-bin.tar.gz -C /opt/tez
hdfs dfs -mkdir -p /opt/tez
hdfs dfs -put -f /opt/tez/apache-tez-0.10.3-bin/share/tez.tar.gz /opt/tez
fi
'

HIVE_SITE_CONFIG_COMMON=$(cat << EOF
<property>
<name>hive.server2.enable.doAs</name>
<value>false</value>
</property>
<property>
<name>hive.tez.exec.inplace.progress</name>
<value>false</value>
</property>
<property>
<name>hive.exec.scratchdir</name>
<value>/opt/${HIVE_PREFIX}/scratch_dir</value>
</property>
<property>
<name>hive.user.install.directory</name>
<value>/opt/${HIVE_PREFIX}/install_dir</value>
</property>
<property>
<name>tez.runtime.optimize.local.fetch</name>
<value>true</value>
</property>
<property>
<name>hive.exec.submit.local.task.via.child</name>
<value>false</value>
</property>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>tez.local.mode</name>
<value>false</value>
</property>
<property>
<name>tez.lib.uris</name>
<value>/opt/tez/tez.tar.gz</value>
</property>
<property>
<name>hive.execution.engine</name>
<value>tez</value>
</property>
<property>
<name>metastore.warehouse.dir</name>
<value>/opt/${HIVE_PREFIX}/data/warehouse</value>
</property>
<property>
<name>metastore.metastore.event.db.notification.api.auth</name>
<value>false</value>
</property>
EOF
)

cat > /tmp/hive-site-for-metastore.xml << EOF
<configuration>
${HIVE_SITE_CONFIG_COMMON}
</configuration>
EOF

cat > /tmp/hive-site-for-hiveserver2.xml << EOF
<configuration>
<property>
<name>hive.metastore.uris</name>
<value>thrift://${HIVE_METASTORE_CONTAINER_NAME}:9083</value>
</property>
${HIVE_SITE_CONFIG_COMMON}
</configuration>
EOF

# Copy hadoop config file to hive container
docker cp ${HADOOP_CONTAINER_NAME}:/opt/hadoop/etc/hadoop/core-site.xml /tmp/core-site.xml
docker cp ${HADOOP_CONTAINER_NAME}:/opt/hadoop/etc/hadoop/hdfs-site.xml /tmp/hdfs-site.xml
docker cp ${HADOOP_CONTAINER_NAME}:/opt/hadoop/etc/hadoop/yarn-site.xml /tmp/yarn-site.xml
docker cp ${HADOOP_CONTAINER_NAME}:/opt/hadoop/etc/hadoop/mapred-site.xml /tmp/mapred-site.xml

# Use customized entrypoint
cat > /tmp/updated_entrypoint.sh << 'EOF'
#!/bin/bash

echo "IS_RESUME=${IS_RESUME}"
FLAG_FILE=/opt/hive/already_init_schema

if [ -z "${IS_RESUME}" ] || [ "${IS_RESUME}" = "false" ]; then
if [ -f ${FLAG_FILE} ]; then
echo "Skip init schema when restart."
IS_RESUME=true /entrypoint.sh
else
echo "Try to init schema for the first time."
touch ${FLAG_FILE}
IS_RESUME=false /entrypoint.sh
fi
else
echo "Skip init schema for every time."
IS_RESUME=true /entrypoint.sh
fi
EOF
chmod a+x /tmp/updated_entrypoint.sh

# Start standalone metastore
docker create --name ${HIVE_METASTORE_CONTAINER_NAME} --network ${SHARED_NS} -p 9083:9083 -e SERVICE_NAME=metastore --entrypoint /updated_entrypoint.sh apache/hive:4.0.0

docker cp /tmp/hive-site-for-metastore.xml ${HIVE_METASTORE_CONTAINER_NAME}:/opt/hive/conf/hive-site.xml
docker cp /tmp/core-site.xml ${HIVE_METASTORE_CONTAINER_NAME}:/opt/hadoop/etc/hadoop/core-site.xml
docker cp /tmp/hdfs-site.xml ${HIVE_METASTORE_CONTAINER_NAME}:/opt/hadoop/etc/hadoop/hdfs-site.xml
docker cp /tmp/yarn-site.xml ${HIVE_METASTORE_CONTAINER_NAME}:/opt/hadoop/etc/hadoop/yarn-site.xml
docker cp /tmp/mapred-site.xml ${HIVE_METASTORE_CONTAINER_NAME}:/opt/hadoop/etc/hadoop/mapred-site.xml
docker cp /tmp/updated_entrypoint.sh ${HIVE_METASTORE_CONTAINER_NAME}:/updated_entrypoint.sh

docker start ${HIVE_METASTORE_CONTAINER_NAME}

# Start standalone hiveserver2
docker create --name ${HIVE_SERVER_CONTAINER_NAME} --network ${SHARED_NS} -p 10000:10000 -e SERVICE_NAME=hiveserver2 -e IS_RESUME=true apache/hive:4.0.0

docker cp /tmp/hive-site-for-hiveserver2.xml ${HIVE_SERVER_CONTAINER_NAME}:/opt/hive/conf/hive-site.xml
docker cp /tmp/core-site.xml ${HIVE_SERVER_CONTAINER_NAME}:/opt/hadoop/etc/hadoop/core-site.xml
docker cp /tmp/hdfs-site.xml ${HIVE_SERVER_CONTAINER_NAME}:/opt/hadoop/etc/hadoop/hdfs-site.xml
docker cp /tmp/yarn-site.xml ${HIVE_SERVER_CONTAINER_NAME}:/opt/hadoop/etc/hadoop/yarn-site.xml
docker cp /tmp/mapred-site.xml ${HIVE_SERVER_CONTAINER_NAME}:/opt/hadoop/etc/hadoop/mapred-site.xml

docker start ${HIVE_SERVER_CONTAINER_NAME}

Test:

1
2
3
4
5
6
7
docker exec -it ${HIVE_SERVER_CONTAINER_NAME} beeline -u 'jdbc:hive2://localhost:10000/' -e "
create table hive_example(a string, b int) partitioned by(c int);
alter table hive_example add partition(c=1);
insert into hive_example partition(c=1) values('a', 1), ('a', 2),('b',3);
select * from hive_example;
drop table hive_example;
"

4.2.2 Use External Postgres

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
SHARED_NS=hadoop-ns
POSTGRES_CONTAINER_NAME=postgres
POSTGRES_USER="hive_postgres"
POSTGRES_PASSWORD="Abcd1234"
POSTGRES_DB="hive-metastore"
HADOOP_CONTAINER_NAME=hadoop
HIVE_PREFIX=hive-with-postgres
HIVE_METASTORE_CONTAINER_NAME=${HIVE_PREFIX}-metastore
HIVE_SERVER_CONTAINER_NAME=${HIVE_PREFIX}-server
IS_RESUME="false"

# How to use sql:
# 1. docker exec -it ${POSTGRES_CONTAINER_NAME} bash
# 2. psql -U ${POSTGRES_USER} -d ${POSTGRES_DB}
docker run --name ${POSTGRES_CONTAINER_NAME} --network ${SHARED_NS} \
-e POSTGRES_USER="${POSTGRES_USER}" \
-e POSTGRES_PASSWORD="${POSTGRES_PASSWORD}" \
-e POSTGRES_DB="${POSTGRES_DB}" \
-d postgres:17.0

# Download tez resources and put to hdfs
if [ ! -e /tmp/apache-tez-0.10.3-bin.tar.gz ]; then
wget -O /tmp/apache-tez-0.10.3-bin.tar.gz https://downloads.apache.org/tez/0.10.3/apache-tez-0.10.3-bin.tar.gz
fi
docker exec ${HADOOP_CONTAINER_NAME} bash -c 'mkdir -p /opt/tez'
docker cp /tmp/apache-tez-0.10.3-bin.tar.gz ${HADOOP_CONTAINER_NAME}:/opt/tez
docker exec ${HADOOP_CONTAINER_NAME} bash -c '
if ! hdfs dfs -ls /opt/tez/tez.tar.gz > /dev/null 2>&1; then
rm -rf /opt/tez/apache-tez-0.10.3-bin
tar -zxf /opt/tez/apache-tez-0.10.3-bin.tar.gz -C /opt/tez
hdfs dfs -mkdir -p /opt/tez
hdfs dfs -put -f /opt/tez/apache-tez-0.10.3-bin/share/tez.tar.gz /opt/tez
fi
'

HIVE_SITE_CONFIG_COMMON=$(cat << EOF
<property>
<name>hive.server2.enable.doAs</name>
<value>false</value>
</property>
<property>
<name>hive.tez.exec.inplace.progress</name>
<value>false</value>
</property>
<property>
<name>hive.exec.scratchdir</name>
<value>/opt/${HIVE_PREFIX}/scratch_dir</value>
</property>
<property>
<name>hive.user.install.directory</name>
<value>/opt/${HIVE_PREFIX}/install_dir</value>
</property>
<property>
<name>tez.runtime.optimize.local.fetch</name>
<value>true</value>
</property>
<property>
<name>hive.exec.submit.local.task.via.child</name>
<value>false</value>
</property>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>tez.local.mode</name>
<value>false</value>
</property>
<property>
<name>tez.lib.uris</name>
<value>/opt/tez/tez.tar.gz</value>
</property>
<property>
<name>hive.execution.engine</name>
<value>tez</value>
</property>
<property>
<name>metastore.warehouse.dir</name>
<value>/opt/${HIVE_PREFIX}/data/warehouse</value>
</property>
<property>
<name>metastore.metastore.event.db.notification.api.auth</name>
<value>false</value>
</property>
EOF
)

cat > /tmp/hive-site-for-metastore.xml << EOF
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:postgresql://${POSTGRES_CONTAINER_NAME}/${POSTGRES_DB}</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.postgresql.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>${POSTGRES_USER}</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>${POSTGRES_PASSWORD}</value>
</property>
${HIVE_SITE_CONFIG_COMMON}
</configuration>
EOF

cat > /tmp/hive-site-for-hiveserver2.xml << EOF
<configuration>
<property>
<name>hive.metastore.uris</name>
<value>thrift://${HIVE_METASTORE_CONTAINER_NAME}:9083</value>
</property>
${HIVE_SITE_CONFIG_COMMON}
</configuration>
EOF

# Copy hadoop config file to hive container
docker cp ${HADOOP_CONTAINER_NAME}:/opt/hadoop/etc/hadoop/core-site.xml /tmp/core-site.xml
docker cp ${HADOOP_CONTAINER_NAME}:/opt/hadoop/etc/hadoop/hdfs-site.xml /tmp/hdfs-site.xml
docker cp ${HADOOP_CONTAINER_NAME}:/opt/hadoop/etc/hadoop/yarn-site.xml /tmp/yarn-site.xml
docker cp ${HADOOP_CONTAINER_NAME}:/opt/hadoop/etc/hadoop/mapred-site.xml /tmp/mapred-site.xml

# Prepare jdbc driver
if [ ! -e /tmp/postgresql-42.7.4.jar ]; then
wget -O /tmp/postgresql-42.7.4.jar https://jdbc.postgresql.org/download/postgresql-42.7.4.jar
fi

# Use customized entrypoint
cat > /tmp/updated_entrypoint.sh << 'EOF'
#!/bin/bash

echo "IS_RESUME=${IS_RESUME}"
FLAG_FILE=/opt/hive/already_init_schema

if [ -z "${IS_RESUME}" ] || [ "${IS_RESUME}" = "false" ]; then
if [ -f ${FLAG_FILE} ]; then
echo "Skip init schema when restart."
IS_RESUME=true /entrypoint.sh
else
echo "Try to init schema for the first time."
touch ${FLAG_FILE}
IS_RESUME=false /entrypoint.sh
fi
else
echo "Skip init schema for every time."
IS_RESUME=true /entrypoint.sh
fi
EOF
chmod a+x /tmp/updated_entrypoint.sh

# Start standalone metastore
docker create --name ${HIVE_METASTORE_CONTAINER_NAME} --network ${SHARED_NS} -p 9083:9083 -e SERVICE_NAME=metastore -e DB_DRIVER=postgres -e IS_RESUME=${IS_RESUME} --entrypoint /updated_entrypoint.sh apache/hive:4.0.0

docker cp /tmp/hive-site-for-metastore.xml ${HIVE_METASTORE_CONTAINER_NAME}:/opt/hive/conf/hive-site.xml
docker cp /tmp/core-site.xml ${HIVE_METASTORE_CONTAINER_NAME}:/opt/hadoop/etc/hadoop/core-site.xml
docker cp /tmp/hdfs-site.xml ${HIVE_METASTORE_CONTAINER_NAME}:/opt/hadoop/etc/hadoop/hdfs-site.xml
docker cp /tmp/yarn-site.xml ${HIVE_METASTORE_CONTAINER_NAME}:/opt/hadoop/etc/hadoop/yarn-site.xml
docker cp /tmp/mapred-site.xml ${HIVE_METASTORE_CONTAINER_NAME}:/opt/hadoop/etc/hadoop/mapred-site.xml
docker cp /tmp/updated_entrypoint.sh ${HIVE_METASTORE_CONTAINER_NAME}:/updated_entrypoint.sh
docker cp /tmp/postgresql-42.7.4.jar ${HIVE_METASTORE_CONTAINER_NAME}:/opt/hive/lib/postgresql-42.7.4.jar

docker start ${HIVE_METASTORE_CONTAINER_NAME}

# Start standalone hiveserver2
docker create --name ${HIVE_SERVER_CONTAINER_NAME} --network ${SHARED_NS} -p 10000:10000 -e SERVICE_NAME=hiveserver2 -e DB_DRIVER=postgres -e IS_RESUME=true apache/hive:4.0.0

docker cp /tmp/hive-site-for-hiveserver2.xml ${HIVE_SERVER_CONTAINER_NAME}:/opt/hive/conf/hive-site.xml
docker cp /tmp/core-site.xml ${HIVE_SERVER_CONTAINER_NAME}:/opt/hadoop/etc/hadoop/core-site.xml
docker cp /tmp/hdfs-site.xml ${HIVE_SERVER_CONTAINER_NAME}:/opt/hadoop/etc/hadoop/hdfs-site.xml
docker cp /tmp/yarn-site.xml ${HIVE_SERVER_CONTAINER_NAME}:/opt/hadoop/etc/hadoop/yarn-site.xml
docker cp /tmp/mapred-site.xml ${HIVE_SERVER_CONTAINER_NAME}:/opt/hadoop/etc/hadoop/mapred-site.xml
docker cp /tmp/postgresql-42.7.4.jar ${HIVE_SERVER_CONTAINER_NAME}:/opt/hive/lib/postgresql-42.7.4.jar

docker start ${HIVE_SERVER_CONTAINER_NAME}

Test:

1
2
3
4
5
6
7
docker exec -it ${HIVE_SERVER_CONTAINER_NAME} beeline -u 'jdbc:hive2://localhost:10000/' -e "
create table hive_example(a string, b int) partitioned by(c int);
alter table hive_example add partition(c=1);
insert into hive_example partition(c=1) values('a', 1), ('a', 2),('b',3);
select * from hive_example;
drop table hive_example;
"

4.3 Hive Metastore Demo

hive_metastore.thrift

1
2
3
4
mkdir hive_metastore_demo
cd hive_metastore_demo

wget https://raw.githubusercontent.com/apache/hive/master/standalone-metastore/metastore-common/src/main/thrift/hive_metastore.thrift

Modify hive_metastore.thrift, remove fb parts:

1
2
3
4
5
6
7
25,26d24
< include "share/fb303/if/fb303.thrift"
<
2527c2525
< service ThriftHiveMetastore extends fb303.FacebookService
---
> service ThriftHiveMetastore
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
thrift --gen cpp hive_metastore.thrift

mkdir build

# create libhms.a
gcc -o build/ThriftHiveMetastore.o -c gen-cpp/ThriftHiveMetastore.cpp -O3 -Wall -fPIC
gcc -o build/hive_metastore_constants.o -c gen-cpp/hive_metastore_constants.cpp -O3 -Wall -fPIC
gcc -o build/hive_metastore_types.o -c gen-cpp/hive_metastore_types.cpp -O3 -Wall -fPIC
ar rcs build/libhms.a build/ThriftHiveMetastore.o build/hive_metastore_constants.o build/hive_metastore_types.o

# main.cpp
cat > main.cpp << 'EOF'
#include <thrift/protocol/TBinaryProtocol.h>
#include <thrift/transport/TSocket.h>
#include <thrift/transport/TTransportUtils.h>

#include <iostream>

#include "gen-cpp/ThriftHiveMetastore.h"

using namespace apache::thrift;
using namespace apache::thrift::transport;
using namespace apache::thrift::protocol;

int main(int argc, char* argv[]) {
if (argc < 4) {
std::cerr << "requires 4 arguments" << std::endl;
return 1;
}
const std::string hms_ip = argv[1];
const int hms_port = std::atoi(argv[2]);
const std::string db_name = argv[3];
const std::string table_name = argv[4];

std::cout << "hms_ip: " << hms_ip << ", hms_port: " << hms_port << ", db_name: " << db_name
<< ", table_name: " << table_name << std::endl;

std::shared_ptr<TTransport> socket(new TSocket(hms_ip, hms_port));
std::shared_ptr<TTransport> transport(new TBufferedTransport(socket));
std::shared_ptr<TProtocol> protocol(new TBinaryProtocol(transport));
Apache::Hadoop::Hive::ThriftHiveMetastoreClient client(protocol);

try {
transport->open();

// Fetch and print the list of databases
std::vector<std::string> databases;
client.get_all_databases(databases);
std::cout << "Databases:" << std::endl;
for (const auto& db : databases) {
std::cout << " " << db << std::endl;
}

// Fetch and print the list of tables in a specific database
std::vector<std::string> tables;
client.get_all_tables(tables, db_name);
std::cout << "Tables in database '" << db_name << "':" << std::endl;
for (const auto& table : tables) {
std::cout << " " << table << std::endl;
}

// Fetch and print the details of a specific table
Apache::Hadoop::Hive::Table table;
client.get_table(table, db_name, table_name);
std::cout << "Table details for '" << table_name << "':" << std::endl;
std::cout << " Table name: " << table.tableName << std::endl;
std::cout << " Database name: " << table.dbName << std::endl;
std::cout << " Owner: " << table.owner << std::endl;
std::cout << " Create time: " << table.createTime << std::endl;
std::cout << " Location: " << table.sd.location << std::endl;

transport->close();
} catch (TException& tx) {
std::cerr << "Exception occurred: " << tx.what() << std::endl;
}

return 0;
}
EOF

gcc -o build/main main.cpp -O3 -Lbuild -lhms -lstdc++ -std=gnu++17 -lthrift -lm
build/main <ip> 9083 default hive_test_table

4.4 Syntax

4.4.1 Partition

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
CREATE TABLE sales (
sale_id INT,
product STRING,
amount DOUBLE
)
PARTITIONED BY (year INT, month INT)
STORED AS PARQUET;

INSERT INTO TABLE sales PARTITION (year=2023, month=6)
VALUES
(1, 'Product A', 100.0),
(2, 'Product B', 150.0),
(3, 'Product C', 200.0);

INSERT INTO TABLE sales PARTITION (year=2023, month=7)
VALUES
(4, 'Product D', 120.0),
(5, 'Product E', 130.0),
(6, 'Product F', 140.0);

SELECT * FROM sales;

5 Flink

Here’s a example of how to use docker-spark to start a flink cluster and do some tests.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
git clone https://github.com/big-data-europe/docker-flink.git
cd docker-flink
docker-compose up -d

cat > person.csv << 'EOF'
1,"Tom",18
2,"Jerry",19
3,"Spike",20
4,"Tyke",21
EOF

docker exec flink-master mkdir -p /opt/data
docker exec flink-worker mkdir -p /opt/data
docker cp person.csv flink-master:/opt/data/person.csv
docker cp person.csv flink-worker:/opt/data/person.csv

docker exec -it flink-master bash
sql-client.sh

CREATE TABLE person (
id INT,
name STRING,
age INT
) WITH (
'connector' = 'filesystem',
'path' = 'file:///opt/data/person.csv',
'format' = 'csv'
);

SELECT * FROM person;

6 Docker-Compose