阅读更多
1 Overview
Below are the Hadoop components, that together form a Hadoop ecosystem:
HDFS
-> Hadoop Distributed File SystemYARN
-> Yet Another Resource NegotiatorMapReduce
-> Data processing using programmingSpark
-> In-memory Data ProcessingPIG
-> Data Processing Services using Query (SQL-like)HBase
-> NoSQL DatabaseMahout & Spark MLlib
-> Machine LearningDrill
-> SQL on HadoopZookeeper
-> Managing ClusterOozie
-> Job SchedulingFlume
-> Data Ingesting ServicesSolr & Lucene
-> Searching & IndexingAmbari
-> Provision, Monitor and Maintain cluster
2 HDFS
HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.
2.1 Deployment
Prerequisite:
jdk8
, high than 8 may encounter runtime problems
Pseudo Distributed Cluster means the cluster has only one machine.
Run the following script to setup the cluster:
1 |
|
Stop all:
1 | ${HADOOP_HOME}/bin/hdfs --daemon stop namenode |
Test:
1 | ${HADOOP_HOME}/bin/hdfs dfs -ls / |
3 Spark
3.1 Deployment
1 |
|
Stop:
1 | ${SPARK_HOME}/sbin/stop-master.sh |
Test:
1 | ${SPARK_HOME}/bin/spark-shell |
4 Hive
4.1 Components
4.1.1 Hive-Server 2 (HS2)
HS2 supports multi-client concurrency and authentication. It is designed to provide better support for open API clients like JDBC and ODBC.
4.1.2 Hive Metastore Server (HMS)
The Hive Metastore (HMS) is a central repository of metadata for Hive tables and partitions in a relational database, and provides clients (including Hive, Impala and Spark) access to this information using the metastore service API. It has become a building block for data lakes that utilize the diverse world of open-source software, such as Apache Spark and Presto. In fact, a whole ecosystem of tools, open-source and otherwise, are built around the Hive Metastore, some of which this diagram illustrates.
4.2 Deployment
Prerequisite:
jdk8
, high than 8 may encounter runtime problems
Step1: Installation
1 |
|
The template of ${HIVE_HOME}/conf/hive-site.xml
is called ${HIVE_HOME}/conf/hive-default.xml.template
Step2: Start HiveMetastore with derby
-
Dependencies are already installed with hive.
-
Edit
${HIVE_HOME}/conf/hive-site.xml
as follows:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby:;databaseName=metastore_db;create=true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.apache.derby.jdbc.EmbeddedDriver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>APP</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>mine</value>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
</property>
<property>
<name>hive.exec.local.scratchdir</name>
<value>/tmp/hive</value>
</property>
</configuration> -
Init:
1
2# You must set env 'HADOOP_HOME'
${HIVE_HOME}/bin/schematool -dbType derby -initSchema
Step2: Or, start HiveMetastore with Mysql
-
Start a mysql server via docker:
1
docker run -dit -p 13306:3306 -e MYSQL_ROOT_PASSWORD='Abcd1234' -v /path/xxx/mysql:/var/lib/mysql mysql:5.7.37 mysqld --lower_case_table_names=1
-
Download mysql jdbc driver from MySQL Community Downloads:
- choose
Platform Independent
1
2
3wget https://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-j-8.3.0.tar.gz
tar -zxvf mysql-connector-j-8.3.0.tar.gz
cp -f mysql-connector-j-8.3.0/mysql-connector-j-8.3.0.jar ${HIVE_HOME}/lib/ - choose
-
Edit
${HIVE_HOME}/conf/hive-site.xml
as follows:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:13306/metastore_db?createDatabaseIfNotExist=true</value>
<description>Metadata storage DB connection URL</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.cj.jdbc.Driver</value>
<description>Driver class for the DB</description>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
<description>Username for accessing the DB</description>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>Abcd1234</value>
<description>Password for accessing the DB</description>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive/warehouse</value>
</property>
<property>
<name>hive.exec.local.scratchdir</name>
<value>/tmp/hive</value>
</property>
</configuration> -
Init:
1
2# You must set env 'HADOOP_HOME'
${HIVE_HOME}/bin/schematool -dbType mysql -initSchema
Step3: Hive On MapReduce
- Test
-
Text Format
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29cat > /tmp/employees.txt << 'EOF'
1,John Doe,12000,IT
2,Jane Doe,15000,HR
3,Jim Beam,9000,Marketing
4,Sarah Connor,18000,IT
5,Gordon Freeman,20000,R&D
EOF
${HADOOP_HOME}/bin/hdfs dfs -mkdir -p /user/hive/warehouse/test_db/
${HADOOP_HOME}/bin/hdfs dfs -put /tmp/employees.txt /user/hive/warehouse/test_db/
${HIVE_HOME}/bin/hive
hive> CREATE DATABASE IF NOT EXISTS test_db;
hive> USE test_db;
hive> CREATE TABLE IF NOT EXISTS employees (
id INT,
name STRING,
salary INT,
department STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
hive> LOAD DATA INPATH '/user/hive/warehouse/test_db/employees.txt' INTO TABLE employees;
hive> SELECT * FROM employees;
hive> SELECT name, salary FROM employees WHERE department = 'IT';
hive> SELECT department, AVG(salary) AS average_salary FROM employees GROUP BY department; -
Orc Format
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15hive> CREATE TABLE employees_orc (
id INT,
name STRING,
salary INT,
department STRING
) STORED AS ORC;
hive> INSERT INTO employees_orc (id, name, salary, department) VALUES
(1, 'Alice', 70000, 'Engineering'),
(2, 'Bob', 60000, 'HR'),
(3, 'Charlie', 80000, 'Finance'),
(4, 'David', 75000, 'Engineering'),
(5, 'Eve', 65000, 'Marketing');
hive> SELECT * FROM employees_orc; -
Parquet Format
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15hive> CREATE TABLE employees_parquet (
id INT,
name STRING,
salary INT,
department STRING
) STORED AS Parquet;
hive> INSERT INTO employees_parquet (id, name, salary, department) VALUES
(1, 'Alice', 70000, 'Engineering'),
(2, 'Bob', 60000, 'HR'),
(3, 'Charlie', 80000, 'Finance'),
(4, 'David', 75000, 'Engineering'),
(5, 'Eve', 65000, 'Marketing');
hive> SELECT * FROM employees_parquet;
-
Step3: Hive On Spark
-
Edit
${HIVE_HOME}/conf/hive-site.xml
, add following properties:1
2
3
4
5
6
7
8<property>
<name>hive.execution.engine</name>
<value>spark</value>
</property>
<property>
<name>spark.master</name>
<value>spark://localhost:7077</value>
</property> -
Test
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17${HIVE_HOME}/bin/hive
hive> CREATE TABLE employees_parquet (
id INT,
name STRING,
salary INT,
department STRING
) STORED AS Parquet;
hive> INSERT INTO employees_parquet (id, name, salary, department) VALUES
(1, 'Alice', 70000, 'Engineering'),
(2, 'Bob', 60000, 'HR'),
(3, 'Charlie', 80000, 'Finance'),
(4, 'David', 75000, 'Engineering'),
(5, 'Eve', 65000, 'Marketing');
hive> SELECT * FROM employees_parquet;