0%

搭建CDH6大数据平台

在虚拟机上演示搭建 CDH6.1.1 大数据平台的过程,操作系统为 CentOS 7。


环境准备

CentOS 7 系统安装过程略。环境如下:

环境 说明
操作系统 CentOS 7.6.1810
服务器数量 3
主机名称 devcdh1.cdh.com; devcdh2.cdh.com; devcdh3.cdh.com
IP地址 192.168.153.200; 192.168.153.200; 192.168.153.200
JDK 1.8
MySQL 5.7.30

安装常用软件

1
yum -y install wget iptables-services telnet net-tools git curl unzip sysstat lsof ntpdate lrzsz vim

设置时钟同步

安装并启用 ntp 网络时间协议服务:

1
2
3
yum -y install ntp
systemctl start ntpd
systemctl enable ntpd

关闭防火墙

为了方便安装使用,先关闭防火墙并禁止开机启动:

1
2
3
4
systemctl stop iptables
systemctl disable iptables
systemctl stop firewalld
systemctl disable firewalld

安装JDK

  1. 下载 JDK 并安装,建议使用 IDM 等多线程下载工具进行下载:
1
rpm -ivh oracle-j2sdk1.8-1.8.0+update181-1.x86_64.rpm
  1. 添加环境变量:
1
2
3
4
5
echo "export JAVA_HOME=/usr/java/jdk1.8.0_181-cloudera"  >> /etc/profile
echo "export CLASSPATH=.:$CLASSPATH:$JAVA_HOME/lib" >> /etc/profile
echo "export PATH=$PATH:$JAVA_HOME/bin" >> /etc/profile

source /etc/profile
  1. 检查 JDK 是否安装成功:
1
2
3
4
# java -version
java version "1.8.0_181"
Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)

配置hosts

1
2
3
echo "192.168.153.200 devcdh1 devcdh1.cdh.com" >> /etc/hosts
echo "192.168.153.201 devcdh2 devcdh2.cdh.com" >> /etc/hosts
echo "192.168.153.202 devcdh3 devcdh3.cdh.com" >> /etc/hosts

在宿主机同样对 hosts 文件进行配置。

安装MySQL驱动

安装 mysql-connect

1
2
3
4
5
6
7
8
9
10
11
# 下载驱动
wget https://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-5.1.46.tar.gz

# 解压
tar -zxvf mysql-connector-java-5.1.46.tar.gz

# 创建目录
mkdir /usr/share/java/

# 移动jar包到目录
mv mysql-connector-java-5.1.46/mysql-connector-java-5.1.46.jar /usr/share/java/mysql-connector-java.jar

复制虚拟机

复制该虚拟机 2 份,修改 ip 和主机名:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# 修改主机名
# vi /etc/hostname
- devcdh1.cdh.com
+ devcdh2.cdh.com

# 修改ip
# vi /etc/sysconfig/network-scripts/ifcfg-ens33
TYPE="Ethernet"
PROXY_METHOD="none"
BROWSER_ONLY="no"
BOOTPROTO="none"
DEFROUTE="yes"
IPV4_FAILURE_FATAL="no"
IPV6INIT="yes"
IPV6_AUTOCONF="yes"
IPV6_DEFROUTE="yes"
IPV6_FAILURE_FATAL="no"
IPV6_ADDR_GEN_MODE="stable-privacy"
NAME="ens33"
UUID="0d3a0253-024f-4033-97e0-06309768cc9d"
DEVICE="ens33"
ONBOOT="yes"
- IPADDR="192.168.153.200"
+ IPADDR="192.168.153.201"
PREFIX="24"
GATEWAY="192.168.153.2"
DNS1="8.8.8.8"
IPV6_PRIVACY="no"

# 重启网络服务
# service network restart

# 重启服务器
# reboot

安装MySQL

devcdh1 上安装 MySQL:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# 添加MySQL5.7仓库
rpm -ivh https://dev.mysql.com/get/mysql57-community-release-el7-11.noarch.rpm

# 安装rpm包
yum -y localinstall mysql57-community-release-el7-11.noarch.rpm

# 显示已配置的源,确认MySQL源添加成功
# yum repolist enabled | grep "mysql.*-community.*"
!mysql-connectors-community/x86_64 MySQL Connectors Community 153
!mysql-tools-community/x86_64 MySQL Tools Community 110
!mysql57-community/x86_64 MySQL 5.7 Community Server 424

# 安装MySQL
yum -y install mysql-community-server

# 启动MySQL
systemctl start mysqld

# 查看root默认密码
grep 'temporary password' /var/log/mysqld.log
2020-05-27T09:13:13.177687Z 1 [Note] A temporary password is generated for [email protected]: ayCk+>Otq7YQ

# 运行安全设置脚本,可以完成设置root密码,移除匿名用户,禁止root用户远程连接等
mysql_secure_installation

创建数据库

根据需要安装的服务,参照下表创建对应的数据库以及数据库用户,数据库必须使用 utf8 编码。创建数据库时要记录好用户名及对应密码:

服务名 数据库名 用户名
Cloudera Manager Server scm scm
Activity Monitor amon amon
Reports Manager rman rman
Hue hue hue
Hive Metastore Server metastore hive
Sentry Server sentry sentry
Cloudera Navigator Audit Server nav nav
Cloudera Navigator Metadata Server navms navms
Oozie oozie oozie
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
CREATE DATABASE scm DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci;
GRANT ALL ON scm.* TO 'scm'@'%' IDENTIFIED BY '2020sannaha.MOE';
CREATE DATABASE amon DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci;
GRANT ALL ON amon.* TO 'amon'@'%' IDENTIFIED BY '2020sannaha.MOE';
CREATE DATABASE rman DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci;
GRANT ALL ON rman.* TO 'rman'@'%' IDENTIFIED BY '2020sannaha.MOE';
CREATE DATABASE hue DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci;
GRANT ALL ON hue.* TO 'hue'@'%' IDENTIFIED BY '2020sannaha.MOE';
CREATE DATABASE metastore DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci;
GRANT ALL ON metastore.* TO 'hive'@'%' IDENTIFIED BY '2020sannaha.MOE';
CREATE DATABASE sentry DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci;
GRANT ALL ON sentry.* TO 'sentry'@'%' IDENTIFIED BY '2020sannaha.MOE';
CREATE DATABASE nav DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci;
GRANT ALL ON nav.* TO 'nav'@'%' IDENTIFIED BY '2020sannaha.MOE';
CREATE DATABASE navms DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci;
GRANT ALL ON navms.* TO 'navms'@'%' IDENTIFIED BY '2020sannaha.MOE';
CREATE DATABASE oozie DEFAULT CHARACTER SET utf8 DEFAULT COLLATE utf8_general_ci;
GRANT ALL ON oozie.* TO 'oozie'@'%' IDENTIFIED BY '2020sannaha.MOE';

flush privileges;

Cloudera Manager

简介

Cloudera Manager 是用于管理 CDH 群集(Cluster)的应用程序,提供了良好的 Web UI 界面。借助 Cloudera Manager,可以轻松地部署和集中操作完整的 CDH 和其他托管服务。该应用程序可以:

  • 自动执行安装过程,从而缩短部署时间;
  • 提供集群内运行的主机和服务的实时视图;
  • 提供一个中央控制台来更改群集中的配置;
  • 结合各种报告和诊断工具,可帮助您优化性能和利用率。

Cloudera Manager 的核心是 Cloudera Manager Server,它承载管理控制台的 Web 服务器和应用程序逻辑,并负责安装软件,配置,启动和停止服务,以及管理运行服务的群集。

cm_arch

Cloudera Manager Server 与其他几个组件一起使用:

  • Agent:安装在每台主机上。它负责启动和停止进程,解包配置,触发安装和监控主机。
  • Management Service:该服务由一组角色组成,这些角色执行各种监视、报警和报告功能。
  • Database:存储配置和监视信息。通常,多个逻辑数据库跨一个或多个数据库服务器运行。例如,Cloudera Manager Server 和监视角色使用不同的逻辑数据库。
  • Cloudera Repository:可供 Cloudera Manager 分发的软件存储库。
  • Client:与服务器交互的接口:
  • Admin Console:基于 Web 的 UI,管理员可以使用它来管理群集和 Cloudera Manager。
  • API:开发人员用来创建自定义 Cloudera Manager 应用程序的API。

Cloudera Management Service

Cloudera Management Service 可作为一组角色实施各种管理功能:

  • Activity Monitor:收集有关服务运行的活动的信息
  • Host Monitor:收集有关主机的运行状况和指标信息
  • Service Monitor:收集有关服务的运行状况和指标信息
  • Event Server:聚合组件的事件并将其用于警报和搜索
  • Alert Publisher :为特定类型的事件生成和提供警报
  • Reports Manager:生成图表报告,它提供用户、用户组的目录的磁盘使用率、磁盘、io等历史视图

信号检测

默认情况下,Agent 每隔 15 秒向 Cloudera Manager Service 发送一次检测信号。但是,为了减少用户延迟,在状态变化时会提高频率。

状态管理

  • 模型状态捕获什么进程应在何处运行以及具有什么配置
  • 运行时状态是哪些进程正在何处运行以及正在执行哪些命令(例如,重新平衡 HDFS 或执行备份/灾难恢复计划或滚动升级或停止)
  • 当您更新配置(例如Hue Server Web 端口)时,您即更新了模型状态。但是,如果 Hue 在更新时正在运行,则它仍将使用旧端口。当出现这种不匹配情况时,角色会标记为具有”过时的配置”。要重新同步,您需重启角色(这会触发重新生成配置和重启进程)
  • 特殊情况如果要加入一些clouder manager控制台没有的属性时候都在高级里面嵌入

服务器和客户端配置

  • 如使用HDFS,文件 /etc/hadoop/conf/hdfs-site.xml 仅包含与 HDFS 客户端相关的配置
  • 而 HDFS 角色实例(例如,NameNode 和 DataNode)会从 /var/run/cloudera-scm-agent/process/unique-process-name下的每个进程专用目录获取它们的配置

进程管理

  • 在 Cloudera Manager 管理的群集中,只能通过 Cloudera Manager 启动或停止服务。ClouderaManager 使用一种名为 supervisord 的开源进程管理工具,它会重定向日志文件,通知进程失败,为合适用户设置调用进程的有效用户 ID 等等
  • Cloudera Manager 支持自动重启崩溃进程。如果一个角色实例在启动后反复失败,Cloudera Manager 还会用不良状态标记该实例
  • 特别需要注意的是,停止 Cloudera Manager 和 Cloudera Manager Agent 不会停止群集;所有正在运行的实例都将保持运行
  • Agent 的一项主要职责是启动和停止进程。当 Agent 从检测信号检测到新进程时,Agent 会在 /var/run/cloudera-scm-agent 中为它创建一个目录,并解压缩配置
  • Agent 受到监控,属于 Cloudera Manager 的主机监控的一部分:如果 Agent 停止检测信号,主机将被标记为运行状况不良

主机管理

  • Cloudera Manager 自动将作为群集中的托管主机身份:JDK、Cloudera Manager Agent、CDH、Impala、Solr 等参与所需的所有软件部署到主机
  • Cloudera Manager 提供用于管理参与主机生命周期的操作以及添加和删除主机的操作
  • Cloudera Management Service Host Monitor 角色执行运行状况检查并收集主机度量,以使您可以监控主机的运行状况和性能

安全

身份验证

  • Hadoop中身份验证的目的仅仅是证明用户或服务确实是他或她所声称的用户或服务,通常,企业中的身份验证通过单个分布式系统(例如,轻型目录访问协议 (LDAP) 目录)进行管理。LDAP 身份验证包含由各种存储系统提供支持的简单用户名/密码服务
  • Hadoop 生态系统的许多组件会汇总到一起来使用 Kerberos 身份验证并提供用于在 LDAP 或 AD 中管理和存储凭据的选项

授权
CDH 当前提供以下形式的访问控制:

  • 适用于目录和文件的传统 POSIX 样式的权限。
  • 适用于 HDFS 的扩展的访问控制列表 (ACL)。
  • Apache HBase 使用 ACL 来按列、列族和列族限定符授权各种操作 (READ, WRITE, CREATE, ADMIN)。
  • 使用 Apache Sentry 基于角色进行访问控制。

加密
需要获得企业版的 Cloudera(Cloudera Navigator 许可)。

下载

下载 Cloudera Manager:https://archive.cloudera.com/cm6/6.1.1/redhat7/yum/RPMS/x86_64/

├── cloudera-manager-agent-6.1.1-853290.el7.x86_64.rpm
├── cloudera-manager-daemons-6.1.1-853290.el7.x86_64.rpm
├── cloudera-manager-server-6.1.1-853290.el7.x86_64.rpm
├── cloudera-manager-server-db-2-6.1.1-853290.el7.x86_64.rpm
└── enterprise-debuginfo-6.1.1-853290.el7.x86_64.rpm

以及 allkeys.asc 文件以免在安装 Agent 时报错:https://archive.cloudera.com/cm6/6.1.1/allkeys.asc

下载 Parcel(根据操作系统选择下载 el6el7):https://archive.cloudera.com/cdh6/6.1.1/parcels/

├── CDH-6.1.1-1.cdh6.1.1.p0.875250-el7.parcel
└── manifest.json

建议使用 IDM 等多线程下载工具进行下载。

制作本地yum源

首先安装 httpdcreaterepo

1
yum -y install httpd createrepo

启动 httpd 服务并设置开机启动:

1
2
systemctl start httpd
systemctl enable httpd

移动 rpm 包和 allkeys.asc 文件到 /var/www/html/cloudera-repos/ 下,该目录树如下:

/var/www/html/cloudera-repos/
1
2
3
4
5
6
7
8
# tree /var/www/html/cloudera-repos/
/var/www/html/cloudera-repos/
├── allkeys.asc
├── cloudera-manager-agent-6.1.1-853290.el7.x86_64.rpm
├── cloudera-manager-daemons-6.1.1-853290.el7.x86_64.rpm
├── cloudera-manager-server-6.1.1-853290.el7.x86_64.rpm
├── cloudera-manager-server-db-2-6.1.1-853290.el7.x86_64.rpm
├── enterprise-debuginfo-6.1.1-853290.el7.x86_64.rpm

生成 rpm 元数据:

/var/www/html/cloudera-repos/
1
createrepo .

创建 cloudera-repos.repo

1
2
3
4
5
6
# vim /etc/yum.repos.d/cloudera-repos.repo
[cloudera-repos]
name=cloudera-repos
baseurl=http://devcdh1.cdh.com/cloudera-repos/
gpgcheck=0
enabled=1

清除并重新建立元数据缓存:

1
yum clean all && yum makecache

访问 http://devcdh1.cdh.com/cloudera-repos/ 检查是否成功。

如果页面显示无访问权限:

本地yum源无访问权限

检查是否是因为启用了 SELINUX,将它关闭后重启服务器即可:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# vim /etc/selinux/config

# This file controls the state of SELinux on the system.
# SELINUX= can take one of these three values:
# enforcing - SELinux security policy is enforced.
# permissive - SELinux prints warnings instead of enforcing.
# disabled - No SELinux policy is loaded.
- SELINUX=enforcing
+ SELINUX=disabled
# SELINUXTYPE= can take one of three values:
# targeted - Targeted processes are protected,
# minimum - Modification of targeted policy. Only selected processes are protected.
# mls - Multi Level Security protection.
SELINUXTYPE=targeted

安装Cloudera Manager

主机 角色
devcdh1 cloudera-manager-daemons; cloudera-manager-agent; cloudera-manager-server
devcdh2 cloudera-manager-daemons; cloudera-manager-agent
devcdh3 cloudera-manager-daemons; cloudera-manager-agent

在 devcdh1 全部安装,在 devcdh2 和 devchd2 上只安装 daemons 和 agent:

1
2
3
4
5
# devcdh1上执行
yum -y install cloudera-manager-daemons cloudera-manager-agent cloudera-manager-server

# devcdh2和devcdh3上执行
yum -y install cloudera-manager-daemons cloudera-manager-agent

配置本地Parcel存储库

Cloudera Manager Server安装完成后,将 Parcel 文件移动到 Parcel 存储库默认目录 /opt/cloudera/parcel-repo/ 下,该目录树如下:

/opt/cloudera/parcel-repo/
1
2
3
4
5
# tree /opt/cloudera/parcel-repo
/opt/cloudera/parcel-repo
├── CDH-6.1.1-1.cdh6.1.1.p0.875250-el7.parcel
├── CDH-6.1.1-1.cdh6.1.1.p0.875250-el7.parcel.sha
└── manifest.json

生成 sha 文件:

/opt/cloudera/parcel-repo/
1
sha1sum CDH-6.1.1-1.cdh6.1.1.p0.875250-el7.parcel | awk '{ print $1 }' > CDH-6.1.1-1.cdh6.1.1.p0.875250-el7.parcel.sha

修改文件所有者:

/opt/cloudera/parcel-repo/
1
chown -R cloudera-scm:cloudera-scm /opt/cloudera/parcel-repo/*

设置Cloudera Manager数据库

执行数据库配置脚本:

1
2
3
4
5
6
7
8
9
# /opt/cloudera/cm/schema/scm_prepare_database.sh mysql scm scm
Enter SCM password:
JAVA_HOME=/usr/java/jdk1.8.0_181-cloudera
Verifying that we can write to /etc/cloudera-scm-server
Creating SCM configuration file in /etc/cloudera-scm-server
Executing: /usr/java/jdk1.8.0_181-cloudera/bin/java -cp /usr/share/java/mysql-connector-java.jar:/usr/share/java/oracle-connector-java.jar:/usr/share/java/postgresql-connector-java.jar:/opt/cloudera/cm/schema/../lib/* com.cloudera.enterprise.dbutil.DbCommandExecutor /etc/cloudera-scm-server/db.properties com.cloudera.cmf.db.
Thu May 28 03:17:56 CST 2020 WARN: Establishing SSL connection without server's identity verification is not recommended. According to MySQL 5.5.45+, 5.6.26+ and 5.7.6+ requirements SSL connection must be established by default if explicit option isn't set. For compliance with existing applications not using SSL the verifyServerCertificate property is set to 'false'. You need either to explicitly disable SSL by setting useSSL=false, or set useSSL=true and provide truststore for server certificate verification.
[ main] DbCommandExecutor INFO Successfully connected to database.
All done, your SCM database is configured correctly!

启动Cloudera Manager

启动 Cloudera Manager Server,当看到 Started Jetty server.,说明启动成功:

1
2
3
systemctl start cloudera-scm-server && tail -f /var/log/cloudera-scm-server/cloudera-scm-server.log
...
2020-05-28 03:25:09,778 INFO WebServerImpl:com.cloudera.server.cmf.WebServerImpl: Started Jetty server.

浏览器访问 http://devcdh1.cdh.com:7180 ,看到登录页面如下:

登录页面

用户名和密码默认都为 admin

安装CDH

登录 Cloudera Manager,同意协议并选择免费版后,接下来便可进行群集安装。

说明一下 CDH 6.1.1 的组件内容:

Component Component Version Changes Information
Apache Avro 1.8.2 Changes
Apache Flume 1.8.0 Changes
Apache Hadoop 3.0.0 Changes
Apache HBase 2.1.1 Changes
HBase Indexer 1.5 Changes
Apache Hive 2.1.1 Changes
Hue 4.3.0 Changes
Apache Impala 3.1.0 Changes
Apache Kafka 2.0 Changes
Kite SDK 1.0.0
Apache Kudu 1.8.0 Changes
Apache Solr 7.4 Changes
Apache Oozie 5.0.0 Changes
Apache Parquet 1.9.0 Changes
Parquet-format 2.3.1 Changes
Apache Pig 0.17.0 Changes
Apache Sentry 2.1.0 Changes
Apache Spark 2.4 Changes
Apache Sqoop 1.4.7 Changes
Apache ZooKeeper 3.4.5 Changes

群集安装

输入主机名称,点击搜索,勾选主机:

1-1-集群安装-SpecifyHost

选择自定义存储库,填入 制作本地yum源 的地址:

1-2-集群安装-选择存储库

之前安装过 JDK,这里不勾选:

1-3-集群安装-JDK安装选项

填入用户密码,用于 ssh 登录:

1-4-集群安装-提供SSH登录凭据

等待安装完成:

1-5-集群安装-Install Agents

群集设置

选择要安装的角色,以及安装在哪个主机:

2-1-群集设置-自定义角色分配

根据安装的角色不同,填入 创建数据库 设定的数据库名称以及用户名:

2-2-群集设置-数据库设置

基准测试

Hadoop 提供了几个 jar 包来进行基准测试:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
# 常用的测试有TestDFSIO,mrbench,nnbench
$ yarn jar /opt/cloudera/parcels/CDH-6.1.1-1.cdh6.1.1.p0.875250/jars/hadoop-mapreduce-client-jobclient-3.0.0-cdh6.1.1-tests.jar
...
An example program must be given as the first argument.
Valid program names are:
DFSCIOTest: Distributed i/o benchmark of libhdfs.
MRReliabilityTest: A program that tests the reliability of the MR framework by injecting faults/failures
TestDFSIO: Distributed i/o benchmark.
fail: a job that always fails
gsleep: A sleep job whose mappers create 1MB buffer for every record.
loadgen: Generic map/reduce load generator
mapredtest: A map/reduce test check.
mrbench: A map/reduce benchmark that can create many small jobs
nnbench: A benchmark that stresses the namenode w/ MR.
nnbenchWithoutMR: A benchmark that stresses the namenode w/o MR.
sleep: A job that sleeps at each map and reduce task.
testbigmapoutput: A map/reduce program that works on a very big non-splittable file and does identity map/reduce
testfilesystem: A test for FileSystem read/write.
testmapredsort: A map/reduce program that validates the map-reduce framework’s sort.
testsequencefile: A test for flat files of binary key value pairs.
testsequencefileinputformat: A test for sequence file input format.
testtextinputformat: A test for text input format.
threadedmapbench: A map/reduce benchmark that compares the performance of maps with multiple spills over maps with 1 spill
timelineperformance: A job that launches mappers to test timeline service performance.

# 常用的测试有randomwriter,sort,terasort
$ yarn jar /opt/cloudera/parcels/CDH-6.1.1-1.cdh6.1.1.p0.875250/jars/hadoop-mapreduce-examples-3.0.0-cdh6.1.1.jar
Valid program names are:
aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.
dbcount: An example job that count the pageview counts from a database.
distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi.
grep: A map/reduce program that counts the matches of a regex in the input.
join: A job that effects a join over sorted, equally partitioned datasets
multifilewc: A job that counts words from several files.
pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method.
randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
randomwriter: A map/reduce program that writes 10GB of random data per node.
secondarysort: An example defining a secondary sort to the reduce.
sort: A map/reduce program that sorts the data written by the random writer.
sudoku: A sudoku solver.
teragen: Generate data for the terasort
terasort: Run the terasort
teravalidate: Checking results of terasort
wordcount: A map/reduce program that counts the words in the input files.
wordmean: A map/reduce program that counts the average length of the words in the input files.
wordmedian: A map/reduce program that counts the median length of the words in the input files.
wordstandarddeviation: A map/reduce program that counts the standard deviation of the length of the words in the input files.

TestDFSIO

测试 HDFS 写性能:

1
2
3
4
5
6
7
8
9
10
11
12
# 向HDFS集群写入10个128M的文件
$ yarn jar /opt/cloudera/parcels/CDH-6.1.1-1.cdh6.1.1.p0.875250/jars/hadoop-mapreduce-client-jobclient-3.0.0-cdh6.1.1-tests.jar TestDFSIO -write -nrFiles 10 -fileSize 128MB
...
20/07/14 10:58:12 INFO fs.TestDFSIO: ----- TestDFSIO ----- : write
20/07/14 10:58:12 INFO fs.TestDFSIO: Date & time: Tue Jul 14 10:58:12 CST 2020
20/07/14 10:58:12 INFO fs.TestDFSIO: Number of files: 10
20/07/14 10:58:12 INFO fs.TestDFSIO: Total MBytes processed: 1280
20/07/14 10:58:12 INFO fs.TestDFSIO: Throughput mb/sec: 38.51
20/07/14 10:58:12 INFO fs.TestDFSIO: Average IO rate mb/sec: 40.69
20/07/14 10:58:12 INFO fs.TestDFSIO: IO rate std deviation: 11.09
20/07/14 10:58:12 INFO fs.TestDFSIO: Test exec time sec: 47.86
20/07/14 10:58:12 INFO fs.TestDFSIO:

测试 HDFS 读性能:

1
2
3
4
5
6
7
8
9
10
11
12
# 从HDFS集群读取10个128M的文件
$ yarn jar /opt/cloudera/parcels/CDH-6.1.1-1.cdh6.1.1.p0.875250/jars/hadoop-mapreduce-client-jobclient-3.0.0-cdh6.1.1-tests.jar TestDFSIO -read -nrFiles 10 -fileSize 128MB
...
20/07/14 11:02:05 INFO fs.TestDFSIO: ----- TestDFSIO ----- : read
20/07/14 11:02:05 INFO fs.TestDFSIO: Date & time: Tue Jul 14 11:02:05 CST 2020
20/07/14 11:02:05 INFO fs.TestDFSIO: Number of files: 10
20/07/14 11:02:05 INFO fs.TestDFSIO: Total MBytes processed: 1280
20/07/14 11:02:05 INFO fs.TestDFSIO: Throughput mb/sec: 1584.16
20/07/14 11:02:05 INFO fs.TestDFSIO: Average IO rate mb/sec: 1641.68
20/07/14 11:02:05 INFO fs.TestDFSIO: IO rate std deviation: 305.71
20/07/14 11:02:05 INFO fs.TestDFSIO: Test exec time sec: 23.92
20/07/14 11:02:05 INFO fs.TestDFSIO:

删除测试数据:

1
$ yarn jar /opt/cloudera/parcels/CDH-6.1.1-1.cdh6.1.1.p0.875250/jars/hadoop-mapreduce-client-jobclient-3.0.0-cdh6.1.1-tests.jar TestDFSIO -clean

TeraSort

TeraSort 会对数据进行排序,用来测试 MapReduce 的性能。TeraSort 由三个 MapReduce 程序组成:

  • TeraGen:生成用于排序的大数据集,每行数据大小为 100 B。
  • TeraSort:读取输入数据并使用 MapReduce 对数据进行排序。
  • TeraValidate:验证排序后的输出,以确保键在每个文件中排序。如果排序后的输出有任何问题,则此 Reducer 的输出将报告该问题。
1
2
3
4
5
6
7
8
# 生成用于排序的大数据集。参数10737418表示生成10737418行数据,约为1GB大小
$ yarn jar /opt/cloudera/parcels/CDH-6.1.1-1.cdh6.1.1.p0.875250/jars/hadoop-mapreduce-examples-3.0.0-cdh6.1.1.jar teragen 10737418 /1g_gen

# 使用MapReduce对数据进行排序
$ yarn jar /opt/cloudera/parcels/CDH-6.1.1-1.cdh6.1.1.p0.875250/jars/hadoop-mapreduce-examples-3.0.0-cdh6.1.1.jar terasort /1g_gen /1g_sorted

# 验证排序结果
$ yarn jar /opt/cloudera/parcels/CDH-6.1.1-1.cdh6.1.1.p0.875250/jars/hadoop-mapreduce-examples-3.0.0-cdh6.1.1.jar teravalidate /1g_sorted /1g_validated

使用

Hive

使用 hive:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
$ hive
WARNING: Use "yarn jar" to launch YARN applications.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.1.1-1.cdh6.1.1.p0.875250/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.1.1-1.cdh6.1.1.p0.875250/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]

Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-6.1.1-1.cdh6.1.1.p0.875250/jars/hive-common-2.1.1-cdh6.1.1.jar!/hive-log4j2.properties Async: false

WARNING: Hive CLI is deprecated and migration to Beeline is recommended.

hive> show databases;
OK
default
myhive
Time taken: 0.935 seconds, Fetched: 2 row(s)

使用 Beeline:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
$ beeline
WARNING: Use "yarn jar" to launch YARN applications.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.1.1-1.cdh6.1.1.p0.875250/jars/log4j-slf4j-impl-2.8.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-6.1.1-1.cdh6.1.1.p0.875250/jars/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Beeline version 2.1.1-cdh6.1.1 by Apache Hive

beeline> !connect jdbc:hive2://devcdh1.cdh.com:10000
Connecting to jdbc:hive2://devcdh1.cdh.com:10000
Enter username for jdbc:hive2://devcdh1.cdh.com:10000: root
Enter password for jdbc:hive2://devcdh1.cdh.com:10000: ****
Connected to: Apache Hive (version 2.1.1-cdh6.1.1)
Driver: Hive JDBC (version 2.1.1-cdh6.1.1)
Transaction isolation: TRANSACTION_REPEATABLE_READ

0: jdbc:hive2://devcdh1.cdh.com:10000> show databases;
INFO : Compiling command(queryId=hive_20200606183616_d1077ff0-fac9-4280-8e4c-86794c63a9ed): show databases
INFO : Semantic Analysis Completed
INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:database_name, type:string, comment:from deserializer)], properties:null)
INFO : Completed compiling command(queryId=hive_20200606183616_d1077ff0-fac9-4280-8e4c-86794c63a9ed); Time taken: 0.754 seconds
INFO : Executing command(queryId=hive_20200606183616_d1077ff0-fac9-4280-8e4c-86794c63a9ed): show databases
INFO : Starting task [Stage-0:DDL] in serial mode
INFO : Completed executing command(queryId=hive_20200606183616_d1077ff0-fac9-4280-8e4c-86794c63a9ed); Time taken: 0.026 seconds
INFO : OK
+----------------+
| database_name |
+----------------+
| default |
| myhive |
+----------------+
2 rows selected (1.158 seconds)

压缩

snappy压缩

CDH 版 Hadoop 已经集成了 snappy 压缩,无需额外下载:

1
2
3
4
5
6
7
8
9
10
11
12
13
# 查看Hadoop是否支持snappy压缩
$ hadoop checknative
20/07/11 22:51:10 INFO bzip2.Bzip2Factory: Successfully loaded & initialized native-bzip2 library system-native
20/07/11 22:51:10 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
Native library checking:
hadoop: true /opt/cloudera/parcels/CDH-6.1.1-1.cdh6.1.1.p0.875250/lib/hadoop/lib/native/libhadoop.so.1.0.0
zlib: true /lib64/libz.so.1
zstd : true /opt/cloudera/parcels/CDH-6.1.1-1.cdh6.1.1.p0.875250/lib/hadoop/lib/native/libzstd.so.1
snappy: true /opt/cloudera/parcels/CDH-6.1.1-1.cdh6.1.1.p0.875250/lib/hadoop/lib/native/libsnappy.so.1
lz4: true revision:10301
bzip2: true /lib64/libbz2.so.1
openssl: true /lib64/libcrypto.so
ISA-L: true /opt/cloudera/parcels/CDH-6.1.1-1.cdh6.1.1.p0.875250/lib/hadoop/lib/native/libisal.so.2

在 Hive Shell 中开启 snappy 压缩,仅对当前会话生效:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
-- Map阶段压缩(测试中未启用)
-- 开启Hive中MR中间文件压缩
set hive.exec.compress.intermediate=true;
-- 开启MR中Map输出压缩功能
set mapreduce.map.output.compress=true;
-- 设置MR中Map阶段的压缩算法(对应的编/解码器):
set mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;

-- Reduce阶段压缩
-- 开启MR最终输出文件压缩
set mapreduce.output.fileoutputformat.compress=true;
-- 设置MR以序列化文件输出时的压缩方式,建议设置成按BLOCK级别进行压缩,可选项包括NONE,RECORD
set mapreduce.output.fileoutputformat.compress.type=BLOCK;
-- 设置MR最终输出文件的压缩算法(对应的编/解码器)
set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
-- 开启Hive中MR最终查询结果文件压缩
set hive.exec.compress.output=true;

-- 插入数据,执行MR任务
insert overwrite table test_table2 select * from test_table;

查看表文件,发现后缀为 .snappy ,说明启用了 snappy 压缩:

1
2
$ hdfs dfs -ls /hivetable/test_table2
-rwxr-xr-x 3 root supergroup 18 2020-07-11 17:59 /hivetable/test_table2/000000_0.snappy

lzo压缩

相比于 snappy 压缩,lzo 压缩文件还支持切片。CDH 默认不支持 lzo 压缩,如要使用需要下载额外的 Parcel 包。

  1. 添加 Parcel 存储库:
1
https://archive.cloudera.com/gplextras6/6.1.1/parcels/

开启lzo压缩-添加远程Parcel存储库

  1. 下载、分配、激活:

开启lzo压缩-下载激活分配

  1. core-site.xml 配置 io.compression.codecs,添加 lzo 压缩编码解码器:

开启lzo压缩-添加压缩编码解码器

1
2
com.hadoop.compression.lzo.LzoCodec
com.hadoop.compression.lzo.LzopCodec
  1. mapred-site.xml 配置 mapreduce.application.classpath,添加 MR 应用程序 Classpath:
1
/opt/cloudera/parcels/GPLEXTRAS/lib/hadoop/lib/*

开启lzo压缩-添加MR应用程序Classpath

重启集群后即可使用 lzo 压缩:

1
2
3
4
5
6
7
8
9
10
11
12
-- 开启MR最终输出文件压缩
set mapreduce.output.fileoutputformat.compress=true;
-- 设置MR以序列化文件输出时的压缩方式,建议设置成按BLOCK级别进行压缩,可选项包括NONE,RECORD
set mapreduce.output.fileoutputformat.compress.type=BLOCK;
-- 设置MR最终输出文件的压缩算法(对应的编/解码器)
set mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzopCodec;

-- 开启Hive中MR最终查询结果文件压缩
set hive.exec.compress.output=true;

-- 插入数据,执行MR任务
insert overwrite table test_table2 select * from test_table;

创建索引

lzo 压缩文件的可切片特性依赖于其索引,需要手动为其创建索引,否则切片只有一个。

为压缩文件创建索引:

1
hadoop jar /opt/cloudera/parcels/GPLEXTRAS/lib/hadoop/lib/hadoop-lzo.jar com.hadoop.compression.lzo.DistributedLzoIndexer bigfile.lzo

使用 快视频 项目中的视频表作为测试数据,使用 lzo 压缩后的文件大小为 187.65 MB,Block 数量为 2:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
-- 开启lzo压缩
set mapreduce.output.fileoutputformat.compress=true;
set mapreduce.output.fileoutputformat.compress.type=BLOCK;
set mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzopCodec;
set hive.exec.compress.output=true;

-- 创建视频表数据
create table quickvideo_video_lzo(
videoId string,
uploader string,
age int,
category array<string>,
length int,
views int,
rate float,
rating int,
comment int,
relatedId array<string>)
row format delimited fields terminated by "\t"
collection items terminated by "&";

-- 插入数据
insert into table quickvideo_video_lzo select * from quickvideo_video_orc;

创建索引前,先执行一次 WordCount,可以看到 splits 数量为 1:

1
hadoop jar /opt/cloudera/parcels/CDH-6.1.1-1.cdh6.1.1.p0.875250/jars/hadoop-mapreduce-examples-3.0.0-cdh6.1.1.jar wordcount /user/hive/warehouse/myhive.db/quickvideo_video_lzo /hdfsdata/wordcount/

开启lzo压缩-创建索引之前

给压缩文件创建索引,在相同目录下生成了一个索引文件 000000_0.lzo.index

1
$ hadoop jar /opt/cloudera/parcels/GPLEXTRAS/lib/hadoop/lib/hadoop-lzo.jar com.hadoop.compression.lzo.DistributedLzoIndexer /user/hive/warehouse/myhive.db/quickvideo_video_lzo/000000_0.lzo

创建索引后,再执行一次 WordCount,可以看到 splits 数量为 2:

1
$ hadoop jar /opt/cloudera/parcels/CDH-6.1.1-1.cdh6.1.1.p0.875250/jars/hadoop-mapreduce-examples-3.0.0-cdh6.1.1.jar wordcount /user/hive/warehouse/myhive.db/quickvideo_video_lzo /hdfsdata/wordcount2/

开启lzo压缩-创建索引之后

目录结构

使用 Cloudera 官方建议方式进行安装,目录结构如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
$ tree /etc/hadoop/
/etc/hadoop/
├── conf -> /etc/alternatives/hadoop-conf
├── conf.cloudera.hdfs
│ ├── __cloudera_generation__
│ ├── __cloudera_metadata__
│ ├── core-site.xml
│ ├── hadoop-env.sh
│ ├── hdfs-site.xml
│ ├── log4j.properties
│ ├── ssl-client.xml
│ ├── topology.map
│ └── topology.py
└── conf.cloudera.yarn
├── __cloudera_generation__
├── __cloudera_metadata__
├── core-site.xml
├── hadoop-env.sh
├── hdfs-site.xml
├── log4j.properties
├── mapred-site.xml
├── ssl-client.xml
├── topology.map
├── topology.py
└── yarn-site.xml

3 directories, 20 files

$ ll /etc/alternatives/hadoop-conf
lrwxrwxrwx 1 root root 30 5月 28 20:54 /etc/alternatives/hadoop-conf -> /etc/hadoop/conf.cloudera.yarn

会发现 /etc/hadoop 下有 3 个目录,其中 conf 经过软链接最终指向 conf.cloudera.yarn

CDH 的配置文件放置于 /var/run/cloudera-scm-agent/process/ 目录下。如:/var/run/cloudera-scm-agent/process/193-hdfs-NAMENODE/core-site.xml。这些配置文件是通过 Cloudera Manager 启动相应服务(如 HDFS)时生成的,内容从数据库中获得(即通过界面配置的参数)。

在 CM 界面上更改配置是不会立即反映到配置文件中,这些信息会存储于数据库中,等下次重启服务时才会生成配置文件。且每次启动时都会产生新的配置文件。CM 为每个服务进程生成独立的配置目录(文件)。所有配置统一在服务端查询数据库生成(因为 scm 数据库只能在 localhost下访问)生成配置文件,再由agent通过网络下载包含配置文件的zip包到本地解压到指定的目录。

所以改客户端的配置文件时无法生效的,必须要在网页上配置“高级配置代码段“,添加 hive-site.xml 配置,配置后重启服务即可。

ISSUSE

Zookeeper

异常断电导致 Zookeeper 提示 Unable to load database on disk

Issuse-zookeeper-异常断电导致IOException

解决方法:

  1. 找到 version-2
1
2
$ find / -name version-2
/var/lib/zookeeper/version-2
  1. 备份后清除该目录下的数据:
1
2
3
4
$ mv version-2/ /data/

$ mkdir version-2
$ chown zookeeper:zookeeper version-2/
  1. 重启 Zookeeper 服务。

参考资料

Cloudera Manager Overview
大数据平台CDH搭建
CentOS7下完全离线安装CDH6集群