linux搭建hadoop学习
linux搭建hadoop学习
下载安装包:
海外资源可能需要翻墙或者找国内资源
cd /opt
wget https://dlcdn.apache.org/hadoop/common/hadoop-2.10.2/hadoop-2.10.2.tar.gz
tar -zxvf hadoop-2.10.2.tar.gz
mv hadoop-2.10.2 hadoop
配置环境变量
# 在/etc/profile文件中添加下面内容
export HADOOP_HOME=/opt/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
#生效环境变量
source /etc/profile
测试安装,在命令行中执行
hadoop version
我这里执行后报错,很明显前面提到过hadoop由java开发,所以还需要配置java环境.
[root@win-local-17 ~]# hadoop version
Error: JAVA_HOME is not set and could not be found.
这里可以参考文档: https://blog.csdn.net/qq_42402854/article/details/108164936
如果是原本使用rpm包安装jdk需要注意下,可能你主机上可以执行java命令,但是依然会遇到这个报错.
这时需要我们在手动配置下环境变量:
# 看下java是在哪个目录下,然后配置到/etc/profile文件中
which javaexport JAVA_HOME=/usr/
export PATH=$PATH:$JAVA_HOME/bin
然后再执行:
[root@win-local-17 ~]# hadoop version
Hadoop 2.10.2
Subversion Unknown -r 965fd380006fa78b2315668fbc7eb432e1d8200f
Compiled by ubuntu on 2022-05-24T22:35Z
Compiled with protoc 2.5.0
From source with checksum d3ab737f7788f05d467784f0a86573fe
This command was run using /opt/hadoop/share/hadoop/common/hadoop-common-2.10.2.jar
到这里hadoop的应用程序安装完成了,
接下来我们开始部署单机模型和伪分布式服务.因为是学习阶段所以使用单台主机部署伪分布式就可以了.如果需要多台主机部署分布式可以查看文末的参考文档.
单机模式
先进行一个简单的示例:用来统计分析单词的个数和数量
#首先创建一个目录
cd /opt/hadoop
mkdir input
cd ./intput
然后创建文件,写入一些简单数据
cat test
hadoop yarn
hadoop mapreduce
spark
spark
然后执行一下MapReduce 程序,我们来看下效果:
hadoop jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.10.2.jar wordcount input/ wcoutput
程序执行之后会看到有很多输出,此时主机cpu使用率也会升高,主要看有没有明显的报错出现.,执行执行完成之后,在/opt/hadoop下会创建出一个目录: wcoutput
可以看到这里将我们给的测试内容中的单词数量统计出来了.
[root@win-local-17 hadoop]# ll wcoutput
总用量 4
-rw-r--r--. 1 root root 36 5月 8 18:05 part-r-00000
-rw-r--r--. 1 root root 0 5月 8 18:05 _SUCCESS
[root@win-local-17 hadoop]# cat wcoutput/part-r-00000
hadoop 2
mapreduce 1
spark 2
yarn 1
[root@win-local-17 hadoop]# pwd
/opt/hadoop
接下来,我们进行一个nginx日志的统计,上面的测试发现它可以统计简单的单词,那么就可以对指定的日志格式内容进行分析,比如我们让他统计一下nginx日志文件中,请求的客户端IP的数量和状态码的数量.
当然我们不能直接和上面一样运行命令直接进行统计,hadoop还不能直接实现,需要我们对MapReduce 过程进行自定义,分为map过程和Reduce过程;Hadoop Streaming 允许使用任何可执行程序(如 Python 脚本)作为 MapReduce 作业的 mapper 和 reducer。
这里我们自己写两个Python脚本,来实现map过程和Reduce过程.
mapper.py
import sys
import re# 正则表达式用于解析 Nginx 日志行
NGINX_LOG_PATTERN = re.compile(r'^([\d.]+) - - \[(.*?)\] "(.*?)" (\d+) (\d+)')for line in sys.stdin:line = line.strip()match = NGINX_LOG_PATTERN.match(line)if match:# 提取 IP 地址ip = match.group(1)# 提取状态码status_code = match.group(4)# 输出 IP 地址和计数print(f"{ip}\t1")# 输出状态码和计数print(f"{status_code}\t1")
reducer.py
import syscurrent_key = None
current_count = 0for line in sys.stdin:line = line.strip()key, count = line.split('\t', 1)try:count = int(count)except ValueError:continueif current_key == key:current_count += countelse:if current_key:print(f"{current_key}\t{current_count}")current_key = keycurrent_count = countif current_key:print(f"{current_key}\t{current_count}")
这里通过mapper.py对输入的日志内容进行过滤,提取IP和状态码,输入一个键值对。而reducer.py就是对map中输出的键值对,对相同键值进行累计,得到次数。
然后我们在创建目录和日志文件内容:
mkdir nginx-input
cd nginx-input
[root@win-local-17 nginx-input]# ll
总用量 36
-rw-r--r--. 1 root root 502 5月 8 18:34 mapper.py
-rw-r--r--. 1 root root 26326 5月 8 18:48 nginx.log
-rw-r--r--. 1 root root 474 5月 8 18:35 reducer.py
[root@win-local-17 nginx-input]# head nginx.log -n 3
192.168.112.125 - - [03/Apr/2025:20:25:15 +0800] "HEAD /app/psychicai/psychicai_test.ipa HTTP/2.0" 200 0 "-" "com.apple.appstored/1.0 iOS/17.5.1 model/iPhone13,2 hwp/t8101 build/21F90 (6; dt:229) AMS/1" "-"
192.168.112.125 - - [03/Apr/2025:20:25:15 +0800] "GET /app/psychicai/icon_1024@1x.png HTTP/2.0" 200 162747 "-" "com.apple.appstored/1.0 iOS/17.5.1 model/iPhone13,2 hwp/t8101 build/21F90 (6; dt:229) AMS/1" "-"
192.168.112.125 - - [03/Apr/2025:20:25:21 +0800] "GET /app/psychicai/psychicai_test.ipa HTTP/2.0" 200 74397692 "-" "com.apple.appstored/1.0 iOS/17.5.1 model/iPhone13,2 hwp/t8101 build/21F90 (6; dt:229) AMS/1" "-"
文件内容准备好之后,我们执行命令用自定义的脚本去执行两个阶段:(主机需要有Python3的环境)
cd /opt/hadoop/nginx-inputhadoop jar ../share/hadoop/tools/lib/hadoop-streaming-*.jar \
-input ./nginx.log \
-output /output \
-mapper "python3 mapper.py" \
-reducer "python3 reducer.py"
执行期间我们会看到有很对输出,观察里面是否有明显报错.
[root@win-local-17 nginx-input]# hadoop jar ../share/hadoop/tools/lib/hadoop-streaming-*.jar \
> -input ./nginx.log \
> -output /output \
> -mapper "python3 mapper.py" \
> -reducer "python3 reducer.py"
25/05/08 18:50:59 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
25/05/08 18:50:59 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
25/05/08 18:51:19 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
25/05/08 18:51:40 INFO mapred.FileInputFormat: Total input files to process : 1
25/05/08 18:51:40 INFO mapreduce.JobSubmitter: number of splits:1
25/05/08 18:51:41 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1405416554_0001
25/05/08 18:51:41 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
25/05/08 18:51:41 INFO mapred.LocalJobRunner: OutputCommitter set in config null
25/05/08 18:51:41 INFO mapreduce.Job: Running job: job_local1405416554_0001
25/05/08 18:51:41 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapred.FileOutputCommitter
25/05/08 18:51:41 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
25/05/08 18:51:41 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
25/05/08 18:51:41 INFO mapred.LocalJobRunner: Waiting for map tasks
25/05/08 18:51:41 INFO mapred.LocalJobRunner: Starting task: attempt_local1405416554_0001_m_000000_0
25/05/08 18:51:42 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
25/05/08 18:51:42 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
25/05/08 18:51:42 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
25/05/08 18:51:42 INFO mapred.MapTask: Processing split: file:/opt/hadoop/nginx-input/nginx.log:0+26326
25/05/08 18:51:42 INFO mapred.MapTask: numReduceTasks: 1
25/05/08 18:51:42 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
25/05/08 18:51:42 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
25/05/08 18:51:42 INFO mapred.MapTask: soft limit at 83886080
25/05/08 18:51:42 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
25/05/08 18:51:42 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
25/05/08 18:51:42 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
25/05/08 18:51:42 INFO streaming.PipeMapRed: PipeMapRed exec [/usr/local/bin/python3, mapper.py]
25/05/08 18:51:42 INFO Configuration.deprecation: mapred.work.output.dir is deprecated. Instead, use mapreduce.task.output.dir
25/05/08 18:51:42 INFO Configuration.deprecation: map.input.start is deprecated. Instead, use mapreduce.map.input.start
25/05/08 18:51:42 INFO Configuration.deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
25/05/08 18:51:42 INFO Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
25/05/08 18:51:42 INFO Configuration.deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
25/05/08 18:51:42 INFO Configuration.deprecation: mapred.local.dir is deprecated. Instead, use mapreduce.cluster.local.dir
25/05/08 18:51:42 INFO Configuration.deprecation: map.input.file is deprecated. Instead, use mapreduce.map.input.file
25/05/08 18:51:42 INFO Configuration.deprecation: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords
25/05/08 18:51:42 INFO Configuration.deprecation: map.input.length is deprecated. Instead, use mapreduce.map.input.length
25/05/08 18:51:42 INFO Configuration.deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
25/05/08 18:51:42 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name
25/05/08 18:51:42 INFO Configuration.deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
25/05/08 18:51:42 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]
25/05/08 18:51:42 INFO streaming.PipeMapRed: R/W/S=10/0/0 in:NA [rec/s] out:NA [rec/s]
25/05/08 18:51:42 INFO streaming.PipeMapRed: R/W/S=100/0/0 in:NA [rec/s] out:NA [rec/s]
25/05/08 18:51:42 INFO streaming.PipeMapRed: Records R/W=100/1
25/05/08 18:51:42 INFO streaming.PipeMapRed: MRErrorThread done
25/05/08 18:51:42 INFO streaming.PipeMapRed: mapRedFinished
.......
在这期间也可以观察到主机cpu和负载会有很明显的升高,
top - 18:51:42 up 2 days, 8:32, 3 users, load average: 0.16, 0.05, 0.06
Tasks: 154 total, 1 running, 153 sleeping, 0 stopped, 0 zombie
%Cpu(s): 12.5 us, 2.8 sy, 0.0 ni, 84.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 1867048 total, 211396 free, 249168 used, 1406484 buff/cache
KiB Swap: 1048572 total, 1048404 free, 168 used. 1304388 avail MemPID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND25764 root 20 0 2304916 138548 22392 S 155.5 7.4 0:12.16 java
我这里测试了100条日志数据执行10s左右就结束了,然后我们可以看到执行结果.
[root@win-local-17 nginx-input]# ll /output/
总用量 4
-rw-r--r--. 1 root root 107 5月 8 18:51 part-00000
-rw-r--r--. 1 root root 0 5月 8 18:51 _SUCCESS
[root@win-local-17 nginx-input]# cat /output/part-00000
192.168.112.125 3
192.168.113.187 5
192.168.114.57 1
192.168.115.0 3
192.168.89.136 88
200 53
206 2
405 45
这里他将日志中的客户端IP,以及状态码都统计出来, 我们在用命令去过滤一下,对比一下统计结果
[root@win-local-17 nginx-input]# cat nginx.log | awk '{print $1}' |sort -n | uniq -c3 192.168.112.1255 192.168.113.1871 192.168.114.573 192.168.115.088 192.168.89.136
[root@win-local-17 nginx-input]# cat nginx.log |grep -w '200' |wc -l
53
[root@win-local-17 nginx-input]# cat nginx.log |grep -w '206' |wc -l
2
[root@win-local-17 nginx-input]# cat nginx.log |grep -w '405' |wc -l
45
[root@win-local-17 nginx-input]# wc -l nginx.log
100 nginx.log
通过我们手动对日志文件进行统计可以看到,最终的结果跟上面使用MapReduce的方式一样,通过这种自定义的方式,可以使用多场景化,当然后面再不断地学习中还会有更好工具帮助简化自定义的数据分析方式.
伪分布式模式
伪分布式则是在一台主机上模拟出分布式的各个进程运行状态,方便我们了解学习.
配置集群,修改hadoop的配置文件: /opt/hadoop/etc/hadoop/core-site.xml (该配置文件是 Hadoop 集群的核心配置文件之一,主要用于定义 Hadoop 分布式系统的全局设置和通用属性,控制 Hadoop 各个组件之间的通信、数据传输和基本行为。这个文件决定了 Hadoop 集群如何运行以及各个服务之间如何交互。)
core-site.xml
<configuration>
<!-- 指定HDFS中NameNode的地址 -->
<property><name>fs.defaultFS</name><!-- namenode主机地址,本地地址 --><value>hdfs://192.168.44.17:8020</value>
</property><!-- 指定Hadoop运行时产生文件的存储目录 -->
<property><name>hadoop.tmp.dir</name><value>/opt/hadoop/data/tmp</value></property>
</configuration>
hdfs-site.xml
<configuration><!-- 指定HDFS副本的数量 --><property><name>dfs.replication</name><value>1</value></property>
</configuration>
修改上面的配置后,启动集群
- 格式化 NameNode(第一次启动时格式化,以后就不要总格式化)
hdfs namenode -format
# 执行之后会有很多输出,检查是否有明显异常
25/05/09 11:16:23 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at win-local-17/192.168.44.17
************************************************************/
- 启动 NameNode
cd /opt/hadoop/etc/hadoop
[root@win-local-17 hadoop]# hadoop-daemon.sh start namenode
starting namenode, logging to /opt/hadoop/logs/hadoop-root-namenode-win-local-17.out
[root@win-local-17 hadoop]# cat /opt/hadoop/logs/hadoop-root-namenode-win-local-17.out
ulimit -a for user root
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 7206
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 7206
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
- 启动DataNode
cd /opt/hadoop/etc/hadoop
[root@win-local-17 hadoop]# hadoop-daemon.sh start datanode
starting datanode, logging to /opt/hadoop/logs/hadoop-root-datanode-win-local-17.out
[root@win-local-17 hadoop]# cat /opt/hadoop/logs/hadoop-root-datanode-win-local-17.out
ulimit -a for user root
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 7206
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 7206
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
- 查看进程是否启动成功
[root@win-local-17 hadoop]# jps
80615 NameNode
80726 DataNode
80790 Jps
[root@win-local-17 hadoop]# netstat -anltp |grep java
tcp 0 0 192.168.44.17:8020 0.0.0.0:* LISTEN 80615/java
tcp 0 0 0.0.0.0:50070 0.0.0.0:* LISTEN 80615/java
tcp 0 0 0.0.0.0:50010 0.0.0.0:* LISTEN 80726/java
tcp 0 0 0.0.0.0:50075 0.0.0.0:* LISTEN 80726/java
tcp 0 0 0.0.0.0:50020 0.0.0.0:* LISTEN 80726/java
tcp 0 0 127.0.0.1:43496 0.0.0.0:* LISTEN 80726/java
tcp 0 0 192.168.44.17:37402 192.168.44.17:8020 ESTABLISHED 80726/java
tcp 0 0 192.168.44.17:8020 192.168.44.17:37402 ESTABLISHED 80615/java
简单介绍上面端口对应的服务:
8020: HDFS 的 NameNode 服务。
50070: NameNode 的 Web UI 服务,可以通过浏览器访问
50010: DataNode 的数据传输端口,DataNode 使用该端口与其他 DataNode 节点或客户端进行数据块的传输。当客户端需要读取或写入数据时,会通过该端口与相应的 DataNode 进行数据交互。
50075: DataNode 的 Web UI 服务。
50020: DataNode 的元数据服务端口。用于 DataNode 与 NameNode 之间进行数据块的元数据信息交换,例如 DataNode 向 NameNode 汇报自己所存储的数据块信息
通过web端访问一下HDFS的系统:
在这里面可以系统的相关信息以及节点和日志的信息
5. 启动 YARN 并运行 MapReduce 程序
修改yarn-site.xml配置文件,添加下面内容
/opt/hadoop/etc/hadoop/yarn-site.xml
<configuration><!-- 指定 NodeManager 节点上运行的辅助服务列表。在 Hadoop 中,mapreduce_shuffle 是一个关键的辅助服务,专门用于处理 MapReduce 作业中的数据混洗(Shuffle)阶段。 --><property><name>yarn.nodemanager.aux-services</name><value>mapreduce_shuffle</value></property><!-- 指定YARN的ResourceManager的地址,默认端口是8088--><property><name>yarn.resourcemanager.webapp.address</name><value>0.0.0.0:8088</value></property>
</configuration>
mv mapred-site.xml.template mapred-site.xml
修改mapred-site.xml
<configuration><!-- 指定MR运行在YARN上 --><property><name>mapreduce.framework.name</name><value>yarn</value></property>
</configuration>
- 启动集群
启动前必须保证 NameNode 和 DataNode 已经启动
启动ResourceManager
[root@win-local-17 hadoop]# yarn-daemon.sh start resourcemanager
starting resourcemanager, logging to /opt/hadoop/logs/yarn-root-resourcemanager-win-local-17.out
启动NodeManager启动NodeManager
[root@win-local-17 hadoop]# yarn-daemon.sh start nodemanager
starting nodemanager, logging to /opt/hadoop/logs/yarn-root-nodemanager-win-local-17.out
查看进程运行:
[root@win-local-17 hadoop]# jps
90321 NodeManager
90593 ResourceManager
80615 NameNode
80726 DataNode
90716 Jps
如果遇到报错: 端口一直报错无法使用,可以将端口改为8098,再重启启动ResourceManager正常运行了.
然后访问web端就能看到:
参考文档:
https://developer.aliyun.com/article/1046126
https://www.cnblogs.com/liugp/p/16607424.html#%E4%BA%8Chadoop-hdfs-ha-%E6%9E%B6%E6%9E%84%E4%B8%8E%E5%8E%9F%E7%90%86