ubuntu 20.04 安装spark
安装openjdk21
- 下载
wget https://download.java.net/openjdk/jdk21/ri/openjdk-21+35_linux-x64_bin.tar.gz
- 解压
tar -xvf openjdk-21+35_linux-x64_bin.tar.gzsudo mv jdk-21/ /opt/jdk-21/
- 设置环境变量
echo 'export JAVA_HOME=/opt/jdk-21' | sudo tee /etc/profile.d/java21.sh
echo 'export PATH=$JAVA_HOME/bin:$PATH'|sudo tee -a /etc/profile.d/java21.sh
source /etc/profile.d/java21.sh
- 验证安装
java --version
安装spark
- 安装组件
sudo apt update
sudo apt install default-jdk scala git -y
java -version; javac -version; scala -version; git --versionCopied!
- 下载spark
mkdir /home/yourname/spark
cd spark/
wget https://dlcdn.apache.org/spark/spark-3.5.6/spark-3.5.6-bin-hadoop3.tgz
- 验证下载的包
cd /home/yourname/spark
wget https://dlcdn.apache.org/spark/spark-3.5.6/spark-3.5.6-bin-hadoop3.tgz.sha512shasum -a 512 -c spark-3.5.6-bin-hadoop3.tgz.sha512
4. 解压
tar -xvf spark-3.5.6-bin-hadoop3.tgz
- 解压后即可,验证安装
/home/yourname/spark/spark-3.5.6-bin-hadoop3/bin/spark-shell --version
- 设置环境变量
vim ~/.profileexport SPARK_HOME=/home/yourname/spark/spark-3.5.6-bin-hadoop3/
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbinsource ~/.profile
- 配置远程连接
cp spark-defaults.conf.template spark-defaults.conf添加以下内容
spark.ui.host 0.0.0.0spark.ui.port 8080
- 设置spark env
cp spark-env.sh.template spark-env.sh添加以下内容
export JAVA_HOME=/opt/jdk-21/
SPARK_MASTER_HOST=192.168.220.132
SPARK_MASTER_PORT=7077
- Start Standalone Spark Master Server
$SPARK_HOME/sbin/start-master.sh访问本机8080端口即可访问
10. 启动一个worker
$SPARK_HOME/sbin/start-worker.sh spark://192.168.220.132:7077
- Basic Commands to Start and Stop Master Server and Workers
The following table lists the basic commands for starting and stopping the Apache Spark (driver) master server and workers in a single-machine setup.
Command | Description |
---|---|
start-master.sh | Start the driver (master) server instance on the current machine. |
stop-master.sh | Stop the driver (master) server instance on the current machine. |
start-worker.sh spark://master_server:port | Start a worker process and connect it to the master server (use the master’s IP or hostname). |
stop-worker.sh | Stop a running worker process. |
start-all.sh | Start both the driver (master) and worker instances. |
stop-all.sh | Stop all the driver (master) and worker instances. |
Pyspark 读csv
- 安装与spark相同版本的pyspark
pip install pyspark==3.5.6
如果你之前安装了别的版本,在你卸载后,最好将package路径下的pyspark文件夹也删除
- 本地调试
from pyspark.sql import SparkSession
import os
os.environ["JAVA_HOME"] = r"D:\java21opensdk\jdk-21.0.1"spark = SparkSession.builder \.appName("test") \.getOrCreate()# .master("spark://192.168.220.132:7077") \# .getOrCreate()path = r"C:\Users\test\Desktop\test.csv"df = spark.read.csv(path,header=True, inferSchema=True)
rows = df.collect()for row in rows:print(row.asDict())
- 使用worker运行
from pyspark.sql import SparkSession
import os
os.environ["JAVA_HOME"] = r"D:\java21opensdk\jdk-21.0.1"spark = SparkSession.builder \.appName("test") \.master("spark://192.168.220.132:7077") \.getOrCreate()
# 注意这个csv文件需要传递到spark服务器上
path = "file:///opt/data/test.csv"df = spark.read.csv(path,header=True, inferSchema=True)
rows = df.collect()for row in rows:print(row.asDict())
参考