pyspark2.x与pyspark3.x启动环境差异小记
- 如果使用的
spark版本是2.x,那么我们可以显式创建SparkContext并传递给SparkSession:
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
def get_spark_context():
conf = SparkConf().setAppName("myApp")
conf.set("spark.sql.execution.arrow.enabled", "False")
sc = SparkContext(conf=conf)
spark = SparkSession.builder.config(conf=conf) \
.enableHiveSupport() \
.getOrCreate()
return sc, spark- 但如果使用的
spark版本是3.x,我们应该避免显式创建SparkContext,因为SparkSession会自动管理SparkContext:
from pyspark.sql import SparkSession
def get_spark_session():
spark = SparkSession.builder \
.appName("myApp") \
.config("spark.sql.execution.arrow.enabled", "true") \
.enableHiveSupport() \
.getOrCreate()
return spark- 但在提交
python的pyspark脚本到spark集群时,只需要指定不同版本的spark即可,以spark3.1版本为例
/usr/lib/software/spark/spark-3.1/bin/wq-spark-submit \
--master yarn \
--deploy-mode cluster \
--num-executors 10 \
--executor-cores 4 \
--executor-memory 2g \
--name your_name \
--queue your_executor_queue \
--conf spark.type=SPARKCORE \
--conf spark.yarn.report.interval=10000 \
--conf spark.dynamicAllocation.maxExecutors=200 \
--driver-memory 20g \
--conf spark.executor.memoryOverhead=8192 \
--conf spark.driver.maxResultSize=20g \
--conf spark.default.parallelism=6000 \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./python_venv/py38/bin/python \
--conf spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON=./python_venv/py38/bin/python \
--archives=path/to/your/python_envs/py38.tar.gz#python_venv \
pyspark_script.pypyspark2.x与pyspark3.x启动环境差异小记
https://www.lihaibao.cn/2024/06/26/pyspark2-x与pyspark3-x启动环境差异小记/