Learn how to run PySpark the easy way with gcp dataproc serverless. A clear, practical guide with commands, setup steps, fixes, and real-world tips beginners can follow. Our Google Cloud Live Support Team is always here to help you.
If you’ve ever tried running PySpark jobs on cloud servers, you already know the drill, long setup time, dependency headaches, and constant VM babysitting. That’s why many teams are switching to gcp dataproc serverless, a simple way to process data without managing clusters. You submit code, Google handles the spark clusters behind the scenes, and you only pay for what you use.
But before jumping in, you should know how to build and run a clean pipeline. So let’s walk through the same steps developers follow in real projects, no fluff, no generic wording, just real commands and clarity.

Overview
1. Prepare Your Environment
First, make sure your machine has:
- Poetry
- gcloud
- gsutil
- bq CLI
- make
And in your GCP project:
- Dataproc API enabled
- Billing enabled
- Private Google Access on your subnet
- BigQuery API enabled
This setup matters because gcp dataproc serverless needs these services to handle staging, compute, and BigQuery writes.
2. Update Makefile Values
Open your Makefile and set:
PROJECT_ID ?= my-gcp-project-292607
REGION ?= europe-west2
Confirm you’re using the right project:
gcloud config list
Then create buckets and datasets:
make setup
You’ll see new buckets like:
- serverless-spark-code-repo-<project_number>
- serverless-spark-staging-<project_number>
- serverless-spark-data-<project_number>
And a dataset named serverless_spark_demo.
This is where gcp dataproc serverless stores your code, temp files, and CSVs.
3. Package Your Code
Dataproc Serverless accepts .py, .egg, or .zip. Most engineers pick zip, so run:
make build
The build target installs dependencies, exports them, packages them, and uploads them:
@cp ./src/main.py ./dist
That line ensures the entry file reaches the bucket.
4. Run Your PySpark Job
Now for the real action:
make run
The underlying command looks like this:
gcloud beta dataproc batches submit --project ${PROJECT_ID} --region ${REGION} pyspark \
gs://${CODE_BUCKET}/dist/main.py --py-files=gs://${CODE_BUCKET}/dist/${APP_NAME}_${VERSION_NO}.zip \
--subnet default --properties spark.executor.instances=2,spark.driver.cores=4,spark.executor.cores=4,spark.app.name=spark_serverless_repo_exemplar \
--jars gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.23.2.jar \
-- --project=${PROJECT_ID} --file-uri=gs://${DATA_BUCKET}/stocks.csv --temp-bq-bucket=${TEMP_BUCKET}
Once it finishes, check BigQuery for a table named stock_prices.
To view job logs:
- Open Dataproc
- Click Batches
- Open your Batch ID
This is where you monitor every gcp dataproc serverless run.
Run Smarter Jobs on Dataproc

5. Local Dev Setup (Optional but Helpful)
If you prefer local testing:
poetry new it-depends
cd it-depends
mkdir src
touch ./src/main.py
Run a simple test:
poetry run python ./src/main.py
Then submit it through:
make run
This helps validate dependencies before sending them to gcp dataproc serverless.
Conclusion
If your project needs heavy packages like numpy, elasticsearch, or xgboost, you may hit compatibility issues using –py-files. In that case, the cleanest route is a custom container image, since Dataproc Serverless supports image streaming. That avoids dependency failures and speeds up every run.
