Learn how to run PySpark the easy way with gcp dataproc serverless. A clear, practical guide with commands, setup steps, fixes, and real-world tips beginners can follow. Our Google Cloud Live Support Team is always here to help you.


If you’ve ever tried running PySpark jobs on cloud servers, you already know the drill, long setup time, dependency headaches, and constant VM babysitting. That’s why many teams are switching to gcp dataproc serverless, a simple way to process data without managing clusters. You submit code, Google handles the spark clusters behind the scenes, and you only pay for what you use.

But before jumping in, you should know how to build and run a clean pipeline. So let’s walk through the same steps developers follow in real projects, no fluff, no generic wording, just real commands and clarity.

gcp dataproc serverless

1. Prepare Your Environment

First, make sure your machine has:

  • Poetry
  • gcloud
  • gsutil
  • bq CLI
  • make

And in your GCP project:

  • Dataproc API enabled
  • Billing enabled
  • Private Google Access on your subnet
  • BigQuery API enabled

This setup matters because gcp dataproc serverless needs these services to handle staging, compute, and BigQuery writes.

2. Update Makefile Values

Open your Makefile and set:

PROJECT_ID ?= my-gcp-project-292607
REGION ?= europe-west2

Confirm you’re using the right project:

gcloud config list

Then create buckets and datasets:

make setup

You’ll see new buckets like:

  • serverless-spark-code-repo-<project_number>
  • serverless-spark-staging-<project_number>
  • serverless-spark-data-<project_number>

And a dataset named serverless_spark_demo.

This is where gcp dataproc serverless stores your code, temp files, and CSVs.

3. Package Your Code

Dataproc Serverless accepts .py, .egg, or .zip. Most engineers pick zip, so run:

make build

The build target installs dependencies, exports them, packages them, and uploads them:

@cp ./src/main.py ./dist

That line ensures the entry file reaches the bucket.

4. Run Your PySpark Job

Now for the real action:

make run

The underlying command looks like this:

gcloud beta dataproc batches submit --project ${PROJECT_ID} --region ${REGION} pyspark \
gs://${CODE_BUCKET}/dist/main.py --py-files=gs://${CODE_BUCKET}/dist/${APP_NAME}_${VERSION_NO}.zip \
--subnet default --properties spark.executor.instances=2,spark.driver.cores=4,spark.executor.cores=4,spark.app.name=spark_serverless_repo_exemplar \
--jars gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.23.2.jar \
-- --project=${PROJECT_ID} --file-uri=gs://${DATA_BUCKET}/stocks.csv --temp-bq-bucket=${TEMP_BUCKET}

Once it finishes, check BigQuery for a table named stock_prices.

To view job logs:

  • Open Dataproc
  • Click Batches
  • Open your Batch ID

This is where you monitor every gcp dataproc serverless run.

Run Smarter Jobs on Dataproc

Chat animation


5. Local Dev Setup (Optional but Helpful)

If you prefer local testing:

poetry new it-depends
cd it-depends
mkdir src
touch ./src/main.py

Run a simple test:

poetry run python ./src/main.py

Then submit it through:

make run

This helps validate dependencies before sending them to gcp dataproc serverless.

Conclusion

If your project needs heavy packages like numpy, elasticsearch, or xgboost, you may hit compatibility issues using –py-files. In that case, the cleanest route is a custom container image, since Dataproc Serverless supports image streaming. That avoids dependency failures and speeds up every run.