Running PySpark Jobs the Smart Way Using GCP Dataproc Serverless

Learn how to run PySpark the easy way with gcp dataproc serverless. A clear, practical guide with commands, setup steps, fixes, and real-world tips beginners can follow. Our Google Cloud Live Support Team is always here to help you.

If you’ve ever tried running PySpark jobs on cloud servers, you already know the drill, long setup time, dependency headaches, and constant VM babysitting. That’s why many teams are switching to gcp dataproc serverless, a simple way to process data without managing clusters. You submit code, Google handles the spark clusters behind the scenes, and you only pay for what you use.

But before jumping in, you should know how to build and run a clean pipeline. So let’s walk through the same steps developers follow in real projects, no fluff, no generic wording, just real commands and clarity.

gcp dataproc serverless

Overview

Prepare Your Environment
Update Makefile Values
Package Your Code
Run Your PySpark Job
Local Dev Setup

1. Prepare Your Environment

First, make sure your machine has:

Poetry
gcloud
gsutil
bq CLI
make

And in your GCP project:

Dataproc API enabled
Billing enabled
Private Google Access on your subnet
BigQuery API enabled

This setup matters because gcp dataproc serverless needs these services to handle staging, compute, and BigQuery writes.

2. Update Makefile Values

Open your Makefile and set:

PROJECT_ID ?= my-gcp-project-292607

REGION ?= europe-west2

Confirm you’re using the right project:

gcloud config list

Then create buckets and datasets:

make setup

You’ll see new buckets like:

serverless-spark-code-repo-<project_number>
serverless-spark-staging-<project_number>
serverless-spark-data-<project_number>

And a dataset named serverless_spark_demo.

This is where gcp dataproc serverless stores your code, temp files, and CSVs.

3. Package Your Code

Dataproc Serverless accepts .py, .egg, or .zip. Most engineers pick zip, so run:

make build

The build target installs dependencies, exports them, packages them, and uploads them:

@cp ./src/main.py ./dist

That line ensures the entry file reaches the bucket.

4. Run Your PySpark Job

Now for the real action:

make run

The underlying command looks like this:

gcloud beta dataproc batches submit --project ${PROJECT_ID} --region ${REGION} pyspark \

gs://${CODE_BUCKET}/dist/main.py --py-files=gs://${CODE_BUCKET}/dist/${APP_NAME}_${VERSION_NO}.zip \

--subnet default --properties spark.executor.instances=2,spark.driver.cores=4,spark.executor.cores=4,spark.app.name=spark_serverless_repo_exemplar \

--jars gs://spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.23.2.jar \

-- --project=${PROJECT_ID} --file-uri=gs://${DATA_BUCKET}/stocks.csv --temp-bq-bucket=${TEMP_BUCKET}

Once it finishes, check BigQuery for a table named stock_prices.

To view job logs:

Open Dataproc
Click Batches
Open your Batch ID

This is where you monitor every gcp dataproc serverless run.

Run Smarter Jobs on Dataproc

5. Local Dev Setup (Optional but Helpful)

If you prefer local testing:

poetry new it-depends

cd it-depends

mkdir src

touch ./src/main.py

Run a simple test:

poetry run python ./src/main.py

Then submit it through:

make run

This helps validate dependencies before sending them to gcp dataproc serverless.

Conclusion

If your project needs heavy packages like numpy, elasticsearch, or xgboost, you may hit compatibility issues using –py-files. In that case, the cleanest route is a custom container image, since Dataproc Serverless supports image streaming. That avoids dependency failures and speeds up every run.

Running PySpark Jobs the Smart Way Using GCP Dataproc Serverless

1. Prepare Your Environment

2. Update Makefile Values

3. Package Your Code

4. Run Your PySpark Job

Run Smarter Jobs on Dataproc

5. Local Dev Setup (Optional but Helpful)

Conclusion

Submit a Comment Cancel reply

Subscribe to our newsletter

Footer newsletter

Running PySpark Jobs the Smart Way Using GCP Dataproc Serverless

1. Prepare Your Environment

Subscribe to our newsletter for the latest updates, news, and features.

2. Update Makefile Values

3. Package Your Code

4. Run Your PySpark Job

Run Smarter Jobs on Dataproc

5. Local Dev Setup (Optional but Helpful)

Conclusion

Submit a Comment Cancel reply

Footer newsletter