Bobcares

gcloud Dataproc Jobs submit PySpark | In Action

by | Jan 12, 2024

Learn how to use the gcloud Dataproc Jobs submit PySpark command. Our Google Cloud Support team is here to help you with your questions and concerns.

All About “gcloud Dataproc Jobs submit PySpark” Command

The Google Cloud Dataproc service offers an environment for running Apache Spark and Apache Hadoop clusters.

 

gcloud Dataproc Jobs submit PySpark

Today, we are going to take a closer look at submitting PySpark jobs with the `gcloud dataproc jobs submit pyspark` command.
Understanding the command’s various options and real-world examples can help us improve our workflow.

Command Syntax

The `gcloud dataproc jobs submit pyspark` command comes with several options. Here’s a breakdown of its syntax:

 

gcloud dataproc jobs submit pyspark PY_FILE (--cluster=CLUSTER | --cluster-labels=[KEY=VALUE,…]) [--archives=[ARCHIVE,…]] [--async] [--bucket=BUCKET] [...]

Here are some of the key parameters:

  • `PY_FILE`: The Python file containing our PySpark script.
  • `–cluster`: The target cluster for job submission.
  • `–archives`: Additional archives to be used by the job.
  • `–async`: Run the job asynchronously.
  • `–bucket`: Cloud Storage bucket for job resources.

Here are some real-world examples

  1. Submitting a PySpark Job with Local Script and Custom Flags

    gcloud dataproc jobs submit pyspark --cluster=my-cluster my_script.py -- --custom-flag

  2. Submitting a Spark Job with a Script Already on the Cluster

    gcloud dataproc jobs submit pyspark --cluster=my-cluster file:///usr/lib/spark/examples/src/main/python/pi.py – 100

Submitting and Reviewing PySpark Job

Before we begin we have to prepare the environment as seen here:

  • Enable the Dataproc API:

    gcloud services enable dataproc.googleapis.com

  • Create a Cloud Storage bucket:

    gsutil mb -l us-central1 gs://$DEVSHELL_PROJECT_ID-data

  • Create the Dataproc cluster:

    gcloud dataproc clusters create wordcount --region=us-central1 --zone=us-central1-f --single-node –master-machine-type=n1-standard-2

  • Download the PySpark script:

    gsutil cp -r gs://acg-gcp-labs-resources/data-engineer/dataproc/*

After preparing the environment, it is time to submit the job as seen here:


gcloud dataproc jobs submit pyspark wordcount.py --cluster=wordcount --region=us-central1 -- gs://acg-gcp-labs-resources/data-engineer/dataproc/romeoandjuliet.txt gs://$DEVSHELL_PROJECT_ID-data/output/

Then, we can review the output by downloading the output files as seen here:

gsutil cp -r gs://$DEVSHELL_PROJECT_ID-data/output/* .

Additionally, our experts suggest deleting the Dataproc cluster when it’s no longer needed. We can do this by heading to the web console, and selecting Dataproc under BIGDATA. Then choose the cluster, and click DELETE.

In case our PySpark script relies on external modules, we can include them using the `–files` or `–py-files` flag:

gcloud dataproc jobs submit pyspark --cluster=clustername --region=regionname --files /lib/lib.py /run/script.py

This ensures seamless imports within your script.

[Need assistance with a different issue? Our team is available 24/7.]

Conclusion

In brief, our Support Experts demonstrated how to use the gcloud Dataproc Jobs submit PySpark command.

PREVENT YOUR SERVER FROM CRASHING!

Never again lose customers to poor server speed! Let us help you.

Our server experts will monitor & maintain your server 24/7 so that it remains lightning fast and secure.

GET STARTED

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *

Never again lose customers to poor
server speed! Let us help you.