Bobcares

PySpark MongoDB Pipeline | Setup Tutorial

PDF Header PDF Footer

This tutorial will explain how to set up the PySpark-MongoDB pipeline easily. Bobcares, as a part of our Server Management Services offers solutions to every query that comes our way.

How to set up the PySpark-MongoDB pipeline?

Apache Spark is an open-source, distributed computing platform and collection of tools for real-time, massive data processing, and PySpark is its Python API.

In order to set up the PySpark MongoDB pipeline, we’ve to run the following steps:

  • Firstly, install PySpark using pip with the below code (Make sure Python 3.6 or higher is installed in the system):
    pip install pyspark
  • Now install the pyspark-mongodb library, which provides the MongoDB connector for PySpark:
    pip install pymongo[srv],pyspark-mongodb
  • Install MongoDB on the computer or, if we already have one, use a server instance.
  • Then begin the MongoDB server.
  • Make a new Python script and open it in a text editor.
  • Now import the required PySpark modules:
    from pyspark.sql import SparkSession
  • Then create a SparkSession:
    spark = SparkSession.builder \
    
        .appName("PySpark MongoDB Pipeline") \
    
        .getOrCreate()
  • Now set up the MongoDB connection details:
    mongodb_uri = "mongodb+srv://<username>:<password>@<cluster-url>/<database>.<collection>?retryWrites=true&w=majority"

    Replace username & password with MongoDB Atlas cluster credentials

    Replace cluster-url with URL of the MongoDB Atlas cluster.

    Replace the database & collection with the required database and collection names.

  • Add data from MongoDB to a DataFrame:
    df = spark.read.format("com.mongodb.spark.sql.DefaultSource") \
    
        .option("uri", mongodb_uri) \
    
        .load()
  • Now carry out transformations, data cleaning, filtering, etc. on the DataFrame:
    transformed_df = df.filter(df["age"] > 30)
  • Restore the modified data to MongoDB:
    transformed_df.write.format("com.mongodb.spark.sql.DefaultSource") \
    
        .option("uri", mongodb_uri) \
    
        .mode("overwrite") \
    
        .save()
  • Save the script and exit the text editor.
  • Now open the terminal and go to the directory containing the pipeline.py script.
  • Finally, run the script using the below spark-submit command:
    spark-submit pipeline.py

We’ve to keep an eye on the process and look out for any issues or warnings.

[Need to know more? We are just a click away.]

Conclusion

The article explains the steps to set up a PySpark – MongoDB pipeline.

PREVENT YOUR SERVER FROM CRASHING!

Never again lose customers to poor server speed! Let us help you.

Our server experts will monitor & maintain your server 24/7 so that it remains lightning fast and secure.

GET STARTED

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *

Get featured on the Bobcares blog and share your expertise with a global tech audience.

WRITE FOR US
server management

Spend time on your business, not on your servers.

TALK TO US

Or click here to learn more.

Speed issues driving customers away?
We’ve got your back!