Bobcares

PySpark MongoDB Pipeline | Setup Tutorial

by | Jul 8, 2023

This tutorial will explain how to set up the PySpark-MongoDB pipeline easily. Bobcares, as a part of our Server Management Services offers solutions to every query that comes our way.

How to set up the PySpark-MongoDB pipeline?

Apache Spark is an open-source, distributed computing platform and collection of tools for real-time, massive data processing, and PySpark is its Python API.

In order to set up the PySpark MongoDB pipeline, we’ve to run the following steps:

  • Firstly, install PySpark using pip with the below code (Make sure Python 3.6 or higher is installed in the system):
    pip install pyspark
  • Now install the pyspark-mongodb library, which provides the MongoDB connector for PySpark:
    pip install pymongo[srv],pyspark-mongodb
  • Install MongoDB on the computer or, if we already have one, use a server instance.
  • Then begin the MongoDB server.
  • Make a new Python script and open it in a text editor.
  • Now import the required PySpark modules:
    from pyspark.sql import SparkSession
  • Then create a SparkSession:
    spark = SparkSession.builder \
    
        .appName("PySpark MongoDB Pipeline") \
    
        .getOrCreate()
  • Now set up the MongoDB connection details:
    mongodb_uri = "mongodb+srv://<username>:<password>@<cluster-url>/<database>.<collection>?retryWrites=true&w=majority"

    Replace username & password with MongoDB Atlas cluster credentials

    Replace cluster-url with URL of the MongoDB Atlas cluster.

    Replace the database & collection with the required database and collection names.

  • Add data from MongoDB to a DataFrame:
    df = spark.read.format("com.mongodb.spark.sql.DefaultSource") \
    
        .option("uri", mongodb_uri) \
    
        .load()
  • Now carry out transformations, data cleaning, filtering, etc. on the DataFrame:
    transformed_df = df.filter(df["age"] > 30)
  • Restore the modified data to MongoDB:
    transformed_df.write.format("com.mongodb.spark.sql.DefaultSource") \
    
        .option("uri", mongodb_uri) \
    
        .mode("overwrite") \
    
        .save()
  • Save the script and exit the text editor.
  • Now open the terminal and go to the directory containing the pipeline.py script.
  • Finally, run the script using the below spark-submit command:
    spark-submit pipeline.py

We’ve to keep an eye on the process and look out for any issues or warnings.

[Need to know more? We are just a click away.]

Conclusion

The article explains the steps to set up a PySpark – MongoDB pipeline.

PREVENT YOUR SERVER FROM CRASHING!

Never again lose customers to poor server speed! Let us help you.

Our server experts will monitor & maintain your server 24/7 so that it remains lightning fast and secure.

GET STARTED

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *

Never again lose customers to poor
server speed! Let us help you.