Apache Nutch Solr Integration – The way we do it
Trying to do Apache Nutch Solr integration? We’ll help you.
An efficient site search can help a lot in growing your business.
Combining web crawlers like Apache Nutch on the Solr search platform brings in quick results.
At Bobcares, we install advanced search solutions as part of our Server Management Services.
Today, we’ll see how we help our customers with Apache Nutch Solr integration.
What is Apache Nutch and Solr?
To begin with, let’s get an idea of Apache Nutch and Solr.
Apache Nutch is an open-source web crawler. Moreover, it is highly extensible too. This web crawler periodically browses the websites on the internet and creates an index.
Likewise, Apache Solr is a powerful fast search engine. It comes with features like full-text search, automated failover, etc. Additionally, Solr can work with MongoDB database servers. This allows enabling database searching capabilities on a MongoDB-based datastore.
Why do we need Apache Nutch Solr integration?
It’s time now to see the need for Apache Nutch Solr integration.
Recently, one of our customers was trying to build a search tool on his website. He wanted to get a quick idea on his public pages.
These pages included his public website assets, as well as the developer guides and documentation. He wanted to feed the search tool with a couple of his website links.
How we do Apache Nutch Solr integration
Let’s check how our Support Engineers did this in the customer’s AWS server. It had Ubuntu 16.04 LTS running.
The set up involved multiple steps. We’ll now check them one by one.
Installing Java dependency
Nutch is coded in Java. Therefore, as the first step, we install Java on the server. For this, as the root user, we install Openjdk using:
apt-get install openjdk-8-jdk
Then, we confirm the working of Java.
Adding packages MongoDB and Nutch
Moving on we install the MongoDB server. This works as the database server that stores the data.
The exact steps include downloading and extracting the MongoDB packages. We, then change to the MongoDB directory and configure it.
wget https://fastdl.mongodb.org/linux/mongodb-linux-x86_64-ubuntu1604-3.4.7.tgz mkdir data logs tar xvfz mongodb-linux-x86_64-ubuntu1604-3.4.7.tgz cd mongodb-linux-x86_64-ubuntu1604-3.4.7/bin ./mongod --dbpath ~/data/ --logpath ~/logs/mongodb.log --fork
Similarly, we add the Nutch package.
wget https://downloads.apache.org/nutch/2.4/apache-nutch-2.4-src.tar.gz tar xvfz apache-nutch-2.4-src.tar.gz cd apache-nutch-2.4/conf
Next, we configure the Nutch package.
For this, we edit the file at
apache-nutch-2.4/conf/nutch-site.xml. Here we define the crawldb database driver, enable plugins, and the crawling behavior. This restricts it to only the specific domain.
Moving on, we instruct Nutch to use MongoDB via the
apache-nutch-2.4/conf/gora.properties file. To manage persistence, Nutch uses Apache Gora. Since MongoDB runs on the same server, we specify parameters as:
gora.mongodb.mapping.file=/gora-mongodb-mapping.xml gora.mongodb.servers=localhost:27017 gora.mongodb.db=nutchdb
Also, we uncomment the MongoDB line in the file
<!-- Uncomment this to use MongoDB as Gora backend. --> <dependency org="org.apache.gora" name="gora-mongodb" rev="0.6.1" conf="*->default" />
Finally, we build Nutch using the ant program.
cd /home/ubuntu/apache-nutch-2.4 ant runtime
That completes the Nutch install.
Finally, it’s time to set up Solr package. Here, we download and install the package first. Then we create a core named nutch.
wget http://archive.apache.org/dist/lucene/solr/8.4.1/solr-8.4.1.tgz tar xvfz solr-8.4.1.tgz cd solr-8.4.1/bin ./solr start ./solr create_core -c nutch -d basic_configs ./solr stop
Next, we remove the managed-schema and copy the schema.xml file.
cp ~/apache-nutch-2.4/conf/schema.xml .
In this schema.xml, we remove all instances of
Likewise, we fix the solrconfig.xml file. Finally, we check the connection strings in the PHP config file of the search function and confirm its correctness.
$cfg['solr']['host'] = 'xx.yy.xx.188'; $cfg['solr']['port'] = '8983'; $cfg['solr']['path'] = '/solr'; $cfg['solr']['core'] = 'nutch';
Setup crawl and index
Moving on we set up the URLs that Nutch has to crawl. We edit and add the links in the file
Then we inject the URLs using Nutch. This adds data in the MongoDB database. Further, we fetch and parse the data.
As the next step, we index the pages in Solr using
runtime/local/bin/nutch solrindex http://localhost:8983/solr/nutch -all
That’s it. Now the data can be viewed from the Solr admin console as:
[Do you need help in creating a custom search with Apache Nutch and Solr? We are available 24×7!]
In short, Apache Nutch Solr integration helps to fetch search results quickly. Today, we saw how our Support Engineers set up Nutch on an AWS server.