Apache Nutch Solr Integration

Please Note: This article is part of our historical archive. Because it was published a while ago, some of the information, links, or context may now be outdated.

Trying to do Apache Nutch Solr integration? We’ll help you.

An efficient site search can help a lot in growing your business.

Combining web crawlers like Apache Nutch on the Solr search platform brings in quick results.

At Bobcares, we install advanced search solutions as part of our Server Management Services.

Today, we’ll see how we help our customers with Apache Nutch Solr integration.

What is Apache Nutch and Solr?

To begin with, let’s get an idea of Apache Nutch and Solr.

Apache Nutch is an open-source web crawler. Moreover, it is highly extensible too. This web crawler periodically browses the websites on the internet and creates an index.

Likewise, Apache Solr is a powerful fast search engine. It comes with features like full-text search, automated failover, etc. Additionally, Solr can work with MongoDB database servers. This allows enabling database searching capabilities on a MongoDB-based datastore.

Why do we need Apache Nutch Solr integration?

It’s time now to see the need for Apache Nutch Solr integration.

Recently, one of our customers was trying to build a search tool on his website. He wanted to get a quick idea on his public pages.

These pages included his public website assets, as well as the developer guides and documentation. He wanted to feed the search tool with a couple of his website links.

How we do Apache Nutch Solr integration

Let’s check how our Support Engineers did this in the customer’s AWS server. It had Ubuntu 16.04 LTS running.

The set up involved multiple steps. We’ll now check them one by one.

Installing Java dependency

Nutch is coded in Java. Therefore, as the first step, we install Java on the server. For this, as the root user, we install Openjdk using:

apt-get install openjdk-8-jdk

Then, we confirm the working of Java.

Adding packages MongoDB and Nutch

Moving on we install the MongoDB server. This works as the database server that stores the data.

The exact steps include downloading and extracting the MongoDB packages. We, then change to the MongoDB directory and configure it.

wget https://fastdl.mongodb.org/linux/mongodb-linux-x86_64-ubuntu1604-3.4.7.tgz
mkdir data logs
tar xvfz mongodb-linux-x86_64-ubuntu1604-3.4.7.tgz
cd mongodb-linux-x86_64-ubuntu1604-3.4.7/bin
./mongod --dbpath ~/data/  --logpath ~/logs/mongodb.log --fork

Similarly, we add the Nutch package.

wget https://downloads.apache.org/nutch/2.4/apache-nutch-2.4-src.tar.gz 
tar xvfz apache-nutch-2.4-src.tar.gz
cd apache-nutch-2.4/conf

Configuring Nutch

Next, we configure the Nutch package.

For this, we edit the file at

apache-nutch-2.4/conf/nutch-site.xml

. Here we define the crawldb database driver, enable plugins, and the crawling behavior. This restricts it to only the specific domain.

Moving on, we instruct Nutch to use MongoDB via the

apache-nutch-2.4/conf/gora.properties

file. To manage persistence, Nutch uses Apache Gora. Since MongoDB runs on the same server, we specify parameters as:

gora.mongodb.mapping.file=/gora-mongodb-mapping.xml
gora.mongodb.servers=localhost:27017
gora.mongodb.db=nutchdb

Also, we uncomment the MongoDB line in the file

apache-nutch-2.4/conf/ivy/ivy.xml

<!-- Uncomment this to use MongoDB as Gora backend. -->
<dependency org="org.apache.gora" name="gora-mongodb" rev="0.6.1" conf="*->default" />

Finally, we build Nutch using the ant program.

cd /home/ubuntu/apache-nutch-2.4
 ant runtime

That completes the Nutch install.

Install Solr

Finally, it’s time to set up Solr package. Here, we download and install the package first. Then we create a core named nutch.

wget http://archive.apache.org/dist/lucene/solr/8.4.1/solr-8.4.1.tgz
tar xvfz solr-8.4.1.tgz
cd solr-8.4.1/bin
./solr start
./solr create_core -c nutch -d basic_configs
./solr stop

Next, we remove the managed-schema and copy the schema.xml file.

cp ~/apache-nutch-2.4/conf/schema.xml .

In this schema.xml, we remove all instances of

enablePositionIncrements="true"

Likewise, we fix the solrconfig.xml file. Finally, we check the connection strings in the PHP config file of the search function and confirm its correctness.

$cfg['solr']['host'] = 'xx.yy.xx.188';
$cfg['solr']['port'] = '8983';
$cfg['solr']['path'] = '/solr';
$cfg['solr']['core'] = 'nutch';

Setup crawl and index

Moving on we set up the URLs that Nutch has to crawl. We edit and add the links in the file

 ~/apache-nutch-2.4/urls/seeds.txt

Then we inject the URLs using Nutch. This adds data in the MongoDB database. Further, we fetch and parse the data.

As the next step, we index the pages in Solr using

runtime/local/bin/nutch solrindex http://localhost:8983/solr/nutch -all

That’s it. Now the data can be viewed from the Solr admin console as:

Apache Nutch Solr Integration

[Do you need help in creating a custom search with Apache Nutch and Solr? We are available 24×7!]

Conclusion

In short, Apache Nutch Solr integration helps to fetch search results quickly. Today, we saw how our Support Engineers set up Nutch on an AWS server.

Apache Nutch Solr Integration – The way we do it

What is Apache Nutch and Solr?

Why do we need Apache Nutch Solr integration?

How we do Apache Nutch Solr integration

Installing Java dependency

Adding packages MongoDB and Nutch

Configuring Nutch

Install Solr

Setup crawl and index

Conclusion

Submit a Comment Cancel reply

Subscribe to our newsletter

Footer newsletter