Nutch1-Solr5 Integration, Searching the Web

Now we are going to Integrate our web crawler with our search server to have our complete search engine solution.

Platform: Linux Ubuntu

Note: before following these guidelines make sure you have followed the quick tutorial on Solr and Nutch, otherwise this won’t make much sense.

So, we want to post the crawled data from Nutch (our crawler) to Solr for indexing and subsequent searching.

1. Start the Solr server.
First make sure the Solr Server is up and running.

bin/solr start.

2. Create a new Solr core
Let’s create a core we will use to store the nutch crawled data and index
it. (note there are underscores in the core name, sometimes they don’t show up on wordpress…)

bin/solr create -c nutch_solr_data_core

3. Modifying the core schema settings.

Solr creates our new core with the default managed settings: e.g. field_types(int, long ,date …), field_names (my_int, my_long …) and other settings.
However Nutch posts data with slightly different field types and settings. Therefore we have to modify the schema configuration file; the instructions on the Nutch tutorial only got me so far and there were several errors at core initialization, so I tried to learn a bit and modified the file such that Solr is happy and things “appear” to be working.

Backup the current schema file.

cd /opt/solr-5.2.1/server/solr/nutch_solr_data_core/conf
mv managed-schema solr_managed-schema.bak

Now download the modified schema.xml file and put it in the configuration directory.

4. Modifying the Solr configuration file.

Now, because we are providing our own schema.xml file, Solr needs to know that we don’t need it to auto generate it’s managed schema file, we want it to load our core based on our modified schema.

Backup the current solrconfig file.

cd /opt/solr-5.2.1/server/solr/nutch_solr_data_core/conf
mv solrconfig.xml solrconfig.xml.bak

Now download the modified solrconfig.xml file and put it in the configuration directory.

5. Restart the Solr server
Now restart the server for changes to take effect.

cd /opt/solr-5.2.1
bin/solr restart.

Now you can go on to the browser to make sure solr restarted with no errors or core exceptions.

6. Post Nutch data to Solr for Indexing
Now we are going to take our previously fetched data from Nutch and post it to Solr.
i.e.

cd /opt/apache-hutch-1.10/bin/nutch solrindex \ http://localhost:8983/solr/nutch_solr_data_core \ crawl/crawldb/ -linkdb crawl/linkdb/ $s1

If all works well there shouldn’t be any errors and indexing should complete successfully.

7. Querying your new Index.
Now let’s actually use Solr search functionality to search in our index.

curl "http://localhost:8983/solr/nutch_solr_data_core/select?wt=json&indent=true&q=foundation"

8. Deleting duplicates
Once indexed the entire contents, it must be disposed of duplicate urls, in this way ensures that the urls are unique.

/bin/nutch dedup http://localhost:8983/solr/nutch_solr_data_core

9. Cleaning Solr Index
The class scans a crawldb directory looking for entries with status DB_GONE (404) and sends delete requests to Solr for those documents. Once Solr receives the request the aforementioned documents are duly deleted. This maintains a healthier quality of Solr index.

/bin/nutch solrclean crawl/crawldb/ \
http://localhost:8983/solr/nutch_solr_data_core

10. Putting all commands on one script.
Now if you want to: inject the seeds, generate a fetch list, fetch content, invert linkes and post the data to solr for indexing in one command…
The guys at Nutch already thought of it:

bin/crawl -i -D solr.server.url=http://localhost:8983/solr/nutch_solr_data_core urls/ crawl/  2   

This concludes our series of three very quick tutorials on Solr, Nutch and both working together.

I got to say, I am sure I have typos here and there, If you find one, let me know so others can benefit as well 🙂

Cheers, that was not bad!

Advertisements

8 thoughts on “Nutch1-Solr5 Integration, Searching the Web

  1. Thank you for these tutorials. I had a hard time finding the info I needed when moving to Nutch 1.11. I’ve got everything running now, except my core has no documents in it.
    From the nutch directory, I run bin/nutch solrindex \ http://localhost:8983/solr/nutch_solr_data_core \ crawl/crawldb/ -linkdb crawl/linkdb/ $s1

    It appears to work well, but no documents make their way into the nutch_solr_data_core according to Solr Admin’s Core Admin.

    Searching on any term doesn’t bring back any results.

    The only thing that looks like an error message to me is: java.io.IOException: No FileSystem for scheme: http

    Can you point me in the right direction? I’m not looking for you to fix my problem. You’ve already been so helpful. Just need a nudge.

    • GM Same thing with me – I don’t see any error, but no documents are returned. I did an echo $s1 and it appears not set. I think I had set it a couple of tutorials ago, but I guess I need to find out how to set it again.

      Anyway – here’s what happened for me:

      steve@quark:/usr/local/nutch/framework/apache-nutch-1.6$ sudo bin/nutch solrindex http://localhost:8983/solr/nutch_solr_data_core crawl/crawldb/ -linkdb crawl/linkdb/ $s1
      SolrIndexer: starting at 2016-07-08 00:15:45
      SolrIndexer: deleting gone documents: false
      SolrIndexer: URL filtering: false
      SolrIndexer: URL normalizing: false
      SolrIndexer: finished at 2016-07-08 00:15:55, elapsed: 00:00:10

  2. : Could not load conf for core nutch_solr_data_core: Error loading solr config from /opt/solr-5.2.1/server/solr/nutch_solr_data_core/conf/solrconfig.xml

    Please help out

  3. Thank you Camilo,

    for this best hint, I could find anywhere, where Nutch/Solr integration is discussed.
    Nutch worked before, Solr worked before, but your guide above is the required glue in files and in process.

    After making Nutch 1.12 work with Solr 4.10.4 as normal and documented elsewhere.
    Based on your tipps it worked with Solr 5.5.2.
    And finally I was able to make that work also with Solr 6.1.0 today on OS X El Capitan.

    What I did:
    download and extract nutch 1.12
    export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_102.jdk/Contents/Home
    echo $JAVA_HOME
    bin/nutch inject crawl/crawldb urls
    bin/nutch generate crawl/crawldb crawl/segments
    s1=ls -d crawl/segments/2* | tail -1
    echo $s1
    bin/nutch fetch $s1
    bin/nutch parse $s1
    bin/nutch updatedb crawl/crawldb $s1
    bin/nutch generate crawl/crawldb crawl/segments -topN 1000
    s2=ls -d crawl/segments/2* | tail -1
    echo $s2
    bin/nutch fetch $s2
    bin/nutch parse $s2
    bin/nutch updatedb crawl/crawldb $s2
    bin/nutch generate crawl/crawldb crawl/segments -topN 1000
    s3=ls -d crawl/segments/2* | tail -1
    echo $s3
    bin/nutch fetch $s3
    bin/nutch parse $s3
    bin/nutch updatedb crawl/crawldb $s3
    bin/nutch invertlinks crawl/linkdb -dir crawl/segments

    download and extract solr 6.1.0
    bin/solr start
    bin/solr create -c foo
    –> this created “foo” in $solr_home/server/solr/ with conf inside
    <– following your process above I replaced files in foo/conf that were downloaded from your links above
    bin/solr restart
    bin/nutch solrindex http://127.0.0.1:8983/solr/foo crawl/crawldb -linkdb crawl/linkdb $s3 -filter -normalize

    –> documents could be queried successful within Solr or simply in browser:
    http://localhost:8983/solr/foo/select?q=$querystring

    Thanks again 🙂

  4. Got stuck at step 6 and ended up doing this to get everything working:

    Copy the conf/schema.xml file from Nutch into your Solr core conf folder (e.g. “cp /opt/apache-nutch-1.13/conf/schema.xml /opt/solr-6.5.1/server/solr/nutch_solr_data_core/conf/schema.xml”)
    Restart Solr.
    If it’s throwing some errors along the lines of “Plugin init failure for….”, edit the new schema.xml and remove all instances of enablePositionIncrements=”true”. Restart Solr again.
    From the main Apache Nutch directory, run the following command:
    bin/nutch index -Dsolr.server.url=http://localhost:8983/solr/nutch_solr_data_core crawl/crawldb/ -linkdb crawl/linkdb/ $s1

    Notes:
    * The command “solrindex” is deprecated, hence why I’m using the “index” command above.
    * If the indexer says the job failed, make sure your URL does not contain the #, verify the port number and the core name, and try using full paths to make sure.
    * If you have write.lock errors, verify the ownership and permissions of the nutch_solr_data_core directory. After fixing those, delete the nutch_solr_data_core/data/index/write.lock file and then restart Solr.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s