Nutch1-Solr5 Integration, Searching the Web

Blog moved to new address.

Advertisements

8 thoughts on “Nutch1-Solr5 Integration, Searching the Web

  1. Thank you for these tutorials. I had a hard time finding the info I needed when moving to Nutch 1.11. I’ve got everything running now, except my core has no documents in it.
    From the nutch directory, I run bin/nutch solrindex \ http://localhost:8983/solr/nutch_solr_data_core \ crawl/crawldb/ -linkdb crawl/linkdb/ $s1

    It appears to work well, but no documents make their way into the nutch_solr_data_core according to Solr Admin’s Core Admin.

    Searching on any term doesn’t bring back any results.

    The only thing that looks like an error message to me is: java.io.IOException: No FileSystem for scheme: http

    Can you point me in the right direction? I’m not looking for you to fix my problem. You’ve already been so helpful. Just need a nudge.

    • GM Same thing with me – I don’t see any error, but no documents are returned. I did an echo $s1 and it appears not set. I think I had set it a couple of tutorials ago, but I guess I need to find out how to set it again.

      Anyway – here’s what happened for me:

      steve@quark:/usr/local/nutch/framework/apache-nutch-1.6$ sudo bin/nutch solrindex http://localhost:8983/solr/nutch_solr_data_core crawl/crawldb/ -linkdb crawl/linkdb/ $s1
      SolrIndexer: starting at 2016-07-08 00:15:45
      SolrIndexer: deleting gone documents: false
      SolrIndexer: URL filtering: false
      SolrIndexer: URL normalizing: false
      SolrIndexer: finished at 2016-07-08 00:15:55, elapsed: 00:00:10

  2. : Could not load conf for core nutch_solr_data_core: Error loading solr config from /opt/solr-5.2.1/server/solr/nutch_solr_data_core/conf/solrconfig.xml

    Please help out

  3. Thank you Camilo,

    for this best hint, I could find anywhere, where Nutch/Solr integration is discussed.
    Nutch worked before, Solr worked before, but your guide above is the required glue in files and in process.

    After making Nutch 1.12 work with Solr 4.10.4 as normal and documented elsewhere.
    Based on your tipps it worked with Solr 5.5.2.
    And finally I was able to make that work also with Solr 6.1.0 today on OS X El Capitan.

    What I did:
    download and extract nutch 1.12
    export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_102.jdk/Contents/Home
    echo $JAVA_HOME
    bin/nutch inject crawl/crawldb urls
    bin/nutch generate crawl/crawldb crawl/segments
    s1=ls -d crawl/segments/2* | tail -1
    echo $s1
    bin/nutch fetch $s1
    bin/nutch parse $s1
    bin/nutch updatedb crawl/crawldb $s1
    bin/nutch generate crawl/crawldb crawl/segments -topN 1000
    s2=ls -d crawl/segments/2* | tail -1
    echo $s2
    bin/nutch fetch $s2
    bin/nutch parse $s2
    bin/nutch updatedb crawl/crawldb $s2
    bin/nutch generate crawl/crawldb crawl/segments -topN 1000
    s3=ls -d crawl/segments/2* | tail -1
    echo $s3
    bin/nutch fetch $s3
    bin/nutch parse $s3
    bin/nutch updatedb crawl/crawldb $s3
    bin/nutch invertlinks crawl/linkdb -dir crawl/segments

    download and extract solr 6.1.0
    bin/solr start
    bin/solr create -c foo
    –> this created “foo” in $solr_home/server/solr/ with conf inside
    <– following your process above I replaced files in foo/conf that were downloaded from your links above
    bin/solr restart
    bin/nutch solrindex http://127.0.0.1:8983/solr/foo crawl/crawldb -linkdb crawl/linkdb $s3 -filter -normalize

    –> documents could be queried successful within Solr or simply in browser:
    http://localhost:8983/solr/foo/select?q=$querystring

    Thanks again 🙂

  4. Got stuck at step 6 and ended up doing this to get everything working:

    Copy the conf/schema.xml file from Nutch into your Solr core conf folder (e.g. “cp /opt/apache-nutch-1.13/conf/schema.xml /opt/solr-6.5.1/server/solr/nutch_solr_data_core/conf/schema.xml”)
    Restart Solr.
    If it’s throwing some errors along the lines of “Plugin init failure for….”, edit the new schema.xml and remove all instances of enablePositionIncrements=”true”. Restart Solr again.
    From the main Apache Nutch directory, run the following command:
    bin/nutch index -Dsolr.server.url=http://localhost:8983/solr/nutch_solr_data_core crawl/crawldb/ -linkdb crawl/linkdb/ $s1

    Notes:
    * The command “solrindex” is deprecated, hence why I’m using the “index” command above.
    * If the indexer says the job failed, make sure your URL does not contain the #, verify the port number and the core name, and try using full paths to make sure.
    * If you have write.lock errors, verify the ownership and permissions of the nutch_solr_data_core directory. After fixing those, delete the nutch_solr_data_core/data/index/write.lock file and then restart Solr.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s