Nutch1 Quick Tutorial, Learning to Crawl

Here is a quick hands-on tutorial to gain some familiarity with Apache Nutch-1x (A web crawler) as well as for my own reference just in case my memory bails on me :).

Note this is not intended to be an official guide or anything formal per-se, I am not discussing the architecture, implementation details or the history of the project (For all details regarding Nutch refer to the Official Nutch Wiki).

Platform: Linux Ubuntu

1. Download and extract Apache Nutch 1.x
Go to the Apache Nutch website and download the Apache Nutch 1.10 (bin.tar.gz); we want the binary version.

Now extract the compressed archive into your /opt/ directory. (you can use tar if your would like or the explorer, it doesn’t matter)

2. Check your Nutch Install
In the terminal cd to the /opt/apache-nutch-1.10 directory. Now check your installation, run the binary nutch command:

cd /opt/apache-nutch-1.10
bin/nutch

You should get a nutch version and a snapshot of the usage of the nutch command.

3. Setup your java home.
Note: the location of your java home is system dependent (but this worked for me).
In order to set up the environment variable temporarily, do:

export JAVA_HOME=/usr

If you want to make it persistent over login sessions, edit your ~/.profile file and add the line:

export JAVA_HOME=/usr

4. Setup your Crawl Properties.
change directory to /opt/apache-hutch-1.10/conf and modify the nutch-site.xml file.

cd /opt/apache-hutch-1.10/conf
nano nutch-site.xml

Add your agent name property(to the .xml) for external servers to recognize.

<property>
    <name>http.agent.name</name>
    <value>My Nutch Spider</value>
</property>

5. Create a seed list, a list of URLs to crawl.
Crate a new URLs directory and a text file containing a single URL per line.

mkdir -p urls
cd urls
nano seed.txt

add the url you would like to crawl (I will use the nutch website as a valid example), add the following line to the seed.txt file:
https://nutch.apache.org

6. Limit your crawling scope, configure regular expression filters.
Now we would like to limit our crawling to only our chosen url and any other url in that domain only. (i.e. we don’t want to do a whole-web crawl).

We need to edit the file conf/regex-urlfilter.txt

cd /opt/apache-hutch-1.10/conf 
nano regex-urlfilter.txt

now replace the last line and its comment:

# accept anything else
 +.

For

# Limit our search to this domain and any pages within it.
+^https://([a-z0-9]*\.)*nutch.apache.org/

NOTE: Not specifying any domains to include within regex urlfilter.txt will lead to all domains linking to your seed URLs file being crawled as well.

7. Seeding the nutch crawl database with urls
We can use the nutch injector to add our seed list to the crawl database.

cd /opt/apache-hutch-1.10/
bin/nutch inject crawl/crawldb urls

8. Generate a fetch list
Now we need to generate a list of pages to be fetched from our domain urls, you can imagine that each domain url might have a large number of links, we need to gather all of these into a list before we can start crawling, do the following:

bin/nutch generate crawl/crawldb crawl/segments -topN 10

This generates a fetch list for all of the pages due to be fetched, we are limiting the number of pages in our list to 10 (to keep the example short and simple), but you can remove this argument if you want to list all of the pages in the domain.
The fetch list is placed in a newly created segment directory. The name of the directory is the timestamp of when it was created. For easiness we would like to set a shell variable to keep track of this segment, for example in my case:

s1=crawl/segments/20150813112625

and check the variable value

echo $s1

9. Start fetching content.
Now that we have our list of pages ready, let’s start fetching content from our newly created segment.

bin/nutch fetch $s1

and parse all of our entries

bin/nutch parse $s1

Finally update our crawl database with our fetched results.

bin/nutch updatedb crawl/crawldb $s1

10. prepare the database for indexing, invert links
Before indexing we first invert all of the links, so that we may index incoming anchor text with the pages.

bin/nutch invertlinks crawl/linkdb -dir crawl/segments

Great!
We just crawled a website with Apache Nutch, fun 🙂

The next step is to index our data using Apache Solr, for that make sure you went through the Solr tutorial already and then refer to the quick tutorial on Nutch Solr integration.

If I have any errors/typos or if I missed something, post a comment or e-mail and let me know, Thanks!

Reference for this post: Official Nutch Tutorial

Advertisements

6 thoughts on “Nutch1 Quick Tutorial, Learning to Crawl

  1. Hi, I have the situation to integrate solr nutch in drupal 7
    I have integrated the solr-4.10.4 with drupal 7 through module the search operation works fine with the apache solr search(module) that available in drupal7. the Point is to fetch the hyper links that are available on the page. so that i found apache nutch is fine. but i have configured the Solr in drupal with the following change of files in solr. 1)schema.xml 2)solrconfig.xml 3)protwords.xml from drupal module.

    how to connect all these solr4.10.4 nutch1.12 and drupal7 kindly help in this.

    • Hi Karthik, unfortunately I am no longer working on this project and I can not help you. But, have you thought about joining the mailing lists of the Solr and Nutch projects, there you can probably find people that can help you.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s