
Crawling in Open Source, Part 1
This is the first of a two part series of articles that will
focus on Open Source web crawlers implemented in Java programming
language. The goal is familiarize user in some basic concepts of
crawling and also dig deeper into some implementations such as Apache
Nutch and Apache Droids. This first part covers the generic part as
well as Apache Nutch.
Table of Contents
In this article I will give you a short introduction to crawling
in general and then move on to Apache Nutch, its history, and
architecture, and explanations of its core processing steps and
MapReduce functions at a very technical level. After reading this
article readers should be somewhat familiar with the basic crawling
concepts and core MapReduce jobs in Nutch.
A Web Crawler is a computer program that usually discovers and
downloads content from the web via an HTTP protocol. The discovery
process of a crawler is usually simple and straightforward. A crawler is
first given a set of URLs, often called seeds. Next the crawler goes and
downloads the content from those URLs and then extracts hyperlinks or
URLs from the downloaded content. This is exactly the same thing that
happens in the real world when a human is interfacing with a web browser
and clicks on links from a homepage, and pages that follow, one after
another.
When a web crawler discovers new URLs to follow, it then goes to
fetch the content from those URLs - just like a normal browser fetches
the content when links on a web page are clicked. The download part is
even simpler. In the simplified case of HTTP, a network connection to
the host is opened, an HTTP request is sent, an HTTP response is read
from the wire and then the network connection is closed. A crawler might
also support other means of getting the content, making it capable of
finding and fetching content from various different places like local or
shared file systems, ftp servers, mail boxes etc.
Generally speaking, crawlers are used to find and bring in the
content that interests you. Often the reason is that you want to make
that information searchable by indexing it. The use case and context for
crawling can vary greatly depending on the task at hand. For example the
most popular web search engines, like Google and Yahoo, deploy a web
crawler that primarily finds content that is publicly available via the
web. The main challenge for such a generic web crawler is obviously
scalability. The crawler architecture needs to be efficient because the
amount of content to crawl is enormous and growing all the time at ever
increasing speeds.
Another example, of a completely different kind of use case, is a
price comparison site where users are presented with products and their
price information from various online shops. To gather such information
one could implement a crawler that continuously re-crawls content from a
predefined set of web sites, and is mainly interested in just a few
specific URLs from those sites.
A crawler in an enterprise search context possesses yet another
set of features important for that context. Usually, an enterprise
crawler needs to understand and extract information from various
different document formats and access those documents from many
different places like mail boxes, CRM systems, and content management
systems. Understanding many different document formats plays an
important role.
And there are many other examples of crawling needs that lie in
between these examples that each require a slightly different set of
requirements for the crawler to fulfill.
There are a few features or characteristics that are very
important for a crawler. The most important single feature every crawler
should contain is a sense of politeness. Since crawling content from web
sites uses resources from the target server, like network bandwidth and
processor time, it is mandatory to limit the frequency a crawler
accesses particular content or else the target server would soon become
inaccessible to real users. Owners of a website may also want to expose
only some part of the available content to web crawlers. For such cases
there exists a method called Robots Exclusion Standard that gives the
owner of a web site the means to control what and how often something is
fetched from the server. A polite crawler respects such rules.
Politeness is not just following the rules, but respecting a site's
resources even when such rules are not explicitly specified by limiting
the frequency of your crawler's visits.
The most common components of a crawler include a queue, fetcher,
extractor and content repository. The queue contains URLs to be fetched.
It may be a simple memory based, first in, first out queue, but usually
it's more advanced and consists of host based queues, a way to
prioritize fetching of more important URLs, an ability to store parts or
all of the data structures on disk and so on. The fetcher is a component
that does the actual work of getting a single piece of content, for
example one single HTML page. The extractor is a component responsible
for finding new URLs to fetch, for example by extracting that
information from an HTML page. The newly discovered URLs are then
normalized and queued to be fetched. The content repository is a place
where you store the content. Later the content is processed and finally
indexed. I am not going to go into more detail about crawler
architectures and implementation details, but will instead take a closer
look at one open source java crawler implementation: Apache Nutch.
Nutch was originally implemented by Doug Cutting and Michael
Cafarella et al. in around 2002. The goal was to make Nutch a web scale
crawler and search application capable of fetching billions of URLs per
month, maintain an index of these pages and allow searching of that
index 1000 times per second. The original design was proven to scale up
to 100 million documents but became impractical with problems of
maintenance at such a scale. During 2004, after the incubation process,
Nutch became part of Apache. Soon after Google published its MapReduce
paper in 2004, the Nutch architecture was rewritten to take advantage of
MapReduce, and a new data storage system called
NutchDistributedFileSystem (NDFS) was implemented. This new architecture
allowed Nutch to be run on a large cluster of machines, making the
architecture scalable both in processing power and data storage. Later
both the MapReduce execution framework and NDFS were promoted to a top
level Apache project called Apache Hadoop. Hadoop solves much of the
maintenance issues Nutch had in the pre-MapReduce era. After setting up
a Hadoop cluster you can control Nutch crawling from one single machine
despite the size of the cluster you are running Nutch on.
Nutch as it exists today is still pretty much an application that
helps you to build a generic web search engine. It supports fetching
content with various protocols such as HTTP, HTTPS, FTP and local file
system. Nutch can also extract textual content from several document
formats like HTML, RSS, ATOM, PDF, ms formats (doc, excel, ppt), etc
right out of the box. Nutch is not for everyone as some quite low level
skills are still required to run a crawl and maintain the
indexes.
Running Nutch consists of executing several commands from the
shell in sequence. Those commands each launch one or more MapReduce jobs
that are executed by Hadoop. Next I will walk you through some of the
most important commands and how they are constructed as MapReduce jobs
and tasks. I'll use Python-like pseudocode to describe the functions in
simplified examples for illustration purposes, but this should still
capture the most essential aspects of each job. Before going into the
individual commands,the core terms and components that I will refer to
later in this article need to be explained. The Crawl Database is a data
store where Nutch stores every URL, together with the metadata that it
knows about. In Hadoop terms it's a Sequence file (meaning all records
are stored in sequential manner) consisting of tuples of URL and
CrawlDatum. Many other data structures in Nutch are of similar structure
and no relational databases are used. The rationale behind this kind of
data structure is scalability. The model of simple, flat data storage
works well in a distributed environment. Each node gets a one or more
split of the whole data and operates on that (The Map phase of
MapReduce). Data can be stored inside the Hadoop Distributed File System
so nodes can access the splits from the nearest host that contains a
replica of it. Operations (like inserts, deletes and updates) in Crawl
Database and other data are processed in batch mode. Here is an example
of the contents of crawldb:
http://www.example.com/page1.html -> status=..., fetchTime=..., retries=..., fetchInterval=..., ... http://www.example.com/page2.html -> status=..., fetchTime=..., retries=..., fetchInterval=..., ... http://www.example.com/page3.html -> status=..., fetchTime=..., retries=..., fetchInterval=..., ...
The Fetch List is a data structure (SequenceFile,
URL->Crawldatum) that contains the URL, crawldatum tuples that are
going to be fetched in one batch, usually the Fetch list contents are a
subset of CrawlDB that was created by the generate command. FetchList is
stored inside Segment.
Segment is basically a folder containing all the data related to
one fetching batch. Besides the Fetch List, the fetched content itself
will be stored there in addition to the extracted plain text version of
the content, anchor texts and URLs of outlinks, protocol and document
level metadata etc.
The Link Database is a data structure (Sequence file, URL ->
Inlinks) that contains all inverted links. In the parsing phase Nutch
can extract outlinks from a document and store them in format source url
-> target_url,anchor_text. In the process of inversion we invert the
order and combine all instances making the data records in Link Database
look like: targetURL -> anchortext[] text so we can use that
information later when individual documents are indexed.
IThe Inject command in Nutch has one responsibility: inject more
URLs into Crawl Database. Normally you should collect a set of URLs to
add and then process them in one batch to keep the time of a single
insert small.
Job1: Convert plain text into URL,CrawlDatum tuples and dedupe
MAP:
in: <lineNumber: Long>,<lineOfText:Text>
out: <url:Text>,<CrawlDatum>
def map(lineNumber, lineOfText, collector):
url=normalize(lineOfText, scope_inject)
url=filter(url)
if url:
collector.collect(url, getCrawlDatum(url))
REDUCE:
in: <url:Text>,<CrawlDatum>
out: <url:Text>,<CrawlDatum>
def reduce(url, values, collector):
preserved = pickValue(values)
preserved.setStatus(unfetched)
output.collect(key, preserved)
Job2: Merge with existing CrawlDB, dedupe
MAP:
in: <url:Text>,<CrawlDatum>
out: <url:Text>,<CrawlDatum>
def map(url, datum):
if normalize:
url=normalize(url)
if filter:
url = filter(url, scope_crawldb)
if url:
collector.collect(url, datum)
REDUCE: in: <url:Text>,<CrawlDatum> out: <url:Text>,<CrawlDatum> def reduce(url, values, collector): preserved = pickValue(values) output.collect(key, preserved)
The Generate command in Nutch is used to generate a list of URLs
to fetch from Crawl Database URLs with the highest scores are
preferred.
Job1: Select potential URLs to fetch. Stop collecting globally
once the topN is met. Stop collecting per host if host limit is set at
the set limit.
COMPARATOR: DecreasingFloatComparator, sorts entries by decreasing score.
Highest scoring entries at top.
MAP:
in: <url:Text>,<CrawlDatum>
out: <score:float><SelectorEntry>
def map(url, datum, collector):
if filtering:
filter(url)
if shouldfetch(url):
entry.setDatum(datum)
collector.output(score, entry)
PARTITION:
def getPartition():
return url.getHost()%numReducers
REDUCE:
in: <score:FloatWritable, entry: SelectorEntry>
out: <score:FloatWritable, entry: SelectorEntry> dir <hadoop_temp>/temp-file
def reduce(key, values, collector):
for entry in values:
if collectcount < topN/numreducers:
finalUrl = normalize(url)
if hostCount(finalUrl.getHost()) < maxPerHost:
incHostCount(finalUrl.getHost())
incCollectCount()
collector.output(key, entry)
Job2: Randomize the order of Fetch List. Partition URLs so that
URLs from one host will be on the same partition.
MAP:
in: <score:FloatWritable><SelectorEntry>
out: <url:Text><SelectorEntry>
def map(url, entry, collector):
collector.collect(entry.getUrl, entry)
PARTITION:
def getPartition():
return url.getHost()%numReducers
COMPARATOR:
HostHashComparator, sorts entries by hash of URL so URLs in fetchlist are in semi randomized order (if they were sorted by URL all URLs from same host would be available to fetcher in sequel).
REDUCE:
in: <url:Text, entry:SelectorEntry>
out <url:Text, datum: CrawlDatum> dir <segment-dir>/crawl_generate
def reduce(key, values, collector):
for entry in values:
collector.output(entry.getUrl(), entry)
Fetcher is responsible for fetching content from URLs and writing
them to disk. It also optionally parses the content. URLs are read from
a Fetch List generated by Generator.
MAP:
in: <url:Text, SelectorEntry> dir: <segment-dir>/crawl_generate
out: <url:Text, CrawlDatum> dir: <segment-dir>/crawl_fetch
out: <url:Text, Content> dir: <segment-dir>/content
optional (if parsing is enabled):
out: <url:Text, text: ParseText> dir: <segment-dir>/parse_text
out: <url:Text, data: ParseData> dir: <segment-dir>/parse_data
out: <url:Text, datum: CrawlDatum> dir: <segment-dir>/crawl_parse
def map(url, entry, collector):
collector.collect(url, entry.crawldatum)
content = fetch(url)
collector.collect(url, entry)
if parsing:
parsed = parse(content)
collector.collect(url, parsed)
Parser reads raw fetched content, parses it and stores the results.
MAP: in: <url, content:Content> dir = <segment_dir>/content out: <url:Text, parsedContent:ParseText> def map(url, content, collector): parses = parse(content) collector.collect(url, parsed) REDUCE: in: <url:Text, Iterator<parseImpl>> out: <url:Text, text:ParseText> dir = <segment_dir>/parse_text out: <url:Text, data:ParseData> dir = <segment_dir>/parse_data out: <url:Text, datum:CrawlDatum> dir = <segment_dir>/crawl_parse def reduce(url, parseImpl,collector): collector.collect(url,content.next())
The UpdateDB command reads the CrawlDatums from Segment (extracted
URLs) and merges them to the existing CrawlDB.
MAP:
in: <url: Text, datum:CrawlDatum> dir = crawldb_dir/current
in: <url: Text, datum:CrawlDatum> dir = segment_dir/crawl_fetch
in: <url: Text, datum:CrawlDatum> dir = segment_dir/crawl_parse
out: <url: Text, datum:CrawlDatum> dir = crawldb_dir/<tmp_name>
def map(url, datum, collector):
if normalize:
url = normalize(url, scope_normalize)
if filter:
url = filter(url, scope_normalize)
collector.collect(url,datum)
REDUCE:
in: <url:Text, Iterator<datum:CrawlDatum>
out: <url:Text, datum:CrawlDatum>
def reduce(url, datums, collector):
datum = merge(datums)
collector.collect(url,datum)
Inverts link information so we can use anchor texts from other
documents that point to a document together with the rest of the
document data.
Job1: Invert link data from source,<target, anchortext> to:
<target,anchortext>
in: <fromUrl:Text, parseData:ParseData> dir = <segment_dir>/parse_data
out:<targetUrl:Text, inLinks: InLinks>
def map(url, parseData):
if filter:
url=filter(url, scope_linkdb)
if normalize:
url=normalize(url, scope_linkdb)
for target in parseData.getOutLinks():
collector.collect(targetUrl,
collector.collect(target.getToUrl(), target.getAnchorText())
COMBINE:
in: <targetUrl:Text, inLinks: InLinks>
out:<targetUrl:Text, inLinks: InLinks>
def reduce(toUrl, inLinksIterator):
newInLinks = []
for inLinks in inLinksIterator:
if len(inLinks)<maxInLinks:
addAll(newInLInks, inLinks)
collector.collect(toUrl, newInLinks)
REDUCE (same as combiner):
in: <targetUrl:Text, inLinks: InLinks>
out:<targetUrl:Text, inLinks: InLinks>
def reduce(toUrl, inLinksIterator):
newInLinks = []
for inLinks in inLinksIterator:
if len(inLinks)<maxInLinks:
addAll(newInLInks, inLinks)
collector.collect(toUrl, newInLinks)
Job2: Merge additions with existing LinkDB:
MAP
in: <targetUrl:Text, inLinks: InLinks>
out:<targetUrl:Text, inLinks: InLinks>
def map(toUrl, anchortexts, collector):
if normalize:
toUrl=normalize(toUrl)
if filter:
toUrl=filter(toUrl)
if toUrl:
collector.collect(toUrl, anchortexts)
REDUCE
in: <targetUrl:Text, inLinks: InLinks>
out:<targetUrl:Text, inLinks: InLinks>
def reduce(toUrl, anchortexts, collector):
collector.collect(toUrl, collectall(anchortexts))
I have now gone through the core MapReduce jobs in Nutch that are
related to crawling. There are also many other data processing jobs in
Nutch like indexing, page scoring calculation, removing duplicate
content from index etc. The best way to get familiar with the remaining
algorithms to look at the following resources (the link is in in
references). In the next article of this series I will take a look at
other open source crawlers.

Post new comment