nlpcrawl: a suite of tools for crawling the Internet

Introduction [top]

nlpcrawl is a "suite of tools [that works] asynchronously in order to crawl the web for relevant content". nlpcrawl is developed at the University of Latvia in order to efficiently crawl web-pages and perform natural language operations on the downloaded content. The crawler software is licensed in full under the permissible BSD 3-part license. For more information on the underlying language research, please contact Dr Guntis Bārzdiņš, guntis at latnet dot lv, at the University of Latvia's Institute of Mathematics and Computer Science.

The intent of nlpcrawl is to provide a portable, robust, and simple set of tools to acquire mark-up data (i.e., not images, binary documents, or other "rich media") and keep it fresh. The system is heavily tuned for per-language crawling, or in other words, crawling pages in a particular language.

The nlpcrawl suite of tools is being developed by Kristaps Džonsons, kristaps dot dzonsons at latnet dot lv, at the University of Latvia's Institute of Mathematics and Computer Science.

Site administrators: if your site is being abused by nlpcrawl, please contact us with the offending source IP address. The crawler User-Agent field is set to nlpcrawl-x.y.z, where x.y.z correspond to the version.

Some features of the system follow:

  • archival of any modified pages (temporarily removed)
  • a simple monitoring-mode and one-line startup command (temporarily removed)
  • conformance to the Robots Exclusion Standard: the first edition and IETF draft with non-standard extensions (Koster, 1994; Koster, 1996), and per-page mark-up designations
  • page-fetch politeness policy: per-physical-host tunable politeness with sane defaults (5 seconds, 1 pages per burst) (augmented by per-virtual-host crawl delays)
  • name-resolution politeness policy: per-server concurrent fetch limits, politeness times, and run-time aging cache (5 seconds, 32 requests per burst, one-hour maximum age)
  • path ascendancy: tunable link path-ascendancy (Cothey, 2004)
  • language selection: language selection is enforced at several stages, from character sets to dictionary matching, with an emphasis on early enforcement
  • standards compliance: the system makes constant use of a variety of standards, from URI construction to HTTP headers (Berners-Lee, et al, 2005; Fielding, et al, 1999; Berners-Lee, et al, 1996)
  • multi-threaded operation in I/O-bound areas (page fetching, name resolution, database reading and writing)

 

Status [top]

This project has been used to crawl medium volumes (several million pages, over twenty gigabytes). The database is not very stable and tends to lock up from time to time. Instructions on handling corruption may be found in nlpcrawl(1). Please monitor the system while it's in use (I suggest using -m with nlpcd(1).

 

News [top]

15-08-2007: considerable updates (version 0.1.32) to almost every sub-system. Uses multiple databases (to avoid very slow insert/delete operations), much smaller record size, many bug-fixes, etc. You'll almost certainly want to upgrade. Note that the databases are not backward-compatible; scans must start afresh.

25-06-2007: a new tool, nlpc-dump(1), archives verified pages to a gzip'd tar file.

19-06-2007: the system has been tuned for large databases. Fetches are now in bulk, and the page size has been increased, which leads to significantly faster throughput in multi-gigabyte databases.

07-06-2007: minor fixes to optimise throughput. I strongly suggest using the newest version of libcurl: the development environment uses the stable version found on the web-site (7.16.2). Older versions will cause strange errors. Note that a release may be forthcoming to address archive usability. nlpc-vrfy(1) and nlpc-scan(1) are now multi-threaded.

06-06-2007: built-in archival, in-line REP processing, a standalone tool (nlpcd(1)) for simpler execution, and a monitor-mode for nlpc-fetch(1). This concludes feature development: future builds, until stable 1.0.0, will be bug-fixes found during extensive use.

30-05-2007: system now has a NLPCDBR_READY state and corresponding utility nlpc-resolve(1) which resolves addresses from names apart from nlpc-fetch(1). These were separated in order to allow better grouping by server address (politeness) and to allow more control over the resolution sequence. The system also sports REP conformance, including some non-standard extensions. Significant internal re-writing has occured within nlpc-fetch(1) in order to accomodate per-physical-address queues and per-host REP policies.

25-05-2007: fetcher is now multi-threaded for significantly greater performance, politeness policy fixed (was not polite), "parent" field culled from database, charset and langlen reduced to 32 bytes (from 64 bytes)

21-05-2007: optional path-ascendancy, politeness policy implemented, signal handling routines modified to be simpler (no execution from signal stack)

03-05-2007: bug-fixes, stability, and documentation updates. Small optimisations to the database. Statistic collection is active but a coherent tool for extraction is not yet available. The system now regularly collects over ten gigabytes of data during each [test] run.

28-04-2007: all basic elements in place and thoroughly tested. Stable release candidate tagged. Focus will now change to testing, then writing a controlling utility and a scan-window auto-optimisation tool.

23-04-2007: significant code has been re-written in order to scale upward properly. The nlpc-fetch(1), nlpc-scan(1), and nlpc-vrfy(1) are routinely run in parallel to collect gigabyte-range data. Development focus is on database stability (i.e., deadlocking). All utilities but these and nlpc-list(1) are on hold until the first three components stabilise.

29-03-2007: nlpc-scan(1) now fully scans databases and extracts addresses. This utility is heavily standards-compliant (RFC 2396, see manual). One may freely use a combination of nlpc-scan(1) and nlpc-fetch(1).

24-03-2007: nlpc-fetch(1) updated to recognise expiration; a new utility, nlpc-list(1), that lists database entities; and the nlpc-up(1) utility is at its first version

22-03-2007: nlpc-fetch(1) should work and nlpc-scan(1) may be used to seed a database

13-03-2007: pre-release peek at design documentation

 

Documentation [top]

The Unix manuals are, and will continue to be, the canonical sources for system documentation. Data in these manuals override anything you may read on this web page.

In order to compile, you'll need the following libraries installed (tested versions are parenthesised):

  • libdb4: for the database store (4.5)
  • libcurl: for page downloading (7.16.2)
  • libcares: for name resolution (1.3.2)
  • libxml: for mark-up parsing (2.6.27)
  • libtar: for creating dumps (1.2.11)
  • zlib: for compressing dumps (1.2.3)

Note that these use all use BSD- or MIT-derivative licenses. Edit the Makefile to system-dependent values, then execute make and make install. nlpcrawl should compile and run in any Unix environment. The development platform is Debian GNU/Linux "etch" (amd64).

 

Strategy [top]

A general view of the state transitions and their correspondence to utility follows:

Utility State Machine Record State Machine
Figure: inter-utility diagram Figure: inter-state diagram

The nlpcrawl tools follow this state machine in order to optimise crawling for language-specific pages. This state machine augments a sequence of pruning language-specific pages under the following hypothesis: that language-specific texts generally link to other pages in the same language. Although the ultimately-correct system is to crawl all pages and then to analyse languages based on the downloaded corpus, this is infeasable given the Internet's scope. Given the state machine above, the language-enforcement sequence follows.

  1. nlpc-fetch(1): check content encoding and language as received from the content server (before downloading content). Only allow matching entities.
  2. nlpc-vrfy(1): check in-line character encoding and language (before scanning document) followed by a dictionary-match ratio. Only allow matching entities.

This guarantees that verified pages have been fully language-verified with emphasis on early enforcement. The basis case in the inductive step is seed sites as provided to nlpc-scan(1).

Note that path-ascendancy may violate the inductive sequence by path-ascending links at the scanning stage. Thus, nlpc-scan(1) has options for only path-ascending verified links, and not new ones. Whether this effects the signal-noise ratio is unknown.

The system doesn't follow any particular order (e.g., breadth- or depth-first) in iterating over the link database. Links are sorted by hash: a fetch-queue is filled by exhausting the database of unique physical adresses. A small cache of links per address is maintained, but the focus is on broadening the search domain of physical addresses, following the hypothesis that pages link to those within the same hierarchy.

A common execution strategy is to simply run all daemons and collect pages. These may be further processed with other tools. In the following example, daemons are started in order to collect language-specific pages. The database rests in /tmp/nlpcrawl/db; the database files, in /tmp/nlpcrawl/data; and the file cache in /tmp/nlpcrawl/cache. These are the defaults. In this example, the file /usr/share/dict/lang contains a utf-8 dictionary of words. The system restricts itself to the http scheme; and utf-8, windows-1257, iso-8859-4, and iso-8859-13 content-type character sets. This combination of parameters is entirely context-dependent.

$ nlpcd -m --filter-charset "utf-8 windows-1257 iso-8859-4 iso-8859-13" --filter-dict /usr/share/dict/lang 
  --filter-lang lv --filter-scheme http http://the-seed-site 
					

 

References [top]

 

Download [top]
nlpcrawl-0.1.32.tgz 15-08-2007 [md5]
nlpcrawl-0.1.20.tgz 19-06-2007 [md5]
nlpcrawl-0.1.15.tgz 06-06-2007 [md5]
nlpcrawl-0.1.9.tgz 30-05-2007 [md5]
nlpcrawl-0.1.6.tgz 25-05-2007 [md5]
nlpcrawl-0.1.5.tgz 21-05-2007 [md5]
nlpcrawl-0.1.3.tgz 03-05-2007 [md5]
nlpcrawl-0.1.2.tgz 28-04-2007 [md5]
nlpcrawl-0.0.8.tgz 23-04-2007 [md5]
nlpcrawl-0.0.4.tgz 29-03-2007 [md5]
nlpcrawl-0.0.3.tgz 24-03-2007 [md5]
nlpcrawl-0.0.2.tgz 22-03-2007 [md5]
nlpcrawl-0.0.1.tgz 13-03-2007 [md5]

 

$Id: nlpcrawl.html,v 1.55 2007-08-15 10:40:16 kristaps Exp $