nlpcrawl: a suite of tools for crawling the Internet

Introduction [top]

nlpcrawl is a "suite of tools [that works] asynchronously in order to crawl the web for relevant content". nlpcrawl is developed at the University of Latvia in order to efficiently crawl web-pages and perform natural language operations on the downloaded content. The crawler software is licensed in full under the permissible BSD 3-part license. For more information on the underlying language research, please contact Dr Guntis Bārzdiņš, guntis at latnet dot lv, at the University of Latvia's Institute of Mathematics and Computer Science.

The intent of nlpcrawl is to provide a portable, robust, and simple set of tools to acquire mark-up data (i.e., not images, binary documents, or other "rich media") and keep it fresh. The system is heavily tuned for per-language crawling, or in other words, crawling pages in a particular language.

The nlpcrawl suite of tools are being developed by Kristaps Džonsons, kristaps dot dzonsons at latnet dot lv, at the University of Latvia's Institute of Mathematics and Computer Science.

Some features of the system follow:

  • conformance to the Robots Exclusion Standard (first edition and IETF draft) and some non-standard extensions (Koster, 1994; Koster, 1996)
  • page-fetch politeness policy: highly tunable politeness with sane defaults (10 seconds, 2 pages per burst)
  • name-resolution politeness policy: per-server concurrent fetch limits, politeness times, and run-time aging cache (5 seconds, 64 requests per burst, 120 second maximum age)
  • path ascendancy: tunable link path-ascendancy (Cothey, 2004)
  • pipelining: if the underlying libcurl is up-to-date, the system will pipeline bursts of requests (6 pages) (Fielding, et al, 1999)
  • language selection: language selection is enforced at several stages, from character sets to dictionary matching, with an emphasis on early enforcement
  • standards compliance: the system makes constant use of a variety of standards, from URI construction to HTTP headers (Berners-Lee, et al, 2005; Fielding, et al, 1999; Berners-Lee, et all, 1996)
  • multi-threaded operation in speed-critical areas (page fetching, name resolution)

 

Status [top]

This project has been used to crawl fairly small volumes (several hundred thousand pages, ten to twenty gigabytes). It's stable enough. Please monitor the system while it's in use.

Some outstanding features follow:

  • in-line HTML Robots Exclusion Standard conformance
  • simple one-command controlling daemon
  • in-line archiving of downloaded corpus

 

News [top]

x-05-2007: system now has a NLPCDBR_READY state and corresponding utility nlpc-resolve(1) which resolves addresses from names apart from nlpc-fetch(1). These were separated in order to allow better grouping by server address (politeness) and to allow more control over the resolution sequence. The system also sports REP conformance, including some non-standard extensions. Significant internal re-writing has occured within nlpc-fetch(1) in order to accomodate per-physical-address queues and per-host REP policies.

25-05-2007: fetcher is now multi-threaded for significantly greater performance, politeness policy fixed (was not polite), "parent" field culled from database, charset and langlen reduced to 32 bytes (from 64 bytes)

21-05-2007: optional path-ascendancy, politeness policy implemented, signal handling routines modified to be simpler (no execution from signal stack)

03-05-2007: bug-fixes, stability, and documentation updates. Small optimisations to the database. Statistic collection is active but a coherent tool for extraction is not yet available. The system now regularly collects over ten gigabytes of data during each [test] run.

28-04-2007: all basic elements in place and thoroughly tested. Stable release candidate tagged. Focus will now change to testing, then writing a controlling utility and a scan-window auto-optimisation tool.

23-04-2007: significant code has been re-written in order to scale upward properly. The nlpc-fetch(1), nlpc-scan(1), and nlpc-vrfy(1) are routinely run in parallel to collect gigabyte-range data. Development focus is on database stability (i.e., deadlocking). All utilities but these and nlpc-list(1) are on hold until the first three components stabilise.

29-03-2007: nlpc-scan(1) now fully scans databases and extracts addresses. This utility is heavily standards-compliant (RFC 2396, see manual). One may freely use a combination of nlpc-scan(1) and nlpc-fetch(1).

24-03-2007: nlpc-fetch(1) updated to recognise expiration; a new utility, nlpc-list(1), that lists database entities; and the nlpc-up(1) utility is at its first version

22-03-2007: nlpc-fetch(1) should work and nlpc-scan(1) may be used to seed a database

13-03-2007: pre-release peek at design documentation

 

Documentation [top]

Pre-release manuals have been released for comment (these manuals are installed in Unix manual format along with system executables):

The Unix manuals are, and will continue to be, the canonical source for system documentation. In order to compile, you'll need the following libraries installed:

Note that these use all use BSD- or MIT-derivative licenses. Edit the Makefile to system-dependent values, then execute make and make install. nlpcrawl should compile and run in any Unix environment. Development platform is Debian GNU/Linux "etch" (amd64).

A general view of the state transitions and their correspondence to utility follows:

Utility State Machine Record State Machine
Figure: inter-utility diagram Figure: inter-state diagram

A common execution strategy is to simply run all daemons and collect pages. These may be further processed with other tools. In the following example, daemons are started in order to collect language-specific pages. The database rests in /tmp/nlpcrawl/db; the database files, in /tmp/nlpcrawl/data; and the file cache in /tmp/nlpcrawl/cache. These are the defaults. In this example, the file /usr/share/dict/lang contains a utf-8 dictionary of words. The system restricts itself to http and https schemes; and utf-8, windows-1257, iso-8859-4, and iso-8859-13 content-type character sets. This combination of parameters is entirely context-dependent.

$ nlpc-up &
$ nlpc-vrfy --filter-charset "utf-8 windows-1257 iso-8859-4 iso-8859-13" 
  --filter-dict /usr/share/dict/lang &
$ nlpc-resolv &
$ nlpc-fetch --filter-lang lv --head-lang lv --filter-scheme "http https" 
  --filter-charset "utf-8 windows-1257 iso-8859-4 iso-8859-13"
  --head-charset "utf-8 windows-1257 iso-8859-4 iso-8859-13" &
$ nlpc-scan --filter-scheme "http https" http://the-seed-site &
					

 

References [top]

 

Download [top]
nlpcrawl-0.1.9.tgz x-05-2007
nlpcrawl-0.1.6.tgz 25-05-2007 28c824720405e948d8c6b9eda0df4693
nlpcrawl-0.1.5.tgz 21-05-2007 5d042d30aa8d81c36cdbd6456cd152d1
nlpcrawl-0.1.3.tgz 03-05-2007 7f6bd8d5fcd10d930efa8ea5563cf69e
nlpcrawl-0.1.2.tgz 28-04-2007 320fc98c5029925eaad3fa9bc68859fc
nlpcrawl-0.0.8.tgz 23-04-2007 11f23972f2790e2343fb4d9ccb417f0a
nlpcrawl-0.0.4.tgz 29-03-2007 730dea28391e8dcd2b23734340f5695d
nlpcrawl-0.0.3.tgz 24-03-2007 fab07c82992017123fb95052ac3c2013
nlpcrawl-0.0.2.tgz 22-03-2007 99fc757ec313bc20106d0dac10254d71
nlpcrawl-0.0.1.tgz 13-03-2007 870a9d4cf839f2e3007310dd65718fe9

 

$Id: nlpcrawl.html,v 1.39 2007-05-29 17:39:36 kristaps Exp $