nlpcrawl, nlpcd, nlpc-fetch, nlpc-vrfy, nlpc-resolv, nlpc-up, nlpc-scan, nlpc-list, nlpc-dump - a suite of simple tools for crawling the Internet
The nlpcrawl suite of tools work asynchronously in order to crawl the web for relevant content. Although these tools are developed specifically for natural language processing, they may be deployed in any crawling environment.
Each tool in nlpcrawl has a specific function, further defined as transitions in Cache State. In brief:
The centre of nlpcrawl is a database containing detailed information on crawled entities. The database model was designed for maximum simplicity. The database itself is laid out as a series of records. Each record may be represented by the following structure:
URI structures are thus organised (the terminology follows RFC 3986 URI Generic Syntax):
Technically, URIs are disallowed finite bounds; however, we choose to bind them in deference to simplicity (and database efficiency).
Multi-byte types are written and read from the database in network-byte order. Records are keyed by the 16-byte MD5 hash of the stringified uri field. Note that date fields are cast from native time_t to uint64_t as some hosts store 64-bit time, and others, 32-bit.
The database is currently implemented in a Berkeley (4.4) ‘‘btree’’ database file. This may change as the system matures.
Note that the database is not very stable and uses considerable system resources. Large scans (multi-millions of pages) should be run on a lightly-loaded system. The database state should be monitored for corruption (see the Corruption sub-section for details). Read the Berkeley database manual carefully in order to tune the surrounding environment: we recommend using the ext3 file-system on a system with considerable physical memory.
Each record in the database is governed by state machine. Entities may be in one of the following states:
State transitions are as follows:
All utilities maintain a separate file tree for archival purposes. The volume of this tree will, in general, be significantly greater than the active cache tree. Any modification to a cache file will result in a complete backup in the archive tree. The notion of archival is still under consideration, and will likely see greater elegance in future versions. (NOTE: DEPRECATED)
If any of the databases become corrupt (from shutting down while data is still being flushed, or killing processes), options for restructure are limited. Corruption results in endless blocking or outright errors. The environment must first be repaired (in these examples, /var/crawl/ is the argument passed as -p in utility startup):
% db_recover -v -h /var/crawl/data/
At that point, corrupt databases must be rebuilt. See which databases are corrupt by verifying their contents:
% db_verify -h /var/crawl/data/ /var/crawl/db/*
Any corrupt databases (in this example, vrfy.db) must be re-built:
% db_dump -r -h /var/crawl/data/ /var/crawl/db/vrfy.db >sv.db % mv /var/crawl/db/vrfy.db corrupt.db % db_recover -h /var/crawl/data/ /var/crawl/db/vrfy.db <sv.db
During extensive crawls, this may sometimes take place several times.
Most nlpcrawl utilities are equipped to run in daemon-mode, controlling other elements of the state machine with signals. All of the following signals guarantee that the system exits in a recoverable state.
Do not send a SIGINT or other process-termination signals. Doing so raises the likelihood that a database will become corrupt.
nlpcd(1) , nlpc-fetch(1) , nlpc-scan(1) , nlpc-list(1) , nlpc-resolv(1) , nlpc-proc(1) , nlpc-up(1) , db(3) , btree(3) , libcurl(3) , libxml(3)
All sources link to Colin Plum’s public domain implementation of Ron Rivest’s MD5 algorithm, and Nate Nielson’s BSD-licensed mkdir_p().
The nlpcrawl suite was developed in full by Kristaps Dzonsons <firstname.lastname@example.org> for the University of Latvia’s Institute of Mathematics and Computer Science.
Database integral types are internally represented in little-endian (database files may still be moved between architectures: see lorder in btree(3) ). Big-endian systems will incur a penalty as the database internally converts types. If you use an exclusively big-endian architecture, consider initialising your database on a big-endian machine, or setting the appropriate ordering in db.c.
When reading addresses, nlpcrawl still doesn’t know how to convert from encoded multi-byte sequences into percent-encoded strings.
Table of Contents