Table of Contents

Name

nlpcrawl, nlpcd, nlpc-fetch, nlpc-vrfy, nlpc-resolv, nlpc-up, nlpc-scan, nlpc-list, nlpc-dump - a suite of simple tools for crawling the Internet

Synopsis

nlpcd
nlpc-fetch
nlpc-vrfy
nlpc-resolv
nlpc-up
nlpc-scan
nlpc-dump
nlpc-list

Description

The nlpcrawl suite of tools work asynchronously in order to crawl the web for relevant content. Although these tools are developed specifically for natural language processing, they may be deployed in any crawling environment.

Each tool in nlpcrawl has a specific function, further defined as transitions in Cache State. In brief:

nlpcd(1)
Run all tools in concert.
nlpc-fetch(1)
Fetch pages from accessible media.
nlpc-resolv(1)
Resolve IP addresses from names.
nlpc-vrfy(1)
Examine page for authenticity: valid language, character encoding, and so forth.
nlpc-scan(1)
Extract addresses from fresh pages.
nlpc-up(1)
Prune, kill, or revive dead and unreachable records.
nlpc-list(1)
List entities in an nlpc database (not technically part of the state machine).
nlpc-dump(1)
Archive relevent (verified) content from the nlpc cache.

Database
The centre of nlpcrawl is a database containing detailed information on crawled entities. The database model was designed for maximum simplicity. The database itself is laid out as a series of records. Each record may be represented by the following structure:

struct record

struct uri uri
structure containing the parsed and normalised indicator
uint64_t state
an integer representing the cache state that may be one of NEW, FRESH, SCANNED, VERIFIED, RESOLVED, DIRTY, UNREACH, or DEAD: see Cache State for an explanation
uint64_t code
a code signifying error when in UNREACH or DEAD state
uint64_t revived
number of times that this has been marked UNREACH upon failed re-validate attempts (this is re-set upon entering the SCAN state)
uint64_t ctime
time of initial database addition
uint64_t mtime
time of last modification
uint64_t ftime
time of last fetch
uint64_t size
size of on-disc content (in bytes), if applicable FILES for details)
uint64_t type
type of cache data (integral symbol for HTML, XHTML, etc.)
char charset[32]
nil-terminated character set identifier for cache document (see ASN.1 standard for reason of length) (zero, if unset)
char lang[32]
nil-terminated language identifier for cache document (zero, if unset)
uint64_t expires
expiration time of the cached object (zero, if unset; UINT64_MAX, if content immediately expires)

URI structures are thus organised (the terminology follows RFC 3986 URI Generic Syntax):

struct uri

char scheme[16]
one of http, https, or other schemes
char auth[128]
location authority (for appropriate schemes)
char path[128]
hierarchic path component or opaque part
char query[128]
query string

Technically, URIs are disallowed finite bounds; however, we choose to bind them in deference to simplicity (and database efficiency).

Multi-byte types are written and read from the database in network-byte order. Records are keyed by the 16-byte MD5 hash of the stringified uri field. Note that date fields are cast from native time_t to uint64_t as some hosts store 64-bit time, and others, 32-bit.

The database is currently implemented in a Berkeley (4.4) ‘‘btree’’ database file. This may change as the system matures.

Note that the database is not very stable and uses considerable system resources. Large scans (multi-millions of pages) should be run on a lightly-loaded system. The database state should be monitored for corruption (see the Corruption sub-section for details). Read the Berkeley database manual carefully in order to tune the surrounding environment: we recommend using the ext3 file-system on a system with considerable physical memory.

Cache State
Each record in the database is governed by state machine. Entities may be in one of the following states:

NEW
discovered by nlpc-scan(1) , not yet resolved by nlpc-resolv(1)
FRESH
fetched successfully by nlpc-fetch(1) , not yet verified by nlpc-vrfy(1)
VERIFIED
verified by nlpc-vrfy(1) , not yet scanned by nlpc-scan(1)
RESOLVED
name-resolved by nlpc-resolv(1) , not yet fetched by nlpc-fetch(1)
SCANNED
scanned by nlpc-scan(1)
DIRTY
found by nlpc-up(1) to be out-of-date or in need of refetching
UNREACH
marked by as being un-reachable (or other temporary failure) (note that this is synonymous with TEMP as may appear elsewhere in the documentation)
DEAD
marked as having syntax errors (or other fatal failure)

State transitions are as follows:

NEW
-> RESOLV | UNREACH
RESOLV
-> FRESH | DEAD | UNREACH
FRESH
-> VERIFIED | UNREACH
VERIFIED
-> SCANNED | UNREACH
SCANNED
-> DIRTY
DIRTY
-> RESOLV | UNREACH
DEAD
-> DIRTY | (delete)
UNREACH
-> DIRTY | DEAD

Archival
All utilities maintain a separate file tree for archival purposes. The volume of this tree will, in general, be significantly greater than the active cache tree. Any modification to a cache file will result in a complete backup in the archive tree. The notion of archival is still under consideration, and will likely see greater elegance in future versions. (NOTE: DEPRECATED)

Corruption
If any of the databases become corrupt (from shutting down while data is still being flushed, or killing processes), options for restructure are limited. Corruption results in endless blocking or outright errors. The environment must first be repaired (in these examples, /var/crawl/ is the argument passed as -p in utility startup):

% db_recover -v -h /var/crawl/data/

At that point, corrupt databases must be rebuilt. See which databases are corrupt by verifying their contents:

% db_verify -h /var/crawl/data/ /var/crawl/db/*

Any corrupt databases (in this example, vrfy.db) must be re-built:

% db_dump -r -h /var/crawl/data/ /var/crawl/db/vrfy.db >sv.db % mv /var/crawl/db/vrfy.db corrupt.db % db_recover -h /var/crawl/data/ /var/crawl/db/vrfy.db <sv.db

During extensive crawls, this may sometimes take place several times.

Signals

Most nlpcrawl utilities are equipped to run in daemon-mode, controlling other elements of the state machine with signals. All of the following signals guarantee that the system exits in a recoverable state.

SIGHUP
Triggers a re-scan. If delivered during a scan, the signal is queued until blocking, at which time it’s processed in turn. Multiple signals delivered while blocking will be merged into one.
SIGTERM
Triggers a graceful shut-down after any active scanning.
SIGUSR1
Enlarge the scan window (number of records matched before blocking for a re-scan signal).
SIGUSR2
Narrow the scan window (number of records matched before blocking for a re-scan signal). Minimum scan window is 0 records.

Do not send a SIGINT or other process-termination signals. Doing so raises the likelihood that a database will become corrupt.

Files

/tmp/nlpcrawl/
default database /db and cache file /cache prefix
/db/data/
database data directory within the database prefix
/db/db/
database directory within the database prefix
/db/run/
contains pid-files for active processes /cache/00/00/.../00 /cache/00/00/.../01 ...
/cache/ff/ff/.../ff
directory system for storing cache files: each URI hash is broken up into pairs of hex digits representing directories, with the last two digits being the file name (these are created onthe-fly as a cache file is stored)
/cache/reps
default directory for storing REP entities /archive/00/00/.../00/ /archive/00/00/.../01/ ...
/archive/ff/ff/.../ff/
archive directories (similar to cache directories), where each directory will be filled by a snapshot of an altered file, with the file format being %S-%M-%H-%j-%y in srtftime(3) notation

See Also

nlpcd(1) , nlpc-fetch(1) , nlpc-scan(1) , nlpc-list(1) , nlpc-resolv(1) , nlpc-proc(1) , nlpc-up(1) , db(3) , btree(3) , libcurl(3) , libxml(3)

Acknowledgements

All sources link to Colin Plum’s public domain implementation of Ron Rivest’s MD5 algorithm, and Nate Nielson’s BSD-licensed mkdir_p().

Authors

The nlpcrawl suite was developed in full by Kristaps Dzonsons <kristaps.dzonsons@latnet.lv> for the University of Latvia’s Institute of Mathematics and Computer Science.

Caveats

Database integral types are internally represented in little-endian (database files may still be moved between architectures: see lorder in btree(3) ). Big-endian systems will incur a penalty as the database internally converts types. If you use an exclusively big-endian architecture, consider initialising your database on a big-endian machine, or setting the appropriate ordering in db.c.

When reading addresses, nlpcrawl still doesn’t know how to convert from encoded multi-byte sequences into percent-encoded strings.


Table of Contents