Table of Contents

Name

nlpc-fetch - fetches content to an nlpc database

Synopsis

nlpc-fetch [-dhmv] [-f path] [-p path] [-r path] [LONG_OPTS]

Description

The nlpc-fetch utility fetches on-line content and records meta-information; it’s built atop libcurl(3) . URIs are harvested from an nlpc database. nlpc-fetch is part of the nlpcrawl(1) series of tools. The arguments are as follow:

-d
Enable debugging. Use multiple times for more verbosity.

-h
Print a help message and exit.

-m
Disable debugging and view “monitor-mode", which displays current hosts on the console at a refresh-rate of once per second.

-v
Print version information and exit.

-f path
The cache file path. See the nlpcrawl(1) FILES section.

-p path
The database environment path. See the nlpcrawl(1) FILES section.

-r path
The REP entity path. See the nlpcrawl(1) FILES section.

In addition, the following long arguments may be used:

--filter-auth string
Match new address authorities against string. If the authority isn’t matched, do not add it to the database. Multiple authorities may be space-separated. Example: --filter-auth ‘www.a.lv www.b.lv’.

--filter-charset string
For relevant protocols, if a charset is provided by the server, match it now against string. If the charset is not matched, discontinue fetching the page. Multiple charsets may be space-separated. Example: --filter-charset ‘utf-8 utf-16’.

--filter-lang string
For relevant protocols, if a language is provided by the server, match it now against string. If the language is not matched, discontinue fetching the page. Multiple charsets may be spaceseparated. Example: --filter-lang ‘lv en’.

--filter-scheme string
Match new address schemes against string. If the scheme isn’t matched, do not add it to the database. Multiple schemes may be space-separated. Example: --filter-scheme ‘http ftp’.

--filter-ascend
Path-ascend new records.

--fetch-burst int
Number of hosts to fetch at once at each access point. This tries to use pipelining, should the underlying protocol support this feature. Defaults to 1.

--fetch-depth int
The maximum number of requests to queue for each host over each fetch session. Note that setting this to be too deep may result in a single host being waited upon with an upper-bound fetchdepth * fetch-wait time. Defaults to 15.

--fetch-width int
Maximum hosts in fetch queue. Defaults to 2048. This and --fetch-parallel largely govern the impact of politeness on throughput.

--fetch-wait int
Amount of time to wait between fetches. Defaults to 5 (seconds). Last-access times are marked at the end of a fetch attempt.

--fetch-parallel int
Maximum number of parallel connections to open. This shouldn’t exceed the shell’s file descriptor limit. Defaults to 256. This and --fetch-width largely govern the impact of politeness on throughput.

--fetch-trigger int
Trigger for re-filling the wait queue. Defaults to the queue width, --fetch-width, divided by two.

--rep-expires int
Interval (in seconds) after which time a REP entity expires. If set to zero, REP entities are re-fetched with every access (you probably don’t want this). Defaults to seven days.

--rep-maxallow int
Maximum allow clauses per REP. If set to zero, the queue is not limited. Defaults to 1024.

--rep-maxdeny int
Maximum deny clauses per REP. If set to zero, the queue is not limited. Defaults to 1024.

--rep-cachesz int
Maximum number of cached REP entities. Once this limit is reached, the least-used caches (and/or oldest) are pruned. If set to zero, the queue is not limited. Defaults to 4096.

--monitor-up int
Number of seconds between -m monitor-mode updates. Defaults to 1 second.

Monitor Mode
Monitor mode is enabled with the -m flag. It displays a table of the currently-valid physical hosts; or in other words, those who are awaiting activity. Since active connections generally start and complete too quickly to show meaningfully, this concentrates instead upon the next entity to be fetched. The table displays the following tabulated fields:

ip
The physical host’s address.
pend
The number of active connections to the physical host.
size
The number of queued virtual hosts for the physical host.
elapsed
Seconds elapsed since previous fetch.
next
If no connections are queued, this displays (empty); if the REP is begin fetched, this displays (acquiring); otherwise, it displays first the REP delay, then the estimated time until next fetch, then the next address to fetch.

Politeness Policy
The nlpc-fetch utility uses a politeness policy in order to reduce strain on content servers. This policy seperates a burst of per-authority accesses by a configurable amount of time. The politeness policy may be tuned by using the --fetch group of long options. Note that politeness is per-server as dictated by the resolved IP address in nlpc-resolv(1) . The politeness as dictated with --fetch-wait may be considered a minimum: a host’s (or virtual host’s) robots.txt may create considerable delay between hits to a particular physical address.

Robots Exclusion Standard
The User-Agent, Disallow, Allow, and Crawl-delay REP (Robot Exclusion Protocol) tokens are processed. See the --rep family of command-line options. The robots.txt file is fetched when a host (not IP address) is first accessed or the extant file expires. If the server returns an HTTP code 404, the scan is not limited; code 200 results in the returned file being parsed. All other codes disable scanning of the host. Note that this processing occurs on a per-host basis, not a per-IP basis (in the event of virtual hosting). Thus if Crawl-delay is specified, this will influence all sites hosted at a particular IP address. Note that the maximum crawl-delay is 999 seconds, while the minimum is the politeness wait time.

The nlpc-fetch utility considers accesses to multiple virtual hosts on the same physical host as being disjoint: a crawl-delay applies only to the virtual host currently being accessed (a smaller crawl-delay on another virtual host on the same physical host will be obeyed as if it were a completely different host). Lastly, the REP for a host is upheld for all scheme accesses (FTP, etc.).

This utility advertises itself as nlpcrawl-version, as in nlpcrawl-0.1.9.

See Also

nlpcrawl(1)

Standards

Although nlpc-fetch depends mostly on libcurl(3) to handle most standardconformance, it does manipulate HTTP/1.1 and 1.0 headers as specified in RFC 2616 and 1945, respectively. It requests UTF-8 encoding and only HTML, XHTML, or XML IANA media types (RFC 2045, 2046). Cache control occurs via HTTP/1.0 mechanisms (RFC 1945). REP processing occurs as per the IETF draft version (with full backwards compatibility) with some industry-used extensions.


Table of Contents