Table of Contents
nlpc-fetch - fetches content to an nlpc database
nlpc-fetch [-dhmv] [-f path] [-p path] [-r path] [LONG_OPTS]
The nlpc-fetch utility fetches on-line content and records meta-information;
it’s built atop libcurl(3)
. URIs are harvested from an nlpc
database. nlpc-fetch is part of the nlpcrawl(1)
series of tools. The
arguments are as follow:
- -d
- Enable debugging. Use multiple times for more verbosity.
- -h
- Print a help message and exit.
- -m
- Disable debugging and view “monitor-mode", which displays current
hosts on the console at a refresh-rate of once per second.
- -v
- Print version information and exit.
- -f path
- The cache file path. See the nlpcrawl(1)
FILES section.
- -p path
- The database environment path. See the nlpcrawl(1)
FILES section.
- -r path
- The REP entity path. See the nlpcrawl(1)
FILES section.
In addition, the following long arguments may be used:
- --filter-auth string
-
Match new address authorities against string. If the authority
isn’t matched, do not add it to the database. Multiple authorities
may be space-separated. Example:
--filter-auth ‘www.a.lv www.b.lv’.
- --filter-charset string
-
For relevant protocols, if a charset is provided by the server,
match it now against string. If the charset is not matched, discontinue
fetching the page. Multiple charsets may be space-separated.
Example:
--filter-charset ‘utf-8 utf-16’.
- --filter-lang string
-
For relevant protocols, if a language is provided by the server,
match it now against string. If the language is not matched,
discontinue fetching the page. Multiple charsets may be spaceseparated.
Example:
--filter-lang ‘lv en’.
- --filter-scheme string
-
Match new address schemes against string. If the scheme isn’t
matched, do not add it to the database. Multiple schemes may be
space-separated. Example:
--filter-scheme ‘http ftp’.
- --filter-ascend
-
Path-ascend new records.
- --fetch-burst int
-
Number of hosts to fetch at once at each access point. This
tries to use pipelining, should the underlying protocol support
this feature. Defaults to 1.
- --fetch-depth int
-
The maximum number of requests to queue for each host over each
fetch session. Note that setting this to be too deep may result
in a single host being waited upon with an upper-bound fetchdepth
* fetch-wait time. Defaults to 15.
- --fetch-width int
-
Maximum hosts in fetch queue. Defaults to 2048. This and
--fetch-parallel largely govern the impact of politeness on
throughput.
- --fetch-wait int
-
Amount of time to wait between fetches. Defaults to 5 (seconds).
Last-access times are marked at the end of a fetch attempt.
- --fetch-parallel int
-
Maximum number of parallel connections to open. This shouldn’t
exceed the shell’s file descriptor limit. Defaults to 256. This
and --fetch-width largely govern the impact of politeness on
throughput.
- --fetch-trigger int
-
Trigger for re-filling the wait queue. Defaults to the queue
width, --fetch-width, divided by two.
- --rep-expires int
-
Interval (in seconds) after which time a REP entity expires. If
set to zero, REP entities are re-fetched with every access (you
probably don’t want this). Defaults to seven days.
- --rep-maxallow int
-
Maximum allow clauses per REP. If set to zero, the queue is not
limited. Defaults to 1024.
- --rep-maxdeny int
-
Maximum deny clauses per REP. If set to zero, the queue is not
limited. Defaults to 1024.
- --rep-cachesz int
-
Maximum number of cached REP entities. Once this limit is
reached, the least-used caches (and/or oldest) are pruned. If
set to zero, the queue is not limited. Defaults to 4096.
- --monitor-up int
-
Number of seconds between -m monitor-mode updates. Defaults to 1
second.
Monitor Mode
Monitor mode is enabled with the -m flag. It displays a table of the
currently-valid physical hosts; or in other words, those who are awaiting
activity. Since active connections generally start and complete too
quickly to show meaningfully, this concentrates instead upon the next
entity to be fetched. The table displays the following tabulated fields:
- ip
- The physical host’s address.
- pend
- The number of active connections to the physical host.
- size
- The number of queued virtual hosts for the physical host.
- elapsed
- Seconds elapsed since previous fetch.
- next
- If no connections are queued, this displays (empty); if
the REP is begin fetched, this displays (acquiring); otherwise,
it displays first the REP delay, then the estimated
time until next fetch, then the next address to
fetch.
Politeness Policy
The nlpc-fetch utility uses a politeness policy in order to reduce strain
on content servers. This policy seperates a burst of per-authority
accesses by a configurable amount of time. The politeness policy may be
tuned by using the --fetch group of long options. Note that politeness
is per-server as dictated by the resolved IP address in nlpc-resolv(1)
.
The politeness as dictated with --fetch-wait may be considered a minimum:
a host’s (or virtual host’s) robots.txt may create considerable delay
between hits to a particular physical address.
Robots Exclusion Standard
The User-Agent, Disallow, Allow, and Crawl-delay REP (Robot Exclusion
Protocol) tokens are processed. See the --rep family of command-line
options. The robots.txt file is fetched when a host (not IP address) is
first accessed or the extant file expires. If the server returns an HTTP
code 404, the scan is not limited; code 200 results in the returned file
being parsed. All other codes disable scanning of the host. Note that
this processing occurs on a per-host basis, not a per-IP basis (in the
event of virtual hosting). Thus if Crawl-delay is specified, this will
influence all sites hosted at a particular IP address. Note that the
maximum crawl-delay is 999 seconds, while the minimum is the politeness
wait time.
The nlpc-fetch utility considers accesses to multiple virtual hosts on
the same physical host as being disjoint: a crawl-delay applies only to
the virtual host currently being accessed (a smaller crawl-delay on
another virtual host on the same physical host will be obeyed as if it
were a completely different host). Lastly, the REP for a host is upheld
for all scheme accesses (FTP, etc.).
This utility advertises itself as nlpcrawl-version, as in nlpcrawl-0.1.9.
nlpcrawl(1)
Although nlpc-fetch depends mostly on libcurl(3)
to handle most standardconformance,
it does manipulate HTTP/1.1 and 1.0 headers as specified in
RFC 2616 and 1945, respectively. It requests UTF-8 encoding and only
HTML, XHTML, or XML IANA media types (RFC 2045, 2046). Cache control
occurs via HTTP/1.0 mechanisms (RFC 1945). REP processing occurs as per
the IETF draft version (with full backwards compatibility) with some
industry-used extensions.
Table of Contents