Table of Contents

Name

nlpc-scan - scans nlpc database and adds new addresses

Synopsis

nlpc-scan [-dhv] [-f path] [-p path] [uri ...]

Description

The nlpc-scan utility scans an nlpc database for verified entities, then parses their contents for further addresses to harvest. It’s part of the nlpcrawl(1) series of tools. The arguments are as follow:

-d
Enable debugging. Use multiple times for more verbosity.

-f path
The cache file path. See the nlpcrawl(1) FILES section.

-h
Print a help message and exit.

-p path
The database environment path. See the nlpcrawl(1) FILES section.

-v
Print version information and exit.

In addition, the following long arguments may be used:

--filter-auth string
Match new address authorities against string. If the authority isn’t matched, do not add it to the database. Multiple authorities may be space-separated. Example: --filter-auth ‘www.foo.com www.bar.com’.

--filter-scheme string
Match new address schemes against string. If the scheme isn’t matched, do not add it to the database. Multiple schemes may be space-separated. Example: --filter-scheme ‘http https’.

--filter-ascend
Path-ascend new records.

--filter-ascend-scan
Path-ascend only scanned records (limits path-ascendancy to already-verified records).

---use-interval num
Define the wait period between read scans. Defaults to 120 (seconds). Must be greater than 1.

One may supply a list of seed URIs (these must be absolute URIs) in order to jump-start the sequence.

Media Types
At this time, the nlpc-scan utility is capable of parsing HTML pages (text/html) and XML (text/xml, application/xml, application/xml+html) pages. HTML-ish pages are processed and new links farmed. Plain text transitions to being directly processed without scanning.

Recognised Tags
The nlpc-scan utility recognises the following HTML node names:

A
extract new links
BASE
extract the base URI for relative reference resolution
LINK
extract new links
META
define REP parameters (see the Robots Exclusion Standard subsection of this document)
NOINDEX
define REP parameters (see the Robots Exclusion Standard subsection of this document)

Robots Exclusion Standard
The robot type of META tag (HTML only) is obeyed. The directive not to follow is processed by exiting the parse sequence. The directive not to index still allows for new pages, but the source page is marked UNREACH. If the NOINDEX tag is encountered, all children to the node pair are skipped.

See Also

nlpcrawl(1)


Table of Contents