Table of Contents
nlpc-scan - scans nlpc database and adds new addresses
nlpc-scan [-dhv] [-f path] [-p path] [uri ...]
The nlpc-scan utility scans an nlpc database for verified entities, then
parses their contents for further addresses to harvest. It’s part of the
nlpcrawl(1)
series of tools. The arguments are as follow:
- -d
- Enable debugging. Use multiple times for more verbosity.
- -f path
- The cache file path. See the nlpcrawl(1)
FILES section.
- -h
- Print a help message and exit.
- -p path
- The database environment path. See the nlpcrawl(1)
FILES section.
- -v
- Print version information and exit.
In addition, the following long arguments may be used:
- --filter-auth string
-
Match new address authorities against string. If the authority
isn’t matched, do not add it to the database. Multiple authorities
may be space-separated. Example:
--filter-auth ‘www.foo.com www.bar.com’.
- --filter-scheme string
-
Match new address schemes against string. If the scheme isn’t
matched, do not add it to the database. Multiple schemes may be
space-separated. Example:
--filter-scheme ‘http https’.
- --filter-ascend
-
Path-ascend new records.
- --filter-ascend-scan
-
Path-ascend only scanned records (limits path-ascendancy to
already-verified records).
- ---use-interval num
-
Define the wait period between read scans. Defaults to 120 (seconds).
Must be greater than 1.
One may supply a list of seed URIs (these must be absolute URIs) in order
to jump-start the sequence.
Media Types
At this time, the nlpc-scan utility is capable of parsing HTML pages
(text/html) and XML (text/xml, application/xml, application/xml+html)
pages. HTML-ish pages are processed and new links farmed. Plain text
transitions to being directly processed without scanning.
Recognised Tags
The nlpc-scan utility recognises the following HTML node names:
- A
- extract new links
- BASE
- extract the base URI for relative reference resolution
- LINK
- extract new links
- META
- define REP parameters (see the Robots Exclusion
Standard subsection of this document)
- NOINDEX
- define REP parameters (see the Robots Exclusion
Standard subsection of this document)
Robots Exclusion Standard
The robot type of META tag (HTML only) is obeyed. The directive not to
follow is processed by exiting the parse sequence. The directive not to
index still allows for new pages, but the source page is marked UNREACH.
If the NOINDEX tag is encountered, all children to the node pair are
skipped.
nlpcrawl(1)
Table of Contents