nlpcrawl: a suite of tools for crawling the Internet
introduction
status news documentation download |
nlpcrawl is a "suite of tools [that works] asynchronously in order to crawl the web for relevant content." nlpcrawl was originally developed at the University of Latvia in order to provide a reliable, simple means to crawl web-pages and perform natural language operations on the downloaded corpus. This research is still underway. Since the requirements of these operations are fairly simple, the considerable man-handling of existing tools was considered overkill. Instead, new tools were either written from scratch or carefully re-fitted from existing ones. The software is licensed in full under the permissable BSD 3-part license. The intent of nlpcrawl is to provide a portable, robust, and simple set of tools to acquire mark-up data (i.e., not images, binary documents, or other "rich media") and keep it fresh. Off-line analysis tools must be able to easily iterate over new content and have a reliable means of referencing sources (URL, access time, etc.). All tools must be entirely architecture- and system-indifferent. The nlpcrawl suite of tools are being developed by Kristaps Dzonsons, kristaps dot dzonsons at latnet dot lv, at the University of Latvia's Institute of Mathematics and Computer Science.
This project is still in the early/design phase.
23-03-2007: nlpc-fetch(1) updated to recognise expirations; a new utility, nlpc-list(1), that lists database entities; and the nlpc-up(1) utility is at its first version 22-03-2007: nlpc-fetch(1) should work and nlpc-scan(1) may be used to seed a database 13-03-2007: pre-release peek at design documentation
Pre-release manuals have been released for comment (these manuals are installed in Unix manual format along with system executables):
The Unix manuals are, and will continue to be, the canonical source for system documentation. In order
to compile, you'll need Berkeley DB, libcurl, and libxml.
Edit the Makefile to system-dependent values, then execute
|
$Id: nlpcrawl.html,v 1.9 2007-03-23 11:51:48 kristaps Exp $