nlpcrawl: a suite of tools for crawling the Internet

Introduction [top]

nlpcrawl is a "suite of tools [that works] asynchronously in order to crawl the web for relevant content." nlpcrawl was originally developed at the University of Latvia in order to provide a reliable, simple means to crawl web-pages and perform natural language operations on the downloaded corpus. This research is still underway. Since the requirements of these operations are fairly simple, the considerable man-handling of existing tools was considered overkill. Instead, new tools were either written from scratch or carefully re-fitted from existing ones. The software is licensed in full under the permissable BSD 3-part license.

The intent of nlpcrawl is to provide a portable, robust, and simple set of tools to acquire mark-up data (i.e., not images, binary documents, or other "rich media") and keep it fresh. Off-line analysis tools must be able to easily iterate over new content and have a reliable means of referencing sources (URL, access time, etc.). All tools must be entirely architecture- and system-indifferent.

The nlpcrawl suite of tools are being developed by Kristaps Dzonsons, kristaps dot dzonsons at latnet dot lv, at the University of Latvia's Institute of Mathematics and Computer Science.

 

Status [top]

This project is still in the early/design phase.

 

News [top]

23-03-2007: nlpc-fetch(1) updated to recognise expirations; a new utility, nlpc-list(1), that lists database entities; and the nlpc-up(1) utility is at its first version

22-03-2007: nlpc-fetch(1) should work and nlpc-scan(1) may be used to seed a database

13-03-2007: pre-release peek at design documentation

 

Documentation [top]

Pre-release manuals have been released for comment (these manuals are installed in Unix manual format along with system executables):

The Unix manuals are, and will continue to be, the canonical source for system documentation. In order to compile, you'll need Berkeley DB, libcurl, and libxml. Edit the Makefile to system-dependent values, then execute make and make install. nlpcrawl should install in any Unix environment.

 

Download [top]
nlpcrawl-0.0.3.tgz 23-03-2007 -
nlpcrawl-0.0.2.tgz 22-03-2007 99fc757ec313bc20106d0dac10254d71
nlpcrawl-0.0.1.tgz 13-03-2007 870a9d4cf839f2e3007310dd65718fe9

 

$Id: nlpcrawl.html,v 1.9 2007-03-23 11:51:48 kristaps Exp $