| TextGrab - software for grabbing text from internet sites |
Last update: 7. July 2007
NEW: Demo version of Textgrab for MS-Windows
The demo version of TextGrab works with one web site only, the output file can be processed
with the line or paragraph format of TextQuest only.
Only *.HTM and *.TXT files are downloaded.
TextGrab (Text Grabbing) is a tool for the content analysis of web sites.
It copies the text files (*.txt *.htm, *.php, *.xml, and *.cfm files) from a web site to the hard
disk of your computer and prepares it for the processing with content analysis software.
Also RTags (remove tags) is part of the distribution. HTML-tags are removed, and special
characters like ä or ß are translated.
TextGrab is a special kind of offline-reader (or web spider). You specifiy
the web site to be downloaded, TextGrab reads all the files (extensions see above) and stores them
in one output file. A separator is written between the files, so that one
can easily detect where a new file begins. This separator is TextQuest/Intext
compatible, so that these text analysis program can process the output of TextGrab just on the fly.
The program works with Win9x or better, Redhat Linux, and HP-UX.
Versions for other operating systems with a C++-compiler
can be generated on demand.
TextGrab is a command line program and has no graphical user interface (e.g.
like Windows). The following options are currently implemented:
-h = Get a document header
other formats will be implemented on demand.
You invoke the demo version by typing in the following command at the MS-DOS prompt:
grabdemo -r output.txt 1
Instead of output.txt you can specify any other valid destination for the output file. Please
remind that the demo version downloads www.intext.de only, no other web site is possible with
the demo version. The full vesion however can download any web site until your hard disk is full
or your internet connection fails.
Of course you can do the work TextGrab does yourself, but it will take you a
considerable time, efforts, and costs to do so. The following works
must be done:
TextGrab saves you a lot of time, because it does step 2 and 3 automatically.
With a hundred files on a web site, TextGrab saves you hours of time of boring work.
TextGrab is a command line driven program without a graphical user interface.
So it runs in the background downloading the web site. Command line options
can be used to specify how to follow links. Also RTAGS, a program that removes
HTML-tags, is part of the TextGrab distribution.
You can order TextGrab using this order form.
-l = Grab a file plus its links
-r = Grab a file and recursively follow each link
-s = Grab the specified file only (default)
Format = Output format 1= TextQuest/Intext
2 = LIWC
Advantages of TextGrab