TextGrab - software for grabbing text from internet sites

Last update: 17. January 2014

NEW: Demo version of Textgrab for MS-Windows

The demo version of TextGrab works with one web site only, the output file can be processed with the line or paragraph format of TextQuest only. Only *.HTM and *.TXT files are downloaded.

TextGrab (Text Grabbing) is a tool for the content analysis of web sites. It copies the text files (*.txt *.htm, *.php, *.xml, and *.cfm files) from a web site to the hard disk of your computer and prepares it for the processing with content analysis software. Also RTags (remove tags) is part of the distribution. HTML-tags are removed, and special characters like ä or ß are translated.

TextGrab is a special kind of offline-reader (or web spider). You specifiy the web site to be downloaded, TextGrab reads all the files (extensions see above) and stores them in one output file. A separator is written between the files, so that one can easily detect where a new file begins. This separator is TextQuest/Intext compatible, so that these text analysis program can process the output of TextGrab just on the fly.

The program works with Win9x or better, Redhat Linux, and HP-UX. Versions for other operating systems with a C++-compiler can be generated on demand.

TextGrab is a command line program and has no graphical user interface (e.g. like Windows). The following options are currently implemented:

-h = Get a document header
-l = Grab a file plus its links
-r = Grab a file and recursively follow each link
-s = Grab the specified file only (default)
Format = Output format 1= TextQuest/Intext
2 = LIWC

other formats will be implemented on demand.

You invoke the demo version by typing in the following command at the MS-DOS prompt:

grabdemo -r output.txt 1

Instead of output.txt you can specify any other valid destination for the output file. Please remind that the demo version downloads www.intext.de only, no other web site is possible with the demo version. The full vesion however can download any web site until your hard disk is full or your internet connection fails.

Advantages of TextGrab

Of course you can do the work TextGrab does yourself, but it will take you a considerable time, efforts, and costs to do so. The following works must be done:

  1. copy the text files from the web site(s) to your local hard drive. If you do that with your browser, this will take you a long time. An alternative are off-line readers/browsers that copy these files. There are several of these available, some are also free. Because they will allow you to read the files just like you were online, they store all the files the way they are stored on the web sites: in directories and files. The structure of the web site is kept, but text analysis software cannot make use of the files using this structure.
  2. edit each file and insert control sequences. Before you can work with text analysis software, you have to segment your texts and insert control sequences that set the values for external variables (e.g. name, web site, date, number of pics etc.). You have to do that for each file, and even if you are fast, it will take about a minute for each file. And this kind of work is errorneous, too.
  3. merge all the file to one big output file. This operation is required by most text analysis programs, and you can do it with the tools that belong to the operating system. But remember: if you have a lot of directories, copying all the stuff maybe difficult, at least you have to make sure that no file is forgotten. This step may require up to 15 minutes, depending on the quantity of files and your working speed.

TextGrab saves you a lot of time, because it does step 2 and 3 automatically. With a hundred files on a web site, TextGrab saves you hours of time of boring work.

TextGrab is a command line driven program without a graphical user interface. So it runs in the background downloading the web site. Command line options can be used to specify how to follow links. Also RTAGS, a program that removes HTML-tags, is part of the TextGrab distribution.

