
gTextCvt v0.3.1
(C)2007 Paul Schuurmans


==============================================================================
Description
==============================================================================

gTextCvt is a simple utility to convert text files to other formats (primarily 
HTML and DICT).  I had two goals in mind:

My first goal was to make it easier to create Web pages from the plain-text 
ebooks found at Project Gutenberg.  Although plain-text is perfectly fine for 
most purposes, I wanted an easy-to-read format that I could read using a Web 
browser.  I also wanted the books split up into smaller (chapter-sized) files 
to make it easier to resume reading at a later time without having to put too 
much effort into finding where I left off.

My second goal was to make it easier to create dictionaries in a format that 
the dictd server could use.  Although single-file plain-text dictionaries have 
their advantages, I like the idea of being able to type a word (or phrase) 
into a dict client and having the dictd server do the rest of the work.

gTextCvt now includes a mode called "Plain Text" which allows you to use this 
program as a simple text editor.  The items in the Text menu are described 
below.

"Center Lines" / "UnCenter Lines" centers or left-justifies the selected 
lines.  This function works on a single line or a selection of multiple lines.

"Change Case" changes the case (lower or upper) of selected text.  If no text 
is selected, this function changes the case of the character at the current 
cursor position.

"Title Case" works only on selections of text.  The first letter of each 
word in the selection is capitalized and all other letters in that word are 
converted to lowercase.

"Tabify" / "UnTabify" will convert spaces to tabs or tabs to spaces, 
respectively.  This function works only on selections of text.

"Split Line" takes the current line and adds a NewLine (LF) character somewhere 
before the nth character (set in File|Preferences).  This funtion splits only 
the first n characters of a line.  So, for example, if n is 80 and the current 
line is 800 characters long, you'll need to do this 9 or 10 times to split the 
whole line.

"Concatenate Lines" works on a selection of lines.  Basically, what this 
function does is it removes all NewLine (LF) characters from the selected 
text.  This essentially causes all of the selected lines to become one.

"Reduce Whitespace" works only on a selection of text.  This function looks for 
a series of consecutive spaces in the selected text and reduces them down to a 
single space.

"Add LFs To Line" works on a single line only.  This function adds a NewLine 
(LF) to the beginning and end of the current line.  This basically puts a blank 
line above and below the current line.



==============================================================================
Installation
==============================================================================

Due to its size and the fact that it doesn't work on all systems, the 
configure script has been omitted from this package.  There is, however, 
a script included which creates the configure file.

To install this package:

  tar zxvf gtextcvt-0.3.tar.gz
  cd gtextcvt-0.3
  sh autogen.sh
  make
  make install

The autogen script creates the configure script.  If you need to pass any 
special parameters to the configure script, you can do so before running 
make.



==============================================================================
Using Anjuta
==============================================================================

This package contains an Anjuta 1.2.2 project file.  Because of the missing 
configure script, "Build|Build" won't work initially; you'll need to select 
"Build|Auto generate..." first.



==============================================================================
Conversion Modes
==============================================================================

DOS Text > UNIX Text
====================
In this mode, the CRLF (Carriage Return / Line Feed) found at the end of each 
line in a DOS text file is replaced with LF.


UNIX Text > DOS Text
====================
In this mode, the LF (Line Feed) found at the end of each line in a UNIX text 
file is replaced with CRLF (Carriage Return / Line Feed).


Book > HTML
===========
In this mode, a tagged etext book (from Project Gutenberg) is processed to 
create one or more chapter files in HTML format.

The file to be converted should be tagged as follows: an underscore (_) and a 
single capital letter, on a line by itself, specifies what to do with the next 
line in the file:

Command Tags:
_S	Stop Chapter.  Close the current chapter file and start a new one.
_C	Same as _S, but use the next line as the new Caption (Title).
_H	Heading1.  Use the next line as a main entry in the index file.
_P	Heading2.  Use the next line as a sub-entry in the index file.

Mode Tags:
_L	Highlight.  Highlight the first line of the following paragraphs in bold.
_N	Same as _L, and add break at the end of the first line.
_B	Breaks.  Add <BR> to end of lines and indent lines starting with 2 spaces.
_Q	Same as _B, but enclose in <BLOCKQUOTE>.
_K	Keep "As Is".  Enclose in <PRE>.
_X	End.  Exit the current mode.

Lines beginning with these special characters will be parsed as follows:
~	Highlight line in bold (<B>).
!	Highlight line in bold (<B>), and add break (<BR>) at end.

How Tagged Files Are Processed
------------------------------
The first 3 lines of a tagged file should contain the following information 
about the book:
 1. Title    This is printed at the top and bottom of each chapter page and 
             serves as a link back to the index page.  
 2. Author   This is used only once near the top of the index page (just below 
             the title of the book).
 3. Etext #  This is used in a string at the top and bottom of the index page.

After the 3 information lines, the next part of the file is assumed to be the 
Project Gutenberg Information (included at the top of each PG etext).  This PG 
Info will be written to a file called "<BaseName>-pg.html" until gTextCvt 
encounters the first Stop tag.  A Stop tag can be either a StopChapter or 
NewCaption tag.  When the first Stop tag is encountered, gTextCvt closes the 
PG Info file and starts a new file.  This new file is the first chapter file 
and gTextCvt continues writing to it until it encounters another Stop tag.  
Each subsequently encountered Stop tag will close the current file and start a 
new one until the end of the tagged etext file is reached.

The Heading tag is used to specify that the next line should be used as a main 
entry in the index file.  The SubHeading tag is used to specify that the next 
line should be used as a sub-entry in the index file.  Main entries are links 
that point to chapter files, while sub-entries point to specific sections of a 
chapter file.

Note: Newer Project Gutenberg eBooks have PG info at the top and bottom of the 
file.  To avoid having 2 separate HTML pages with PG info, it may be a good 
idea to Cut&Paste the PG info from the bottom of the file to somewhere near the 
top of the file.  Ideally, the bottom PG info should be moved to a location 
that is _after_ the top PG info and _before_ the first Stop tag.

Options
-------
"BaseName" is used to name the newly-created HTML files.
"Digits" specifies how many digits to use when naming sequential files (e.g., 3 
means that files will be named BaseName001, BaseName002, and so on).
"Use BaseName for Index File" specifies that BaseName should be used as the 
main index file (otherwise, index.html is used).  Use this option if you plan 
to keep several eBooks in the same directory.
"Head Style" specifies which H value to use for main topics.
"Sub" specifies which H value to use for subtopics.
"Graphical Nav Buttons" specifies that bitmaps should be used for the 
navigation buttons at the top and bottom of each HTML page.  The bitmaps must 
be named "prev.jpg" and "next.jpg" and must reside in the same directory as the 
HTML files.


Book > HTML PRE
===============
This is basically the same as Book To HTML (see above) except that the text 
will be enclosed in HTML <PRE> and </PRE> tags.  This essentially leaves the 
body of the text "as is".


Text > Dict Source Data
=======================
This creates a tagged Dict Source Data file which can then be used to create 
the actual Dict DZ and Index files.

The conversion starts by creating the following 3 keywords (dictionary entries) 
at the top of the output file: 00-database-url, 00-database-short, and 
00-database-info.  The first part of the input text file is assumed to be 
introductory text describing the dictionary.  This introductory text will be 
appended to the 00-database-info topic.  When gTextCvt encounters at least 2 
blank lines, the next non-blank line is assumed to be a new keyword.  The text 
following this new keyword becomes the body of the definition until another 
series of blank lines are encountered.  This process continues until the end of 
the text file is reached.

Note 1: If your dictionary has no introductory text (i.e., it begins with a 
dictionary keyword instead), you'll need to insert at least 2 completely blank 
lines (with no spaces) at the top of the file.

Note 2: If you need more than one blank line inside of a definition, you can 
use lines consisting of one or more spaces to give them the appearance of being 
blank lines.


Dict Source Data > Dict Dz
==========================
This creates the actual Dict files (xxxx.dict.dz and xxxx.index) to be used 
by the dictd server.  The Dict Source Data should have the following three 
keywords (dictionary entries) at the top of the file: 

 00-database-short  A short name for the dictionary (e.g., "New Dictionary").
 00-database-url    The site where the original database can be found.
 00-database-info	Can be any kind of pertinent introductory info.

The Dict Source Data file is processed as follows:

 :entry:    Use "entry" as a dictionary entry.  The text following this entry 
            becomes the body of the dictionary definition, until a new entry 
			is encountered.
 {keyword}  Use "keyword" as an index entry.  Index entries are words that 
            correspond to dictionary entries.  Therefore, "keyword" must be a 
			valid entry in the current dictionary.

After creating the xxxx.dict file, gTextCvt will try to compress it using 
dictzip to create xxxx.dict.dz.  dictzip is part of the dictd package, so if 
you don't end up with a dict.dz file, it probably means that you don't have 
dictd installed.

After creating the xxxx.index file, gTextCvt will try to sort the lines in that 
file using sort.  AFAIK, all GNU/Linux systems include sort, so there shouldn't 
be a problem here.  However... it appears that some versions of sort ignore the 
-d (dictionary order) option.  If you come across missing entries (ones that 
you know are in the dictionary, but aren't being found by the dict client), a 
possible cause might be incorrectly sorted lines in the index file.  If the 
index file is fairly small, you can try editing it manually moving lines in the 
correct order.  Otherwise, you can try sorting the index on a different distro 
(Debian machines are known to include a version of sort that works correctly).
Sort the file using:  sort -df [name2].index > [name].index


Text > HTML
===========
In this mode, the contents of an input file are read and placed inside a set of 
HTML header and footer tags.


Text > HTML BODY Only
=====================
This is basically the same as Text To HTML (see above) except that no HTML 
header (<HTML><BODY>) or footer (</BODY></HTML>) is added to the output file.  
The output is meant to be copied and pasted into an existing HTML file.


Text > HTML Index Entries
=========================
This is basically only useful if you're creating something like a dictionary in 
HTML format, where the index consists of several sub-links that point to the 
same page.  The output from this mode is meant to be copied and pasted into an 
index.html file.

In this mode, the input file is read 3 lines at a time.  The 3 lines are then 
inserted inside <A HREF> and </A> pairs.  The output will be something like:

  <A HREF="tmpname.html#line1">line1</A>, <A HREF="tmpname.html#line2">line2</A>, <A HREF="tmpname.html#line3">line3</A>, 
  <A HREF="tmpname.html#line4">line4</A>, <A HREF="tmpname.html#line5">line5</A>, <A HREF="tmpname.html#line6">line6</A>, 

In the ouput file, the filename "tmpname.html" should be changed (using a text 
editor's "Search & Replace") to whatever filename the sub-links point to.


Text > HTML PRE
===============
This is basically the same as Text To HTML (see above) except that the text 
will be enclosed in HTML <PRE> and </PRE> tags.  This essentially leaves the 
body of the text "as is".


Text > HTML Contents Table
==========================
This reads an input file and creates an HTML file containing a table.  Each 
line of the input file is inserted 3 times to create an entry of the form 
"<TR><TD><B>line</B></TD><TD><LI><A HREF="line">line</A></TD></TR>".


Text > HTML Table Data
======================
This also reads an input file and creates an HTML file containing a table.  
However, the input file should be a tab-delimited text file (usually exported 
from a spreadsheet file) and there should be no more than 1 tab character 
separating the fields.


HTML > Text
===========
In this mode, an HTML input file is read to produce an output file with the 
HTML tags stripped.  Processing procedes as follows:

- When gTextCvt encounters a '<' character, a flag is set indicating that we 
  are now inside of an HTML tag.  The following characters are ignored until a
  '>' is encountered.
- When gTextCvt encounters a '>' character, the "In Tag" flag is cleared and 
  the following characters are written to the output file until the next '<'.

Note: If the HTML input file contains '<' or '>' characters which are not used 
to specify HTML tags, the characters should be replaced with "&lt;" (less than) 
and "&gt;" (greater than) strings (or any other unique strings) which can later 
be replaced with '<' and '>'.


==============================================================================
 Miscellaneous Notes
==============================================================================

The File|Save menu item writes the current contents of the editor to the file 
named in the 'Src' field, NOT the file named in the 'Dst' field.  The file or 
directory named in the 'Dst' field is used as the destination path when the 
[Convert] button is pressed.  It's important not to confuse the two.

