4Spell, a spell-checker for Windows 95/98/NT

Wietse Dol
Erik Frambach

Abstract

In this paper we will describe the features of 4Spell 1.1, a Windows spell-checker for TeX documents. Since there aren't many good spell-checkers around and since 4Spell only works on Windows platforms, we will also explain how the spell-checking is done. This should make it possible to write a spell-checker for other platforms (why not use perl and become platform independent :-) 4Spell is part of the new 4TeX for Windows (release expected by the end of March 1999). We realized, however, that this tool could be useful for people who do not want to use 4TeX and hence we made it a stand-alone freeware program.

Introduction

Spell-checkers are nowadays widely used by word pro-cessors such as MS-Word and WordPerfect. They are extremely useful in correcting spelling errors, especially when writing in a non-native language.

TeX users often claim that TeX is better than those word processors, one wonders why there are so few good spell-checkers for TeX documents.

The main reason for this is that TeX documents not only contain "normal" words, but also complex TeX commands. And TeX commands may or may not take parameters, and parameters can be delimited in any imaginable way.

Within certain TeX environments you want the words to be checked (e.g. in tables) and in others you want them to be ignored (e.g. mathematics). All in all a complex situation if you realize that TeX commands, mathematics, and normal words needn't be separated by spaces and line feeds. Any good spell-checker for TeX documents requires a TeX parser that reads the text and decides whether or not a word or a part of the word should be spell-checked. Is writing a parser difficult? The answer probably would be "yes", since there aren't many spell-checkers around. 4Spell proofs that writing such a spell-checker can be done and that it's not that hard to write a spell-checker that can even do more than just TeX.

main window

4Spell features

When we started to write 4TeX for Windows we still needed the "old" MS-Dos based AmSpell as a spell-checker. AmSpell has some serious problems/bugs when it checks your documents. We will not give you a list of those problems, but after AmSpell has checked your document you still can find spell-checking errors. This is because Amspell skips parts of your document and doesn't tell you it did.

In September 1998 we had the discussion if we needed to write a spell-checker for 4TeX and concluded that it should be too time consuming to write a good program, since TeX documents are too complex. As often, complex material tends to become much simpler when you have a closer look and spend more time thinking about the structures (TeX is a structured language isn't it). When starting to write 4Spell we started not on the spell-checking routines but on describing how a TeX document should be parsed through the spell-checker. This parsing is the engine of a good spell-checker (and here AmSpell makes it's mistakes). The spell-checking routines were supplied by Aleksander Simonic. Alex is the author of WinEdt, probably the best TeX-aware shareware editor there is for OS/2 and Windows. Cooperation with Alex means that we can all benefit from the same dictionaries, which makes maintenance a lot easier.

In the next section we will describe the parsing of a document, but now we will summarize some of 4Spell's features:

The parser

All functionality above is mostly the result of writing a (TeX) parser. To make it easier for others to write their own parser, and for those who are just curious to know how it works, we will explain the parsing algorithms. The parser will read words until the end of a file is reached. This is done by letting a pointer start at the beginning of the file and start with the procedure READWORD.

STEP 1: get a word procedure READWORD

  1. Skip EndOfWord characters until the first non-EndOfWord character.
  2. Read and remember characters until the first EndOfWord character.
The result of 1 and 2 is a word.

This READWORD procedure is repeated until the end of the file. With these words you need to do a lot of checks before you can spell-check (since a word as defined above can contain (TeX) commands, etc.). Note also (within TeX) the EndOfWord characters are defined as: a space, a hyphen, a tilde, a Carriage-Return, a Line-Feed, and an End-Of-File character.

STEP 2: check the word for properties

For every word check the following:
  1. check if the Language Switch (i.e., the command that is used to change dictionaries) is part of the word
  2. check if one of the commands in the (TeX) Begin Environments list is part of the word
  3. check if one of the commands in the (TeX) commands list is part of the word
  4. check if one of the commands in the Begin Mathematics Environments list is part of the word
  5. check if the Mathematics Command (e.g., $x+y$ or $$x+y$$) is part of the word
  6. check if the Verbatim command is part of the word
  7. check if part of the word starts a (TeX) comment (i.e. the % sign)
If one of the above is true you keep on reading words until:
  1. the characters after the Language Switch command is the filename of the dictionary that should be loaded at that point.
  2. the End Environment command is part of the word
  3. the End command command is part of the word
  4. the End Mathematics Environment command is part of the word
  5. the End Mathematics is part of the word
  6. the End Verbatim character is reached
  7. end of the line is reached
This seems easy, but the problem is that when looking for, say, an environment to be ended, the same environment can start again and hence we do not stop at the first end-environment part, but at the second (or even higher) end-environment parts. This example will hopefully explain the problem:

\begin{skipping}
To explain the spell problem see this example
\begin{skipping}
This won't work if you do not count the number
of begin environments
\end{skipping}
You understand the example?
\end{skipping}
What the spell-checker should do is skip the complete example above. It should not stop skipping at the first \end{skipping} command.

When performing the actions above, we were looking for parts of the words. This means that after these actions we will have found (part of) a word preceding the action and (part of) a word after ending the action. With these two words (which may be empty) we proceed as with a word that doesn't trigger one of the actions described above.

STEP 3: divide word into subwords

Look if the word contains SubWordPunctiationMarks. If so, divide word into subwords. SubWordPunctuationMarks are

,;:.!@#$&*?"%(){}[]-+=0123456789\`~^*_/|'
An example could clarify the meaning of subwords. Suppose we have the word

\def\hello{\textbf{Hello}}
This will be divided into four subwords:

\def
\hello
\textbf
Hello

STEP 4: check the subwords for properties

These subwords are candidates for spell-checking, but before we spell-check these subwords we check:
  1. Does the subword start with a (TeX) Command Character ("\"), then skip the subword (so the first three subwords of the example are skipped).
  2. Is the subword one of the words in the Ignore words list, then ignore it.
  3. Is the subword one of the words in the Replace words list, then replace it automatically.
  4. If the subword is one of the words in the User Dictionary, then skip the subword.
  5. If the subword is one of the words in the Ignore Dictionary, then skip the subword.
  6. If the subword is one of the Auto Replace word list, then replace the subword with correct word from the Auto Replace with word list
If the subword doesn't belong to any of the six categories above, we spell-check the subword (Alex's routines do the job fast and easy). If the subword is correct we skip the subword. If it is not a correct word, we will search for alternatives for this (sub)word. The user will be prompted by 4Spell what to do in this case: select one of the alternatives, enter your own text, ignore this word, or add it to the user dictionary.

It seems easy, but be aware that when building a parser, you will you need to do a lot of bookkeeping, and you will need some more advanced programming tricks (e.g., all word actions and subword actions are recursive procedures).


Published in NTG's MAPS 22, 1999.
4Spell is free software, which means that you don't have to pay for using it. The standard GNU public license applies. You can download it from any CTAN (mirror) site, e.g. ftp://ftp.ntg.nl/pub/tex-archive/support/4spell/), or from ftp://4tex.ntg.nl/4spell/.