How spell-checking is done by 4Spell

All functionality of 4Spell is mostly the result of writing a (TeX) parser. To make it easier for others to write their own parser, and for those who are just curious to know how it works, we will explain the parsing algorithms.

The parser will read words until the end of a file is reached. This is done by letting a pointer start at the beginning of the file and start with the procedure READWORD.

STEP 1: get a word

procedure READWORD
  1. Skip EndOfWord characters until the first non-EndOfWord character.
  2. Read and remember characters until the first EndOfWord character.

The result of 1. and 2. is a word.

This READWORD procedure is repeated until the end of the file. With these words you need to do a lot of checks before you can spell-check (since a word as defined above can contain (TeX) commands, etc.). Note also (within TeX) the EndOfWord characters are defined as: a space, a hyphen, a tilde, a Carriage-Return, a Line-Feed, and an End-Of-File character.

STEP 2: check the word for properties

For every word check the following:
  1. check if the Language Switch (i.e., the command that is used to change dictionaries) is part of the word
  2. check if one of the commands in the (TeX) Begin Environments list is part of the word
  3. check if one of the commands in the (TeX) commands list is part of the word
  4. check if one of the commands in the Begin Mathematics Environments list is part of the word
  5. check if the Mathematics Command (e.g., $x+y$ or $$x+y$$) is part of the word
  6. check if the Verbatim command is part of the word
  7. check if part of the word starts a (TeX) comment (i.e. the % sign)

If one of the above is true you keep on reading words until:

  1. the characters after the Language Switch command is the filename of the dictionary that should be loaded at that point.
  2. the End Environment command is part of the word
  3. the End command command is part of the word
  4. the End Mathematics Environment command is part of the word
  5. the End Mathematics is part of the word
  6. the End Verbatim character is reached
  7. end of the line is reached

This seems easy, but the problem is that when looking for, say, an environment to be ended, the same environment can start again and hence we do not stop at the first end-environment part, but at the second (or even higher) end-environment parts. This example will hopefully explain the problem:

\begin{skipping}
  To explain the spell problem see this example
  \begin{skipping}
    This won't work if you do not count the number
    of begin environments
  \end{skipping}
  You understand the example?
\end{skipping}

What the spell-checker should do is skip the complete example above. It should not stop skipping at the first \end{skipping} command.

When performing the actions above, we were looking for parts of the words. This means that after these actions we will have found (part of) a word preceding the action and (part of) a word after ending the action. With these two words (which may be empty) we proceed as with a word that doesn't trigger one of the actions described above.

STEP 3: devide word into subwords

Look if the word contains SubWordPunctiationMarks. If so, divide word into subwords. SubWordPunctuationMarks are
   ,;:.!@#$&*?"%(){}[]-+=0123456789\`~^*_/|'

An example could clarify the meaning of subwords. Suppose we have the word

   \def\hello{\textbf{Hello}}

This will be divided into four subwords:

\def
\hello
\textbf
Hello

STEP 4: check the subwords for properties

These subwords are candidates for spell-checking, but befor we spell-check these subwords we check:
  1. Does the subword start with a (TeX) Command Character ("\"), then skip the subword (so the first three subwords of the example are skipped).
  2. Is the subword one of the words in the Ignore words list, then ignore it.
  3. Is the subword one of the words in the Replace words list, then replace it automatically.
  4. If the subword is one of the words in the User Dictionary, then skip the subword.
  5. If the subword is one of the words in the Ignore Dictionary, then skip the subword.
  6. If the subword is one of the Auto Replace word list, then replace the subword with correct word from the Auto Replace with word list

If the subword doesn't belong to any of the six categories above, we spell-check the subword (Alex's routines do the job fast and easy). If the subword is correct we skip the subword. If it is not a correct word, we will search for alternatives for this (sub)word. The user will be prompted by 4Spell what to do in this case: select one of the alternatives, enter your own text, ignore this word, or add it to the user dictionary.

It seems easy, but be aware that when building a parser, you will you need to do a lot of bookkeeping, and you will need some more advanced programming tricks (e.g., all word actions and subword actions are recursive procedures).