While working on TypograFix, I managed to get myself into the situation where I have a massive state machine that goes through HTML one character at a time. However, due to its specifics, it starts doing look-ahead using code similar to the following:

case '-':
  if (CannotReplace) goto default;
  // if surrounded by spaces, it's an en dash
  if (i > 0 && i + 1 < text.Length && text[i - 1] == ' ' && text[i + 1] == ' ')
    hb.Append("&ndash;");
  else if (i + 1 < text.Length && text[i + 1] == '-' &&
           (i + 2 == text.Length || text[i + 2] != '&'))
  {
    hb.Append("&mdash;");
    ++i; // ignore the second dash
  }
  else if (i + 5 < text.Length &&
           text[i + 1] == '-' &&
           text[i + 2] == '&' &&
           text[i + 3] == 'g' &&
           text[i + 4] == 't' &&
           text[i + 5] == ';')
  {
    hb.Append("&rarr;");
    i += 5;
  }
  else goto default;
  // same for em dash
  break;
  #endregion

This got me thinking: apart from maybe improving the code by doing "&gt;".StartsWith(text.Substring(i, 5)) kind of code, is there anything that can fundamentally improve the way the code looks? My attention turned, predictably enough, to using F# lists for string processing. (Disclaimer: using lists is probably not the best way of going through strings, but it might just be the clearest.)

F# actually treats strings as sequences (think IEnumerable<char>) instead of, err, lists (think List<T>). Consequently, if you want to pass a string as a list, you would write

Parse(List.of_seq "hello")

However, for a string parsing function, it’s OK to just declare the string parameter in the constructor.

let Proc (text : string) =

Then, we can define some recursive processor that goes through the string one character at a time:

let Proc (text : string) =
  let rec Other html =
    match html with

Now comes the part with matching. Getting something like --> is, however, a bit ugly:

let Proc (text : string) =
  let rec Other html =
    match html with
    | '-' :: '-' :: '>' :: tail -> ("&rarr;" |> List.of_seq) @ Other tail

It’s already looking weird, because the three parts of --> are spread out as list elements. What I can do is declare a temporary variable to store the first few elements of a list (Seq.of_list) and then use string.StartsWith. Which gets me more or less back to where I started.