While working on TypograFix, I managed to get myself into the situation where I have a massive state machine that goes through HTML one character at a time. However, due to its specifics, it starts doing look-ahead using code similar to the following:
case '-':
if (CannotReplace) goto default;
// if surrounded by spaces, it's an en dash
if (i > 0 && i + 1 < text.Length && text[i - 1] == ' ' && text[i + 1] == ' ')
hb.Append("–");
else if (i + 1 < text.Length && text[i + 1] == '-' &&
(i + 2 == text.Length || text[i + 2] != '&'))
{
hb.Append("—");
++i; // ignore the second dash
}
else if (i + 5 < text.Length &&
text[i + 1] == '-' &&
text[i + 2] == '&' &&
text[i + 3] == 'g' &&
text[i + 4] == 't' &&
text[i + 5] == ';')
{
hb.Append("→");
i += 5;
}
else goto default;
// same for em dash
break;
#endregion
This got me thinking: apart from maybe improving the code by doing ">".StartsWith(text.Substring(i, 5)) kind of code, is there anything that can fundamentally improve the way the code looks? My attention turned, predictably enough, to using F# lists for string processing. (Disclaimer: using lists is probably not the best way of going through strings, but it might just be the clearest.)
F# actually treats strings as sequences (think IEnumerable<char>) instead of, err, lists (think List<T>). Consequently, if you want to pass a string as a list, you would write
Parse(List.of_seq "hello")
However, for a string parsing function, it’s OK to just declare the string parameter in the constructor.
let Proc (text : string) =
Then, we can define some recursive processor that goes through the string one character at a time:
let Proc (text : string) =
let rec Other html =
match html with
Now comes the part with matching. Getting something like --> is, however, a bit ugly:
let Proc (text : string) =
let rec Other html =
match html with
| '-' :: '-' :: '>' :: tail -> ("→" |> List.of_seq) @ Other tail
It’s already looking weird, because the three parts of --> are spread out as list elements. What I can do is declare a temporary variable to store the first few elements of a list (Seq.of_list) and then use string.StartsWith. Which gets me more or less back to where I started.