I’m developing a small console application in C# to convert .gde files – file format of the wonderful “The Guide outliner” – to .chm (Microsoft Compiled HTML Help).

First step is to convert a .gde file to XML. This can be done with gdeutil, a tool included with The Guide. However, gedeutil.exe does not create a valid XML files: the character ‘&’ in node titles is not escaped to ‘&’.

So, I had to incorporate an XML preprocessing step in my tool, in which unescaped charachters are replaced by their XML entities. Otherwise, the document can not be parsed by the .NET XML parser (or most other parsers).

This is the method I created for this purpose:

/// <summary>
/// Inserts '&amp;' for '&' character in XML text.
/// </summary>
/// <param name="xmlText"></param>
public static String PreProcess(String xmlText)
{
    if (String.IsNullOrEmpty(xmlText))
        return xmlText;
   
    bool ampersand = false;

    StringBuilder output = new StringBuilder();
    StringBuilder buffer = new StringBuilder();
    for (int i = 0; i < xmlText.Length; i++)
    {
        char c = xmlText[i];
        if (c == '&')
        {
            // Maybe this is the start of an entity
            ampersand = true;
            buffer.Append(c);
        }
        else if (ampersand && c >= 64 && c <= 122)
        {                                        
            buffer.Append(c);
        }
        else if (ampersand && c == ';')
        {
            // Turns out to be an entity; don't change the output                
            output.Append(buffer.ToString());
            buffer.Clear();
            output.Append(c);
            ampersand = false;
        }
        else if (ampersand && (c < 64 || c > 122))
        {
            // Turns out not to be an entity                                      
            output.Append("&amp;" + c);
            buffer.Clear();
            ampersand = false;
        }
        else
        {
            output.Append(c);
        }                
    }
    return output.ToString();
}

Note that this is not a way of escaping entities in a text string (thats what HttpUtility.HtmlEncode is for), but a method of escaping characters in a complete XML document that includes tags The trick is to ignore already escaped characters; otherwise a simple search ‘&’ and replace with ‘&amp;’ would suffice.

I’m aware that this is not a fail-safe method. Nevertheless, I’m confident that this method is robust enough for use with XML files produced by gdeutil

Links

Leave a Reply





Human Verification