static void

Read XHTML into XmlDocument

We want Html entities in our xml, but don't need to validate against all the other XHtml DTDs. So the XmlResolver just shows a single document containing all the entities.

Testing

    XmlDocument doc = new XmlDocument();

    doc.PreserveWhitespace = true; //keep all line breaks

    doc.XmlResolver = new HtmlResolver(); //will resolve entities

    XmlNamespaceManager ns = new XmlNamespaceManager(doc.NameTable);

    ns.AddNamespace("html", "http://www.w3.org/1999/xhtml");

    string html =

@"<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0 Transitional//EN""

""http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"">

<html xmlns=""http://www.w3.org/1999/xhtml"" xml:lang=""en"" lang=""en"">

<head><title>None</title>

</head>

<body><p class=""nbsp"">Hello&nbsp;you</p><p>It costs &euro;5</p></body>

</html>";

    doc.LoadXml(html);

 

    //The DocType is always rewritten as one line (no internal line breaks)

    //and appends a "[]" (the internal DTD subset- i.e. inline dtd).

    //So we do a quick fix here.

    string dtd = html.Substring(0, html.IndexOf("<html"));

    string newhtml = doc.OuterXml;

    newhtml = dtd + newhtml.Remove(0, html.IndexOf("<html"));

    Assert.AreEqual(html, newhtml);

 

    XmlNode p = doc.SelectSingleNode("//html:p[@class='nbsp']", ns);

    XmlText txt1 = (XmlText)p.ChildNodes[0];

    Assert.AreEqual("Hello", txt1.Value);

    XmlEntityReference ent = (XmlEntityReference)p.ChildNodes[1];

    Assert.AreEqual("nbsp", ent.Name);

    XmlText txt2 = (XmlText)p.ChildNodes[2];

    Assert.AreEqual("you", txt2.Value);

Using Linq to Xml

Reading into an XDocument is much the same as an XmlDocument. Saving does not preserve the named entities, so you have to do lots of ugly string replaces. But see a fuller XDocument example

var settings = new XmlReaderSettings { ProhibitDtd = false, XmlResolver = new HtmlResolver() };
var reader = XmlReader.Create(path, settings);
var doc = XDocument.Load(reader, LoadOptions.PreserveWhitespace);

HtmlResolver

Fixup the namespaces, especially for the embedded resource.

using System;

using System.Xml;

 

namespace Library.ParseXHtml

{

    public class HtmlResolver : XmlUrlResolver

    {

        public override object GetEntity(Uri absoluteUri, string role, Type ofObjectToReturn)

        {

            if (absoluteUri.AbsoluteUri.Equals("urn:XHTMLEntities", StringComparison.OrdinalIgnoreCase))

            {

                //ensure the embedded resource is suitably namespaced

                return System.Reflection.Assembly.GetExecutingAssembly().

                    GetManifestResourceStream("Library.ParseXHtml.xhtml-entities.ent");

            }

            return null; //we don't return any other external resources

        }

 

        public override Uri ResolveUri(Uri baseUri, string relativeUri)

        {

            //make all the XHTML urls resolve to the single "dtd" which is actually just the entities

            if (relativeUri.Equals("-//W3C//DTD XHTML 1.0 Transitional//EN", StringComparison.OrdinalIgnoreCase)

                || relativeUri.Equals("-//W3C//DTD XHTML 1.0 Strict//EN", StringComparison.OrdinalIgnoreCase)

                || relativeUri.Equals("-//W3C//DTD XHTML 1.0 Frameset//EN", StringComparison.OrdinalIgnoreCase)

                || relativeUri.Equals("-//W3C//DTD XHTML 1.1//EN", StringComparison.OrdinalIgnoreCase))

            {

                return new Uri("urn:XHTMLEntities");

            }

            return base.ResolveUri(baseUri, relativeUri);

        }

    }

}

XHTML Entity DTD (extract)

Too long to post here, so this is the first few lines. Append the entities from the URLs listed... When referenced by the HtmlResolver above, check the namespace.

<!-- A local copy of (X)HTML entities, containing the three subsets:

Latin-1 characters: http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent

Special characters: http://www.w3.org/TR/xhtml1/DTD/xhtml-special.ent

Mathematical, Greek and Symbolic characters: http://www.w3.org/TR/xhtml1/DTD/xhtml-symbol.ent

-->

<!-- Portions (C) International Organization for Standardization 1986:

    Permission to copy in any form is granted for use with

    conforming SGML systems and applications as defined in

    ISO 8879, provided this notice is included in all copies.

-->

 

<!ENTITY nbsp   "&#160;"> <!-- no-break space = non-breaking space,

                                  U+00A0 ISOnum -->

<!ENTITY iexcl  "&#161;"> <!-- inverted exclamation mark, U+00A1 ISOnum -->

<!ENTITY cent   "&#162;"> <!-- cent sign, U+00A2 ISOnum -->

<!ENTITY pound  "&#163;"> <!-- pound sign, U+00A3 ISOnum -->

<!ENTITY curren "&#164;"> <!-- currency sign, U+00A4 ISOnum -->

<!ENTITY yen    "&#165;"> <!-- yen sign = yuan sign, U+00A5 ISOnum -->

... (etc)