Read XHTML into XmlDocument
We want Html entities in our xml, but don't need to validate against all the other XHtml DTDs. So the XmlResolver just shows a single document containing all the entities.
- XHTML Documents must be well formed (but not necessarily valid Xhtml). If it's not well formed, use Html Agility Pack.
- If you edit and save, XmlDocument.PreserveWhitespace ensures original format is preserved. BUT XmlDocument will rewrite the DocType (see test below for the quick fix).
- Reading into an XDocument is much the same. See a fuller XDocument example
Testing
XmlDocument doc = new XmlDocument();
doc.PreserveWhitespace = true; //keep all line breaks
doc.XmlResolver = new HtmlResolver(); //will resolve entities
XmlNamespaceManager ns = new XmlNamespaceManager(doc.NameTable);
ns.AddNamespace("html", "http://www.w3.org/1999/xhtml");
string html =
@"<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0 Transitional//EN""
""http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"">
<html xmlns=""http://www.w3.org/1999/xhtml"" xml:lang=""en"" lang=""en"">
<head><title>None</title>
</head>
<body><p class=""nbsp"">Hello you</p><p>It costs €5</p></body>
</html>";
doc.LoadXml(html);
//The DocType is always rewritten as one line (no internal line breaks)
//and appends a "[]" (the internal DTD subset- i.e. inline dtd).
//So we do a quick fix here.
string dtd = html.Substring(0, html.IndexOf("<html"));
string newhtml = doc.OuterXml;
newhtml = dtd + newhtml.Remove(0, html.IndexOf("<html"));
Assert.AreEqual(html, newhtml);
XmlNode p = doc.SelectSingleNode("//html:p[@class='nbsp']", ns);
XmlText txt1 = (XmlText)p.ChildNodes[0];
Assert.AreEqual("Hello", txt1.Value);
XmlEntityReference ent = (XmlEntityReference)p.ChildNodes[1];
Assert.AreEqual("nbsp", ent.Name);
XmlText txt2 = (XmlText)p.ChildNodes[2];
Assert.AreEqual("you", txt2.Value);
Using Linq to Xml
Reading into an XDocument is much the same as an XmlDocument. Saving does not preserve the named entities, so you have to do lots of ugly string replaces. But see a fuller XDocument example
var settings = new XmlReaderSettings { ProhibitDtd = false, XmlResolver = new HtmlResolver() };
var reader = XmlReader.Create(path, settings);
var doc = XDocument.Load(reader, LoadOptions.PreserveWhitespace);
HtmlResolver
Fixup the namespaces, especially for the embedded resource.
using System;
using System.Xml;
namespace Library.ParseXHtml
{
public class HtmlResolver : XmlUrlResolver
{
public override object GetEntity(Uri absoluteUri, string role, Type ofObjectToReturn)
{
if (absoluteUri.AbsoluteUri.Equals("urn:XHTMLEntities", StringComparison.OrdinalIgnoreCase))
{
//ensure the embedded resource is suitably namespaced
return System.Reflection.Assembly.GetExecutingAssembly().
GetManifestResourceStream("Library.ParseXHtml.xhtml-entities.ent");
}
return null; //we don't return any other external resources
}
public override Uri ResolveUri(Uri baseUri, string relativeUri)
{
//make all the XHTML urls resolve to the single "dtd" which is actually just the entities
if (relativeUri.Equals("-//W3C//DTD XHTML 1.0 Transitional//EN", StringComparison.OrdinalIgnoreCase)
|| relativeUri.Equals("-//W3C//DTD XHTML 1.0 Strict//EN", StringComparison.OrdinalIgnoreCase)
|| relativeUri.Equals("-//W3C//DTD XHTML 1.0 Frameset//EN", StringComparison.OrdinalIgnoreCase)
|| relativeUri.Equals("-//W3C//DTD XHTML 1.1//EN", StringComparison.OrdinalIgnoreCase))
{
return new Uri("urn:XHTMLEntities");
}
return base.ResolveUri(baseUri, relativeUri);
}
}
}
XHTML Entity DTD (extract)
Too long to post here, so this is the first few lines. Append the entities from the URLs listed... When referenced by the HtmlResolver above, check the namespace.
<!-- A local copy of (X)HTML entities, containing the three subsets:
Latin-1 characters: http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent
Special characters: http://www.w3.org/TR/xhtml1/DTD/xhtml-special.ent
Mathematical, Greek and Symbolic characters: http://www.w3.org/TR/xhtml1/DTD/xhtml-symbol.ent
-->
<!-- Portions (C) International Organization for Standardization 1986:
Permission to copy in any form is granted for use with
conforming SGML systems and applications as defined in
ISO 8879, provided this notice is included in all copies.
-->
<!ENTITY nbsp " "> <!-- no-break space = non-breaking space,
U+00A0 ISOnum -->
<!ENTITY iexcl "¡"> <!-- inverted exclamation mark, U+00A1 ISOnum -->
<!ENTITY cent "¢"> <!-- cent sign, U+00A2 ISOnum -->
<!ENTITY pound "£"> <!-- pound sign, U+00A3 ISOnum -->
<!ENTITY curren "¤"> <!-- currency sign, U+00A4 ISOnum -->
<!ENTITY yen "¥"> <!-- yen sign = yuan sign, U+00A5 ISOnum -->
... (etc)