A few years ago while working on a web-scraping tool in .NET I found an amazing library SgmlReader which made my life so easy to convert HTML documents to XHTML. With this I was able to run XPATH queries to extract whatever information I want from any dam website written in worst possible malformed HTML. Had it not been SgmlReader I would have had to write tedious parsing  code to extract the tokens from the HTML string.

With this simple code you cleanup the mess that most web-masters do !!

Here is the function for VB.NET. Please download SgmlReader from the link above.

    Public Function Html2Xml(ByVal txtHTMLString As String) As String
        Dim XHTML As New Sgml.SgmlReader
        Dim sw As StringWriter = New StringWriter()
        Dim w As XmlTextWriter = New XmlTextWriter(sw)
        XHTML.DocType = "HTML"
        XHTML.InputStream = New StringReader(txtHTMLString)
        While (Not XHTML.EOF)
            w.WriteNode(XHTML, True)
        End While
        Return sw.ToString()
    End Function


Recently I encountered a similar need in PHP and I was desperately searching for SgmlReader equivalent and my search zeroed on php_tidy extension. Once you enable this extension you get all the functionality.

	$opts = array("clean" => true, "output-xml" => true); 	
	$xhtml = tidy_parse_file("http://www.example.com", $opts);
	echo $xhtml;


For more information about php_tidy goto http://us.php.net/tidy


1 Response » to “Convert HTML to well-formed XML document (Clean HTML) with SgmlReader or php_tidy”

  1. […] Class that will help you write code to send requests to pages that need cookies. Also checkout Convert HTML to well-formed XML document (Clean HTML) with SgmlReader or php_tidy post to get insight on how to extract information from HTML.   If you enjoyed this […]

Leave a Reply