Blog Home  Home Feed your aggregator (RSS 2.0)  
What did you learn today? - HTML Agility Pack
Phil Denoncourt's Technology Rants
 
 Wednesday, March 24, 2004
HTML Agility Pack by Phil Denoncourt

As I was preparing for my talk tonight on web scraping, I came across a class library that has proved to be invaluable.  The HTML Agility Pack is awesome.  It allows you to download the HTML from a website and navigate through it like an XML document or using XPath queries.  You could do this before by hosting an IE Browser control in your scraping app, and going through the document using DOM.  However the IE browser control has a problem with badly formed HTML and unfortunately, most of the data on the web is not well formed.  The HTML Agility Pack deals with badly formed HTML just as easily as it does with well formed HTML.  This cut down the time required for me to write a scraper from a couple of days to a couple of hours.

People in my fantasy baseball league, beware.  I've downloaded a significant amount of baseball data and I know how to use it!

Wednesday, March 24, 2004 5:44:00 PM (GMT Standard Time, UTC+00:00)  #    Comments [0]    | 
Comments are closed.
Copyright © 2008 Phil Denoncourt III. All rights reserved.
DasBlog 'Portal' theme by Johnny Hughes.
Pick a theme: