http://www.crummy.com/software/BeautifulSoup/documentation.html#Parsing HTML
While libxml2 (and thus lxml) can also parse broken HTML, BeautifulSoup is a bit more forgiving and has superior support for encoding detection.
As stated earlier, beautifulsoup is more tolerant when parsing html than libxml, but libxml is faster. But, I prefer to go with beautifulsoup, since, web is full of faulty html pages.
In Python, typically we write a crawler using urllib2 to get the pages and Beautiful Soup to parse the HTML looking for the content.
Here's an example of reading a page:
http://docs.python.org/library/urllib2.html#examples
Here's an example of parsing the page:
http://www.crummy.com/software/BeautifulSoup/documentation.html#Parsing HTML
While libxml2 (and thus lxml) can also parse broken HTML, BeautifulSoup is a bit more forgiving and has superior support for encoding detection.
As stated earlier, beautifulsoup is more tolerant when parsing html than libxml, but libxml is faster. But, I prefer to go with beautifulsoup, since, web is full of faulty html pages.