BeautifulSoup parser or lxml2 parser?



Which should I used beautifulsoup or lxml2 parser? I am writing a crawler in Python.



In Python, typically we write a crawler using urllib2 to get the pages and Beautiful Soup to parse the HTML looking for the content.

Here's an example of reading a page:

http://docs.python.org/library/urllib2.html#examples

Here's an example of parsing the page:

http://www.crummy.com/software/BeautifulSoup/documentation.html#Parsing HTML

While libxml2 (and thus lxml) can also parse broken HTML, BeautifulSoup is a bit more forgiving and has superior support for encoding detection.

As stated earlier, beautifulsoup is more tolerant when parsing html than libxml, but libxml is faster. But, I prefer to go with beautifulsoup, since, web is full of faulty html pages.