The system I was building allowed for entry of HTML entities, but was used by many non-HTML familiar users.
To prevent broken HTML from entering the database and potentially breaking the site layout upon display, I needed a way of indicating that the markup was broken whilst informative enough to point to where the error is.
Luckily, the HTML Tidy library for Python does just that! Depending on your operating system, you can get it via pip or easy_install with "pytidylib".
Once you're set up, doing the actual check is easy.
from tidylib import tidy_fragment
import re
# Check for missing close tags (bold, italics, links, etc)
document, errors = tidy_fragment(data)
reobj = re.compile(r"(line \d+ column \d+ - Warning: missing </\w+>)")
missing_tags = []
for match in reobj.finditer(errors):
missing_tags.append(match.group())
print missing_tags
Done and dusted!
Time for a long awaited feel-good "cat pushing another cat off a shelf GIF"!