The system I was building allowed for entry of HTML entities, but was used by many non-HTML familiar users.
To prevent broken HTML from entering the database and potentially breaking the site layout upon display, I needed a way of indicating that the markup was broken whilst informative enough to point to where the error is.
Luckily, the HTML Tidy library for Python does just that! Depending on your operating system, you can get it via pip or easy_install with "pytidylib".
Once you're set up, doing the actual check is easy.
01.
from
tidylib
import
tidy_fragment
02.
import
re
03.
04.
# Check for missing close tags (bold, italics, links, etc)
05.
document, errors
=
tidy_fragment(data)
06.
reobj
=
re.compile(r
"(line \d+ column \d+ - Warning: missing </\w+>)"
)
07.
08.
missing_tags
=
[]
09.
10.
for
match
in
reobj.finditer(errors):
11.
missing_tags.append(match.group())
12.
13.
print
missing_tags
Done and dusted!
Time for a long awaited feel-good "cat pushing another cat off a shelf GIF"!