Hi Vincent,
Can we use an Enum type instead?
Yep, sounds good.
I'm not sure why you changed CleaningFilter to HTMLFilter. They are
HTML filters for sure but not any type of HTML filters
they are HTML
Cleaning Filters. It's not very important but I had chosen
CleaningFilter name voluntarily so I wanted to know what's your
rationale for changing the name.
Now I think about it, CleaningFilter sounds better. I was used to HTMLFilter
in officeimporter and wanted keep the same name so that I can change
officeimporter to use core-xml classes instead of it's own types. I will
change officeimporter also to use CleaningFilters from xwiki-xml.
Why have you removed the doc type cleaning filter and instead
"hardcoded" it in the default cleaner
implementation?
This was an unforseen tradgedy that happened to me. I decided to switch to
w3c dom from jdom for several reasons:
1. HTMLCleaner returns a w3c DOM and we convert TagNode (htmlcleaner) into
JDOM only to apply the cleaning filters and then we convert it back to w3c
DOM. Now, If we can base our CleaningFilters on w3c DOM api, there is no
need to convert to jdom and we can directly work with w3c DOM eliminating
the overhead of converting between types.
2. DOM Api is easy to use compared to JDOM api (from what i've been through)
3. OpenOfficeHTMLCleaner already uses w3c DOM filters (it has to, because
HTMLCleaner returns a w3c DOM)
However because of this problem :
http://sourceforge.net/forum/forum.php?thread_id=3021756&forum_id=637245 +
the fact that w3c DOM does not allow DocType to be set after the Document
has been created (jdom allows this), made me hard code the DocType into the
DefaultHTMLCleaner and we still have to do the conversion to jdom just to
set the DocType.
Anyway, I believe
http://sourceforge.net/forum/forum.php?thread_id=3021756&forum_id=637245 is
a bug in htmlcleaner (waiting for a reply) and when it get's fixed (if it
get's fixed) we can remove the dependency on jdom.
Note: Haven't checked the criterion stuff yet, so I'm not sure what
this is about.
Question: have you been careful about performance? HTML cleaning is
the most costly thing in our rendering system so we have to be very
careful. Have you measured the rendering test speed before and after
as an indication?
After you mentioned it, I did some testing and following are the results:
Brefore:
Tests run: 762, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 5.248 sec
Tests run: 762, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 5.15 sec
Tests run: 762, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 5.259 sec
Tests run: 762, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 5.511 sec
Tests run: 762, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 5.788 sec
Tests run: 762, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 5.61 sec
Tests run: 762, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 5.614 sec
Tests run: 762, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 5.586 sec
Tests run: 762, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 5.625 sec
Tests run: 762, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 5.235 sec
Average: 5.4626
After:
Tests run: 762, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 5.39 sec
Tests run: 762, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 5.674 sec
Tests run: 762, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 5.264 sec
Tests run: 762, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 5.465 sec
Tests run: 762, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 5.423 sec
Tests run: 762, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 5.367 sec
Tests run: 762, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 5.232 sec
Tests run: 762, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 5.34 sec
Tests run: 762, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 5.408 sec
Tests run: 762, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 5.437 sec
Average: 5.40
The results seem to be pretty consistent with the earlier ones. I Don't
think the slight reduction in average test execution times means much(?)
But I believe that once we manage to get rid of the conversion to jdom, the
performance would increase. I'm going to try this out with a custom method
to directly convert into w3c DOM and see hwo much we can gain from that.
Thanks.
- Asiri