Is there anything like the Xwiki-feed-plugin except that instead of fetching
a feed, it would fetch an HTML document via HTTP, returning a DOM structure
that can be scanned or filtered by API-calls, e.g.:
$fetchedDom = $xwiki.FetchPlugin.getDocumentDOM("http://nielsmayer.com")
$images = $fetchedDom.getImgList()
$media = $fetchedDom.getAnchorHREFsByExtension([".mp3", ".mv4",
".mp4"])
$content = $fetchedDom.getDivListById(['xwikicontent, 'container',
'content'])
Since this would happen on the server, you'd probably need to "fake" being
a
real browser (or just capture the user's browser configuration and pass it
via the call to the hypothetical "getDocumentDOM()" in order to capture an
accurate scraped representation of a modern site.)
The existing examples I've seen store an Xwiki document in the database
first. I was hoping there was an "in memory" option that would allow for the
document to be maintained in the app's context for long enough to process
the remaining stream of plugin calls such as "getDivListById()" or
"getAnchorHREFsByExtension()" and then appropriately dispose the DOM when no
longer referenced, via garbage collection. Maybe compared to the
implementation headaches -- of retrieving a potentially large document into
memory incrementally, parsing it into a DOM incrementally, making that
available in the context, etc -- maybe I should just write the damn document
into the database, scrape it, and delete it.
Since I would use Xwiki to store a JSON "scrape" of the document in the DB
(as a xwiki doc), I could store it in XWiki.JavaScriptExtension[0] of the
retrieved document, and then just delete the wiki-contents after
scraping.... So actually, if anybody has any suggestions for "scraping" with
a retrieved document, stored as Xwiki doc, please, suggest as well! This
seems like an area potentially fraught with peril that many people have
already dealt with, so I would appreciate advice.
Thanks,
Niels
http://nielsmayer.com