Is there anything like the Xwiki-feed-plugin
except that instead of
fetching
a feed, it would fetch an HTML document via HTTP, returning a DOM
structure
that can be scanned or filtered by API-calls, e.g.:
$fetchedDom = $xwiki.FetchPlugin.getDocumentDOM("http://
nielsmayer.com")
$images = $fetchedDom.getImgList()
$media = $fetchedDom.getAnchorHREFsByExtension([".mp3", ".mv4",
".mp4"])
$content = $fetchedDom.getDivListById(['xwikicontent, 'container',
'content'])
Since this would happen on the server, you'd probably need to "fake"
being a
real browser (or just capture the user's browser configuration and
pass it
via the call to the hypothetical "getDocumentDOM()" in order to
capture an
accurate scraped representation of a modern site.)
The existing examples I've seen store an Xwiki document in the
database
first. I was hoping there was an "in memory" option that would allow
for the
document to be maintained in the app's context for long enough to
process
the remaining stream of plugin calls such as "getDivListById()" or
"getAnchorHREFsByExtension()" and then appropriately dispose the DOM
when no
longer referenced, via garbage collection. Maybe compared to the
implementation headaches -- of retrieving a potentially large
document into
memory incrementally, parsing it into a DOM incrementally, making that
available in the context, etc -- maybe I should just write the damn
document
into the database, scrape it, and delete it.
Since I would use Xwiki to store a JSON "scrape" of the document in
the DB
(as a xwiki doc), I could store it in XWiki.JavaScriptExtension[0]
of the
retrieved document, and then just delete the wiki-contents after
scraping.... So actually, if anybody has any suggestions for
"scraping" with
a retrieved document, stored as Xwiki doc, please, suggest as well!
This
seems like an area potentially fraught with peril that many people
have
already dealt with, so I would appreciate advice.
Thanks,
Niels
http://nielsmayer.com _______________________________________________
devs mailing list
devs(a)xwiki.org