Hi Niels,
You could easily call $xwiki.getExternalURL() which returns the
content at a URL.
Then you can use our XHTML parser to generate a XDOM and then do
whatever you want with it.
Only little issue: the renderer is not available in the xwiki content
right now. But if you're doing groovy it should be easy.
For large document we can add a method easily in Parser interface:
parser(Reader, Listener). All you'd need to do is implement Listener a
groovy script for ex and you'd get called for each element in the page.
Thanks
-Vincent
On Jun 18, 2009, at 8:01 PM, Niels Mayer wrote:
Is there anything like the Xwiki-feed-plugin except
that instead of
fetching
a feed, it would fetch an HTML document via HTTP, returning a DOM
structure
that can be scanned or filtered by API-calls, e.g.:
$fetchedDom = $xwiki.FetchPlugin.getDocumentDOM("http://
nielsmayer.com")
$images = $fetchedDom.getImgList()
$media = $fetchedDom.getAnchorHREFsByExtension([".mp3", ".mv4",
".mp4"])
$content = $fetchedDom.getDivListById(['xwikicontent, 'container',
'content'])
Since this would happen on the server, you'd probably need to "fake"
being a
real browser (or just capture the user's browser configuration and
pass it
via the call to the hypothetical "getDocumentDOM()" in order to
capture an
accurate scraped representation of a modern site.)
The existing examples I've seen store an Xwiki document in the
database
first. I was hoping there was an "in memory" option that would allow
for the
document to be maintained in the app's context for long enough to
process
the remaining stream of plugin calls such as "getDivListById()" or
"getAnchorHREFsByExtension()" and then appropriately dispose the DOM
when no
longer referenced, via garbage collection. Maybe compared to the
implementation headaches -- of retrieving a potentially large
document into
memory incrementally, parsing it into a DOM incrementally, making that
available in the context, etc -- maybe I should just write the damn
document
into the database, scrape it, and delete it.
Since I would use Xwiki to store a JSON "scrape" of the document in
the DB
(as a xwiki doc), I could store it in XWiki.JavaScriptExtension[0]
of the
retrieved document, and then just delete the wiki-contents after
scraping.... So actually, if anybody has any suggestions for
"scraping" with
a retrieved document, stored as Xwiki doc, please, suggest as well!
This
seems like an area potentially fraught with peril that many people
have
already dealt with, so I would appreciate advice.
Thanks,
Niels
http://nielsmayer.com