Re: [xwiki-devs] what's the best way to "scrape" an HTML document with Xwiki

18 Jun 2009

Hi Niels,
You could easily call $xwiki.getExternalURL() which returns the
content at a URL.
Then you can use our XHTML parser to generate a XDOM and then do
whatever you want with it.
Only little issue: the renderer is not available in the xwiki content
right now. But if you're doing groovy it should be easy.
For large document we can add a method easily in Parser interface:
parser(Reader, Listener). All you'd need to do is implement Listener a
groovy script for ex and you'd get called for each element in the page.
Thanks
-Vincent
On Jun 18, 2009, at 8:01 PM, Niels Mayer wrote:
...
  Is there anything like the Xwiki-feed-plugin except
that instead of
 fetching
 a feed, it would fetch an HTML document via HTTP, returning a DOM
 structure
 that can be scanned or filtered by API-calls, e.g.:
 $fetchedDom = $xwiki.FetchPlugin.getDocumentDOM("http://
 nielsmayer.com")
 $images = $fetchedDom.getImgList()
 $media =  $fetchedDom.getAnchorHREFsByExtension([".mp3", ".mv4",
 ".mp4"])
 $content = $fetchedDom.getDivListById(['xwikicontent, 'container',
 'content'])
 Since this would happen on the server, you'd probably need to "fake"
 being a
 real browser (or just capture the user's browser configuration and
 pass it
 via the call to the hypothetical "getDocumentDOM()" in order to
 capture an
 accurate scraped representation of a modern site.)
 The existing examples I've seen store an Xwiki document in the
 database
 first. I was hoping there was an "in memory" option that would allow
 for the
 document to be maintained in the app's context for long enough to
 process
 the remaining stream of plugin calls such as "getDivListById()" or
 "getAnchorHREFsByExtension()" and then appropriately dispose the DOM
 when no
 longer referenced, via garbage collection. Maybe compared to the
 implementation headaches -- of retrieving a potentially large
 document into
 memory incrementally, parsing it into a DOM incrementally, making that
 available in the context, etc -- maybe I should just write the damn
 document
 into the database, scrape it, and delete it.
 Since I would use Xwiki to store a JSON "scrape" of the document in
 the DB
 (as a xwiki doc), I could store it in XWiki.JavaScriptExtension[0]
 of the
 retrieved document, and then just delete the wiki-contents after
 scraping.... So actually, if anybody has any suggestions for
 "scraping" with
 a retrieved document, stored as Xwiki doc, please, suggest as well!
 This
 seems like an area potentially fraught with peril that many people
 have
 already dealt with, so I would appreciate advice.
 Thanks,
 Niels
 http://nielsmayer.com

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [xwiki-devs] what's the best way to "scrape" an HTML document with Xwiki