[xwiki-devs] what's the best way to "scrape" an HTML document with Xwiki

18 Jun 2009

Is there anything like the Xwiki-feed-plugin except that instead of fetching
a feed, it would fetch an HTML document via HTTP, returning a DOM structure
that can be scanned or filtered by API-calls, e.g.:
$fetchedDom = $xwiki.FetchPlugin.getDocumentDOM("http://nielsmayer.com")
$images = $fetchedDom.getImgList()
$media =  $fetchedDom.getAnchorHREFsByExtension([".mp3", ".mv4",
".mp4"])
$content = $fetchedDom.getDivListById(['xwikicontent, 'container',
'content'])
Since this would happen on the server, you'd probably need to "fake" being a
real browser (or just capture the user's browser configuration and pass it
via the call to the hypothetical "getDocumentDOM()" in order to capture an
accurate scraped representation of a modern site.)
The existing examples I've seen store an Xwiki document in the database
first. I was hoping there was an "in memory" option that would allow for the
document to be maintained in the app's context for long enough to process
the remaining stream of plugin calls such as "getDivListById()" or
"getAnchorHREFsByExtension()" and then appropriately dispose the DOM when no
longer referenced, via garbage collection. Maybe compared to the
implementation headaches -- of retrieving a potentially large document into
memory incrementally, parsing it into a DOM incrementally, making that
available in the context, etc -- maybe I should just write the damn document
into the database, scrape it, and delete it.
Since I would use Xwiki to store a JSON "scrape" of the document in the DB
(as a xwiki doc), I could store it in XWiki.JavaScriptExtension[0] of the
retrieved document, and then just delete the wiki-contents after
scraping.... So actually, if anybody has any suggestions for "scraping" with
a retrieved document, stored as Xwiki doc, please, suggest as well! This
seems like an area potentially fraught with peril that many people have
already dealt with, so I would appreciate advice.
Thanks,
Niels
http://nielsmayer.com

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

[xwiki-devs] what's the best way to "scrape" an HTML document with Xwiki