Dashboard > WSO2 Mashup Server > ... > WSO2 Mashup Server Reference > Scraper Host Object
  WSO2 Mashup Server Log in | Register   View a printable version of the current page.  
  Scraper Host Object
Added by Jonathan Marsh , last edited by Keith Chapman on Sep 10, 2007  (view change)
Labels: 
(None)

1.0 Introduction

The Scraper object allows data to be extracted from HTML pages and presented in XML format.  It provides a bridge to data sources that don't have XML or Web service representations at present.

1.1 Example

var config =
    <config>
        <var-def name='response'>
            <html-to-xml>
                <http method='get' url='http://ww2.wso2.org/~builder/'/>
            </html-to-xml>
        </var-def>
    </config>;   var scraper = new Scraper(config);
result = scraper.response;
// strip off the XML declaration and parse as XML.
resultXML = new XML(result.substring(result.indexOf('?>') + 2));

2.0 Scraper Object

The Scraper Object has a single operation (scrape) which takes a set of scraping instructions in an XML language.  The scraping component wraps the WebHarvest open source scraper.  The XML scraping configuration language is described in the WebHarvest User Manual.

Note that there are a few caveats when using the screen scraping language from within the Scraper object and within E4X, as listed below:

  1. The result of the scrape must be saved in a variable named "response":
    var config =
        <config>
            <var-def name='response'>
                <html-to-xml>
                    <http method='get' url='http://ww2.wso2.org/~builder/'/>
                </html-to-xml>
            </var-def>
        </config>;
  2. The result comes back as a string at present. When the result represents XML, not only do you have to parse it into XML yourself, but you have to make sure you remove the XML declaration. The XML constructor does not parse documents, but only node lists, and rejects the declaration as an illegal processing instruction:
    var scraper = new Scraper();
    result = scraper.scrape(config);
    // strip off the XML declaration and parse as XML.
    resultXML = new XML(result.substring(result.indexOf('?>') + 2));
    return resultXML;
  3. The WebHarvest language <template> instruction allows variables to be referenced, using the notation ${variable-name}. The curly brackets conflict with the use of XML literals in E4X, where they cause evaluation of the enclosed data. To escape the curly brackets in E4X (so they will be interpreted by WebHarvest), use the character entity references { and } for '{' and '}' respectively.

2.1 Scraper Interface

{
    string scrape(XML config);
}

2.2 API Documentation

Member Description Supported in version
string scrape(XML config) Uses the WebHarvest instruction set (<config> element) defining how to scrape some data from the Web and return it.  0.1

3.0 References

Powered by a free Atlassian Confluence Open Source Project License granted to WSO2 Inc.. Evaluate Confluence today.
Powered by Atlassian Confluence 2.7.1, the Enterprise Wiki. Bug/feature request - Atlassian news - Contact administrators