1.0 Introduction
The Scraper object allows data to be extracted from HTML pages and presented in XML format. It provides a bridge to data sources that don't have XML or Web service representations at present.
1.1 Example
var config =
<config>
<var-def name='response'>
<html-to-xml>
<http method='get' url='http: </html-to-xml>
</var-def>
</config>; var scraper = new Scraper(config);
result = scraper.response;
resultXML = new XML(result.substring(result.indexOf('?>') + 2));
2.0 Scraper Object
The Scraper Object has a single operation (scrape) which takes a set of scraping instructions in an XML language. The scraping component wraps the WebHarvest open source scraper. The XML scraping configuration language is described in the WebHarvest User Manual.
Note that there are a few caveats when using the screen scraping language from within the Scraper object and within E4X, as listed below:
- The result of the scrape must be saved in a variable named "response":
var config =
<config>
<var-def name='response'>
<html-to-xml>
<http method='get' url='http: </html-to-xml>
</var-def>
</config>;
- The result comes back as a string at present. When the result represents XML, not only do you have to parse it into XML yourself, but you have to make sure you remove the XML declaration. The XML constructor does not parse documents, but only node lists, and rejects the declaration as an illegal processing instruction:
var scraper = new Scraper();
result = scraper.scrape(config);
resultXML = new XML(result.substring(result.indexOf('?>') + 2));
return resultXML;
- The WebHarvest language <template> instruction allows variables to be referenced, using the notation ${variable-name}. The curly brackets conflict with the use of XML literals in E4X, where they cause evaluation of the enclosed data. To escape the curly brackets in E4X (so they will be interpreted by WebHarvest), use the character entity references { and } for '{' and '}' respectively.
2.1 Scraper Interface
{
string scrape(XML config);
}
2.2 API Documentation
| Member |
Description |
Supported in version |
| string scrape(XML config) |
Uses the WebHarvest instruction set (<config> element) defining how to scrape some data from the Web and return it. |
0.1 |
3.0 References