How we do it
This is the era of specialized AI—AI judiciously tailored to specific problems such as image recognition, machine translation, or, with Wrapidity, automatic data extraction. Extracting structured data from the web has been one of the long standing challenges in search and knowledge acquisition that has withstood repeated attempts at solving it in a generic fashion. With Wrapidity we have developed an object extraction system that exploits extensive metadata about the relevant objects (in form of both a schema and sample instances). With this approach we outperform existing semi-supervised and unsupervised approaches by a wide margin (> 95% accuracy on a wide range of domains and sites).
How websites work and what data is relevant for your problem?
Most of this knowledge is generic, but some of it is task- or vertical-specific and thus needs to be acquired for each task or vertical—e.g., that location is key in real estate. While this acquisition often requires some human supervision, it is only needed once for an entire vertical.
Humans think in patterns and thus most websites follow a common set of conventions for presenting data, e.g., most shops will have a prominent price information.
Wrapidity exploits redundancy at many levels, whether in the presentation of the data, the actual instances in the same source, or instances shared between sources.
Wrapidity has developed a specialized AI for the autonomous exploration and classification of web sites and their constituent objects. This AI is able to adapt itself to each website by automatically composing atomic exploration actions and relies on a specialized entity recognition that considers page context rather than textual context.