Chapter 9. Heterogeneous and Distributed Searching

Table of Contents

The surface web and the deep web
Architecture

Because auf its dynamic nature a personalized portal that integrates applications and various backend services in realtime has its problems providing a top level search facility. Much of the discussions in this chapter draws from an article on the future of internet search by Axel Uhl.

The surface web and the deep web

Uhl differentiates between static web pages (surface web) and dynamically generated pages (sometimes within a session context) or dynamic queries (deep web). Regular search engines cannot access content in the deep web, because the content returned from HTTP POST requests is not indexable (it does not have a URL). This content grows at a frightening rate and is already now more than 500 times bigger than what's available on the surface web. Uhl suggests applications to offer a query interface that can be used by a search framework to map a top level query to different underlying applications and data sources. The keywords here are heterogeneous and distributed search.

From an enterprise portal point of view it would be nice being able to offer a) a top level global search across all services b) a site directory, generated, that allows browsing type access to all information

There is of course the problem of mapping a fragment based architecture to a search mechanism. Here so called "topics" - kind of "canned queries" could offer a solution. But the biggest and still unsolved problem is the definition of the information model for the portal.

A different problem is performance: In the chapters above we have shown how backend access affects performance negatively. Offering a search mechanism can easily conflict with the approach of minimizing backend access. So where does a top level search really work from? The performance problem mirrors the one Uhl has diagnosed for internet search in general: a centralized index causes bandwidth problems (content has to come to the search engine) and performance problems (the query itself is not distributed to the sources and the sources cannot work on it concurrently). Last but not least do we need a mechanism to cache query results (see the IBM Watson paper on this topic).