Heterogeneous collection track
Motivation
The primary INEX test collection has been based on a single DTD. In practical
environments, such a restriction will hold in rare cases only. Instead, most
XML collections will consist of documents from different sources, and thus
with different DTDs or Schemas. In addition, distributed systems (federations
or peer-to-peer systems), where each node manages a different type of
collection, will need to be searched and the results combined. If there is a
semantic diversity between the collections, not every collection is suitable
to satisfy the user's information need. On the other hand, querying each
collection is expensive w.r.t. communication costs, so a preselection of
appropriate collections should be performed. So a heterogeneous collection
poses a number of challenges for XML retrieval, including:
- For content-oriented queries, most current approaches use the
DTD for defining elements that would form reasonable answers. In
heterogeneous collections, DTD-independent methods need to be
developed.
- For content and structure queries, there is the added problem of
mapping structural conditions from one DTD or Schema onto other
(possibly unknown) DTDs and Schemas. Methods from federated databases
could be applied here, where schema mappings between the different
DTDs are defined manually. However, for a larger number of DTDs,
automatic methods must be developed, e.g. based on ontologies. The
goal of an INEX track on heterogeneous collections would be to set up
such a test collection, and investigate the new challenges posed by
such a setting.
- Both content-only and content-and-structure approaches should be able to
preselect suitable collections. This way, retrieval costs can be minimized by
neglecting collections which would probably not yield valuable answers but are
expensive to query w.r.t. communication efforts.
This track aims to answer, among other, the following research questions:
- For content-oriented queries, what methods are possible for
determining which elements contain reasonable answers? Are pure
statistical methods appropriate, or are ontology-based approaches also
helpful?
- What methods can be used to map structural criteria onto other DTDs?
- Should mappings focus on element names only, or also deal with
element content or semantics?
- How can suitable collections be preselected in order to improve retrieval
efficiency and without corrupting retrieval effectiveness?
- What are appropriate evaluation criteria for heterogeneous collections?
In 2004-2005, the heterogeneous track was mainly explorative. This year we
intend to expand both the number and the syntactic and semantic diversity of
the collections to be used. The collections are based on different DTDs and
dealing with different topics (computer science research to IT business to
non-IT related issues). The primary focus will still be on the construction
of an appropriate test collection, and on appropriate tools for evaluation of
heterogeneous retrieval. Of equal importance is the exploration of the
research questions outlined above.
|