Home
  Register
  Participants
  Tracks
  Adhoc
  ° Collection
  ° Topics
  ° Submissions
  ° Assessments
  ° Results
  Document mining
  Multimedia
  ° Topics
  ° Submissions
  ° Assessments
  ° Results
  Heterogeneous
Collection
  Entity Ranking
  ° Topics
  Link the wiki
  ° Topics
  ° Submissions
  ° Results
  Book search
  ° Topics
  ° Submissions
  Workshop
  News
  Organizers
  Schedule
 

Heterogeneous collection track

Motivation

The primary INEX test collection has been based on a single DTD. In practical environments, such a restriction will hold in rare cases only. Instead, most XML collections will consist of documents from different sources, and thus with different DTDs or Schemas. In addition, distributed systems (federations or peer-to-peer systems), where each node manages a different type of collection, will need to be searched and the results combined. If there is a semantic diversity between the collections, not every collection is suitable to satisfy the user's information need. On the other hand, querying each collection is expensive w.r.t. communication costs, so a preselection of appropriate collections should be performed. So a heterogeneous collection poses a number of challenges for XML retrieval, including:

  1. For content-oriented queries, most current approaches use the DTD for defining elements that would form reasonable answers. In heterogeneous collections, DTD-independent methods need to be developed.
  2. For content and structure queries, there is the added problem of mapping structural conditions from one DTD or Schema onto other (possibly unknown) DTDs and Schemas. Methods from federated databases could be applied here, where schema mappings between the different DTDs are defined manually. However, for a larger number of DTDs, automatic methods must be developed, e.g. based on ontologies. The goal of an INEX track on heterogeneous collections would be to set up such a test collection, and investigate the new challenges posed by such a setting.
  3. Both content-only and content-and-structure approaches should be able to preselect suitable collections. This way, retrieval costs can be minimized by neglecting collections which would probably not yield valuable answers but are expensive to query w.r.t. communication efforts.

This track aims to answer, among other, the following research questions:

  1. For content-oriented queries, what methods are possible for determining which elements contain reasonable answers? Are pure statistical methods appropriate, or are ontology-based approaches also helpful?
  2. What methods can be used to map structural criteria onto other DTDs?
  3. Should mappings focus on element names only, or also deal with element content or semantics?
  4. How can suitable collections be preselected in order to improve retrieval efficiency and without corrupting retrieval effectiveness?
  5. What are appropriate evaluation criteria for heterogeneous collections?

In 2004-2005, the heterogeneous track was mainly explorative. This year we intend to expand both the number and the syntactic and semantic diversity of the collections to be used. The collections are based on different DTDs and dealing with different topics (computer science research to IT business to non-IT related issues). The primary focus will still be on the construction of an appropriate test collection, and on appropriate tools for evaluation of heterogeneous retrieval. Of equal importance is the exploration of the research questions outlined above.


Schedule


May 28- submission deadline for candidate topics
Jun 08 - distribution of final topics
Jul 30  - submission deadline for search results
Aug 20 - distribution of results to participants for relevance assessments
Oct 15 - deadline for relevance judgements
Oct 30 - distribution of relevance judgements and evaluation scores to participants.
Nov 27- submission of papers for the workshop pre-proceedings.
Dec 08- workshop pre-proceedings and workshop programme online.
 

Organisers

Ingo Frommholz
University of Duisburg-Essen
Fak. 5/IIIS
Information Systems
Lotharstr. 65
47048 Duisburg
http://www.is.informatik.uni-duisburg.de
Email: ingo.frommholz@uni-due.de


Ray Larson
School of Information Management and Systems
University of California, Berkeley
Berkeley, California 94720-4600
http://www.sims.berkeley.edu/people/faculty/raylarson
Email: ray@sims.berkeley.edu
Phone: (510)642-6046
Fax: +1 (510)642-5814