Home
  Springer proceedings
 Submit paper
 Submit review
  Register
  Participants
  Tasks/Tracks
  Adhoc
  ° Collection
  ° Topics
  ° Submissions
  ° Assessments
  ° Results
  Interactive
  ° Guidelines
  ° Topics
  ° System
  ° Log Viewer
  ° Schedule Exp
  Multimedia
  ° Topics
  ° Submissions
  ° Assessments
  ° Results
  Relevance feedback
  ° Submissions
  ° Results
  Document mining
  User-case studies
  ° Results
  XML Entity Ranking
  Natural language    processing
  ° Submissions
  ° Results
  Heterogenous Collection
  ° Collections
  ° Topics
  ° Runs
  Workshop
  News
  Organizers
  Schedule
  Publications
  2006
  2005
  2004
  2003
  2002
 

Heterogeneous collection track

Het Track continues in 2007!

Motivation

The primary INEX test collection has been based on a single DTD. In practical environments, such a restriction will hold in rare cases only. Instead, most XML collections will consist of documents from different sources, and thus with different DTDs or Schemas. In addition, distributed systems (federations or peer-to-peer systems), where each node manages a different type of collection, will need to be searched and the results combined. If there is a semantic diversity between the collections, not every collection is suitable to satisfy the user's information need. On the other hand, querying each collection is expensive w.r.t. communication costs, so a preselection of appropriate collections should be performed. So a heterogeneous collection poses a number of challenges for XML retrieval, including:

  1. For content-oriented queries, most current approaches use the DTD for defining elements that would form reasonable answers. In heterogeneous collections, DTD-independent methods need to be developed.
  2. For content and structure queries, there is the added problem of mapping structural conditions from one DTD or Schema onto other (possibly unknown) DTDs and Schemas. Methods from federated databases could be applied here, where schema mappings between the different DTDs are defined manually. However, for a larger number of DTDs, automatic methods must be developed, e.g. based on ontologies. The goal of an INEX track on heterogeneous collections would be to set up such a test collection, and investigate the new challenges posed by such a setting.
  3. Both content-only and content-and-structure approaches should be able to preselect suitable collections. This way, retrieval costs can be minimized by neglecting collections which would probably not yield valuable answers but are expensive to query w.r.t. communication efforts.

This track aims to answer, among other, the following research questions:

  1. For content-oriented queries, what methods are possible for determining which elements contain reasonable answers? Are pure statistical methods appropriate, or are ontology-based approaches also helpful?
  2. What methods can be used to map structural criteria onto other DTDs?
  3. Should mappings focus on element names only, or also deal with element content or semantics?
  4. How can suitable collections be preselected in order to improve retrieval efficiency and without corrupting retrieval effectiveness?
  5. What are appropriate evaluation criteria for heterogeneous collections?

In 2004-2005, the heterogeneous track was mainly explorative. This year we intend to expand both the number and the syntactic and semantic diversity of the collections to be used. The collections are based on different DTDs and dealing with different topics (computer science research to IT business to non-IT related issues). The primary focus will still be on the construction of an appropriate test collection, and on appropriate tools for evaluation of heterogeneous retrieval. Of equal importance is the exploration of the research questions outlined above.

The het track continues in 2007 with run submissions and evaluation. Please find topics, guidelines, collections and the run submission interface on the internal het track page.


Schedule (obsolete one from 2006)


May 28- submission deadline for candidate topics
Jun 08 - distribution of final topics
Jul 30  - submission deadline for search results
Aug 20 - distribution of results to participants for relevance assessments
Oct 15 - deadline for relevance judgements
Oct 30 - distribution of relevance judgements and evaluation scores to participants.
Nov 27- submission of papers for the workshop pre-proceedings.
Dec 08- workshop pre-proceedings and workshop programme online.
 

Organisers

Ingo Frommholz
University of Duisburg-Essen
Faculty of Engineering Sciences
Information Systems
Lotharstr. 65
D-47048 Duisburg
Germany
http://www.is.inf.uni-due.de/staff/ingo.html.en
Email: ingo.frommholz@uni-due.de
Phone: +49-203-379-3755
Fax: +49-203-379-2549


Ray Larson
School of Information Management and Systems
University of California, Berkeley
Berkeley, California 94720-4600
USA
http://www.sims.berkeley.edu/people/faculty/raylarson
Email: ray@sims.berkeley.edu
Phone: +1 (510)642-6046
Fax: +1 (510)642-5814