Initiative for the Evaluation of XML Retrieval

April 2003 - December 2003


[ Home | News | Participants | Schedule | {down,up}load Area | Workshop | INEX 2004 Track Proposals | Organisers ] For latest info see News
(Last update: Jan 19, 2004)

Track Proposals at INEX 2004

Contents


Interactive Track

Proposed by: Larsen Birger, Saadia Malik and Tassos Tombross

Aim/Motivation

The main motivation for the track is twofold. First, to investigate the behaviour of users when interacting with components of XML documents, and secondly to investigate and develop approaches for XML retrieval which are effective in user-based environments.

In the first year, we plan to address the first issue: to investigate the behaviour of searchers when presented with components of XML documents that have a high probability of being relevant (as estimated by an XML-based IR system). Presently, all metrics that are in use for the evaluation of system effectiveness in the INEX ad hoc track are based on certain assumptions of user behaviour which are not empirically validated. The track also aims to investigate those assumptions.

Effort involved for the participants, in particular with respect to the topic development and the relevance assessment task:

Document Collection

The collection of documents will be the same one used for the ad hoc retrieval task. No further relevance assessments are required on behalf of the participating sites.

Searchers

Each participating site will be responsible for recruiting test persons to participate in the study as searchers. A minimum number of test persons will be defined so that the obtained results can be statistically meaningful.

IR System

For the first year of the track, all participating sites will use the same system which will be made available. The system will provide a basic functionality which will be agreed upon with the participating sites. To enable the efficient logging of searcher interaction the system would preferably be installed locally on the participating sites. Additional systems may be employed locally if a participant wishes to compare the official system to these.

Topics

A number of the 2002/2003 (or even 2004) CO topics will be used in the study. The topics may have to be slightly expanded into what is known as simulated work task situations. Compared to the existing topics more context on the motives and background of the topic is provided in the simulated work tasks. Thereby the test persons can better place themselves in a situation where they would be motivated to search for information related to the work tasks. The aim is to enable the test persons to formulate and reformulate their own queries as realistically as possible in the interaction with the IR system.

Searcher Tasks

The test persons will need to identify documents which are useful/relevant for completing the requirements specified in the simulated work task. They can either identify these documents explicitly (e.g. by marking down a relevance score for each document) or implicitly (e.g. by saving or bookmarking useful documents). It may also be interesting to compare relevance assessments obtained from test persons to those obtained from the INEX assessors. A time limit will be set for each simulated work task.

Questionnaires

Three types of questionnaires will be used. The first type will be used before the start of each experimental session (pre-experiment questionnaire) where we collect information related to the test person?s background, experience, etc. the second type will be used after the end of each work task by a test person (post-search questionnaire) and will collect data relating to the test person?s perception of task completion, system satisfaction, etc. The last type (post-experiment questionnaire) will collect general information about the test persons feeling about the tasks. A minimum number of questions for each type of questionnaires will be provided.

Data Analysis

Analysis of the collected data will be required in order to extract conclusions from the studies. The collected data will comprise (as minimum) of the questionnaires and the logging of searcher interaction with the system. The logging could consist of the queries issued, the components returned by the system, the components actually viewed, relevance assessments of these, any browsing behaviour, as well as time stamps for each act of interaction between the test person and the system.

All the above will need to be discussed in detail among the track participants.

Top

Relevance feedback track at INEX 2004

Proposed By:Carolyn Crouch and Mounia Lalmas

Relevance Feedback (RF) has proven through the years to be a powerful tool for improving retrieval results. It improves results in two ways: by increasing the ranks of relevant documents and/or by introducing new, relevant documents into the retrieved set.

RF works as follows. Let us assume that a typical user is interested in or willing to examine a set of documents retrieved in response to his/her request, in the hope of improving the results returned in the next iteration. There are several points to note in this process:

  1. Most users are willing to examine only a few documents in this regard. The upper limit on the number of documents a user is willing to evaluate can reasonably be set at, say, 20. And there is precedent for this limit.
  2. In order to employ RF in this environment, we need access to ranked output and a window of reasonable size (e.g., 20).
  3. To effectively measure RF (and in accordance with 1, above), we need access to the relevance assessments within the window in question. The latter is an important issue since in the INEX environment, relevance assessments can be available for elements within the current window, but these assessments are only available AFTER results are returned by the participating groups when the new queries are processed.

So it would appear that in order to facilitate a RF track, we would need to do the following:

  1. Establish a window within which RF would be applied (say 20).
  2. Ask the participants who are able to retrieve ranked output to provide that output (i.e., the top 20 retrieved documents/elements for each query). Actually, the top 100 retrieved documents/elements for each query might be easier to obtain and at least as useful.
  3. Formulate a way to merge the results so that we are able to specify, for each query, a rank-ordered list of 1 to n top-ranked documents/elements. Here n must be greater than or equal to 20. (And again, up to 100 might be useful, or somewhere in between if the process involves much manual labour).
  4. After the relevance assessments are done by INEX, relevance assessments must be attached to the n top-ranked documents/elements. (It would seem that this would be done automatically by the current INEX process; we would just need to identify those top n documents/elements and their associated relevance assessments.)
  5. We need an evaluation package that allows us to evaluate retrieval results within the defined window (presumably the current INEX package could be modified without too much difficulty). P@20 would be a good measure.
  6. .For RF evaluation, we need to compare the RF run to a baseline run. Results obtained through RF would be reported and compared with the results contributed by the other groups (using again the INEX evaluation package). Perhaps INEX would map all baseline and RF results to one graph, as it currently does for participant contributions.
  7. We need to decide on the appropriate methods of evaluation for RF experiments in general, e.g., residual collection. Perhaps other methods, such as rank-freezing, might be considered. (Although residual collection is probably the most generally used method at present, it requires a meaningful number of relevant documents/elements per query so that if say, 20 elements are discarded, improvement is still possible.) We might also decide to leave this decision to the individual participants.
  8. We may want to have to sub-tasks, one for CO and one for CAS queries. Reformulating a CAS queries is a new challenge! For RF.

    All the above will need to be discussed in detail, if the track goes ahead. The main issue is the availability of the relevance assessment to perform the relevance feedback process.

    The ranked data used for RF could also be used for pseudo-feedback experiments as well.
    Top

    Natural Query Language Track

    Proposed By:Shlomo Geva and Tony Sahama

    The idea is to specify the information need in English (un-restricted).The query may naturally include both structural references and content references. It may or it may not make explicit use of XML tag names.The IR system needs to understand both structural cues and content cues that are present in the query. The system ought to operate on the input, as is, and return an upper limited amount of text in response to a query (volume limit, not number of elements). The result may consist of any set of XML elements in the collection. Assessments are made with respect to that entire set of elements. A result can be, say, a set of sections, paragraphs, figures, and bibliographic references, from the entire INEX collection. The IR system constructs the result set autonomously (not part of the query)

    We feel that this is distinctly different from non-XML tracks (say in TREC) in that with XML collections we can use structural cues in the query and we can use actual document structure in determining where in a document to look and what to return.

    Each element in a given submission result set is assessed individually (with respect to relevance and coverage - the existing interface is probably useable). An automated global score is generated for the submission as a whole (scoring to be determined). The assessor then *moderates* the overall score.After assessing all submissions on hand the assessor then *moderates the scores of all submissions* to reflect the overall satisfaction (which may be somewhat different to what the automated numbers suggest)

    Assessment load - for each topic, the topic author needs to assess a set of submissions. The number of submissions is equal to the number of participants in the track (or multiple thereof).

    Each submission will be displayed as 3-4 pages of actual document material - as returned by the IR system. The assessor will not see the original documents from which it was extracted (although that information will be available) so assessment is restricted to what is actually returned by the IR system and seen on the screen. No need to assess up or down.

    Doubly blind assessment could be used - if desired - the assessor need not know which submission is being assessed.The participants need not know who are the topic authors/assessors.Participants do not assess their own system against their own topics.

    We feel that assessment can be done very effectively - much faster than is the case with the current track. Therefore, multiple topics per participant could be used to increase the total number of topics in the track.

    Topic development effort - not more than with existing track - probably less.

    A user interface will be required for assessment. We are prepared to develop it at QUT, but we need more information. For instance, access to the full set of documents (figures and all) may be required. We would like to see assessment done locally - the assessor will download the entire set of results for a topic, zipped, where each submission is in similar display format to the current assessment system). This will minimize on-line access requirements. Assessment scores will be uploaded when the entire set of submissions for a topic are done. It is not clear to us what is involved with integrating this into the existing INEX assessment system. Perhaps we can discuss this later if the track seems plausible.

    Top

    Proposal for an INEX track on Heterogenous Collections

    Proposed By:Norbert Fuhr

    Motivation

    The current INEX collection is based on a single DTD. In practical environments, such a restriction will hold in rare cases only. Instead, most XML collections will comprise documents from different sources, and thus with different DTDs. Also, there will be distributed systems (federations or peer-to-peer systems), where each node manages a different type of collection. So a heterogeneous collection poses new challenges:

    1. For content-only queries, most current approaches use the DTD for defining elements that would form reasonable answers. In heterogeneous collections, DTD-independent methods should be developed.
    2. For CAS queries, there is the problem of mapping structural conditions from one DTD onto other (possibly unknown) DTDs. Methods from federated databases could be applied here, where schema mappings between the different DTDs are defined manually. However, for a larger number of DTDs, automatic methods must be developed, e.g. based on ontologies. The goal of an INEX track on heterogeneous collections would be to set up such a test collection, and investigate the new challenges posed by such a setting.

    This track aims to answer, among other, the following research questions:

    1. For CO queries, what methods are feasible for determining elements that would be reasonable answers? Are pure statistical methods appropriate, or my ontology-based approaches also be helpful?
    2. What methods can be used to map structural criteria onto other DTDs?
    3. Should mappings focus on element names only, or also deal with element content?
    4. What are appropriate evaluation criteria for heterogeneous collections?

    In the first year, the track would be mainly explorative. The focus should be on the construction of an appropriate test collection, and the elaboration of the research issues.

    Effort

    For setting up a heterogeneous collection track, the following tasks should be performed:

    1. Creation of a heterogeneous test collection.
    2. Retrieval experiments with a small number of both CO and CAS queries.
    3. Qualitative (rather than quantitative) analysis of the results.

    In the following, we discuss each of these issues in some more detail.

    Testbed creation

    Instead of creating a completely new testbed, we propose to modify and extend the current INEX collection in the following way:

    • Split up the IEEE CS collection by journal and by year, and regard each volume as a collection with a specific DTD. Since the current DTD is a mixture of the DTDs of the different journal volumes, we would reverse this process and derive the volume-specific DTDs as a subset of the complete DTD.
    • Add new collections that a re related to computer science:
      • There is a number of Open Archives, and some of them use additional schemas besides Dublin Core. Those archives relevant to computer science could be included.
      • The HCI bibliography is a reference database with bibliographic data as well as abstracts.
      • There are a number of freely available bibliographic databases related to computer science (e.g. the Ley database http://dblp.uni-trier.de/, the Achilles bibliography http://liinwww.ira.uka.de/bibliography/index. html, Citeseer (?) http://citeseer.nj.nec.com/cs)
      • In addition, we will try to get access to some resources with restricted access (e.g. the CompuScience database http://www.zblmath.fiz-karlsruhe.de/ COMP/quick.html)

    Retrieval experiments

    Since the heterogeneous collection is from the same application domain, the topics formulated for the standard INEX tasks can be used. The CAS queries will possibly have to be modified, e.g. in a collection-neutral way or as a (sub)collection-specific query (which then should be processed on other sub-collections as well)

    Evaluation

    In the first year of the track, no real quantitative evaluation should be attempted. Instead, track participants, should analyse results in a qualitative way and and start discussion about possible quantitative evaluation criteria for the following years.

    Top