Contents
Proposed by: Larsen Birger, Saadia Malik and Tassos Tombross
Aim/Motivation
The main motivation for the track is twofold. First, to
investigate the behaviour of users when interacting with
components of XML documents, and secondly to investigate and
develop approaches for XML retrieval which are effective in
user-based environments.
In the first year, we plan to address the first issue: to
investigate the behaviour of searchers when presented with
components of XML documents that have a high probability of being
relevant (as estimated by an XML-based IR system). Presently, all
metrics that are in use for the evaluation of system
effectiveness in the INEX ad hoc track are based on certain
assumptions of user behaviour which are not empirically
validated. The track also aims to investigate those assumptions.
Effort involved for the participants, in particular with respect
to the topic development and the relevance assessment task:
Document Collection
The collection of documents will be the same one used for the ad
hoc retrieval task. No further relevance assessments are required
on behalf of the participating sites.
Searchers
Each participating site will be responsible for recruiting test
persons to participate in the study as searchers. A minimum
number of test persons will be defined so that the obtained
results can be statistically meaningful.
IR System
For the first year of the track, all participating sites will use
the same system which will be made available. The system will
provide a basic functionality which will be agreed upon with the
participating sites. To enable the efficient logging of searcher
interaction the system would preferably be installed locally on
the participating sites. Additional systems may be employed
locally if a participant wishes to compare the official system to
these.
Topics
A number of the 2002/2003 (or even 2004) CO topics will be used
in the study. The topics may have to be slightly expanded into
what is known as simulated work task situations. Compared to the
existing topics more context on the motives and background of the
topic is provided in the simulated work tasks. Thereby the test
persons can better place themselves in a situation where they
would be motivated to search for information related to the work
tasks. The aim is to enable the test persons to formulate and
reformulate their own queries as realistically as possible in the
interaction with the IR system.
Searcher Tasks
The test persons will need to identify documents which are
useful/relevant for completing the requirements specified in the
simulated work task. They can either identify these documents
explicitly (e.g. by marking down a relevance score for each
document) or implicitly (e.g. by saving or bookmarking useful
documents). It may also be interesting to compare relevance
assessments obtained from test persons to those obtained from the
INEX assessors. A time limit will be set for each simulated work
task.
Questionnaires
Three types of questionnaires will be used. The first type will
be used before the start of each experimental session
(pre-experiment questionnaire) where we collect information
related to the test person?s background, experience, etc. the
second type will be used after the end of each work task by a
test person (post-search questionnaire) and will collect data
relating to the test person?s perception of task completion,
system satisfaction, etc. The last type (post-experiment
questionnaire) will collect general information about the test
persons feeling about the tasks. A minimum number of questions
for each type of questionnaires will be provided.
Data Analysis
Analysis of the collected data will be required in order to
extract conclusions from the studies. The collected data will
comprise (as minimum) of the questionnaires and the logging of
searcher interaction with the system. The logging could consist
of the queries issued, the components returned by the system, the
components actually viewed, relevance assessments of these, any
browsing behaviour, as well as time stamps for each act of
interaction between the test person and the system.
All the above will need to be discussed in detail among the track
participants.
Top
Proposed By:Carolyn Crouch and Mounia Lalmas
Relevance Feedback (RF) has proven through the years to be a powerful
tool for improving retrieval results. It improves results in two ways:
by increasing the ranks of relevant documents and/or by introducing
new, relevant documents into the retrieved set.
RF works as follows. Let us assume that a typical user is interested
in or willing to examine a set of documents retrieved in response to
his/her request, in the hope of improving the results returned in the
next iteration. There are several points to note in this process:
- Most users are willing to examine only a few documents in this
regard. The upper limit on the number of documents a user is willing
to evaluate can reasonably be set at, say, 20. And there is precedent
for this limit.
- In order to employ RF in this environment, we need access to ranked
output and a window of reasonable size (e.g., 20).
- To effectively measure RF (and in accordance with 1, above), we need
access to the relevance assessments within the window in question. The
latter is an important issue since in the INEX environment, relevance
assessments can be available for elements within the current window,
but these assessments are only available AFTER results are returned by
the participating groups when the new queries are processed.
So it would appear that in order to facilitate a RF track, we would
need to do the following:
- Establish a window within which RF would be applied (say 20).
- Ask the participants who are able to retrieve ranked output to
provide that output (i.e., the top 20 retrieved documents/elements for
each query). Actually, the top 100 retrieved documents/elements for
each query might be easier to obtain and at least as useful.
- Formulate a way to merge the results so that we are able to specify,
for each query, a rank-ordered list of 1 to n top-ranked
documents/elements. Here n must be greater than or equal to 20. (And
again, up to 100 might be useful, or somewhere in between if the
process involves much manual labour).
- After the relevance assessments are done by INEX, relevance
assessments must be attached to the n top-ranked
documents/elements. (It would seem that this would be done
automatically by the current INEX process; we would just need to
identify those top n documents/elements and their associated relevance
assessments.)
- We need an evaluation package that allows us to evaluate retrieval
results within the defined window (presumably the current INEX package
could be modified without too much difficulty). P@20 would be a good
measure.
- .For RF evaluation, we need to compare the RF run to a baseline
run. Results obtained through RF would be reported and compared with
the results contributed by the other groups (using again the INEX
evaluation package). Perhaps INEX would map all baseline and RF
results to one graph, as it currently does for participant
contributions.
- We need to decide on the appropriate methods of evaluation for RF
experiments in general, e.g., residual collection. Perhaps other
methods, such as rank-freezing, might be considered. (Although
residual collection is probably the most generally used method at
present, it requires a meaningful number of relevant
documents/elements per query so that if say, 20 elements are
discarded, improvement is still possible.) We might also decide to
leave this decision to the individual participants.
- We may want to have to sub-tasks, one for CO and one for CAS
queries. Reformulating a CAS queries is a new challenge! For RF.
All the above will need to be discussed in detail, if the track goes
ahead. The main issue is the availability of the relevance assessment
to perform the relevance feedback process.
The ranked data used for RF could also be used for pseudo-feedback
experiments as well.
Top
Proposed By:Shlomo Geva and Tony Sahama
The idea is to specify the information need in English
(un-restricted).The query may naturally include both structural
references and content references. It may or it may not make
explicit use of XML tag names.The IR system needs to understand both
structural cues and content cues that are present in the query. The
system ought to operate on the input, as is, and return an upper
limited amount of text in response to a query (volume limit, not
number of elements). The result may consist of any set of XML
elements in the collection. Assessments are made with respect to
that entire set of elements. A result can be, say, a set
of sections, paragraphs, figures, and bibliographic references, from the
entire INEX collection. The IR system constructs the result set
autonomously (not part of the query)
We feel that this is distinctly different from non-XML tracks (say in
TREC) in that with XML collections we can use structural cues in the
query and we can use actual document structure in determining where in
a document to look and what to return.
Each element in a given submission result set is assessed individually
(with respect to relevance and coverage - the existing interface is
probably useable). An automated global score is generated for the
submission as a whole (scoring to be determined). The assessor then
*moderates* the overall score.After assessing all submissions on hand
the assessor then *moderates the scores of all submissions* to
reflect the overall satisfaction (which may be somewhat different to
what the automated numbers suggest)
Assessment load - for each topic, the topic author needs to assess a
set of submissions. The number of submissions is equal to the number of
participants in the track (or multiple thereof).
Each submission will be displayed as 3-4 pages of actual document
material - as returned by the IR system. The assessor will not see
the original documents from which it was extracted (although that
information will be available) so assessment is restricted to what is
actually returned by the IR system and seen on the screen. No need to
assess up or down.
Doubly blind assessment could be used - if desired - the assessor need
not know which submission is being assessed.The participants need
not know who are the topic authors/assessors.Participants do not assess
their own system against their own topics.
We feel that assessment can be done very effectively - much faster
than is the case with the current track. Therefore, multiple topics per
participant could be used to increase the total number of topics in
the track.
Topic development effort - not more than with existing track -
probably less.
A user interface will be required for assessment. We are prepared to
develop it at QUT, but we need more information. For instance, access
to the full set of documents (figures and all) may be required. We
would like to see assessment done locally - the assessor will download
the entire set of results for a topic, zipped, where each submission
is in similar display format to the current assessment system). This will
minimize on-line access requirements. Assessment scores will be
uploaded when the entire set of submissions for a topic are done.
It is not clear to us what is involved with integrating this into the
existing INEX assessment system. Perhaps we can discuss this later if
the track seems plausible.
Top
Proposed By:Norbert Fuhr
Motivation
The current INEX collection is based on a single DTD. In practical
environments, such a restriction will hold in rare cases only. Instead, most XML
collections will comprise documents from different sources, and thus with different
DTDs. Also, there will be distributed systems (federations or peer-to-peer systems), where each
node manages a different type of collection.
So a heterogeneous collection poses new challenges:
-
For content-only queries, most current approaches use the DTD
for defining elements that would form reasonable answers. In
heterogeneous collections, DTD-independent methods should be developed.
- For CAS queries, there is the problem of mapping structural
conditions from one DTD onto other (possibly unknown) DTDs. Methods from
federated databases could be applied here, where schema mappings between the
different DTDs are defined manually. However, for a larger number of DTDs,
automatic methods must be developed, e.g. based on ontologies. The goal of an INEX
track on heterogeneous collections would be to set up such a test collection,
and investigate the new challenges posed by such a setting.
This track aims to answer, among other, the following research
questions:
- For CO queries, what methods are feasible for determining
elements that would be reasonable answers? Are pure statistical methods
appropriate, or my ontology-based approaches also be helpful?
- What methods can be used to map structural criteria onto other
DTDs?
- Should mappings focus on element names only, or also deal with
element content?
- What are appropriate evaluation criteria for heterogeneous
collections?
In the first year, the track would be mainly explorative. The
focus should be on the construction of an appropriate test collection, and the elaboration of
the research issues.
Effort
For setting up a heterogeneous collection track, the following tasks
should be performed:
- Creation of a heterogeneous test collection.
- Retrieval experiments with a small number of both CO and CAS
queries.
- Qualitative (rather than quantitative) analysis of the
results.
In the following, we discuss each of these issues in some more
detail.
Testbed creation
Instead of creating a completely new testbed, we propose to
modify and extend the current INEX collection in the following way:
- Split up the IEEE CS collection by journal and by year, and
regard each volume as a collection with a specific DTD. Since the
current DTD is a mixture of the DTDs of the different journal volumes, we would reverse this
process and derive the volume-specific DTDs as a subset of the complete DTD.
- Add new collections that a re related to computer science:
- There is a number of Open Archives, and some of them use
additional schemas besides Dublin Core. Those archives relevant to
computer science could be included.
- The HCI bibliography is a reference database with
bibliographic data as well as abstracts.
- There are a number of freely available
bibliographic databases related to
computer science (e.g. the Ley database
http://dblp.uni-trier.de/, the
Achilles bibliography
http://liinwww.ira.uka.de/bibliography/index.
html, Citeseer (?) http://citeseer.nj.nec.com/cs)
- In addition, we will try to get access to some resources
with restricted access
(e.g. the CompuScience database
http://www.zblmath.fiz-karlsruhe.de/
COMP/quick.html)
Retrieval experiments
Since the heterogeneous collection is from the same application
domain, the topics formulated for the standard INEX tasks can be used. The CAS queries will
possibly have to be modified, e.g. in a collection-neutral way or as a
(sub)collection-specific query (which then should be processed on other sub-collections as well)
Evaluation
In the first year of the track, no real quantitative evaluation should
be attempted. Instead, track participants, should analyse results in a qualitative
way and and start discussion about possible quantitative evaluation criteria for the following years.
Top
|