Home
  Register
  Participants
  Tracks
  Adhoc
  ° Collection
  ° Topics
  ° Submissions
  ° Assessments
  ° Results
  Document mining
  Multimedia
  ° Topics
  ° Submissions
  ° Assessments
  ° Results
  Heterogeneous
Collection
  Entity Ranking
  ° Topics
  Link the wiki
  ° Topics
  ° Submissions
  ° Results
  Book search
  ° Topics
  ° Submissions
  Workshop
  News
  Organizers
  Schedule
 

Book Search track

Introduction

Searching for information in a collection of books is seen as one of the natural application areas of XML retrieval (and more generally, of structured document retrieval) approaches, where a clear benefit to users is to gain direct access to parts of books relevant to their information need.

The ultimate goal of the track is to investigate book-specific relevance ranking strategies, UI issues and user behaviour, exploiting special features, such as back of book indexes provided by authors, and linking to associated metadata like catalogue information from libraries. However, searching over a large collection of books comes with many new challenges that need to be addressed first. For example, proper infrastructure has to be developed to allow for the scalable storage, indexing and retrieval of the content. In its first year, the track will explore these issues with the aim to provide a set of recommendations for setting up such an infrastructure. The track will also aim to run a similar task to INEX's ad-hoc track, where participants can evaluate their relevance ranking strategies.

Participation

The minimum requirement to participate in the track is to provide relevance judgements for two topics. Registration to the track is open until the end of the relevance judgement collection phase (see schedule information below). Please regsiter by filling in this form.

Schedule

Last updated: 26 November 2007


July 23: Distribution of candidate queries and instructions on topic creation.
Oct 5: Submission deadline of candidate topics.
Oct 10: Distribution of final set of topics.
Dec 3: Submission deadline of search results.
Dec 10: Relevance judgements start.
Nov 26: Submission of papers for the workshop pre-proceedings.
Dec 07: Workshop pre-proceedings and workshop programme online.
Dec 17-19: Workshop in Schloss Dagstuhl. (http://www.dagstuhl.de/)
Jan 7: Submission deadline for relevance judgements.
Jan 14: Distribution of complete books test collection and evaluation scores.

The Books Corpus

The corpus is provided by Microsoft Live Book Search and the Internet Archive (for non-commercial purposes only). It consists of 42049 digitized out-of-copyright books (210Gb). The OCR content of the books is stored in djvu.xml format. Each book also has an associated metadata file (*.mrc), which contains publication (author, title, etc.) and classification information in MAchine-Readable Cataloging (MARC) record format.

The basic XML structure of a book (djvu.xml) is as follows:

<DjVuXML>
<BODY>
<OBJECT data="file.." ...>
<PARAM name="PAGE" value=".."<
[...]
<REGION>
<PARAGRAPH>
<LINE>
<WORD coords="..."/>
<WORD coords="..."/>
</LINE>
</PARAGRAPH>
</REGION>
[...]
</OBJECT>
[...]

Essentially, an <OBJECT> corresponds to a page. A page counter is embedded in the @value attribute of the <PARAM> element with the @name="PAGE" attribute. The actual page numbers (as printed inside the book) can be found (not always) in the header/footer of a page. Note, however, that headers/footers are not explicitly recognised in the OCR: i.e. the first paragraph on a page could be a header and the last one or more paragraphs on a page could be part of a footer. Depending on the book, headers may include chapter titles and page numbers (although due to OCR error, the page number is not always present).

Accessing the corpus

Note that the collection is strictly for research (non-commercial) purposes.

Access to the book corpus will be given to organizations who participate in the Book Search track and who accept the terms of the License Agreement. Please complete and sign the License Agreement and send it by post to Microsoft Research Cambridge (full address details are given inside the License document).

Please make sure that you include a cover letter with details of the main contact person to whom information on how to download the collection will be communicated (via email).

The collection can be downloaded (53Gb compressed) or can be requested on a USB hard disc (at the cost of approx £150+shipping). Requests for the latter option should be sent to gabkaz@microsoft.com.

Organisers

Gabriella Kazai
Microsoft Research Limited
7 J J Thomson Avenue
Cambridge, CB3 0FB, United Kingdom
Email: gabkaz@microsoft.com
Phone:+44 (0)1223 479 700
Direct Dial: +44 (0)1223 479 755
Fax: +44 (0)1223 479 999

Antoine Doucet
University of Caen
GREYC - Campus 2
F-14032 CAEN Cedex, France
Phone: +33 2 31 56 73 98
Fax: +33 2 31 56 73 30