Continuous Access To Cultural Heritage

Research themes

The challenges to the CATCH programme are: (1) to achieve multidisciplinary collaboration between the cultural heritage field and IT research, (2) to make excellent research contributions, and (3) to produce intelligent and personalised tools. This will ensure that the IT research contributes to knowledge enrichment within the cultural heritage domain.

The CATCH strategy centres on three issues relevant to the cultural heritage: how to link different collections, how to add information to data automatically or semi-automatically, and how to provide information in a user-friendly way. These issues
translate into the following research themes:

  1. Semantic interoperability through metadata
  2. Knowledge enrichment through automated analyses
  3. Personalisation through presentation


  

Ad 1: Semantic interoperability through metadata

Situation in cultural heritage

From the start, the cultural heritage institutes have used registration systems to add metadata to their collections. However, each of the highly autonomous institutes has done so in its own way. Only recently the institutes have become more aware of the need for standards in the structure of the descriptions, the conventions within the descriptions, and the terminological sources. Nowadays, the sheer amount of heritage sources, their great diversity, the amount of different registration systems used, and the ever evolving wishes of the users make it impossible to provide the “Dutch Heritage Collection” with unambiguous metadata through intellectual human labour. The challenge is to achieve the desired situation by combining intelligent IT applications and human expertise.

Hence, cultural heritage may turn to information technology with a clear technology demand for tools and methods (1) to combine and enrich the already registered data and knowledge, (2) to document sources automatically or semi-automatically, and (3) to supply them with the necessary metadata. The (semi-)automatic generation of metadata is an essential prerequisite for the semantic interoperability of the collections. Metadata not only makes sure that a person can find a specific collection or object, it also enables bulk retrieval of digital objects that are related to each other (e.g., created by the same artist, about the same topic, from the same period, from the same geographic location, etc.). Here we reiterate that the creation of such metadata usually requires a considerable intellectual input of curators and others involved in digital heritage collections. Information technology may offer opportunities for semantic interoperability between digital collections and their metadata on a large scale, which could not be achieved by human input alone. Finally, it is remarked that the creation of a Semantic Web can only be achieved by extensive IT research on semantic interoperability.

Research topics

The leading question is: How can we achieve the creation of semantic metadata by applying automatic creation of metadata? An obvious research agenda reads: (1) by deriving metadata from other collections, and (2) by using ontology for adding additional elements in metadata corpora to guarantee 'semantic cohesion' between collections and items. Although the main goal is to provide methods and tools that can be used in the “back office” to create semantically rich metadata, there are two more questions, viz. on the speed of the project execution, and on the open structure of the solutions. The tools should minimize the amount of user effort required for creating and maintaining semantic annotations and should help to increase the overall quality level of annotations.

Research will focus on methods and tools for harmonizing ontology through semantic links between metadata corpora. This research challenge is similar to what is called the “ontology mapping” problem. Research issues with respect to ontology mapping include the following five different topics.

  • Inventory of (the composition of) ontology and vocabularies that are of potential use for cultural heritage applications.
  • Types of mapping relations: e.g., equality, equivalence, subclass, instance.
    Methods for representation of mapping relations: e.g., how to add mappings without affecting the original metadata vocabularies.
  • Semi-automatic learning of mapping relations; techniques such as emergent semantics (learning semantic relations from user behaviour) may be relevant here.
  • Methods for combining metadata with full text documents within a single query.

Background

To understand the research question and the research topics more in depth, we provide some background. The first two bullets underline the importance of metadata once more. The bullets three to five emphasize the various difficulties with semantics.

  • Metadata can refer to various kinds of data types. It turns out that the limited and well-defined semantic scope of keyword type of metadata (like IMDI) can be seen as the backbone for collection maintenance and discovery.
  • Keyword type of metadata is also one of the keys for interoperability due to the broad usage (community agreed on elements and use the same concepts) and well-defined limited semantics.
  • Achieving semantic interoperability is a hard process where the goals have to be clear. The experience shows that most relationships between the elements of two disciplines can only be expressed with the help of a fuzzy type such as “mapsTo”. Frameworks such as RDF(S) and OWL do not include such a relation type for good reasons. Actually, the “mapsTo” relation is exploited as a one-directional equality with some further necessary restrictions.
  • The limited semantics of the keyword type of metadata and the fact that metadata creation is an expensive endeavour leading to missing values makes it necessary to use all types of contextual information (within metadata hierarchies/environments and outside) to enrich the metadata and to add it to the discovery domain. Both topics are completely new and not sorted out very well. Research has to be done to understand what is possible and how the quality of the metadata will be influenced. Also it has to be understood how metadata and context information can be combined to increase the chance of discovery.
  • Semantic annotation has to rely on well-defined domain knowledge to form a coherent discovery space. Therefore, the concepts to be used should be taken from open data category registries (DCR). If a new concept is introduced due to the fact that the existing ones are semantically not sufficient, then the person intending to use it has the duty to enter it into the data category repository, i.e., defining it properly and also where possible define relationships with other existing concepts. The DCRs are essential to avoid a proliferation of concepts which would reduce its relevance for the discovery space and for achieving interoperability.

Back to top

Ad 2: Knowledge enrichment through automated analyses

Situation in cultural heritage

Collection management and research in the cultural heritage field centres around content, i.e., the meaning of texts, objects, images and their mutual relations. For unanalysed objects, this information is hidden and implicit. The goal of knowledge enrichment is to make this implicit information explicitly available. CATCH aims to develop knowledge and to demonstrate its applicability in automated knowledge enrichment tools. One group of tools aims to support experts. Another group of tools enables fully automated analyses.
There are two dimensions in these two groups of tools. First, tools can be used to assist experts, or they can perform fully automatically. Second, tools can follow existing annotation schemes, or they can discover new structures within, and relations between objects. Knowledge enrichment can be applied to any of the media types which are covered by CATCH: text, images, handwritten documents, archaeological objects, etc.

Both groups of tools aim to alleviate the following problems occurring in the daily work of collection managers, and in the quality of many existing databases, respectively.

  • Cultural heritage experts (collection managers and researchers) have used and developed content annotation schemes and classifications, laid down in thesauri, reference lists, topic maps. Their ability to apply these schemes and classifications to new data is only limited by time and scale. Knowledge enrichment techniques can alleviate the time and scale bottlenecks by adding machine power to manpower; by emulating how experts annotate data. After they have learned to emulate experts by examples, they can start to annotate (classify, analyse, relate) very large amounts of new data themselves, in a fraction of the time.
  • Existing databases of objects, partially or inconsistently marked up with legacy classification systems can be automatically made more consistent with knowledge enrichment techniques. As far as they are partially or largely not annotated, disorganized, and unlinked, they can be automatically annotated, organized and linked semantically.

Research topics

The leading question is: How can we arrive at the automatic enrichment of cultural heritage data? We know that the current state of affairs asks for (1) tools to support experts in their manual enrichment work, to alleviate time and scale bottlenecks, and (2) tools for automatic data enrichment, particularly for making existing data cleaner and more consistent, and for discovering new structures and relations in data.
The research agenda that follows from these desiderata starts with the development of methods and software tools that can assist experts in their manual work, allowing them to enrich more data in less time. Such tools should be able to emulate experts' annotations, and suggest annotations of new data at such a high level of precision that experts only need to correct these suggestions occasionally. As a second step, the agenda should list the development of tools that operate in domains that demand even more automation; either because no initial annotation scheme is available (the data is still "raw") and an annotation needs to be bootstrapped from data, or because the annotation needs to be performed automatically, either due to the unavailability of experts or as an initial phase in exploring "raw" data.

This agenda calls for the use and development of methods for automatic knowledge generation in data (a broad field encompassing methods from machine learning, statistical learning, and data mining). Knowledge generation from data is typically needed in situations such as the one central to CATCH, where a digitisation effort has produced (potentially large-scale) databases of unanalysed data, and experts (collection managers) are eager to explore and analyse this data as effectively as possible in as little time as possible. Alternatively, the data is already annotated, or is receiving new annotations through a metadata project (as also present in CATCH), and knowledge enrichment is used to learn this annotation and apply it to yet unanalysed data.

This research is intrinsically empirical; the methods to be developed are based on empirical data, and the function they have can and must be judged and evaluated in terms of measurable improvements in accuracy and speed, both by objective quantitative evaluation and by the collection managers that use the methods.

Background

To understand the research question and the research topics more in depth, we provide some background. Table 1 shows four types of knowledge enrichment we distinguished.

  Expert support  Automatic enrichment
 
Existing annotation systems

A

  • Expert support, based on existing annotation schemes
  • Supporting experts in the annotation of objects in databases according to an existing annotation scheme, in a software annotation environment that is able to make accurate suggestions.
  • Keywords: semi-automatic annotation, domain knowledge, existing ontologies, semantic web

B

  • Automatic enrichment, based on existing annotation schemes
  • Automatic annotation of unannotated objects, and automatic cleanup of incorrectly annotated objects. Allows to do what under quadrant A could not have been done in human time.
  • Keywords: data mining, text mining, automatic classification, machine learning 
Automatic discovery of structure

C

  • Expert support, automatic discovery of structure
  • Confronting experts with statistically salient patterns and structures within and between objects, visualising associations, suggesting new structures.
  • Keywords: exploratory data analysis, data mining, statistical analysis. 

D

  • Automatic enrichment, automatic discovery of structure
  • Discovering structures within and between objects, and exporting these discoveries to ontologies, associative networks, and clustering.
  • Keywords: knowledge generation from data, self-organization, clustering 
Table 1: Four types of knowledge enrichment.

The "A" quadrant represents tools for the direct support of experts in the manual annotation of objects in databases. Precious time can be saved when intelligent software makes accurate suggestions to the annotator, who then only invests time when the suggestion is incorrect. Even more precious time can be saved when the same intelligent software running in the background makes pre-selections of especially salient objects that need to be annotated first.

The "B" quadrant takes over from the "A"-quadrant tools when the scale of the data cannot be tackled by the available human expert time. "B"-quadrant tools automatically annotate large amounts of data, and check for inconsistencies and noise in existing annotated databases. They will not do this flawlessly, but well enough that the automatically annotated data becomes largely searchable and retrievable, where before it was not.

The "C" quadrant is the mirror of the "A" quadrant, except that experts are not helped with annotation, but rather confronted with new patterns and relations that may deserve a new annotation symbol or level. A likely example is a new level of annotation which links pairs of objects to each other on grounds of some significant co-occurrence of the two, that thus far was not acknowledged by any level of annotation.

The "D" quadrant combines "B" and "C" - it operates autonomously in data to discover any grouping of objects that might be of interest, on such large amounts of data that a manual inspection of the process would not be feasible, except at the very end of the automatic knowledge discovery process.

Back tot top

Ad 3: Personalisation through presentation

Situation in cultural heritage

Most of the services that are currently available have predefined presentations. The institutions determine the ways a user may view objects and their metadata. Information technology offers many new options for personalisation of the presentation, but these are hardly used at all. The reason is straightforward: there are actually no easy-to-use tools in that respect. More research into human-computer interaction and user modelling is needed to specify such tools. A clear instance is the need for better navigation through digital collections. The amount of objects from cultural institutions run in the millions, if not billions when considered on a global scale. User modelling is considered as an attractive option for navigating more quickly, easily and efficiently across digital collections or objects. By automatic analysis of the user's search behaviour and by offering the facility to create personal contexts, it is expected that users can benefit more from such information services than via direct search-and-retrieval actions.

Research topics

The leading research question is: How can we develop methods and tools for generating presentations of cultural-heritage objects that are related in a semantic way? This work also includes (1) user-modelling issues, e.g., how can user groups be related to presentation styles? and (2) user-control issues, e.g., how can the user control the presentation style? More specifically, we list the following three research questions.

  • Is it possible adequately to reduce the user’s effort when expressing the ambitious information need that the system must take into account besides many other elements?
  • Is it possible to construct a tool that composes an agreed-upon ontology in order to determine the meaning of terms in the user’s questions and in the information sources?
  • To what extent is it possible to find an “optimal” mix of (1) proactive behaviour that is based solely on the user’s known interests and (2) selection of information based on other users’ interests or the importance of certain (un-requested) information?
    For the research involved two observations are important.
  • The availability of a syntactically (XML-based) and semantically (RDF/OWL based) integrated metadata opens new avenues for presentation and personalization.
  • By using semantic relations such as “period” and “style” it becomes possible to generate tailor-made presentations for groups or individuals.

Background

To provide an appropriate insight into the complexity of the three research questions we add some details about context and depth of the investigations. In research question 1, the “many other elements” include a user model containing the interests, goals, background and knowledge of the user, contextual information such as the physical location of the user and perhaps also his/her orientation, the time of day, the device and network he/she is using to interact with the system. Presently research is carried out on adapting the selection and presentation of information to a user based on one type of information about that user (either knowledge, interest, or context). This should be complemented by research on adaptation based on all kinds of information about the user in question and his/her context.

For research question 2 it is beneficial to understand that the answer to a question also consists of objects described by semantic metadata, used to determine how these objects relate to one another. This semantic information needs to be combined with descriptive metadata in order to generate a hypermedia (Web) structure that can be viewed using a “browser”. While currently it is possible to generate such presentations based on one set of metadata, the combination of different types of metadata has to be investigated in order to generate the most appropriate presentation for each individual user.

Research question 3 looks somewhat further into the future: systems can be made to become proactive, selecting and presenting information that matches the user’s interests and needs without the user having to express that need through a question. The automatic provision of information on a person, e.g., architect Max Weber, when dealing with housing of multicultural groups in Amsterdam, is a good example of proactive behaviour. A mix of active and proactive behaviour is needed in order to prevent an agent from becoming boring because an agent will never surprise the user with interesting but unexpected information.

For the research theme personalisation the CATCH programme aims at acquiring new knowledge in three sub-domains: (1) selection of information, (2) automatic generation of presentations, and (3) adaptation or personalisation.

Selection of information. The challenge here is to answer incomplete information requests from users with an accuracy that is comparable with or even better than the database-query accuracy. Four techniques have to be combined into heuristic evaluation tools to achieve this goal. The techniques are: (1) information retrieval techniques based on (potential) natural language understanding of textual contents, (2) information retrieval techniques based on metadata using ontology, (3) selection of objects based on descriptive metadata, and (4) database integration methods.

Automatic generation of presentations. The challenge is to “combine” selected information objects of different media types. Perhaps having different types of navigational or semantic relationships and combining them into a single virtual hypermedia (Web) presentation is the most difficult part. In that case it is necessary to adapt the result to the device and network capabilities of the user’s environment. This requires a careful (automatic) selection of the use of the “dimensions” layout, time, and navigation.

Adaptation or personalization. The results of almost any possible information request are too large to be presented to and browsed through by a user. Hence, an environment must be designed that derives additional specifications of the information or objects to be selected from past user behaviour. In order to improve this process, and especially its initial stages, users need to be clustered in groups (with similar interests, background, expertise, etc.). Finding scalable algorithms for grouping is an additional research issue here.

Back to top