The research program of this collaborative project is the following:
- 1 Cross lingual interlinking from online knowledge bases
- 1.1 Task 1 - Cross lingual knowledge linking based on encyclopedic resources
- 1.2 Task 2 - Cross lingual infobox alignment and complement
- 2 Cross lingual ontology matching and data interlinking
- 2.1 Task 3 - Cross lingual ontology matching based on aligned cross lingual Human-readable Knowledge Bases
- 2.2 Task 4 - Cross-lingual data interlinking
- 2.3 Task 5 - Query driven cross-lingual data interlinking
- 3 Application to the generation and publication of links across real world cross lingual datasets
- 4 Task 7 - Project management
In this first part, we will design new methods for interlinking resources and structure of cross lingual human-readable knowledge bases. These bases are made of a set of articles which are organized into categories and interconnected each other through hypertext links. The articles contains text, multimedia content, and semi-structured data tables called infoboxes.
Our approach will consist to make emerge cross lingual links between human-readable KB by relying more on the structure of articles than on the text content. To do this, we will first design predictive models using language independent features such as the hyperlink structure and the categories articles. A second aspect will be the development of infobox matching algorithms.
Coordinator: Juanzi Li
Objectives Cross lingual links in Wikipedia connect equivalent articles written in different languages, which have been widely exploited in various applications. However, the number of articles in different languages is very unbalanced within Wikipedia, which limits the size of available cross-lingual links. Since there have already been several large scale encyclopedia other than Wikipedia (Freebase, Hudong Baike, Baidu Baike), we are going to study and develop new methods for discovering cross-lingual links across multiple heterogeneous encyclopedic corpus.
Machine translation is a simple method for bridging the gap between different languages. However, it will also bring a lot of noise in the data. We will study how to define language-independent article features for detecting cross lingual links. Inlinks, outlinks and categories in Wiki articles will be explored to construct useful features.
Cross-lingual links in Wikipedia are valuable seeds or training samples for finding new cross-lingual links. We are going to propose a probabilistic model to predict new cross-lingual links. The predictive model should take both lexical and structural information into account to collectively predict new links.
Expected results Algorithm and tools for cross lingual knowledge linking Cross lingual linking of articles from English, Wikipedia, Hudong and French Wikipedia
Task 2 - Cross lingual infobox alignment and complement
Coordinator: Juanzi Li
Participants: THU, INRIA
Objectives Infoboxes in encyclopedic corpus contain useful structural information, which are important resources for acquiring semantic knowledge. In order to facilitate cross lingual knowledge sharing and interaction, we are going to study methods for infobox alignment and complement.
Task 2.1 - Cross-lingual infobox alignment
The goal of infobox alignment is to match the infobox properties in different languages. It is observed that infoboxes which contain rich links. Therefore we are going to compute semantic similarities between infobox properties based on the link structure in Wikis. Cross lingual links between Wiki articles will also be fully explored in this task.
Task 2.2 - Cross-lingual infobox complement
The number of articles containing infoboxes in different languages is also unbalanced in Wikis. About 30% English Wiki articles contain infoboxes, but the number of infoboxes in other languages is relative small. Cross lingual infobox complement is to map infobox information from one language to other languages. We will study method for making use of cross lingual links and infobox property alignments for this task. Expected results Algorithm and tools for cross lingual infobox alignment and complement Cross lingual linking of infoboxes from English, Wikipedia, Hudong and French Wikipedia
We will design ontology matching algorithm and data-interlinking methods which performs well in cross-lingual environment. The originality and the power of the solution will to combine not only data and ontologies but also links discovered at the syntactic level.
Since the proposed interlinking solution has to scale up very large datasets, we will also pay a particular attention to the development of query-driven strategies allowing to partition large datasets into smaller related ones.
Task 3 - Cross lingual ontology matching based on aligned cross lingual Human-readable Knowledge Bases
Coordinator: Jérôme Euzenat
Objectives In this task, we will study ontology matching in cross-lingual environments by using the cross lingual links generated at the corpus level thanks to Tasts 1 and 2. For this task, we will propose new context-based ontology matching algorithms. The hypothesis underlying context-based matching is that considering the ontologies within their context helps to match cross lingual ontologies. The contextual information that will be associated to ontologies will be the interlinked wiki knowledge bases. We will investigate and combine two different approaches. The first one is multi-alignment and the second is instance-level matching.
Task 3.1 - Context-based ontology matching
Given two ontologies Oc and Of, and two related encyclopedic corpus Wc and Wf respectively written in Chinese and in French. The strategy will consist, in a first step, to match each ontology to the category architecture of their corresponding encyclopedia. The categories of an encyclopedia are often organized in a specialization hierarchy which can be seen as a lightweight ontology. In a second step, the category architectures will be matched using a similarity based on links between the two encyclopedia (A1 and A2). Finally, the alignment between Oc and Of will be computed by composing the alignments in the path Oc - Wc - Wf – Of. One of the key challenge here is to design and develop new similarity measures based on the links discovered between semi-structured encyclopedic corpus.
Task 3.2 - Instance-level matching
Instance similarity is a natural feature for ontology matching. However, it is not an easy task to computing the similarity of two instances in different languages. Some existing approaches employ a translator to address this problem. In this task we will study how to measure the distance between two instances based on knowledge linking. Useful features for cross lingual instances similarity measuring include structured information such as infobox properties and values. In addition, with knowledge linking we could transform cross lingual instance similarity computing into a monolingual problem. Given two instance Aen and Bcn in English and Chinese separately, if Aen is linked with a Chinese instance Acn and Bcn is lined with an English instance Ben, we can calculate similarity(Aen, Bcn) from similarity (Aen, Ben) and similarity(Acn, Bcn). Expected results Algorithms and tools for cross lingual ontology matching based on cross lingual links between human-readable knowledge bases.
Coordinator: Jérôme David
Participants: THU, INRIA
Objectives THU has achieved large scale data interlinking in monolingual environment and proposed a method based on candidate selection. This project will go further on this way and extend the method on cross lingual environment to achieve large scale cross lingual data interlinking which is free of translator. Several techniques needed are as follows:
Task 4.1 - Wiki based similarity measurement
Classical cross lingual data interlinking methods employ translators before matching, which are actually not “real” cross-lingual approaches. Obviously, the performance of these approaches directly depends on the precision of translation tool. To avoid this disadvantage, a Wiki knowledge based similarity measurement using collective intelligence is proposed.
Different from schema matching, data interlinking usually faces a huge scale dataset which requires an efficient algorithm. We will study a candidate selection based matching algorithm to avoid time-consuming calculation. The new algorithm is supposed to have lower time complexity. In addition, extra research is needed to ensure high performance on precision and recall rate.
Due to the limitation of the original seed set derived from the Wiki, the similarity measurement will be limited in a relatively small instance set. To enlarge the amount of matching instances and to improve precision, our algorithm will learn from the matching results to extend the seed set iteratively. In the iterative process, wrong matches will be corrected by new matching results.
Linkkeys are minimal sets of aligned properties which can discriminate instances in the two datasets. In a cross lingual context this approach is very promising if we follows the hypothesis that there are some linkkeys only composed of language independent properties. The outline of such a method based on linkkeys is:
- Match the ontologies used by the two datasets (Task 3).
- Select pairs of ontology properties which are aligned and share similar ranges.
- Select subsets of these properties which are keys in their respective dataset.
- Use these keys to link data
The challenges of this task are to select and design measures for comparing properties range and develop efficient algorithm for discovering keys in large datasets.
Expected results Algorithms and tools for cross lingual data interlinking based on cross lingual links between human-readable knowledge bases and language independent linkkeys.
Coordinator: Juanzi Li Participants: THU
Generally speaking, the instance scale especially in large scale ontologies are much larger than the scale of schemas. Therefore, large scale instance matching becomes the key point in the semantic integration system and lots of efforts have been spent on this issue previously. Moreover, such huge dataset are volatile. Secondly, the users of an integration system may only care about partial instances in a large scale ontology. Thirdly, with the growing size of ontologies, some effective but resource-intensive matching approaches cannot be performed very well. According those facts, we propose a bootstrapping query driven instance matching approach. The task is divided into two sections as follows:
In order to reduce the time complexity of data interlinking, a query driven cross lingual data interlinking algorithm is needed. In this way, the data will not be linked until the search results are returned to user. And only the entries occurring in the search results will be linked. We will research query driven data interlinking in cross-lingual model to improve the efficiency of large scale cross lingual data linking.
Task 5.2 - QM-Similarity computation method
The query driven cross lingual data interlinking model optionally takes advantage of a user feedback process to confirm the recommended matching candidates ranked by QM-similarities. This similarity is computed based on the co-occurrence and matching rate of two instances. We will concentrate on avoiding to compute the local-area similarity which will decrease the accuracy of data interlinking.
Expected results Algorithm and tool for query driven cross lingual data interlinking
The application part will promote the research results of this project by applying the proposed methods for interlinking cross lingual datasets about news and movies. More precisely, we will contribute by publishing new links between cross lingual datasets. We will also illustrate the use of such cross-lingual links through cross-lingual information retrieval applications.
Coordinator: Juanzi Li Participants: INRIA, THU
A typical application of cross lingual data linking is cross lingual news and movies linking. With the rapid spread of globalization, people may want to get information in different language e.g., one who in China might want to know the comments of a Hollywood film posted by Americans. In this situation, the link between a Chinese movie LOD and an English movie LOD is required. To achieve this goal, several efforts are in need as follows: The criteria of selecting existing news & movies LOD datasets include the data quality of the dataset, and the existing amount of interlink with other datasets. For English language datasets, the New York Times dataset for news and the Linked MDB dataset for movies are most popular ones, and have interlinks with DBpedia, Geonames, Freebase, Music Brainz, and other datasets. Traditional non-RDF data sources include structured data, e.g., tabular data, XML data and relational database; and non-structured data, e.g., news and movie webpages. For news, standardized XML news documents like NewsML(News Markup Language) and CNML(Chinese News Markup Language) can be used, as well as news portals like Sina. For movies, we can use webpages, and with the support of M1905.com, we have a large relational database, containing 306,942 movies and 2,102,474 film-related persons. We can use the data linking algorithm to establish the cross lingual links between the entities in movies and news. Converted CNML and Sina datasets can be linked with NTY datasets, and converted M1905 dataset can be linked with Linked MDB dataset. Then we can leverage data interlinking methods and our cross-lingual data linking algorithm to establish links between these datasets and surrounding datasets and build cross-lingual & cross domain linked datasets. Based on the cross-lingual & cross domain linked datasets, a standard linked data portal can be established, containing data browsing and download pages, URI dereference pages, and SPARQL endpoint. Then different application can be build upon the platform.
Task 6.1 - Multi-faceted cross-lingual integrated data retrieval
Cross-lingual data linking connects LOD datasets from different languages and to some extends reduces the imbalance of information. In this case, cross-lingual information sharing becomes easier than before. With the support of cross-lingual domain LOD dataset, we can improve the user experience by providing multi-lingual information on news instances and movie instances.
Task 6.2 - Cross-lingual relevant news retrieval
Current domain news browsing system can only retrieve news according to a specific dimension and from a single source, without the ability of semantic integration and cross-lingual multi-sources comparison and mining. Based on a unified interface of cross-lingual multi-sources linked datasets, we can retrieve relevant news, reviews from different sources, making it possible to compare different reactions for a particular movie among speakers of different languages. Expected results Cross lingual linked data in news and movies. Prototype system in multifaceted cross lingual integrated linked data search and cross lingual relevant news search.
Task 7 - Project management
Coordinator: Juanzi Li, Jérôme Euzenat Participants: INRIA, THU
As mentioned the project management will be kept as light as possible. This involves: Setting up a working infrastructure (wiki and storage space) that the other tasks can use. Elaborating and signing the consortium agreement. Organizing project meetings and teleconferences in charge of monitoring the project progress and replanning if necessary. Producing the final report and any report required by the funding agencies.
- Common wiki/git infrastructure (D7.1);
- Consortium agreement (D7.2);
- Final report (D7.3).