Cross-language patent matching via an international patent classification-based concept bridge
Journal of Information Science
Published online on July 08, 2013
Abstract
Patent documents with sophisticated technical information are valuable for developing new technologies and products. They can be written in almost any language, leading to language barrier problems during retrieval. Traditionally, cross-language information retrieval and cross-language document matching have used text-translation-based or index-set-mapping methods. There are several challenges to the traditional methods, however, such as difficulties with natural language translation, complications owing to bilingual or multi-lingual translations (translating between two or more than two languages), and the unavailability of a parallel dual-language document set. This study offers a new and robust solution to cross-language patent document matching: the International Patent Classification (IPC) based concept bridge approach. The proposed method applies Latent Semantic Indexing to extract concepts from each set of patent documents and utilizes the IPC codes to construct a cross-language mediator that expresses patent documents in different languages. Experiments were carried out to demonstrate the performance of the proposed method. There were 3000 English patents and 3000 Chinese patents gathered as training documents from the United States Patent and Trademark Office and the Taiwan Intellectual Property Office, respectively. Another 30 English patents and another 30 Chinese patents were collected to be query patents. Finally, evaluations using an objective measure and subjective judgement were conducted to prove the feasibility and effectiveness of our method. The results show that our method out-performs the traditional text-translation methods.