MetaTOC stay on top of your field, easily

Automatic Extraction of Property Norm‐Like Data From Large Text Corpora

, ,

Cognitive Science / Cognitive Sciences

Published online on

Abstract

Traditional methods for deriving property‐based representations of concepts from text have focused on either extracting only a subset of possible relation types, such as hyponymy/hypernymy (e.g., car is‐a vehicle) or meronymy/metonymy (e.g., car has wheels), or unspecified relations (e.g., car—petrol). We propose a system for the challenging task of automatic, large‐scale acquisition of unconstrained, human‐like property norms from large text corpora, and discuss the theoretical implications of such a system. We employ syntactic, semantic, and encyclopedic information to guide our extraction, yielding concept‐relation‐feature triples (e.g., car be fast, car require petrol, car cause pollution), which approximate property‐based conceptual representations. Our novel method extracts candidate triples from parsed corpora (Wikipedia and the British National Corpus) using syntactically and grammatically motivated rules, then reweights triples with a linear combination of their frequency and four statistical metrics. We assess our system output in three ways: lexical comparison with norms derived from human‐generated property norm data, direct evaluation by four human judges, and a semantic distance comparison with both WordNet similarity data and human‐judged concept similarity ratings. Our system offers a viable and performant method of plausible triple extraction: Our lexical comparison shows comparable performance to the current state‐of‐the‐art, while subsequent evaluations exhibit the human‐like character of our generated properties.