Summary
Sentences that present a complex linguistic structure act as a major stumbling block for semantic applications whose predictive quality deteriorates with sentence length and complexity. The task of Text Simplification (TS) aims to modify sentences in order to make them easier to process, using a set of rewriting operations, such as reordering, deletion or splitting. These transformations are executed with the objective of converting the input into a simplified output, while preserving its main idea and keeping it grammatically sound.
State-of-the-art syntactic TS approaches suffer from two major drawbacks: first, they follow a very conservative approach in that they tend to retain the input rather than transforming it, and second, they ignore the cohesive nature of texts, where context spread across clauses or sentences is needed to infer the true meaning of a statement. We address the first problem by generating a fine-grained output with a simple and regular structure. % that is easy to analyze by downstream applications. For this purpose, we decompose a source sentence into a set of self-contained propositions, with each of them presenting a minimal semantic unit. Moreover, in order to maximize the expressiveness of the simplified sentences, we suggest not only to split the input into isolated sentences, but to also incorporate the semantic context in the form of semantic relationships between the split propositions.
To address this challenge, we present a discourse-aware TS framework that is able to split and rephrase complex English sentences within the semantic context in which they occur. Our framework differs from previous systems by using a linguistically grounded transformation stage that first transforms syntactically complex sentences into smaller units with a simpler structure using clausal and phrasal disembedding mechanisms. By using a recursive top-down approach, our framework is able to generate a hierarchical representation between those units, capturing both their semantic context and relations to other units in the form of rhetorical relations. In that way, we generate a semantic hierarchy of minimal propositions that benefits downstream Open Information Extraction (IE) tasks.
In a comparative analysis, we demonstrate that our baseline implementation DisSim outperforms the state of the art in structural TS both in an automatic evaluation and a manual analysis, obtaining the highest scores on three simplification datasets from two different domains with regard to SAMSA (0.67, 0.57, 0.54), a recently proposed metric targeted at automatically measuring the syntactic complexity of sentences which highly correlates with human judgments on structural simplicity and grammaticality. Furthermore, a comparative analysis with the annotations contained in the RST Discourse Treebank reveals that we are able to capture the contextual hierarchy between the split sentences with a precision of almost 90% and reach an average precision of approximately 70% for the classification of the rhetorical relations that hold between them. Finally, an extrinsic evaluation shows that when applying our framework as a preprocessing step the performance of state-of-the-art Open IE systems can be improved by up to 346% in precision and 52% in recall.