Prof. Dr. Christina Niklaus

Assistant Professor in Computer Science with focus on “Databases and Data Engineering”

University of St. Gallen

MinWikiSplit: A Sentence Splitting Corpus with Minimal Propositions

Conference paper

Christina Niklaus, André Freitas, Siegfried Handschuh
Proceedings of the 12th International Conference on Natural Language Generation, Association for Computational Linguistics, Tokyo, Japan, 2019 Oct, pp. 118--123

View PDF Dataset

Cite

APA Click to copy
Niklaus, C., Freitas, A., & Handschuh, S. (2019). MinWikiSplit: A Sentence Splitting Corpus with Minimal Propositions. In Proceedings of the 12th International Conference on Natural Language Generation (pp. 118–123). Tokyo, Japan: Association for Computational Linguistics.

Chicago/Turabian Click to copy
Niklaus, Christina, André Freitas, and Siegfried Handschuh. “MinWikiSplit: A Sentence Splitting Corpus with Minimal Propositions.” In Proceedings of the 12th International Conference on Natural Language Generation, 118–123. Tokyo, Japan: Association for Computational Linguistics, 2019.

MLA Click to copy
Niklaus, Christina, et al. “MinWikiSplit: A Sentence Splitting Corpus with Minimal Propositions.” Proceedings of the 12th International Conference on Natural Language Generation, Association for Computational Linguistics, 2019, pp. 118–23.

BibTeX Click to copy

@inproceedings{niklaus2019a,
  title = {MinWikiSplit: A Sentence Splitting Corpus with Minimal Propositions},
  year = {2019},
  month = oct,
  address = {Tokyo, Japan},
  pages = {118--123},
  publisher = {Association for Computational Linguistics},
  author = {Niklaus, Christina and Freitas, André and Handschuh, Siegfried},
  booktitle = {Proceedings of the 12th International Conference on Natural Language Generation},
  month_numeric = {10}
}

Abstract

We compiled a new sentence splitting corpus that is composed of 203K pairs of aligned complex source and simplified target sentences. Contrary to previously proposed text simplification corpora, which contain only a small number of split examples, we present a dataset where each input sentence is broken down into a set of minimal propositions, i.e. a sequence of sound, self-contained utterances with each of them presenting a minimal semantic unit that cannot be further decomposed into meaningful propositions. This corpus is useful for developing sentence splitting approaches that learn how to transform sentences with a complex linguistic structure into a fine-grained representation of short sentences that present a simple and more regular structure which is easier to process for downstream applications and thus facilitates and improves their performance.