Prof. Dr. Christina Niklaus

Assistant Professor in Computer Science with focus on “Databases and Data Engineering”


Curriculum vitae



University of St. Gallen

Institute of Computer Science



MinWikiSplit: A Sentence Splitting Corpus with Minimal Propositions


Conference paper


Christina Niklaus, André Freitas, Siegfried Handschuh
Proceedings of the 12th International Conference on Natural Language Generation, Association for Computational Linguistics, Tokyo, Japan, 2019 Oct, pp. 118--123

View PDF Dataset
Cite

Cite

APA   Click to copy
Niklaus, C., Freitas, A., & Handschuh, S. (2019). MinWikiSplit: A Sentence Splitting Corpus with Minimal Propositions. In Proceedings of the 12th International Conference on Natural Language Generation (pp. 118–123). Tokyo, Japan: Association for Computational Linguistics.


Chicago/Turabian   Click to copy
Niklaus, Christina, André Freitas, and Siegfried Handschuh. “MinWikiSplit: A Sentence Splitting Corpus with Minimal Propositions.” In Proceedings of the 12th International Conference on Natural Language Generation, 118–123. Tokyo, Japan: Association for Computational Linguistics, 2019.


MLA   Click to copy
Niklaus, Christina, et al. “MinWikiSplit: A Sentence Splitting Corpus with Minimal Propositions.” Proceedings of the 12th International Conference on Natural Language Generation, Association for Computational Linguistics, 2019, pp. 118–23.


BibTeX   Click to copy

@inproceedings{niklaus2019a,
  title = {MinWikiSplit: A Sentence Splitting Corpus with Minimal Propositions},
  year = {2019},
  month = oct,
  address = {Tokyo, Japan},
  pages = {118--123},
  publisher = {Association for Computational Linguistics},
  author = {Niklaus, Christina and Freitas, André and Handschuh, Siegfried},
  booktitle = {Proceedings of the 12th International Conference on Natural Language Generation},
  month_numeric = {10}
}

Abstract

We compiled a new sentence splitting corpus that is composed of 203K pairs of aligned complex source and simplified target sentences. Contrary to previously proposed text simplification corpora, which contain only a small number of split examples, we present a dataset where each input sentence is broken down into a set of minimal propositions, i.e. a sequence of sound, self-contained utterances with each of them presenting a minimal semantic unit that cannot be further decomposed into meaningful propositions. This corpus is useful for developing sentence splitting approaches that learn how to transform sentences with a complex linguistic structure into a fine-grained representation of short sentences that present a simple and more regular structure which is easier to process for downstream applications and thus facilitates and improves their performance.


Share

Tools
Translate to