sentsplit
A flexible sentence segmentation library using CRF model and regex rules
Working in the field of natural language processing, I am often faced with the task of segmenting a text passage into multiple sentences.
sentsplit
is a sentence segmentation library for Python with the following desiderata in mind:
- Be able to extend to new languages or “types” of sentences from the data alone.
- But also provide functionality to segment (or not to segment) lines based on regular expression rules.
- Be able to reconstruct the exact original text paragraphs from joining the segmented sentences.
The basic idea is to train and run a conditional random field model on the texts for the first pass; and apply regex rules for more control in the second pass, hopefully allowing for more flexibility.
Demo of this library is available here.