sentsplit

A flexible sentence segmentation library using CRF model and regex rules

Working in the field of natural language processing, I am often faced with the task of segmenting a text passage into multiple sentences.

sentsplit is a sentence segmentation library for Python with the following desiderata in mind:

  • Be able to extend to new languages or “types” of sentences from the data alone.
  • But also provide functionality to segment (or not to segment) lines based on regular expression rules.
  • Be able to reconstruct the exact original text paragraphs from joining the segmented sentences.

The basic idea is to train and run a conditional random field model on the texts for the first pass; and apply regex rules for more control in the second pass, hopefully allowing for more flexibility.

Demo of this library is available here.