index
module docmodel.docstructure_parser
DocumentStructureParser
Document Structure Parser.
This module contains a minimal document structure parser. It is meant as a
temporary default and will be replaced by more sophisticated parsers and these
parsers will act more like the other tarsqi components.
The main goal of the parser is to add docelement tags to the tag repository on
the TarsqiDocument. Sometimes docelement tags already exist in the tag
repository (for example when reading a ttk file), in which case the parser does
nothing. Otherwise, the parser calls a simple method to recognize paragraphs and
creates a docelement Tag for each of them.
The docelements are used by Tarsqi components by looping over them and
processing the elements one by one.
class DocumentStructureParser
Inherits from: object
Simple document structure parser used as a default if no structure tags are
found in the tag repository of the TarsqiDocument.
Public Functions
parse(self, tarsqidoc)
Apply a default document structure parser to the TarsqiDocument if
there are no docelement tags in the tags repository. The parser uses
white lines to separate the paragraphs.
module functions
slurp(text, offset, test)
Starting at offset in text, find a substring where all characters pass
test. Return the begin and end position and the substring.
slurp_space(text, offset)
Starting at offset consume a string of space characters, then return the
begin and end position and the consumed string.
slurp_token(text, offset)
Starting at offset consume a string of non-space characters, then return
the begin and end position and the consumed string.
split_paragraphs(text)
Very simplistic way to split a paragraph into more than one paragraph,
simply by looking for an empty line.