index
module docmodel.source_parser
SourceParser
SourceParserLIF
SourceParserTTK
SourceParserText
SourceParserXML
Source parsers for Toolkit input

Module that contains classes to parse and represent the input document. All
parsers have a parse_file() and a parse_string() method and these methods return
an instance of TarsqiDocument. Of that instance, all the parsers in this file do
is to instantiate the source instance variable, which contains an instance of
SourceDoc.

What parser is used is defined in main.py, which has a mapping from source types
(handed in by the --source-format command line option) to source parsers.

There are now four parsers:

SourceParserXML
   A simple XML parser that splits inline XML into a source string and a list of
   tags. The source string and the tags are stored in the SourceDoc instance,
   which is intended to provide just enough functionality to deal with the input
   in a read-only fashion, that is, additional annotations should not be in this
   instance.

SourceParserText
   Simply puts the entire text in the DocSource instance and leaves the
   TagsRepository empty.

SourceParserTTK
   This parser deals with the ttk format. In the TTK format there are two main
   sources for tags: source_tags and tarsqi_tags. The first are added to the
   tags repository on the SourceDoc (which is considered read-only after that),
   the second are added to the tags repository on the TarsqiDocument.

SourceParserLIF
   Takes the LIF format as input. This results in a source document with empty
   tag repositories. Annotations in the LIF input are stored on a special
   variable on the source document (SourceDoc.lif) so it can be used later when
   producing output.
class SourceParser
Inherits from: object
class SourceParserLIF
Inherits from: SourceParser

Public Functions

__init__(self)
Just declares the variable for the LIF object.
parse_file(self, filename, tarsqidoc)
Parse the TTK file and put the contents in the appropriate parts of the SourceDoc.
parse_string(self, text, tarsqidoc)
Parse the TTK string and put the contents in the appropriate parts of the SourceDoc.
class SourceParserTTK
Inherits from: SourceParser

Public Functions

__init__(self)
Initialize the three variables dom, topnodes and sourcedoc.
parse_file(self, filename, tarsqidoc)
Parse the TTK file and put the contents in the appropriate parts of the SourceDoc.
parse_string(self, text, tarsqidoc)
Parse the TTK string and put the contents in the appropriate parts of the SourceDoc.

Private Functions

_add_comments(self)
_add_metadata(self)
_add_source_tags(self)
Add the source_tags in the TTK document to the tags repository on the SourceDoc.
_add_tarsqi_tags(self)
Add the tarsqi_tags in the TTK document to the tags repository on the TarsqiDocument.
_add_to_source_tags(self, node)
_add_to_tag_repository(self, node, tag_repository)
_add_to_tarsqi_tags(self, node)
_load_topnodes(self)
Fills the topnodes dictionary with text, metadata, source_tags and tarsqi_tags and comment keys.
_parse(self, tarsqidoc)
class SourceParserText
Inherits from: SourceParser

Public Functions

parse_file(self, filename, tarsqidoc)
Parses filename and returns a SourceDoc. Simply dumps the full file content into the text variable of the SourceDoc.
parse_string(self, text, tarsqidoc)
Parses a text string and returns a SourceDoc. Simply dumps the full string into the text variable of the SourceDoc.
class SourceParserXML
Inherits from: SourceParser

Simple XML parser, using the Expat parser.

Instance variables
   encoding - a string
   sourcedoc - an instance of SourceDoc
   parser - an Expat parser

Public Functions

__init__(self, encoding='utf-8')
Set up the Expat parser.
parse_file(self, filename, tarsqidoc)
Parses filename and returns a SourceDoc. Uses the ParseFile routine of the expat parser, where all the handlers are set up to fill in the text and tags in SourceDoc.
parse_string(self, text, tarsqidoc)
Parses a text string and returns a SourceDoc. Uses the ParseFile routine of the expat parser, where all the handlers are set up to fill in the text and tags in SourceDoc.

Private Functions

_debug(self, *rest)
_handle_characters(self, string)
Handle character data by asking the SourceDocument to add the data. This will not necesarily add a contiguous string of character data as one data element. This should include ingnorable whtespace, but see the comment in the method below, I apparently had reason to think otherwise.
_handle_comment(self, data)
Store comments.
_handle_default(self, string)
Handle default data by asking the SourceDoc to add it as characters. This is here to get the 'ignoreable' whitespace, which I do not want to ignore.
_handle_end(self, name)
Add closing tags to the SourceDoc.
_handle_processing_instruction(self, target, data)
Store processing instructions
_handle_start(self, name, attrs)
Handle opening tags. Takes two arguments: a tag name and a dictionary of attributes. Asks the SourceDoc instance in the sourcedoc variable to add an opening tag.
_handle_xmldecl(self, version, encoding, standalone)
Store the XML declaration.
module functions
print_dom(node, indent=0)
Debugging method.
replace_newline(text)
Just used for debugging, make sure to not use this elsewhere because it is dangerous since it turns unicode into non-unicode.