index
module docmodel.document
ClosingTag
OpeningTag
ProcessingStep
SourceDoc
Tag
TagRepository
TarsqiDocument
TarsqiInputError
TarsqiDocument and friends.

This module contains TarsqiDocument and some of the classes used by it.
class ClosingTag
Inherits from: Tag

Like Tag, but self.begin and self.attrs are always None.

Public Functions

__init__(self, name, offset)
__str__(self)
is_closing_tag(self)
class OpeningTag
Inherits from: Tag

Like Tag, but self.end is always None.

Public Functions

__init__(self, name, offset, attrs)
__str__(self)
is_opening_tag(self)
class ProcessingStep
Inherits from: object

Implements an element of the processing history in the metadata. The core
of the ProcessingStep is the pipeline that was exectued, in addition there
is some bookkeeping and the following are included: the TTK version, the git
commit if available and a timestamp. Note that the git commit does not
uniquely define the state of the code when the toolkti ran because there
could have been uncommitted changes.

Public Functions

__init__(self, pipeline=None, dom_node=None)
Initialize from a pipeline or a DOM node.
__str__(self)
as_xml(self)

Private Functions

_initialize_from_dom_node(self, dom_node)
_initialize_from_pipeline(self, pipeline)
class SourceDoc
Inherits from: object

A SourceDoc is created by a SourceParser and contains source data and
annotations of those data. The source data are put in the text variable as a
unicode string, tags are in the source_tags and tarsqi_tags variables and
contain begin and end positions in the source. In addition, metadata,
comments, and any other data from the input is stored here.

Note that the SourceDoc is the input to further Tarsqi processing and it
stores everything that was given as input to the pipeline. This could be a
text document or a TimeBank document without any TimeML annotations. But it
could also be a TTK document that was the result of prior application of
another pipeline and that document can contain Tarsqi tags. The metadata and
tarsqi_tags will be exported to the relevant places in the TarsqiDocument
when the metadata parse and the document structure parsers apply to the
TarsqiDocument.

Public Functions

__getitem__(self, i)
__init__(self, filename='')
Initialize a SourceDoc on a filename or a string.
add_characters(self, string)
Add a character string to the source and increment the current offset. Used by the CharacterDataHandler of the Expat parser in SourceParserXML.
add_closing_tag(self, name)
Add a closing tag to source_tags. This is used by the EndElementHandler of the Expat parser in SourceParserXML.
add_comment(self, string)
add_opening_tag(self, name, attrs)
Add an opening tag to source_tags. This is used by the StartElementHandler of the Expat parser in SourceParserXML.
add_processing_instruction(self, target, data)
finish(self)
Transform the source text list into a string, merge the begin and end tags, and index the tags on offsets. This should be called by SourceParserXML which uses the Expat parser and looks for individual elements, it is not needed by SourceParserTTK since it uses a DOM object, it is also not needed by SourceParserText since it does not deal with tags.
pp(self)
Print source and tags.
print_source(self, filename)
Print the source string to a file, using the utf-8 encoding.
class Tag
Inherits from: object

A Tag has a name, a begin offset, an end offset and a dictionary of
attributes. All arguments are handed in by the code that creates the Tag
which could be: (1) the code that parses the source document, which will
only assign an identifier if the source had an id attribute, (2) the
preprocessor code, which assigns identifiers for lex, ng, vg and s tags, or
(3) one of the components that creates tarsqi tags.

# TODO: check whether those are still the three that are used

Public Functions

__eq__(self, other)
__ge__(self, other)
__gt__(self, other)
__init__(self, name, o1, o2, attrs)
Initialize name, begin, end and attrs instance variables and make sure that what we have can be turned into valid XML by removing duplicate attribute names.
__le__(self, other)
__lt__(self, other)
__ne__(self, other)
__str__(self)
as_lex_xml_string(self, text)
Return an opening and closing tag wrapped around text. This is used only by the GUTime wrapper to create input for GUTime, and it therefore has a narrow focus and does not get all information from the tag.
as_ttk_tag(self)
Return the tag as a tag in the Tarsqi output format.
attributes_as_string(self)
Return a string representation of the attributes dictionary.
get_identifier(self)
Returns the identifier of the event, timex or tlink if there is one, returns None otherwise. For an event, the identifier is assumed to be the eiid.
is_closing_tag(self)
is_opening_tag(self)

Private Functions

_compare(self, other)
Order two Tags based on their begin offset and end offsets. Tags with an earlier begin will be ranked before tags with a later begin, with equal begins the tag with the higher end will be ranked first. Tags with no begin (that is, it is set to -1) will be ordered at the end. The order of two tags with the same begin and end is undefined.
class TagRepository
Inherits from: object

Class that provides access to the tags for a document. An instance of this
class is used for the DocSource instance, other instances will be used for
the elements in a TarsqiDocument. For now, the repository has the following
structure:

self.tmp
   A list of OpeningTag and ClosingTag elements, used only to build the tags
   list.

self.tags
   A list with Tag instances.

self.opening_tags
   A dictionary of tags indexed on begin offset, the values are lists of Tag
   instances, again ordered on id (thereby reflecting text order, but only
   for tags in the original input).

self.closing_tags
   A dictionary indexed on end offset and begin offset, the values are
   dictionary of tagnames. For example,
      closing_tags[547][543] = {'lex':True, 'NG':True }
   indicates that there is both a lex tag and an NG tag from 543-547. The
   opening tags dictionary will have encoded that the opening NG occurs
   before the opening lex:
      opening_tags[543] = [<Tag 204 NG 543-547 {}>, <Tag 205 lex 543-547 {...}]

Public Functions

__init__(self)
add_tag(self, name, begin, end, attrs)
Add a tag to the tags list and the opening_tags and closing_tags dictionaries.
add_tmp_tag(self, tag_instance)
Add an OpeningTag or ClosingTag to a temporary list. Used by the XML handlers.
all_tags(self)
append(self, tag)
Appends an instance of Tag to the tags list.
find_linktags(self, name, o1, o2)
Return all the link tages with type name. Only include the ones that fall between offsets o1 and o2.
find_tag(self, name)
Return the first Tag object with name=name, return None if no such tag exists.
find_tags(self, name, begin=None, end=None)
Return all tags of this name. If the optional begin and end are given only return the tags that fall within those boundaries.
find_tags_at(self, begin_offset)
Return the list of tags which start at begin_offset.
import_tags(self, tag_repository, tagname)
Import all tags with name=tagname from tag_repository into self. This is mostly used when we want to take tags from the SourceDoc and add them to the tags on the TarsqiDocument.
index(self)
Index tags on position.
index_events(self)
index_timexes(self)
is_empty(self)
merge(self)
Take the OpeningTags and ClosingTags in self.tmp and merge them into Tags. Raise errors if tags do not match.
pp(self, indent=' ')
pp_closing_tags(self)
pp_opening_tags(self)
pp_tags(self, indent='')
remove_tag(self, tag)
Remove the tag from the list of tags. This is rather inefficient since the whole list is traversed. Also note that this method does not remove the tag from the opening_tags and closing_tags dictionaries, so depending on when this is done these may need to be re-indexed.
remove_tags(self, tagname)
Remove all tags with name=tagname. Rebuilds the indexes after removing the tags.
reset(self)
class TarsqiDocument
Inherits from: object

An instance of TarsqiDocument should contain all information that may be
needed by the wrappers to do their work. It includes the source, metadata,
processing options, a set of identifier counters and a TagRepository.

Instance Variables:
   source    -  an instance of DocSource
   metadata  -  a dictionary
   options   -  the Options instance from the Tarsqi instance
   tags      -  an instance of TagRepository
   counters  -  a set of counters used to create unique identifiers

Note that he processing options are available to the wrappers only through
this class by accessing the options variable.

Public Functions

__init__(self)
__str__(self)
add_event(self, begin, end, attrs)
Add an EVENT tag to the tarsqi_tags tag repository.
add_options(self, options)
add_timex(self, begin, end, attrs)
Add a TIMEX3 tag to the tag repository.
elements(self)
Method that returns the tags that contain paragraphs, that is, the tags of type docelement.
events(self)
Convenience method for easy access to events.
get_dct(self)
has_event(self, begin, end)
Return True if there is already an event at the given begin and end.
list_of_sentences(self)
next_event_id(self)
next_link_id(self, link_type)
Return a unique lid. The link_type argument is one of {ALINK, SLINK, TLINK} and determines what link counter is incremented. The lid itself is the sum of all the link counts. Assumes that all links are added using the link counters in the document. Breaks down if there are already links added without using those counters.
next_timex_id(self)
pp(self, source_tags=True, tarsqi_tags=True)
print_all(self, fname=None)
Write source string, metadata, comments, source tags and tarsqi tags all to one file or to the standard output.
print_all_lif(self, fh)
print_sentences(self, fname=None)
Write to file (or standard output if no filename was given) a Python variable assignment where the content of the variable the list of sentences as a list of lists of token strings.
print_source(self, fname)
Print the original source of the document, without the tags to file fname.
remove_tlinks(self)
Remove all TLINK tags from the tags repository.
slinks(self)
text(self, p1, p2)
timexes(self)
Convenience method for easy access to timexes.
tlinks(self)
update_processing_history(self, pipeline)

Private Functions

_print_comments(self, fh)
_print_metadata(self, fh)
_print_tags(self, fh, tag_group, tags)
class TarsqiInputError
Inherits from: Exception