index
module docmodel.document
ClosingTag
OpeningTag
ProcessingStep
SourceDoc
Tag
TagRepository
TarsqiDocument
TarsqiInputError
TarsqiDocument and friends.
This module contains TarsqiDocument and some of the classes used by it.
class ClosingTag
Inherits from: Tag
Like Tag, but self.begin and self.attrs are always None.
Public Functions
__init__(self, name, offset)
__str__(self)
is_closing_tag(self)
class OpeningTag
Inherits from: Tag
Like Tag, but self.end is always None.
Public Functions
__init__(self, name, offset, attrs)
__str__(self)
is_opening_tag(self)
class ProcessingStep
Inherits from: object
Implements an element of the processing history in the metadata. The core
of the ProcessingStep is the pipeline that was exectued, in addition there
is some bookkeeping and the following are included: the TTK version, the git
commit if available and a timestamp. Note that the git commit does not
uniquely define the state of the code when the toolkti ran because there
could have been uncommitted changes.
Public Functions
__init__(self, pipeline=None, dom_node=None)
Initialize from a pipeline or a DOM node.
__str__(self)
as_xml(self)
Private Functions
_initialize_from_dom_node(self, dom_node)
_initialize_from_pipeline(self, pipeline)
class SourceDoc
Inherits from: object
A SourceDoc is created by a SourceParser and contains source data and
annotations of those data. The source data are put in the text variable as a
unicode string, tags are in the source_tags and tarsqi_tags variables and
contain begin and end positions in the source. In addition, metadata,
comments, and any other data from the input is stored here.
Note that the SourceDoc is the input to further Tarsqi processing and it
stores everything that was given as input to the pipeline. This could be a
text document or a TimeBank document without any TimeML annotations. But it
could also be a TTK document that was the result of prior application of
another pipeline and that document can contain Tarsqi tags. The metadata and
tarsqi_tags will be exported to the relevant places in the TarsqiDocument
when the metadata parse and the document structure parsers apply to the
TarsqiDocument.
Public Functions
__getitem__(self, i)
__init__(self, filename='')
Initialize a SourceDoc on a filename or a string.
add_characters(self, string)
Add a character string to the source and increment the current
offset. Used by the CharacterDataHandler of the Expat parser in
SourceParserXML.
add_closing_tag(self, name)
Add a closing tag to source_tags. This is used by the
EndElementHandler of the Expat parser in SourceParserXML.
add_comment(self, string)
add_opening_tag(self, name, attrs)
Add an opening tag to source_tags. This is used by the
StartElementHandler of the Expat parser in SourceParserXML.
add_processing_instruction(self, target, data)
finish(self)
Transform the source text list into a string, merge the begin and end
tags, and index the tags on offsets. This should be called by
SourceParserXML which uses the Expat parser and looks for individual
elements, it is not needed by SourceParserTTK since it uses a DOM
object, it is also not needed by SourceParserText since it does not deal
with tags.
pp(self)
Print source and tags.
print_source(self, filename)
Print the source string to a file, using the utf-8 encoding.
class Tag
Inherits from: object
A Tag has a name, a begin offset, an end offset and a dictionary of
attributes. All arguments are handed in by the code that creates the Tag
which could be: (1) the code that parses the source document, which will
only assign an identifier if the source had an id attribute, (2) the
preprocessor code, which assigns identifiers for lex, ng, vg and s tags, or
(3) one of the components that creates tarsqi tags.
# TODO: check whether those are still the three that are used
Public Functions
__eq__(self, other)
__ge__(self, other)
__gt__(self, other)
__init__(self, name, o1, o2, attrs)
Initialize name, begin, end and attrs instance variables and make sure
that what we have can be turned into valid XML by removing duplicate
attribute names.
__le__(self, other)
__lt__(self, other)
__ne__(self, other)
__str__(self)
as_lex_xml_string(self, text)
Return an opening and closing tag wrapped around text. This is used only by
the GUTime wrapper to create input for GUTime, and it therefore has a narrow
focus and does not get all information from the tag.
as_ttk_tag(self)
Return the tag as a tag in the Tarsqi output format.
attributes_as_string(self)
Return a string representation of the attributes dictionary.
get_identifier(self)
Returns the identifier of the event, timex or tlink if there is one, returns
None otherwise. For an event, the identifier is assumed to be the eiid.
is_closing_tag(self)
is_opening_tag(self)
Private Functions
_compare(self, other)
Order two Tags based on their begin offset and end offsets. Tags with
an earlier begin will be ranked before tags with a later begin, with
equal begins the tag with the higher end will be ranked first. Tags with
no begin (that is, it is set to -1) will be ordered at the end. The
order of two tags with the same begin and end is undefined.
class TagRepository
Inherits from: object
Class that provides access to the tags for a document. An instance of this
class is used for the DocSource instance, other instances will be used for
the elements in a TarsqiDocument. For now, the repository has the following
structure:
self.tmp
A list of OpeningTag and ClosingTag elements, used only to build the tags
list.
self.tags
A list with Tag instances.
self.opening_tags
A dictionary of tags indexed on begin offset, the values are lists of Tag
instances, again ordered on id (thereby reflecting text order, but only
for tags in the original input).
self.closing_tags
A dictionary indexed on end offset and begin offset, the values are
dictionary of tagnames. For example,
closing_tags[547][543] = {'lex':True, 'NG':True }
indicates that there is both a lex tag and an NG tag from 543-547. The
opening tags dictionary will have encoded that the opening NG occurs
before the opening lex:
opening_tags[543] = [<Tag 204 NG 543-547 {}>, <Tag 205 lex 543-547 {...}]
Public Functions
__init__(self)
add_tag(self, name, begin, end, attrs)
Add a tag to the tags list and the opening_tags and closing_tags
dictionaries.
add_tmp_tag(self, tag_instance)
Add an OpeningTag or ClosingTag to a temporary list. Used by the XML
handlers.
all_tags(self)
append(self, tag)
Appends an instance of Tag to the tags list.
find_linktags(self, name, o1, o2)
Return all the link tages with type name. Only include the ones that
fall between offsets o1 and o2.
find_tag(self, name)
Return the first Tag object with name=name, return None if no such
tag exists.
find_tags(self, name, begin=None, end=None)
Return all tags of this name. If the optional begin and end are given
only return the tags that fall within those boundaries.
find_tags_at(self, begin_offset)
Return the list of tags which start at begin_offset.
import_tags(self, tag_repository, tagname)
Import all tags with name=tagname from tag_repository into self. This
is mostly used when we want to take tags from the SourceDoc and add them
to the tags on the TarsqiDocument.
index(self)
Index tags on position.
index_events(self)
index_timexes(self)
is_empty(self)
merge(self)
Take the OpeningTags and ClosingTags in self.tmp and merge them into
Tags. Raise errors if tags do not match.
pp(self, indent=' ')
pp_closing_tags(self)
pp_opening_tags(self)
pp_tags(self, indent='')
remove_tag(self, tag)
Remove the tag from the list of tags. This is rather inefficient since the
whole list is traversed. Also note that this method does not remove the
tag from the opening_tags and closing_tags dictionaries, so depending on
when this is done these may need to be re-indexed.
remove_tags(self, tagname)
Remove all tags with name=tagname. Rebuilds the indexes after
removing the tags.
reset(self)
class TarsqiDocument
Inherits from: object
An instance of TarsqiDocument should contain all information that may be
needed by the wrappers to do their work. It includes the source, metadata,
processing options, a set of identifier counters and a TagRepository.
Instance Variables:
source - an instance of DocSource
metadata - a dictionary
options - the Options instance from the Tarsqi instance
tags - an instance of TagRepository
counters - a set of counters used to create unique identifiers
Note that he processing options are available to the wrappers only through
this class by accessing the options variable.
Public Functions
__init__(self)
__str__(self)
add_event(self, begin, end, attrs)
Add an EVENT tag to the tarsqi_tags tag repository.
add_options(self, options)
add_timex(self, begin, end, attrs)
Add a TIMEX3 tag to the tag repository.
elements(self)
Method that returns the tags that contain paragraphs, that is, the
tags of type docelement.
events(self)
Convenience method for easy access to events.
get_dct(self)
has_event(self, begin, end)
Return True if there is already an event at the given begin and
end.
list_of_sentences(self)
next_event_id(self)
next_link_id(self, link_type)
Return a unique lid. The link_type argument is one of {ALINK, SLINK,
TLINK} and determines what link counter is incremented. The lid itself
is the sum of all the link counts. Assumes that all links are added
using the link counters in the document. Breaks down if there are
already links added without using those counters.
next_timex_id(self)
pp(self, source_tags=True, tarsqi_tags=True)
print_all(self, fname=None)
Write source string, metadata, comments, source tags and tarsqi tags
all to one file or to the standard output.
print_all_lif(self, fh)
print_sentences(self, fname=None)
Write to file (or standard output if no filename was given) a Python
variable assignment where the content of the variable the list of
sentences as a list of lists of token strings.
print_source(self, fname)
Print the original source of the document, without the tags to file
fname.
remove_tlinks(self)
Remove all TLINK tags from the tags repository.
slinks(self)
text(self, p1, p2)
timexes(self)
Convenience method for easy access to timexes.
tlinks(self)
update_processing_history(self, pipeline)
Private Functions
_print_comments(self, fh)
_print_metadata(self, fh)
_print_tags(self, fh, tag_group, tags)
class TarsqiInputError
Inherits from: Exception