index
module components.preprocessing.wrapper
ChunkerWrapper
PreprocessorWrapper
TagId
TaggerWrapper
TokenizerWrapper
TreeTagger
Wrapper
Contains the wrappers for all preprocessing components.
class ChunkerWrapper
Inherits from: Wrapper

Wrapper for the chunker.

Public Functions

__init__(self, tarsqidocument)
Set component_name, add the TarsqiDocument and initialize the chunker.
process(self)
Generate input for the chunker from the lex and s tags in the document, run the chunker, and insert the new ng and vg chunks into the TagRepository on the TarsqiDocument.

Private Functions

_export_chunks(self, text)
Export ng and vg tags to the TagRepository on the TarsqiDocument.
_import_tokens(self, element)
Import sentence and lex tags and create the datastructure that the chunker needs as input.
class PreprocessorWrapper
Inherits from: Wrapper

Wrapper for the preprocessing components.

Public Functions

__init__(self, tarsqidocument)
Set component_name, add the TarsqiDocument and initialize the TreeTagger.
process(self)
Retrieve the element tags from the TarsqiDocument and hand the text for the elements as strings to the preprocessing chain. The result is a shallow tree with sentences and tokens. These are inserted into the TarsqiDocument's tags TagRepositories.

Private Functions

_export(self, text)
Export preprocessing information to the tag repository. Updates the TagRepository using the preprocessing result.
_merge_tags(self, tokens, taggedItems)
class TagId
Inherits from: object

Class to provide fresh identifiers for lex, ng, vg and s tags.
class TaggerWrapper
Inherits from: Wrapper

Wrapper for the tagger.

Public Functions

__init__(self, tarsqidocument)
Set component_name, add the TarsqiDocument and initialize the TreeTagger.
process(self)
Generate input for the tagger from the lex and s tags in the document, run the tagger, and insert the new information (pos and lemma) into the TagRepository on the TarsqiDocument.

Private Functions

_export_tags(self, tagged_tokens)
Take the token tuples and add their pos and lemma information to the TagRepository in the TarsqiDocument.
_merge_tags(self, tokens, taggedItems)
Merge the tags and lemmas into the tokens. Result is a list of tokens where each token is a 5-tuple of text, tag, lemma, begin offset and end offset. Sentence information is not kept in this list, which makes this method different from it's sister on PreprocessorWrapper.
class TokenizerWrapper
Inherits from: Wrapper

Wrapper for the tokenizer.

Public Functions

__init__(self, tarsqidocument)
Set component_name and add the TarsqiDocument.
process(self)
Retrieve the element tags from the TarsqiDocument and hand the text for the elements as strings to the tokenizer. The result is a list of pairs, where the pair is either (<s>, None) or (SomeString, TokenizedLex). In the first case an s tag is inserted in the TarsqiDocument's tags TagRepository and in the second a lex tag.

Private Functions

_export_sentence(self, s_begin, s_end)
Add an s tag to the TagRepository of the TarsqiDocument.
_export_tokens(self, tokens)
Add s tags and lex tags to the TagRepository of the TarsqiDocument.
class TreeTagger
Inherits from: object

Class that wraps the TreeTagger.

Public Functions

__del__(self)
When deleting the wrapper, close the TreeTagger process pipes.
__init__(self, treetagger_dir)
Set up the pipe to the TreeTagger.
tag_text(self, text)
Open a thread to the TreeTagger, pipe in the text and return the results.

Private Functions

_get_executable(self)
Get the TreeTagger executable for the platform.
class Wrapper
Inherits from: object

Abstract class for shared functionality of the proprocessing wrappers.

Public Functions

__init__(self, name, tarsqidoc)

Private Functions

_chunk_text(self, text)
Takes a list of sentences and return the same sentences with chunk tags inserted.
_init_chunker(self)
_init_tagger(self)
_init_tokenizer(self)
_tag_text(self, tokens)
Takes a string and returns a list of sentences. Each sentence is a list of tuples of token, part-of-speech and lemma.
_tokenize_text(self, string)
Takes a unicode string and returns a list of objects, where each object is either the pair ('<s>', None) or a pair of a tokenized string and a TokenizedLex instance.
module functions
adjust_lex_offsets(tokens, offset)
The tokenizer works on isolated strings, adding offsets relative to the beginning of the string. But for the lex tags we need to relate the offset to the beginning of the file, not to the beginning of some random string. This procedure is used to increment offsets on instances of TokenizedLex.
initialize_treetagger(treetagger_dir)
normalize_POS(pos)
Some simple modifications of the TreeTagger POS tags.