index

module components.preprocessing.wrapper

ChunkerWrapper
PreprocessorWrapper
TagId
TaggerWrapper
TokenizerWrapper
TreeTagger
Wrapper

Contains the wrappers for all preprocessing components.

class ChunkerWrapper

Inherits from: Wrapper

Wrapper for the chunker.

Public Functions

__init__(self, tarsqidocument)
Set component_name, add the TarsqiDocument and initialize the chunker.

process(self)
Generate input for the chunker from the lex and s tags in the document,
run the chunker, and insert the new ng and vg chunks into the TagRepository
on the TarsqiDocument.

Private Functions

_export_chunks(self, text)
Export ng and vg tags to the TagRepository on the TarsqiDocument.

_import_tokens(self, element)
Import sentence and lex tags and create the datastructure that the chunker
needs as input.

class PreprocessorWrapper

Inherits from: Wrapper

Wrapper for the preprocessing components.

Public Functions

__init__(self, tarsqidocument)
Set component_name, add the TarsqiDocument and initialize the
TreeTagger.

process(self)
Retrieve the element tags from the TarsqiDocument and hand the text
for the elements as strings to the preprocessing chain. The result is a
shallow tree with sentences and tokens. These are inserted into the
TarsqiDocument's tags TagRepositories.

Private Functions

_export(self, text)
Export preprocessing information to the tag repository. Updates the
TagRepository using the preprocessing result.

_merge_tags(self, tokens, taggedItems)

class TagId

Inherits from: object

Class to provide fresh identifiers for lex, ng, vg and s tags.

class TaggerWrapper

Inherits from: Wrapper

Wrapper for the tagger.

Public Functions

__init__(self, tarsqidocument)
Set component_name, add the TarsqiDocument and initialize the
TreeTagger.

process(self)
Generate input for the tagger from the lex and s tags in the document, run
the tagger, and insert the new information (pos and lemma) into the
TagRepository on the TarsqiDocument.

Private Functions

_export_tags(self, tagged_tokens)
Take the token tuples and add their pos and lemma information to the
TagRepository in the TarsqiDocument.

_merge_tags(self, tokens, taggedItems)
Merge the tags and lemmas into the tokens. Result is a list of tokens
where each token is a 5-tuple of text, tag, lemma, begin offset and end
offset. Sentence information is not kept in this list, which makes this
method different from it's sister on PreprocessorWrapper.

class TokenizerWrapper

Inherits from: Wrapper

Wrapper for the tokenizer.

Public Functions

__init__(self, tarsqidocument)
Set component_name and add the TarsqiDocument.

process(self)
Retrieve the element tags from the TarsqiDocument and hand the text for
the elements as strings to the tokenizer. The result is a list of pairs,
where the pair is either (<s>, None) or (SomeString, TokenizedLex). In
the first case an s tag is inserted in the TarsqiDocument's tags
TagRepository and in the second a lex tag.

Private Functions

_export_sentence(self, s_begin, s_end)
Add an s tag to the TagRepository of the TarsqiDocument.

_export_tokens(self, tokens)
Add s tags and lex tags to the TagRepository of the TarsqiDocument.

class TreeTagger

Inherits from: object

Class that wraps the TreeTagger.

Public Functions

__del__(self)
When deleting the wrapper, close the TreeTagger process pipes.

__init__(self, treetagger_dir)
Set up the pipe to the TreeTagger.

tag_text(self, text)
Open a thread to the TreeTagger, pipe in the text and return the results.

Private Functions

_get_executable(self)
Get the TreeTagger executable for the platform.

class Wrapper

Inherits from: object

Abstract class for shared functionality of the proprocessing wrappers.

Public Functions
__init__(self, name, tarsqidoc)

Private Functions

_chunk_text(self, text)
Takes a list of sentences and return the same sentences with chunk
tags inserted.

_init_chunker(self)

_init_tagger(self)

_init_tokenizer(self)

_tag_text(self, tokens)
Takes a string and returns a list of sentences. Each sentence is a
list of tuples of token, part-of-speech and lemma.

_tokenize_text(self, string)
Takes a unicode string and returns a list of objects, where each
object is either the pair ('<s>', None) or a pair of a tokenized string
and a TokenizedLex instance.

module functions

adjust_lex_offsets(tokens, offset)
The tokenizer works on isolated strings, adding offsets relative to the
beginning of the string. But for the lex tags we need to relate the offset
to the beginning of the file, not to the beginning of some random
string. This procedure is used to increment offsets on instances of
TokenizedLex.

initialize_treetagger(treetagger_dir)

normalize_POS(pos)
Some simple modifications of the TreeTagger POS tags.