index

module components.preprocessing.tokenizer

TokenizedLex
TokenizedSentence
TokenizedText
Tokenizer

Simple tokenizer that leaves the input unchanged and that generates character
offsets of tokens and sentences.

Usage:

    tokenizer = Tokenizer(text_string)
    tokenizer.tokenize_text()

At this point, you can get lists of sentences offsets and token offsets in the
tokenizer object:

    print tokenizer.sentences
    print tokenizer.tokens

You can also get the output as an xml string or in the old-fashioned
format with one-sentences-per-line and tokens separated by spaces:

    print tokenizer.get_tokenized_as_xml()
    print tokenizer.get_tokenized_as_string()

Both methods return unicode strings.

If you run this file as the main script, it expects a filename as the single
argument. It prints both the xml string and the old-fashioned string to the
standard output, followed by a note on elasped time. Edit the end of the script
to change this.

class TokenizedLex

Inherits from: object

Public Functions

__init__(self, b, e, text)

__str__(self)

as_pairs(self)

as_string(self, indent='')

as_vertical_string(self, indent='')

print_as_string(self, indent='')

print_as_xmlstring(self, indent='')

class TokenizedSentence

Inherits from: object

Public Functions

__init__(self, b, e)

append(self, item)

as_pairs(self)

as_string(self)

as_vertical_string(self)

print_as_string(self)

print_as_xmlstring(self)

class TokenizedText

Inherits from: object

This class takes a list of sentences of the form (begin_offset, end_offset)
and a list of tokens of the form (begin_offset, end_offset, text), and
creates a list of elements. Each element can either be a TokenizedSentence
or a TokenizedLex (the latter for a token outside a sentence tag).

Public Functions

__init__(self, sentences, tokenizer_lexes)

as_pairs(self)
Return self as a list of pairs, where usually each pair contains a
string and a TokenizedLex instance. Also inserts a ('<s>', None) for the
beginning of each sentence.  This is intended to take tokenized text and
prepare it for the TreeTagger (which does not recognize </s> tags.

as_string(self)

as_vertical_string(self)

print_as_string(self)

print_as_xmlstring(self)

class Tokenizer

Inherits from: object

Class to create lex tags and s tags given a text string that is not
modified. The lexes and sentences are gathered in the variables with the
same name. The token variable contains intermediate data, basically starting
with a list of non-whitespace character sequences, splitting and merging
these sequences as processing continues.

One thing that should be added is functionality that forces sentence
boundaries, so that document structure level processing or prior tags can
help correctly tokenize headers and such.

Public Functions

__init__(self, text)

get_tokenized(self, xml, s_open, s_close, lex_open, lex_close, lexindent)
Return the tokenized text as a string.

get_tokenized_as_string(self)
Return the tokenized text as a string where sentences are on one line
and tokens are separated by spaces. Not that each sentence ends with a
space because each token is followed by a space.

get_tokenized_as_xml(self)
Return the tokenized text as an XML string. Crappy way of printing
XML, will only work for lex and s tags. Need to eventually use a method
on TarsqiDocument (now there is a method on DocSource that probably
needs to be moved.

slurp_token(self, offset)
Given a string and an offset in the string, return two tuples, one
for whitespace characters after the offset and one for non-whitespaces
characters immediately after the whitespace. A tuple consists of a begin
offset, an end offset and a string.

tokenize_text(self)
Tokenize a text and return an instance of TokenizedText. Create lists
of sentences and lexes and feed these into the TokenizedText. Each token
and each sentence is a pair of a begin position and end position.

Private Functions

_first_token_start(self)
Return the begin position of the first token. This could be of the
core token or of a punctuation marker that was split off.

_restore_abbreviation(self, core_token, closing_puncts)
Glue the period back onto the core token if the first closing punctuation
is a period and the core token is a known abbreviation.

_set_lexes(self, )
Set lexes list by flattening self.tokens. Sometimes empty core tokens are
created, filter those out at this step.

_set_sentences(self)

_set_tag_indexes(self)
Populate dictionaries that stire tags on first and last offsets.

_slurp(self, offset, test)

_split_contraction(self, puncts1, tok, puncts2)

_split_contractions(self)

_split_punctuation(self, word)
Return a triple of opening punctuations, core token and closing
punctuation. A core token can contain internal punctuation but
token-initial and token-final punctuations are stripped off. If a token
has punctuation characters only, then the core token wil be the empty
string and the closing list will be empty.

_split_word(self, word)
Split a word into it's constitutent parts. A word is a tuple of begin
offset, end offset and a sequence of non-whitespace characters.

module functions

test_nonspace(char)

test_space(char)

token_is_abbreviation(token)
Return True if token is an abbreviation or acronym. Note that this
overgeneralizes since it catches all initials, including 'So would I. This
is...'. Decided that it was better to miss some sentence boundaries than
adding wrong boundaries.