index

module components.preprocessing.chunker

ChunkUpdater
Sentence

Simple Python Chunker

Usage

   from chunker import chunk_sentences
   chunked_sentences = chunk_sentences(sentences)
   chunked_sentences = chunk_sentences(sentences, terms)

The optional terms argument allows you to hand in a dictionary of terms indexed
on their beginning offsets. With this dictionary, terms are always considered
chunks as long as they are headed by a noun or verb. Terms are instances of
docmodel.document.Tag.

class ChunkUpdater

Inherits from: object

Class that allows you to take a TarsqiDocument and then update the chunks
in it given tags that were unsuccesfully added to the TarsqiTrees in the
document. Currently only done for Timex tags.

Public Functions

__init__(self, tarsqidoc)

update(self)
Uses the orphans in the TarsqiTrees in the document to update chunks.

Private Functions

_add_chunks_for_timexes(self, element)

_remove_overlapping_chunks(self, nodes)
Remove all the noun chunk nodes that were found to be overlapping.

_update_element(self, element)
Uses the orphans in the TarsqiTree of the element to update chunks.

class Sentence

Inherits from: object

The work horse for the chunker.

Public Functions

__init__(self, sentence)
Set sentence variable and initialize chunk_tags dictionary.

chunk(self, terms=None)
Chunk self.sentence. Updates the variable and returns it. Scans
through the sentence and advances the index if a chunk is found. The
optional terms argument contains a dictionary of terms indexed on start
offset. If a terms dictionary is handed in then use it to make sure that
terms on it are considered chunks (as long as they are headed by a noun
or verb).

pp(self)

pp_tokens(self)

Private Functions

_consume_chunk(self, chunk_type, idx)
Read constituent of class chunk_type, starting at index idx. Returns
idx if no constituent could be read, returns the index after the end of the
consitutent otherwise.

_consume_term(self, term, idx)
Now that we now that a term starts at index idx, read the whole term
and, if it matches a few requirements, add it to the chunk_tags
dictionary. A term is an instance of docmodel.document.Tag.

_fix_VBGs(self)
The TreeTagger tends to tag some adjectives as gerunds, as a result
we get

   [see/VBP sleeping/VBG] [men/NNS]

This method finds these occurrences and moves the VBG in to the noun
group:

   [see/VBP] [sleeping/VBG men/NNS]

In order to do this, it finds all occurrences of VGs followed by NGs
where: (i) the VG ends in VBG, (ii) the NG starts with one of NN, NNS,
NNP, NNPS, and (iii) the verb before the VBG is not a form of "be".

_fix_common_errors(self)
Phase 2 of processing. Fix some common errors.

_import_chunks(self)
Add chunk tags to the sentence variable.

_is_VB_VBG_NN(self, idx)
Return True if starting at idx, we have the pattern "NOT_BE VBG
</VG> <NG> NN", return False otherwise.

_set_tags(self, chunk_type, begin_idx, end_idx)
Store beginning and ending position of the hunk in the chunk_tags
dictionary.

module functions

chunk_sentences(sentences, terms=None)
Return a list of sentences with chunk tags added.