index
module components.preprocessing.chunker
ChunkUpdater
Sentence
Simple Python Chunker
Usage
from chunker import chunk_sentences
chunked_sentences = chunk_sentences(sentences)
chunked_sentences = chunk_sentences(sentences, terms)
The optional terms argument allows you to hand in a dictionary of terms indexed
on their beginning offsets. With this dictionary, terms are always considered
chunks as long as they are headed by a noun or verb. Terms are instances of
docmodel.document.Tag.
class ChunkUpdater
Inherits from: object
Class that allows you to take a TarsqiDocument and then update the chunks
in it given tags that were unsuccesfully added to the TarsqiTrees in the
document. Currently only done for Timex tags.
Public Functions
__init__(self, tarsqidoc)
update(self)
Uses the orphans in the TarsqiTrees in the document to update chunks.
Private Functions
_add_chunks_for_timexes(self, element)
_remove_overlapping_chunks(self, nodes)
Remove all the noun chunk nodes that were found to be overlapping.
_update_element(self, element)
Uses the orphans in the TarsqiTree of the element to update chunks.
class Sentence
Inherits from: object
The work horse for the chunker.
Public Functions
__init__(self, sentence)
Set sentence variable and initialize chunk_tags dictionary.
chunk(self, terms=None)
Chunk self.sentence. Updates the variable and returns it. Scans
through the sentence and advances the index if a chunk is found. The
optional terms argument contains a dictionary of terms indexed on start
offset. If a terms dictionary is handed in then use it to make sure that
terms on it are considered chunks (as long as they are headed by a noun
or verb).
pp(self)
pp_tokens(self)
Private Functions
_consume_chunk(self, chunk_type, idx)
Read constituent of class chunk_type, starting at index idx. Returns
idx if no constituent could be read, returns the index after the end of the
consitutent otherwise.
_consume_term(self, term, idx)
Now that we now that a term starts at index idx, read the whole term
and, if it matches a few requirements, add it to the chunk_tags
dictionary. A term is an instance of docmodel.document.Tag.
_fix_VBGs(self)
The TreeTagger tends to tag some adjectives as gerunds, as a result
we get
[see/VBP sleeping/VBG] [men/NNS]
This method finds these occurrences and moves the VBG in to the noun
group:
[see/VBP] [sleeping/VBG men/NNS]
In order to do this, it finds all occurrences of VGs followed by NGs
where: (i) the VG ends in VBG, (ii) the NG starts with one of NN, NNS,
NNP, NNPS, and (iii) the verb before the VBG is not a form of "be".
_fix_common_errors(self)
Phase 2 of processing. Fix some common errors.
_import_chunks(self)
Add chunk tags to the sentence variable.
_is_VB_VBG_NN(self, idx)
Return True if starting at idx, we have the pattern "NOT_BE VBG
</VG> <NG> NN", return False otherwise.
_set_tags(self, chunk_type, begin_idx, end_idx)
Store beginning and ending position of the hunk in the chunk_tags
dictionary.
module functions
chunk_sentences(sentences, terms=None)
Return a list of sentences with chunk tags added.