TARSQI Toolkit - Evita

Evita and the TarsqiTree class

The EvitaWrapper class is handed the TarsqiDocument and loops over all docelement Tags in it, creating an Evita instance for each of them and then procesing the elements one by one. The Evita instance has slots for the TarsqiDocument, the docelement tag and a TarsqiTree instance which contains a document tree for the part of the TarsqiDocument that is being processed. The TarsqiTree instance itself knows what TarsqiDocument it belongs to and what docelement Tag it was created for since these were handed in from the Evita instance. The TarsqiTree instance also has a list of daughters as well as some other attributes that are ignored here because they are not used by Evita.

components.evita.main.Evita
tarsqidoc an instance of docmodel.document.TarsqiDocument
docelement an instance of docmodel.document.Tag with name=docelement, the element of the TarsqiDocument that is being processed by Evita
doctree
components.common_modules.tree.TarsqiTree
tarsqidoc an instance of docmodel.document.TarsqiDocument, this is the same one as on the Evita instance
docelement an instance of docmodel.document.Tag, the element that the document tree is created for, this is the same one as on the Evita instance
dtrs a list of daughters, typically instances of Sentence

Since the TarsqiTree and its elements are the starting point for Evita and Slinket processing we will dwell on them a bit longer. Here is a pretty print of the TarsqiTree for "The dog barked yesterday.".

<Sentence position=0>
  <NounChunk position=0 checkedEvents=False event=None eid=None>
    <Token position=0 pos=DT text=The>
    <Token position=1 pos=NN text=dog>
  <VerbChunk position=1 checkedEvents=False event=None eid=None>
    <Token position=0 pos=VBD text=barked>
  <NounChunk position=2 checkedEvents=False event=None eid=None>
    <TIMEX3 tid=t1 type=DATE value=20160103>
      <Token position=0 pos=NN text=yesterday>
  <Token position=3 pos=. text=.>

A TarsqiTree contains sentences, chunks, tokens and event and time constituents. It is created from a TarsqiDocument and a Tag by the create_tarsqi_tree() method in components.common_modules.tree. This method uses the intermediary Node object to create a tree hierarchy and then replaces all Node objects with instances of Sentence, NounChunk, VerbChunk, Token, AdjectiveToken, EventTag and TimexTag (all defined in submodules of components.common_modules). These tree elements are all subclasses of Constituent and have the following instance variables:

components.common_modules.constituent.Constituent
tree contains the TarsqiTree instance at the top of the tree, through this tree a constituent also has access to the TarsqiDocument and the Tag
parent a reference to the parent, which is an instance of TarsqiTree or one of the subclasses of Constituent
position an integer reflecting the constituent's position in the dtrs list of the parent
dtrs a list of daughters, this is the empty list for leaf nodes
begin the beginning offset in the SourceDoc
end the ending offset in the SourceDoc

The TarsqiTree is used by Tarsqi components as a common data structure to process over. For example, Evita and Slinket both run patterns over elements of this tree and major parts of the Evita and Slinket code are expressed as methods on constituents. Components may update elements of the tree, but it is important to note that those changes are local and will not be handed over to a next component in the pipeline. Instead, results from processing have to be exported to the TagRepository on the TarsqiDocument, as in the following figure.

Components import the TarsqiDocument and for each Tag create a TarsqiTree using the TarsqiDocument's data. They then use this sequence of trees one by one as input to processing and export the resulting tags back into the TagRepository. The next component in the pipeline will start afresh with a new set of TarsqiTrees which will be created from the updated TarsqiDocument.

Back to Evita...

The EvitaWrapper takes the TarsqiDocument on initialization and its process() method loops over all document elements, creating an Evita instance for each of them and then running the process_element() method. The process_elements() method loops through all Sentence instances in the TarsqiTree in Evita's doctree variable, and then loops through all daughters of the Sentence. For each daughter it attempts to create an event with the createEvent() method, which is implemented on the NounChunk and VerbChunk classes.

As an illustration, below is a fragment from the code in the Evita class in components.evita.main. Note that the code is slightly simplified and edited and that it is not well-formed Python code anymore, this is true for all code fragments shown in this document.

components.evita.main.Evita

process_element():
   self.doctree = create_tarsqi_tree(self.tarsqidoc, self.docelement)
   for sentence in self.doctree:
      for node in sentence:
         if not node.checkedEvents:
            node.createEvent()

Nominal events are created with createEvent() on NounChunk and verbal and adjectival events are created with createEvent() on VerbChunk. Adjectival events are initially dealt with by VerbChunk because they are created only if preceded by certain verb groups, which is recognized by VerbChunk. The following three sections give details on how the three kinds of events are created.

Nominal events

Most of the code that deals with nominal events is expressed on the NounChunk class, which is a sublass of Chunk and Constituent. Recall that Constituent defines instance variables tree, parent, position, dtrs, being and end. These are all filled in during TarsqiTree construction by create_tarsqi_tree(), as mentioned earlier in this section. Several other instance variables are defined on chunks. The table below has just the variables added by NounChunk that are relevant for Evita.

components.common_modules.chunks.NounChunk
phraseType the chunktype, basically the chunk tag generated by the chunker, always 'ng' for noun chunks
head an integer reflecting the position of the head token in the chunk's dtrs variable, always set to -1 for noun chunks, picking out the last element
gramchunk
components.evita.gramChunk.GramNChunk
node the NounChunk that the features are for
tense the tense of the nominal, by default set to 'NONE'
aspect the aspect of the nominal, by default set to 'NONE'
modality the modality of the nominal, by default set to 'NONE'
polarity the polarity of the nominal, by default set to 'POS'
gramchunks the empty list for noun chunks, but can be non-empty for verb chunks
checkedEvents a flag indicating whether the chunk has already been checked for events, initially set to False; if set to True, createEvent() will never be run for the chunk

The GramNChunk instance in the gramchunk variable contains the grammatical features for the noun chunk. This instance has a pointer back to the NounChunk it is in and four grammatical features. Three of them are typically 'NONE' for nouns, and it may sound a wee bit odd for nouns to have tense, aspect or modality, but in some cases the nominal will inherit these from a governing verb, for example, with phrases like "This would be a tragedy". When the verb chunk "be" tries to create an event it will recognize that the actual event is the nominal following it, it will then try to create an event on that nominal and hand it the verb chunk features. This is explanined further below in the section on verbal events.

The top-level of the createEvent() code for nominals looks as follows:

components.common_modules.chunks.NounChunk

createEvent(self, gramvchunk=None):
   self.gramchunk = GramNChunk(self, gramvchunk)
   if self.isEventCandidate():
      self._conditionallyAddEvent()

GramNChunk creation for the NounChunk is a simple affair, setting the defaults and then leting them be overwritten if features were handed in from the governing verb. The main work occurs in the isEventCandidate() method on the chunk, which first checks whether the noun chunk is syntactically able to be an event, which is the case if the chunk has a head, the head is a common noun and the chunk is not a time expression. Next, it checks whether the noun chunk can semantically be an event. It does this by looking up the head token in Wordnet and by running a simple Bayesian classifier. This is where the heavy lifting occurs. The general logic is as follows:

  1. If all senses of the head token of the chunk are events in WordNet, return True.
  2. Else, if the classifier has enough data for the head token in its statistical model, let the classifier decide whether the token is an event.
  3. Else, do another WordNet lookup and check either whether the primary sense is an event or whether some senses are events.

The exact application of this logic is driven by settings in the components.evita.settings module, which has a couple of variable that determine whether the Bayesian classifier is used and how WordNet lookup determines event-hood.

components.evita.settings

EVITA_NOM_DISAMB = True
EVITA_NOM_CONTEXT = True
EVITA_NOM_WNPRIMSENSE_ONLY = True

The first variable determines whether the classifier will actually apply and the second whether the classifier will take into account contextual features (as opposed to just the word). The third variable determines the choice to be made in step 3, the default is to decide that a token is an event if its primary sense is.

The resources used by the logic above are in library/evita/dictionaries. Here is a short desciption of the WordNet and classifier resources.

Back to the Evita nominal event logic. If the isEventCandidate() method returns true then the _conditionallyAddEvent() method will perform a few last checks and then add an event to the TarsqiDocument by asking the TarsqiTree to do the honours:

components.common_modules.chunks.Chunk

_conditionallyAddEvent(self, gramChunk=None):
    gchunk = self.gramchunk if gramChunk is None else gramChunk
    if (gchunk.head
        and gchunk.head.getText() not in forms.be
        and gchunk.head.getText() not in forms.spuriousVerb
        and gchunk.evClass):
        self.tree.addEvent(Event(gchunk))

This method is defined on Chunk and is used by VerbChunks as well. The core is to invoke addEvent() on the TarsqiTree instance in the tree variable, handing it an Event instance which was created from the GramNChunk (or GramVChunk for verb chunks).

components.common_modules.tree.TarsqiTree

addEvent(self, event):
    event_attrs = dict(event.attrs)
    eid = self.tarsqidoc.next_event_id()
    eiid = "ei%s" % eid[1:]
    event_attrs['eid'] = eid
    event_attrs['eiid'] = eiid
    token = event.tokens[-1]
    self.tarsqidoc.add_event(token.begin, token.end, event_attrs)

The TarsqiTree instance is handed most event features (tense and aspect and so on), but it is responsible for generating unique identifiers for the event (which it gets from the TarsqiDocument). Finally, it asks the TarsqiDocument to add the tag to the tags TagRepository, thereby exporting the component results as shown in the picture earlier in this section.

Verbal events

Adjectival events