The TARSQI Toolkit Manual

Release Notes and Manual for TTK Version 3.0.1. March 2021.

1. Introduction

The Tarsqi Toolkit (also referred to here as TTK or the Toolkit) is a set of components for extracting temporal information from news wire texts. TTK extracts time expressions, events, subordination links and temporal links; in addition, it can ensure consistency of temporal information.

The first version of the toolkit, version 1.0, was released in 2007, accompanied by many promises and threats that there would be a new version soon, but the release of the next version happened to take a decade. Version 2.0.0 was finally released in March 2017 and it differed from version 1.0 in many respects: code was simplified and debugged, documentation was updated, the Mallet toolkit was used, stand-off annotation was adopted, libraries were redesigned, and test and evaluation code was added. Up to version 2.2.0 only Python 2.7 was supported. Version 2.2.1 and version 3.0.0 were the first versions to also support Python 3 (this was tested on Python 3.8.5). Versions 2.2.1 and 3.0.0 are almost identical, the only differences are in the documentation and some bookkeeping files like the top-level VERSION file. After version 3.0.0, Python 2.7 support is not pursued anymore and future compatibility with Python 2.7 will be accidental. A list of changes is maintained in the changelog on GitHub.

The current version 3.0.1 has few code changes and is mostly concerned with publishing documentation on GitHub Pages.

2. Requirements and Installation

The toolkit runs on recent versions of Linux and Mac OSX (and should run on Windows, but see below). The code requires Python version 2.7 or version 3.8 (perhaps older versions are fine, this was not tested yet), Perl version 5.8 or newer and the Java Development Kit version 8 (version 7 will probably work as well). Java is needed to run the third party software used by the toolkit.

Two Python packages need to be installed:

$ pip install future
$ pip install six

With the requirements in place installation is a four step process:

  1. Install the toolkit code. This is simply a matter of unpacking the downloaded archive or cloning the TTK Git repository at https://github.com/tarsqi/ttk.git. Due to a bug on Windows, you should aim to install the toolkit on a path without spaces.

  2. Install the part-of-speech tagger. The Tarsqi toolkit is designed to work seamlessly with the IMS TreeTagger. Download the packages needed for your platform from the TreeTagger website and follow the directions. You can install the TreeTagger wherever you like, but you may need to tell the Toolkit where it can find the tagger (see below). There is an example script in the build directory which shows how to install the TreeTagger on Mac OSX. (Note: all directory and file names printed in this manual are relative to the path where TTK is installed).

  3. Install MALLET. The classifiers in the Toolkit use the MAchine Learning for LanguagE Toolkit (McCallum, Andrew Kachites. MALLET: A Machine Learning for Language Toolkit. http://mallet.cs.umass.edu. 2002). Download version 2.0.8 from http://mallet.cs.umass.edu/download.php. Version 2.0.7 will work as well. You can put MALLET wherever you want, but as with the TreeTagger you may need to tell the Toolkit where to find it. There is an example script in the build directory which shows how to install MALLET on Mac OSX. For Windows, you need to set the MALLET_HOME environment variable in addition to editing the configuration file.

  4. Create the configuration file. The toolkit comes with an example configuration file in config.sample.txt. Copy that file into config.txt and edit as needed. In most cases, the only settings in the configuration file that you will need to change are the treetagger and mallet options, with the values depending on where you installed the TreeTagger and MALLET. If you use the same installation directories as the two example scripts mentioned above then you will not have to make any changes to the configuration file.

To test the toolkit open a terminal window and make sure your working directory is in the top-level directory of the distribution, then type the following at the prompt (we use the dollar sign as the prompt in all examples):

$ python tarsqi.py data/in/simple-xml/tiny.xml out.xml

If all is well you will now have a file named out.xml with content similar to the output printed below.

<ttk>
<text>

Fido sleeps today.

</text>
<metadata>
  <dct value="20210302"/>
  <processing_steps>
  <processing_step ttk_version="3.0.1" git_commit="bae1870" timestamp="20210302-160635"
                   components="PREPROCESSOR,GUTIME,EVITA,SLINKET,S2T,BLINKER,CLASSIFIER"/>
  </processing_steps>
</metadata>
<source_tags>
  <text id="1" begin="1" end="21" />
</source_tags>
<tarsqi_tags>
  <docelement id="d1" begin="2" end="20" origin="DOCSTRUCTURE" type="paragraph" />
  <s id="s1" begin="2" end="20" origin="PREPROCESSOR" />
  <lex id="l1" begin="2" end="6" lemma="Fido" origin="PREPROCESSOR" pos="NNP" text="Fido" />
  <ng id="c1" begin="2" end="6" origin="PREPROCESSOR" />
  <lex id="l2" begin="7" end="13" lemma="sleep" origin="PREPROCESSOR" pos="VBZ" text="sleeps" />
  <vg id="c2" begin="7" end="13" origin="PREPROCESSOR" />
  <EVENT begin="7" end="13" aspect="NONE" class="OCCURRENCE" eid="e1" eiid="ei1"
         epos="VERB" form="sleeps" origin="EVITA" pos="VBZ" tense="PRESENT" />
  <lex id="l3" begin="14" end="19" lemma="today" origin="PREPROCESSOR" pos="NN" text="today" />
  <ng id="c3" begin="14" end="19" origin="PREPROCESSOR" />
  <TIMEX3 begin="14" end="19" origin="GUTIME" tid="t1" type="DATE" value="20170113" />
  <lex id="l4" begin="19" end="20" lemma="." origin="PREPROCESSOR" pos="." text="." />
  <TLINK eventInstanceID="ei1" origin="BLINKER-Type-1a"
         relType="IS_INCLUDED" relatedToTime="t1" />
  <TLINK origin="LINK_MERGER"
         relType="INCLUDES" relatedToEventInstance="ei1" timeID="t1" />
</tarsqi_tags>
</ttk>

Your output file will not be identical to this in that some dates and identifiers in the metadata will be different. There will also be minor format changes because some tags were put on two lines above for display purposes (TTK output has each tag on one line).

A note for Windows users. The code should run on Windows and we know it actually does run on Windows. However, we have not done extensive testing on Windows and some of the above instructions may not be entirely correct for Windows.

3. Using the TARSQI Toolkit

To run the TARSQI Toolkit, open a terminal and change the working directory to the top-level directory of the distribution. You can run Tarsqi on a file or a directory and the general format of the command is as follows:

$ python tarsqi.py [OPTIONS] INFILE OUTFILE
$ python tarsqi.py [OPTIONS] INDIR OUTDIR

In the first invocation INFILE needs to exist but OUTFILE should not. In the second invocation INDIR has to be a directory and OUTDIR either should not exist, in which case it is created, or it should be a directory. In the latter case you will be asked to confirm whether it is okay to overwrite files in OUTDIR, if not, the program will exit. All files in INDIR will be processed except for hidden files (starting with '.') and backup files (ending in '~').

There are several options you can use, none of them are mandatory. Options can be set in the configuration file (config.txt) or overruled on the command line. For a full list see the configuration file and the module documentation string in the main tarsqi.py script. Here we just present the three options you are most likely to use.

--source-format xml|text|ttk|timebank|...

The source type of the document allows components, especially the source parser and the metadata parser, to be sensitive to idiosyncratic properties of the text. There are three main types: xml, text and ttk. With the xml type the toolkit will assume that the input is well-formed XML and it will separate the text and the tags. The text is input for Tarsqi processing and the XML tags will be put in a separate tag repository. This input type can be used for the files in data/in/simple-xml. The text type makes the toolkit assume that the entire content of the file is to be parsed. Since the toolkit does not try to recognize any tags this type should only be used for real raw text. The ttk source type tells the toolkit that the input is in the TTK format. Typically the input here would be a file previously processed by the toolkit or a file that was converted into the TTK format.

The toolkit will usually run just fine without the --source-format option. In that case the toolkit will make an educated guess as to whether the input is xml, text or ttk. The system can be tricked, for example by giving it a file where the beginning really looks like XML, but in those cases an error will usually follow and in those cases it would be prudent to use this option after all.

The option should always be used when one of the special formats are used. When the input is vanilla text or XML then the toolkit has no good way to determine properties that are expressed in a very idiosyncratic way like the document creation time (DCT) and it will simply default to today's date. The timebank source type is an example of a specific source type. When using this for one of the files in data/in/TimeBank the toolkit will know where to find the DCT, which in the TimeBank case is in the file name or in some tag in the input.

The output of the processing is always in the TTK format, an example of which was given above when we processed data/in/simple-xml/tiny.xml. The TTK format is a standoff format which contains both the original unaltered text as well as the tags added and it contains four main elements:

  1. The text tag with in it the primary data of the input. In the case of the text type this is simply the entire document, for the xml type it is the document with all tags stripped (note that this implies that for the xml type we consider the primary source the document without the XML tags).

  2. The metadata tag which has a dictionary with meta data, which at this point only contains the value of the DCT.

  3. The source_tags tag which has all tags that were available in the input, this is by definition empty for the text type.

  4. The tarsqi_tags tag which contains a flat list of all tags added by the toolkit.

--dct TIMEX_VALUE

This can be used to hand in a DCT so that you are not faced with a situation where all DCTs are set to today's date. You may have a series of documents that you want to parse and that have their own particular format where the DCT is expressed in some way. You could extend the toolkit and define new data types and write metadata parsers for that data type. Or you could do that processing off-line and simply feed it into the toolkit with this option. The value needs to be a normalized time value like 20170412 (for April 12th 2017).

--pipeline STRING

Can be used to overrule the default pipeline specified in the config.txt settings file. A pipeline is a comma-separated string of component names. Allowed component names are PREPROCESSOR, TOKENIZER, TAGGER, CHUNKER, GUTIME, EVITA, SLINKET, S2T, BLINKER, CLASSIFIER and LINK_MERGER. Using other names raises an error. The order of the components in the pipeline specification is significant. Some pipeline examples are:

--pipeline PREPROCESSOR,GUTIME,EVITA
--pipeline TOKENIZER,TAGGER,CHUNKER,GUTIME,EVITA
--pipeline SLINKET,S2T,BLINKER,CLASSIFIER,LINK_MERGER
--pipeline PREPROCESSOR,GUTIME,EVITA,SLINKET,S2T,BLINKER,CLASSIFIER

The first example instructs TTK to take a file, preprocess it and add time expressions and events. The second example is identical except that individual preprocessing modules are used instead of the PREPROCESSOR shorthand. It is worth noting that you should not combine PREPROCESSOR and TOKENIZER in one pipeline since you would then run the tokenizer twice and this will cause errors later in the processing chain. For the third example, preprocessing, times and events are taken for granted and only links are added. In this case the input needs to be a ttk file which has the needed information. The fourth example is the full pipeline except that the link merger component is left out, this is useful for large files since the merger component slows down on large file. This is in fact the default pipeline.

To end this section here is an example to illustrate the interaction of the --source-format and --pipeline options:

$ python tarsqi.py --source-format xml --pipeline TOKENIZER test.xml test.tokenized.xml
$ python tarsqi.py --source-format ttk --pipeline TAGGER test.tokenized.xml test.tagged.xml

Note how the second line uses the ttk source type because output is always in the TTK format. You could leave out the --source-format option and let the toolkit determine the source from the input file.

4. Further Documentation

The code documentation has a list of links to all modules. For each module the documentation strings for classes and functions are printed. Documentation of the code is uneven: some modules are well-documented, other have spotty and underwhelming documentation or are not documented at all. The pages were automatically generated with the make_documentation.py script in the utilities directory, which was created because the pydoc command crashes on many of the toolkit modules.

Several design documents contain notes on the overall design of the toolkit and the algorithms used in many components. These notes are a work in progress and many sections are unfinished. The end of the top-level design document includes a list of tags generated by the toolkit.

The are also various random notes on the Tarsqi toolkit including descriptions of parts of the system, how-to guides and old specifications. Some are fairly recent, some are very old. The recent ones tend to be written in Github flavoured markdown and are best viewed on Github.

Finally, there is a list of published papers on Tarsqi.

6. Reporting bugs

Use https://github.com/tarsqi/ttk/issues for any issue, question or comment you may have. When reporting a bug, please be as specific as you can. Give us the version of the code you are running, your platform specifics, the input that gives you grief, and any other particulars that seem relevant. Support is uneven, but at irregular intervals we do check the list of issues.

5. Contributors and acknowledgements

Many people have contributed to the Tarsqi project, they are listed here in alphabetical order: Alex Baron, John Frank, Swini Garimella, Linda van Guilder, Josh Gieringer, Catherine Havasi, Jerry Hobbs, Seokbae Jang, Bob Knippen, Congmin Lee, Inderjeet Mani, Emin Mimaroglu, Jessica Moszkowicz, Feng Pan, John Phillips, Alex Plotnick, James Pustejovsky, Hongyuan Qiu, Ruth Reeves, Anna Rumshisky, Sanjib Kumar Saha, Roser Saurí, Barry Schiffman, Andrew See, Amber Stubbs, Kevin Thomas, Marc Verhagen, Ben Wellner and Dax Westerman.

From 2001-2008, work on Tarsqi was performed in the context of the IARPA AQUAINT Program and funded under ARDA/DoD/IARPA grant NBCHC040027. We are in particular thankful for the early support of John Prange and Heather McCallum-Bayliss. From 2014 through 2017 further work was supported by a Veterans Administration HSR&D Merit Award IIR 12-364.

8. Copyright and License

The Tarsqi Toolkit is copyright ©2020 of Brandeis University and is distributed under the Apache 2 License.

The Tempex module is copyright of The MITRE corporation and is distributed under the license in tempex-license.pdf.

The Toolkit contains two Python Wordnet modules in utilities/wordnet.py and utilities/wntools.py, which were developed by Oliver Steele. Use of that code is permitted under the Artistic License. See the PyWordNet project page at http://sourceforge.net/projects/pywordnet and the license at http://www.opensource.org/licenses/artistic-license.html.

The data in data/in/TimeBank are copyrighted by the various content providers and can be used for academic purposes only.