Release Notes and Manual for TTK Version 3.0.1. March 2021.
The Tarsqi Toolkit (also referred to here as TTK or the Toolkit) is a set of components for extracting temporal information from news wire texts. TTK extracts time expressions, events, subordination links and temporal links; in addition, it can ensure consistency of temporal information.
The first version of the toolkit, version 1.0, was released in 2007, accompanied by many promises and threats that there would be a new version soon, but the release of the next version happened to take a decade. Version 2.0.0 was finally released in March 2017 and it differed from version 1.0 in many respects: code was simplified and debugged, documentation was updated, the Mallet toolkit was used, stand-off annotation was adopted, libraries were redesigned, and test and evaluation code was added. Up to version 2.2.0 only Python 2.7 was supported. Version 2.2.1 and version 3.0.0 were the first versions to also support Python 3 (this was tested on Python 3.8.5). Versions 2.2.1 and 3.0.0 are almost identical, the only differences are in the documentation and some bookkeeping files like the top-level VERSION file. After version 3.0.0, Python 2.7 support is not pursued anymore and future compatibility with Python 2.7 will be accidental. A list of changes is maintained in the changelog on GitHub.
The current version 3.0.1 has few code changes and is mostly concerned with publishing documentation on GitHub Pages.
The toolkit runs on recent versions of Linux and Mac OSX (and should run on Windows, but see below). The code requires Python version 2.7 or version 3.8 (perhaps older versions are fine, this was not tested yet), Perl version 5.8 or newer and the Java Development Kit version 8 (version 7 will probably work as well). Java is needed to run the third party software used by the toolkit.
Two Python packages need to be installed:
$ pip install future $ pip install six
With the requirements in place installation is a four step process:
Install the toolkit code. This is simply a matter of unpacking the downloaded archive or cloning the TTK Git repository at https://github.com/tarsqi/ttk.git. Due to a bug on Windows, you should aim to install the toolkit on a path without spaces.
Install the part-of-speech tagger. The Tarsqi toolkit is designed to work
seamlessly with the IMS TreeTagger. Download the packages needed for your
platform from
the TreeTagger
website and follow the directions. You can install the TreeTagger
wherever you like, but you may need to tell the Toolkit where it can find
the tagger (see below). There is an
example script in the
build directory which shows how to install the TreeTagger on
Mac OSX. (Note: all directory and file names printed in this manual are
relative to the path where TTK is installed).
Install MALLET. The classifiers in the Toolkit use the MAchine Learning
for LanguagE Toolkit (McCallum, Andrew Kachites. MALLET: A Machine
Learning for Language Toolkit.
http://mallet.cs.umass.edu. 2002). Download
version 2.0.8
from http://mallet.cs.umass.edu/download.php. Version
2.0.7 will work as well. You can put MALLET wherever you want, but as with
the TreeTagger you may need to tell the Toolkit where to find it. There is
an example script in the
build directory which shows how to install MALLET on Mac
OSX. For Windows, you need to set the MALLET_HOME environment variable in
addition to editing the configuration file.
Create the configuration file. The toolkit comes with an example configuration file
in config.sample.txt. Copy that file into
config.txt and edit as needed. In most cases, the only
settings in the configuration file that you will need to change are
the treetagger and mallet options, with the
values depending on where you installed the TreeTagger and MALLET. If you
use the same installation directories as the two example scripts mentioned
above then you will not have to make any changes to the configuration
file.
To test the toolkit open a terminal window and make sure your working directory is in the top-level directory of the distribution, then type the following at the prompt (we use the dollar sign as the prompt in all examples):
$ python tarsqi.py data/in/simple-xml/tiny.xml out.xml
If all is well you will now have a file named out.xml with
content similar to the output printed below.
<ttk>
<text>
Fido sleeps today.
</text>
<metadata>
<dct value="20210302"/>
<processing_steps>
<processing_step ttk_version="3.0.1" git_commit="bae1870" timestamp="20210302-160635"
components="PREPROCESSOR,GUTIME,EVITA,SLINKET,S2T,BLINKER,CLASSIFIER"/>
</processing_steps>
</metadata>
<source_tags>
<text id="1" begin="1" end="21" />
</source_tags>
<tarsqi_tags>
<docelement id="d1" begin="2" end="20" origin="DOCSTRUCTURE" type="paragraph" />
<s id="s1" begin="2" end="20" origin="PREPROCESSOR" />
<lex id="l1" begin="2" end="6" lemma="Fido" origin="PREPROCESSOR" pos="NNP" text="Fido" />
<ng id="c1" begin="2" end="6" origin="PREPROCESSOR" />
<lex id="l2" begin="7" end="13" lemma="sleep" origin="PREPROCESSOR" pos="VBZ" text="sleeps" />
<vg id="c2" begin="7" end="13" origin="PREPROCESSOR" />
<EVENT begin="7" end="13" aspect="NONE" class="OCCURRENCE" eid="e1" eiid="ei1"
epos="VERB" form="sleeps" origin="EVITA" pos="VBZ" tense="PRESENT" />
<lex id="l3" begin="14" end="19" lemma="today" origin="PREPROCESSOR" pos="NN" text="today" />
<ng id="c3" begin="14" end="19" origin="PREPROCESSOR" />
<TIMEX3 begin="14" end="19" origin="GUTIME" tid="t1" type="DATE" value="20170113" />
<lex id="l4" begin="19" end="20" lemma="." origin="PREPROCESSOR" pos="." text="." />
<TLINK eventInstanceID="ei1" origin="BLINKER-Type-1a"
relType="IS_INCLUDED" relatedToTime="t1" />
<TLINK origin="LINK_MERGER"
relType="INCLUDES" relatedToEventInstance="ei1" timeID="t1" />
</tarsqi_tags>
</ttk>
Your output file will not be identical to this in that some dates and identifiers in the metadata will be different. There will also be minor format changes because some tags were put on two lines above for display purposes (TTK output has each tag on one line).
A note for Windows users. The code should run on Windows and we know it actually does run on Windows. However, we have not done extensive testing on Windows and some of the above instructions may not be entirely correct for Windows.
To run the TARSQI Toolkit, open a terminal and change the working directory to the top-level directory of the distribution. You can run Tarsqi on a file or a directory and the general format of the command is as follows:
$ python tarsqi.py [OPTIONS] INFILE OUTFILE $ python tarsqi.py [OPTIONS] INDIR OUTDIR
In the first invocation INFILE needs to exist but OUTFILE should not. In the second invocation INDIR has to be a directory and OUTDIR either should not exist, in which case it is created, or it should be a directory. In the latter case you will be asked to confirm whether it is okay to overwrite files in OUTDIR, if not, the program will exit. All files in INDIR will be processed except for hidden files (starting with '.') and backup files (ending in '~').
There are several options you can use, none of them are mandatory. Options
can be set in the configuration file (config.txt) or overruled on
the command line. For a full list see the configuration file and the module
documentation string in the main tarsqi.py script. Here we just
present the three options you are most likely to use.
--source-format xml|text|ttk|timebank|...
The source type of the document allows components, especially the source
parser and the metadata parser, to be sensitive to idiosyncratic properties of
the text. There are three main types: xml, text
and ttk. With the xml type the toolkit will assume
that the input is well-formed XML and it will separate the text and the
tags. The text is input for Tarsqi processing and the XML tags will be put in a
separate tag repository. This input type can be used for the files
in data/in/simple-xml. The text type makes the toolkit
assume that the entire content of the file is to be parsed. Since the toolkit
does not try to recognize any tags this type should only be used for real raw
text. The ttk source type tells the toolkit that the input is in
the TTK format. Typically the input here would be a file previously processed
by the toolkit or a file that was converted into the TTK format.
The toolkit will usually run just fine without
the --source-format option. In that case the toolkit will make an
educated guess as to whether the input is xml, text
or ttk. The system can be tricked, for example by giving it a file
where the beginning really looks like XML, but in those cases an error will
usually follow and in those cases it would be prudent to use this option after
all.
The option should always be used when one of the special formats are
used. When the input is vanilla text or XML then the toolkit has no good way to
determine properties that are expressed in a very idiosyncratic way like the
document creation time (DCT) and it will simply default to today's
date. The timebank source type is an example of a specific source
type. When using this for one of the files in data/in/TimeBank the
toolkit will know where to find the DCT, which in the TimeBank case is in the
file name or in some tag in the input.
The output of the processing is always in the TTK format, an example of which
was given above when we processed data/in/simple-xml/tiny.xml. The
TTK format is a standoff format which contains both the original unaltered text
as well as the tags added and it contains four main elements:
The text tag with in it the primary data of the input. In the
case of the text type this is simply the entire document, for
the xml type it is the document with all tags stripped (note that
this implies that for the xml type we consider the primary source
the document without the XML tags).
The metadata tag which has a dictionary with meta data, which
at this point only contains the value of the DCT.
The source_tags tag which has all tags that were available in
the input, this is by definition empty for the text type.
The tarsqi_tags tag which contains a flat list of all tags added by the
toolkit.
--dct TIMEX_VALUE
This can be used to hand in a DCT so that you are not faced with a situation where all DCTs are set to today's date. You may have a series of documents that you want to parse and that have their own particular format where the DCT is expressed in some way. You could extend the toolkit and define new data types and write metadata parsers for that data type. Or you could do that processing off-line and simply feed it into the toolkit with this option. The value needs to be a normalized time value like 20170412 (for April 12th 2017).
--pipeline STRING
Can be used to overrule the default pipeline specified in
the config.txt settings file. A pipeline is a comma-separated
string of component names. Allowed component names are PREPROCESSOR,
TOKENIZER, TAGGER, CHUNKER, GUTIME, EVITA, SLINKET, S2T, BLINKER,
CLASSIFIER and
LINK_MERGER. Using other names raises an error. The order of the
components in the pipeline specification is significant. Some pipeline examples
are:
--pipeline PREPROCESSOR,GUTIME,EVITA --pipeline TOKENIZER,TAGGER,CHUNKER,GUTIME,EVITA --pipeline SLINKET,S2T,BLINKER,CLASSIFIER,LINK_MERGER --pipeline PREPROCESSOR,GUTIME,EVITA,SLINKET,S2T,BLINKER,CLASSIFIER
The first example instructs TTK to take a file, preprocess it and add time expressions and events. The second example is identical except that individual preprocessing modules are used instead of the PREPROCESSOR shorthand. It is worth noting that you should not combine PREPROCESSOR and TOKENIZER in one pipeline since you would then run the tokenizer twice and this will cause errors later in the processing chain. For the third example, preprocessing, times and events are taken for granted and only links are added. In this case the input needs to be a ttk file which has the needed information. The fourth example is the full pipeline except that the link merger component is left out, this is useful for large files since the merger component slows down on large file. This is in fact the default pipeline.
To end this section here is an example to illustrate the interaction of
the --source-format and --pipeline options:
$ python tarsqi.py --source-format xml --pipeline TOKENIZER test.xml test.tokenized.xml $ python tarsqi.py --source-format ttk --pipeline TAGGER test.tokenized.xml test.tagged.xml
Note how the second line uses the ttk source type because output is always in
the TTK format. You could leave out the --source-format option and
let the toolkit determine the source from the input file.
The code documentation has a list of
links to all modules. For each module the documentation strings for classes and
functions are printed. Documentation of the code is uneven: some modules are
well-documented, other have spotty and underwhelming documentation or are not
documented at all. The pages were automatically generated with the
make_documentation.py script in the utilities
directory, which was created because the pydoc command crashes on
many of the toolkit modules.
Several design documents contain notes on the overall design of the toolkit and the algorithms used in many components. These notes are a work in progress and many sections are unfinished. The end of the top-level design document includes a list of tags generated by the toolkit.
The are also various random notes on the Tarsqi toolkit including descriptions of parts of the system, how-to guides and old specifications. Some are fairly recent, some are very old. The recent ones tend to be written in Github flavoured markdown and are best viewed on Github.
Finally, there is a list of published papers on Tarsqi.
Use https://github.com/tarsqi/ttk/issues for any issue, question or comment you may have. When reporting a bug, please be as specific as you can. Give us the version of the code you are running, your platform specifics, the input that gives you grief, and any other particulars that seem relevant. Support is uneven, but at irregular intervals we do check the list of issues.
Many people have contributed to the Tarsqi project, they are listed here in alphabetical order: Alex Baron, John Frank, Swini Garimella, Linda van Guilder, Josh Gieringer, Catherine Havasi, Jerry Hobbs, Seokbae Jang, Bob Knippen, Congmin Lee, Inderjeet Mani, Emin Mimaroglu, Jessica Moszkowicz, Feng Pan, John Phillips, Alex Plotnick, James Pustejovsky, Hongyuan Qiu, Ruth Reeves, Anna Rumshisky, Sanjib Kumar Saha, Roser Saurí, Barry Schiffman, Andrew See, Amber Stubbs, Kevin Thomas, Marc Verhagen, Ben Wellner and Dax Westerman.
From 2001-2008, work on Tarsqi was performed in the context of the IARPA AQUAINT Program and funded under ARDA/DoD/IARPA grant NBCHC040027. We are in particular thankful for the early support of John Prange and Heather McCallum-Bayliss. From 2014 through 2017 further work was supported by a Veterans Administration HSR&D Merit Award IIR 12-364.
The Tarsqi Toolkit is copyright ©2020 of Brandeis University and is distributed under the Apache 2 License.
The Tempex module is copyright of The MITRE corporation and is distributed under the license in tempex-license.pdf.
The Toolkit contains two Python Wordnet modules in utilities/wordnet.py and utilities/wntools.py, which were developed by Oliver Steele. Use of that code is permitted under the Artistic License. See the PyWordNet project page at http://sourceforge.net/projects/pywordnet and the license at http://www.opensource.org/licenses/artistic-license.html.
The data in data/in/TimeBank are copyrighted by the
various content providers and can be used for academic purposes only.