index
module tarsqi
Options
Tarsqi
TarsqiError
TarsqiWrapper
tarsqi.py

Main script that drives all tarsqi toolkit processing.

Source-specific processing is delegated to the docmodel package, which has
access to source parsers and metadata parsers. This script also calls on
various tarsqi modules to do the rest of the real work.


USAGE

   % python tarsqy.py [OPTIONS] [INPUT OUTPUT]

   INPUT/OUTPUT

      Input and output files or directories. If the input is a directory then
      the output directory needs to exist. If '--pipe' is one of the options
      then input and output are not required and they are ignored if they are
      there. Output is always in the TTK format, unless the --target-format
      option is set to lif, in which case output will be in the LIF format.

   OPTIONS

      --source-format NAME
          The format of the input; this reflects the source type of the document
          and allows components, especially the source parser and the metadata
          parser, to be sensitive to idiosyncratic properties of the text (for
          example, the location of the DCT and the format of the text). If this
          option is not specified then the system will try to guess one of
          'xml', 'ttk' or 'text', and default to 'text' if no clues can be found
          for the first two cases. Note that currently this guess will fail if
          the --pipe option is used. There are five more types that can be used
          to process the more specific sample data in data/in: lif for the data
          in data/in/lif, timebank for data/in/TimeBank, atee for data/in/ATEE,
          rte3 for data/in/RTE3 and db for data/in/db.

      --target-format NAME
          Output is almost always written in the TTK format. This option allows
          you to overrule that, but at the moment the only other format that is
          experimentally supported is the LIF format. If you use 'lif' for the
          target format then LIF output will be printed. However, currently this
          only works if the --source-format is also LIF.

      --pipeline LIST
          Comma-separated list of Tarsqi components, defaults to the full
          pipeline minus the link merger.

      --dct VALUE
          Use this to pass a document creation time (DCT) to the main script.
          The value is a normalized date expression like 20120830 for August
          30th 2012. If this option is not used then the DCT will be determined
          by the metadata parser that is defined for the input source. Note that
          the value of --dct will overrule any value calculated by the metadata
          parser.

      --pipe
          With this option the script reads input from the standard input and
          writes output to standard output. Without it the script expects the
          INPUT and OUTPUT arguments to be there. Note that when you do this you
          also need to use the --source-format option.

      --perl PATH
          Path to the Perl executable. Typically the operating system default is
          fine here and this option does not need to be used.

      --treetagger PATH
          Path to the TreeTagger.

      --mallet PATH
          Location of Mallet, this should be the directory that contains the
          bin directory.

      --classifier STRING
          The classifier used by the Mallet classifier, the default is MaxEnt.

      --ee-model FILENAME
      --et-model FILENAME
          The models used for classifying event-event and event-timex tlinks,
          these are model files in components/classifier/models, the defaults
          are set to tb-vectors.ee.model and tb-vectors.et.model.

      --import-events
          With this option the Evita component will try to import existing
          events by lifting EVENT tags from the source tags. It is assumed that
          those tags have 'begin', 'end' and 'class' attributes.

      --trap-errors True|False
          Set error trapping, errors are trapped by default.

      --loglevel INTEGER
          Set log level to an integer from 0 to 4, the higher the level the
          more messages will be written to the log, see utilities.logger for
          more details.

      Some of these options (the ones that have values) can also be set in the
      config.txt file.


VARIABLES:

   TTK_ROOT         -  the TTK directory
   CONFIG_FILE      -  file with user settings
   COMPONENTS       -  dictionary with all Tarsqi components
   USE_PROFILER     -  a boolean determining whether the profiler is used
   PROFILER_OUTPUT  -  file that profiler statistics are written to
class Options
Inherits from: object

A class to keep track of all the options. Options can be accessed with
the getopt() method, but standard options are also accessable directly
through the following instance variables: source, dct, pipeline, pipe,
loglevel, trap_errors, import_event_tags, perl, mallet, treetagger,
classifier, ee_model and et_model. There is no instance variable access for
user-defined options in the config.txt file.

Public Functions

__getitem__(self, key)
__init__(self, options)
Initialize options from the config file and the options handed in to the tarsqi script. Put known options in instance variables.
__str__(self)
getopt(self, option_name, default=None)
Return the option, use None as default.
items(self)
Simplistic way to do dictionary emulation.
pp(self)
set_option(self, opt, value)
Sets the value of opt in self._options to value. If opt is also expressed as an instance variable then change that one as well.
set_source_format(self, value)
Sets the source value, both in the dictionary and the instance variable.

Private Functions

_initialize_options(self, command_line_options)
Reads options from the config file and the command line. Also loops through the options dictionary and replaces some of the strings with other objects: (1) replaces 'True', 'False' and 'None', with True, False and None respectively, (2) replaces strings indicating an integer with that integer (but not for the dct), (3) replaces the empty string with True for the --pipe and --import-events options, and (4) replaces the value of the --mallet and --treetagger options, which are known to be paths, with the absolute path.
_initialize_properties(self)
Put options in instance variables for convenience. This is done for those options that are defined for the command line and not for options from config.txt that are user-specific. Note that due to naming rules for attributes (no dashes allowed), options with a dash are spelled with an underscore when they are instance variables.
class Tarsqi
Inherits from: object

Main Tarsqi class that drives all processing.

Instance variables:

   input                -  absolute path
   output               -  absolute path
   basename             -  basename of input file
   options              -  an instance of Options with processing options
   tarsqidoc            -  an instance of TarsqiDocument
   source_parser        -  a source-specific parser for the source
   metadata_parser      -  a source-specific metadata parser
   docstructure_parser  -  a document structure parser
   pipeline             -  list of name-wrapper pairs
   components           -  dictionary of Tarsqi components
   document             -  instance of TarsqiDocument
   tmp_data             -  path to directory for temporary files

The first nine instance variables are initialized using the arguments
provided by the user, the document variable is initialized and changed
during processing.

Public Functions

__init__(self, opts, infile, outfile)
Initialize Tarsqi object conform the data source identifier and the processing options. Does not set the instance variables related to the document model and the meta data. The opts argument has a list of command line options and the infile and outfile arguments are typically absolute paths, but they can be None when we are processing strings.
process_document(self)
Parse the source with the source parser, the metadata parser and the document structure parser, apply all components and write the results to a file. The actual processing itself is driven using the processing options set at initialization. Components are given the TarsqiDocument and update it.
process_string(self, input_string)
Similar to process(), except that it runs on an input string and not on a file, it does not write the output to a file and it returns the TarsqiDocument.

Private Functions

_apply_component(self, name, wrapper, tarsqidocument)
Apply a component by taking the TarsqDocument, which includes the options from the Tarsqi instance, and passing it to the component wrapper. Component-level errors are trapped here if --trap-errors is True. If errors are trapped, it is still possible that partial results were written to the TagRepositories in the TarsqiDocument.
_cleanup_directories(self)
Remove all fragments from the temporary data directory.
_create_pipeline(self)
Return the pipeline as a list of pairs with the component name and wrapper.
_initialize_parsers(self)
_update_processing_history(self)
_write_output(self)
Write the TarsqiDocument to the output file.
class TarsqiError
Inherits from: Exception

Tarsqi Exception class, so far only used in this file.
class TarsqiWrapper
Inherits from: object

Class that wraps the Tarsqi class, taking care of some of the IO aspects.

Public Functions

__init__(self, args)
pp(self)
run(self)
Main method that is called when the script is executed from the command line. It creates a Tarsqi instance and lets it process the input. If the input is a directory, this method will iterate over the contents, setting up Tarsqi instances for all files in the directory. The arguments are the list of arguments given by the user on the command line.

Private Functions

_run_tarsqi_on_directory(self)
Run Tarsqi on all files in a directory.
_run_tarsqi_on_file(self)
_run_tarsqi_on_pipe(self)
Read text from standard input and run tarsqi over it, then print the result to standard out.
module functions
load_ttk_document(fname, loglevel=2, trap_errors=False)
Load a TTK document with all its Tarsqi tags and return the Tarsqi instance and the TarsqiDocument instance. Do not run the pipeline, but run the source parser, metadata parser and the document structure parser. Used by the evaluation code.
process_string(text, pipeline='PREPROCESSOR', loglevel=2, trap_errors=False)
Run tarsqi on a bare string without any XML tags, handing in pipeline, loglevel and error trapping options.
run_profiler(args)
Wrap running Tarsqi in the profiler.