index
module tarsqi
Options
Tarsqi
TarsqiError
TarsqiWrapper
tarsqi.py
Main script that drives all tarsqi toolkit processing.
Source-specific processing is delegated to the docmodel package, which has
access to source parsers and metadata parsers. This script also calls on
various tarsqi modules to do the rest of the real work.
USAGE
% python tarsqy.py [OPTIONS] [INPUT OUTPUT]
INPUT/OUTPUT
Input and output files or directories. If the input is a directory then
the output directory needs to exist. If '--pipe' is one of the options
then input and output are not required and they are ignored if they are
there. Output is always in the TTK format, unless the --target-format
option is set to lif, in which case output will be in the LIF format.
OPTIONS
--source-format NAME
The format of the input; this reflects the source type of the document
and allows components, especially the source parser and the metadata
parser, to be sensitive to idiosyncratic properties of the text (for
example, the location of the DCT and the format of the text). If this
option is not specified then the system will try to guess one of
'xml', 'ttk' or 'text', and default to 'text' if no clues can be found
for the first two cases. Note that currently this guess will fail if
the --pipe option is used. There are five more types that can be used
to process the more specific sample data in data/in: lif for the data
in data/in/lif, timebank for data/in/TimeBank, atee for data/in/ATEE,
rte3 for data/in/RTE3 and db for data/in/db.
--target-format NAME
Output is almost always written in the TTK format. This option allows
you to overrule that, but at the moment the only other format that is
experimentally supported is the LIF format. If you use 'lif' for the
target format then LIF output will be printed. However, currently this
only works if the --source-format is also LIF.
--pipeline LIST
Comma-separated list of Tarsqi components, defaults to the full
pipeline minus the link merger.
--dct VALUE
Use this to pass a document creation time (DCT) to the main script.
The value is a normalized date expression like 20120830 for August
30th 2012. If this option is not used then the DCT will be determined
by the metadata parser that is defined for the input source. Note that
the value of --dct will overrule any value calculated by the metadata
parser.
--pipe
With this option the script reads input from the standard input and
writes output to standard output. Without it the script expects the
INPUT and OUTPUT arguments to be there. Note that when you do this you
also need to use the --source-format option.
--perl PATH
Path to the Perl executable. Typically the operating system default is
fine here and this option does not need to be used.
--treetagger PATH
Path to the TreeTagger.
--mallet PATH
Location of Mallet, this should be the directory that contains the
bin directory.
--classifier STRING
The classifier used by the Mallet classifier, the default is MaxEnt.
--ee-model FILENAME
--et-model FILENAME
The models used for classifying event-event and event-timex tlinks,
these are model files in components/classifier/models, the defaults
are set to tb-vectors.ee.model and tb-vectors.et.model.
--import-events
With this option the Evita component will try to import existing
events by lifting EVENT tags from the source tags. It is assumed that
those tags have 'begin', 'end' and 'class' attributes.
--trap-errors True|False
Set error trapping, errors are trapped by default.
--loglevel INTEGER
Set log level to an integer from 0 to 4, the higher the level the
more messages will be written to the log, see utilities.logger for
more details.
Some of these options (the ones that have values) can also be set in the
config.txt file.
VARIABLES:
TTK_ROOT - the TTK directory
CONFIG_FILE - file with user settings
COMPONENTS - dictionary with all Tarsqi components
USE_PROFILER - a boolean determining whether the profiler is used
PROFILER_OUTPUT - file that profiler statistics are written to
class Options
Inherits from: object
A class to keep track of all the options. Options can be accessed with
the getopt() method, but standard options are also accessable directly
through the following instance variables: source, dct, pipeline, pipe,
loglevel, trap_errors, import_event_tags, perl, mallet, treetagger,
classifier, ee_model and et_model. There is no instance variable access for
user-defined options in the config.txt file.
Public Functions
__getitem__(self, key)
__init__(self, options)
Initialize options from the config file and the options handed in to
the tarsqi script. Put known options in instance variables.
__str__(self)
getopt(self, option_name, default=None)
Return the option, use None as default.
items(self)
Simplistic way to do dictionary emulation.
pp(self)
set_option(self, opt, value)
Sets the value of opt in self._options to value. If opt is also
expressed as an instance variable then change that one as well.
set_source_format(self, value)
Sets the source value, both in the dictionary and the instance
variable.
Private Functions
_initialize_options(self, command_line_options)
Reads options from the config file and the command line. Also loops
through the options dictionary and replaces some of the strings with
other objects: (1) replaces 'True', 'False' and 'None', with True, False
and None respectively, (2) replaces strings indicating an integer with
that integer (but not for the dct), (3) replaces the empty string with
True for the --pipe and --import-events options, and (4) replaces the
value of the --mallet and --treetagger options, which are known to be
paths, with the absolute path.
_initialize_properties(self)
Put options in instance variables for convenience. This is done for
those options that are defined for the command line and not for options
from config.txt that are user-specific. Note that due to naming rules
for attributes (no dashes allowed), options with a dash are spelled with
an underscore when they are instance variables.
class Tarsqi
Inherits from: object
Main Tarsqi class that drives all processing.
Instance variables:
input - absolute path
output - absolute path
basename - basename of input file
options - an instance of Options with processing options
tarsqidoc - an instance of TarsqiDocument
source_parser - a source-specific parser for the source
metadata_parser - a source-specific metadata parser
docstructure_parser - a document structure parser
pipeline - list of name-wrapper pairs
components - dictionary of Tarsqi components
document - instance of TarsqiDocument
tmp_data - path to directory for temporary files
The first nine instance variables are initialized using the arguments
provided by the user, the document variable is initialized and changed
during processing.
Public Functions
__init__(self, opts, infile, outfile)
Initialize Tarsqi object conform the data source identifier and the
processing options. Does not set the instance variables related to the
document model and the meta data. The opts argument has a list of
command line options and the infile and outfile arguments are typically
absolute paths, but they can be None when we are processing strings.
process_document(self)
Parse the source with the source parser, the metadata parser and the
document structure parser, apply all components and write the results to
a file. The actual processing itself is driven using the processing
options set at initialization. Components are given the TarsqiDocument
and update it.
process_string(self, input_string)
Similar to process(), except that it runs on an input string and not
on a file, it does not write the output to a file and it returns the
TarsqiDocument.
Private Functions
_apply_component(self, name, wrapper, tarsqidocument)
Apply a component by taking the TarsqDocument, which includes the
options from the Tarsqi instance, and passing it to the component
wrapper. Component-level errors are trapped here if --trap-errors is
True. If errors are trapped, it is still possible that partial results
were written to the TagRepositories in the TarsqiDocument.
_cleanup_directories(self)
Remove all fragments from the temporary data directory.
_create_pipeline(self)
Return the pipeline as a list of pairs with the component name and
wrapper.
_initialize_parsers(self)
_update_processing_history(self)
_write_output(self)
Write the TarsqiDocument to the output file.
class TarsqiError
Inherits from: Exception
Tarsqi Exception class, so far only used in this file.
class TarsqiWrapper
Inherits from: object
Class that wraps the Tarsqi class, taking care of some of the IO aspects.
Public Functions
__init__(self, args)
pp(self)
run(self)
Main method that is called when the script is executed from the command
line. It creates a Tarsqi instance and lets it process the input. If the
input is a directory, this method will iterate over the contents, setting up
Tarsqi instances for all files in the directory. The arguments are the list
of arguments given by the user on the command line.
Private Functions
_run_tarsqi_on_directory(self)
Run Tarsqi on all files in a directory.
_run_tarsqi_on_file(self)
_run_tarsqi_on_pipe(self)
Read text from standard input and run tarsqi over it, then print the result
to standard out.
module functions
load_ttk_document(fname, loglevel=2, trap_errors=False)
Load a TTK document with all its Tarsqi tags and return the Tarsqi instance
and the TarsqiDocument instance. Do not run the pipeline, but run the source
parser, metadata parser and the document structure parser. Used by the
evaluation code.
process_string(text, pipeline='PREPROCESSOR', loglevel=2, trap_errors=False)
Run tarsqi on a bare string without any XML tags, handing in pipeline,
loglevel and error trapping options.
run_profiler(args)
Wrap running Tarsqi in the profiler.