index
module docmodel.metadata_parser
MetadataParser
MetadataParserATEE
MetadataParserDB
MetadataParserRTE3
MetadataParserTTK
MetadataParserText
MetadataParserTimebank
Metadata Parsers.

This module contains metadata parsers, that is, parsers that pull out the
metadata and add it to a TarsqiDocument. The only requirements on each parser is
that it defines an __init__() method that takes a dictionary of options and a
parse() method that takes a TarsqiDocument instance.

Current parsers only deal with the DCT.
class MetadataParser
Inherits from: object

This is the minimal metadata parser that is used as a default. It selects
the DCT from all available sources and picks one of them, or it uses today's
date if no DCT's are available. Subclasses should override the get_dct()
method to define specific DCT extraction methods for the document source.

Public Functions

__init__(self, options)
At the moment, initialization only uses the --dct option if it is present, but this could change. Note that the TarsqiDocument does not exist yet when the MetadataParser is initialized.
get_dct(self)
parse(self, tarsqidoc)
Adds metadata to the TarsqiDocument. The only thing it adds to the metadata dictionary is the DCT, which is set to today.

Private Functions

_get_source(self)
A convenience method to lift the SourceDoc out of the tarsqi instance.
_get_tag_content(self, tagname)
Return the text content of the first tag with name tagname, return None if there is no such tag.
_import_processing_steps(self)
The processing steps were parsed by the metadata parser for the TTK format, here we just import them.
_moderate_dct_vals(self)
There are five places where a DCT can be expressed: the DCT handed in with the --dct option or defined in the config file, the DCT from the metadata on the TarsqiDocument, the DCT from the metadata on the SourceDoc, DCTs from the TagRepository on the TarsqiDocument and DCTs from the TagRepository on the SourceDoc. The first three are single values or None, the other two are lists of any length. The order of these five is significant in that a DCT earlier on the list if given precedence over a DCT later on the list. Collects all the DCT values and picks the very first one, or today's date if no DCTs are available. Logs a warning if the DCTs do not all have the same value.
class MetadataParserATEE
Inherits from: MetadataParser

The parser for ATEE document.

Public Functions

get_dct(self)
All ATEE documents have a DATE tag with a value attribute, the value of that attribute is returned.
class MetadataParserDB
Inherits from: MetadataParser

A minimal example parser for cases where the DCT is retrieved from a
database. It is identical to MetadataParser except for how it gets the
DCT. This is done by lookup in a database. This here is the simplest
possible case, and it is quite inefficient. It assumes there is an sqlite
database at 'TTK_ROOT/data/in/va/dct.sqlite' which was created as
follows:

   $ sqlite3 dct.sqlite
   sqlite> create table dct (filename TEXT, dct TEXT)
   sqlite> insert into dct values ("test.xml", "1999-12-31");

The get_dct() method uses this database and the location of the database is
specified in the config.txt file. The first use case for this were VA
documents where the DCT was stored externally. To see this in action run

   $ python tarsqi.py --source-format=db data/in/va/test.xml out.xml

Public Functions

get_dct(self)
class MetadataParserRTE3
Inherits from: MetadataParser

The parser for RTE3 documents, no differences with the default parser.
class MetadataParserTTK
Inherits from: MetadataParser

The metadata parser for the ttk format. For now this one adds
nothing to the default metadata parser.
class MetadataParserText
Inherits from: MetadataParser

The metadata parser for the text format. For now this one adds
nothing to the default metadata parser.
class MetadataParserTimebank
Inherits from: MetadataParser

The parser for Timebank documents. All it does is to overwrite the
get_dct() method.

Public Functions

get_dct(self)
Extracts the document creation time, and returns it as a string of the form YYYYMMDD. Depending on the source, the DCT can be found in one of the following tags: DOCNO, DATE_TIME, PUBDATE or FILEID.

Private Functions

_get_doc_source(self)
Return the name of the content provider as well as the content of the DOCNO tag that has that information.
_parse_tag_content(self, regexpr, tagname)
Return the DCT part of the tag content of tagname, requires a reqular expression as one of the arguments.
module functions