[Erp5-report] r27931 - /erp5/trunk/products/ERP5/interfaces/document.py

Fri Jul 3 19:04:14 CEST 2009

Author: nicolas
Date: Fri Jul  3 19:04:14 2009
New Revision: 27931

URL: http://svn.erp5.org?rev=27931&view=rev
Log:
Add IDocument interface

Added:
    erp5/trunk/products/ERP5/interfaces/document.py

Added: erp5/trunk/products/ERP5/interfaces/document.py
URL: http://svn.erp5.org/erp5/trunk/products/ERP5/interfaces/document.py?rev=27931&view=auto
==============================================================================

--- erp5/trunk/products/ERP5/interfaces/document.py (added)
+++ erp5/trunk/products/ERP5/interfaces/document.py [utf8] Fri Jul  3 19:04:14 2009
@@ -1,0 +1,268 @@
+# -*- coding: utf-8 -*-
+##############################################################################
+#
+# Copyright (c) 2009 Nexedi SA and Contributors. All Rights Reserved.
+#                    Nicolas Delaby <nicolas at nexedi.com>
+#
+# WARNING: This program as such is intended to be used by professional
+# programmers who take the whole responsability of assessing all potential
+# consequences resulting from its eventual inadequacies and bugs
+# End users who are looking for a ready-to-use solution with commercial
+# garantees and support are strongly adviced to contract a Free Software
+# Service Company
+#
+# This program is Free Software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License
+# as published by the Free Software Foundation; either version 2
+# of the License, or (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write to the Free Software
+# Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA  02111-1307, USA.
+#
+##############################################################################
+
+from zope.interface import Interface
+
+class IDocument(Interface):
+  """Document is an abstract class with all methods related to document
+  management in ERP5. This includes searchable text, explicit relations,
+  implicit relations, metadata, versions, languages, etc.
+
+  Documents may either store their content directly or cache content
+  which is retrieved from a specified URL. The second case if often
+  referred as "External Document". Standalone "External Documents" may
+  be created by specifying a URL to the contribution tool which is in
+  charge of initiating the download process and selecting the appropriate
+  document type. Groups of "External Documents" may also be generated from
+  so-called "External Source" (refer to ExternalSource class for more
+  information).
+
+  External Documents may be downloaded once or updated at regular interval.
+  The later can be useful to update the content of an external source.
+  Previous versions may be stored in place or kept in a separate file.
+  This feature is known as the crawling API. It is mostly implemented
+  in ContributionTool with wrappers in the Document class. It can be useful
+  for create a small search engine.
+
+  There are currently two types of Document subclasses:
+
+  * File for binary file based documents. File has subclasses such as Image,
+    OOoDocument, PDFDocument, etc. to implement specific conversion methods.
+
+  * TextDocument for text based documents. TextDocument has subclasses such
+    as Wiki to implement specific methods. 
+    TextDocument itself has a subclass (XSLTDocument) which provides
+    XSLT based analysis and transformation of XML content based on XSLT
+    templates. 
+
+  Conversion should be achieved through the convert method and other methods
+  of the conversion API (convertToBaseFormat, etc.).
+  Moreover, any Document subclass must ne able to convert documents to text
+  (asText method) and HTML (asHTML method). Text is required for full text
+  indexing. HTML is required for crawling.
+
+  Instances can be created directly, or via portal_contributions tool which
+  manages document ingestion process whereby a file can be uploaded by http
+  or sent in by email or dropped in by webdav or in some other way as yet
+  unknown. The ingestion process has the following steps:
+
+  (1) portal type detection
+  (2) object creation and upload of data
+  (3) metadata discovery (optionally with conversion of data to another format)
+  (4) other possible actions to finalise the ingestion (ex. by assigning
+      a reference)
+
+  This class handles (3) and calls a ZMI script to do (4).
+
+  Metadata can be drawn from various sources:
+
+  input      -   data supplied with http request or set on the object during (2) (e.g.
+                 discovered from email text)
+  file_name  -   data which might be encoded in file name
+  user_login -   information about user who is contributing the file
+  content    -   data which might be derived from document content
+
+  If a certain property is defined in more than one source, it is set according to
+  preference order returned by a script 
+     Document_getPreferredDocumentMetadataDiscoveryOrderList
+     (or any type-based version since discovery is type dependent)
+
+  Methods for discovering metadata are:
+
+    getPropertyDictFromInput
+    getPropertyDictFromFileName
+    getPropertyDictFromUserLogin
+    getPropertyDictFromContent
+
+  Methods for processing content are implemented either in Document class
+  or in Base class:
+
+    getSearchableReferenceList (Base)
+    getSearchableText (Base)
+    index_html (overriden in Document subclasses)
+
+  Methods for handling relations are implemented either in Document class
+  or in Base class:
+
+    getImplicitSuccessorValueList (Base)
+    getImplicitPredecessorValueList (Base)
+    getImplicitSimilarValueList (Base)
+    getSimilarCloudValueList (Document)
+
+  Implicit relations consist in finding document references inside
+  searchable text (ex. INV-23456) and deducting relations from that.
+  Two customisable methods required. One to find a list of implicit references
+  inside the content (getSearchableReferenceList) and one to convert a given
+  document reference into a list of reference strings which could be present
+  in other content (asSearchableReferenceList).
+
+  document.getSearchableReferenceList() returns
+    [
+     {'reference':' INV-12367'},
+     {'reference': 'INV-1112', 'version':'012}', 
+     {'reference': 'AB-CC-DRK', 'version':'011', 'language': 'en'}
+    ]
+
+  The Document class behaviour can be extended / customized through scripts
+  (which are type-based so can be adjusted per portal type).
+
+  * Document_getPropertyDictFromUserLogin - finds a user (by user_login or
+    from session) and returns properties which should be set on the document
+
+  * Document_getPropertyDictFromContent - analyzes document content and returns
+    properties which should be set on the document
+
+  * Base_getImplicitSuccessorValueList - finds appropriate all documents
+    referenced in the current content
+
+  * Base_getImplicitPredecessorValueList - finds document predecessors based on
+    the document coordinates (can use only complete coordinates, or also partial)
+
+  * Document_getPreferredDocumentMetadataDiscoveryOrderList - returns an order
+    in which metadata should be set/overwritten
+
+  * Document_finishIngestion - called by portal_activities after all the ingestion
+    is completed (and after document has been converted, so text_content
+    is available if the document has it)
+
+  * Document_getNewRevision - calculates revision number which should be set
+    on this document. Implementation depends on revision numbering policy which
+    can be very different. Interaction workflow should call setNewRevision method.
+
+  * Document_populateContent - analyses the document content and produces
+    subcontent based on it (ex. images, news, etc.). This scripts can
+    involve for example an XSLT transformation to process XML.
+
+  Subcontent: documents may include subcontent (files, images, etc.)
+  so that publication of rich content can be path independent. Subcontent
+  can also be used to help the rendering in HTML of complex documents
+  such as ODF documents.
+
+  Consistency checking:
+    Default implementation uses DocumentReferenceConstraint to check if the 
+    reference/language/version triplet is unique. Additional constraints
+    can be added if necessary.
+
+  NOTE: Document.py supports a notion of revision which is very specific.
+  The underlying concept is that, as soon as a document has a reference,
+  the association of (reference, version, language) must be unique accross
+  the whole system. This means that a given document in a given version in a
+  given language is unique. The underlying idea is similar to the one in a Wiki
+  system in which each page is unique and acts the the atom of collaboration.
+  In the case of ERP5, if a team collaborates on a Text document written with
+  an offline word processor, all updates should be placed inside the same object.
+  A Contribution will thus modify an existing document, if allowed from security
+  point of view, and increase the revision number. Same goes for properties
+  (title). Each change generates a new revision.
+  """
+
+  def convert(format, **kw):
+    """Call a wrapped function with CachingMethod and
+    return always converted result.
+
+    format - the format specied in the form of an extension
+    string (ex. jpeg, html, text, txt, etc.)
+    **kw can be various things - e.g. resolution
+    """
+
+  def generateCacheId(format):
+    """Return string to identify Document in cache pool with
+    format - reference - version - language
+    """
+
+  def convertToBaseFormat(**kw):
+    """Converts the content of the document to a base format
+    which is later used for all conversions. This method
+    is common to all kinds of documents and handles
+    exceptions in a unified way.
+
+    Implementation is delegated to _convertToBaseFormat which
+    must be overloaded by subclasses of Document which
+    need a base format.
+
+    convertToBaseFormat is called upon file upload, document
+    ingestion by the processing_status_workflow.
+
+    NOTE: the data of the base format conversion should be stored
+    using the base_data property. Refer to Document.py propertysheet.
+    Use accessors (getBaseData, setBaseData, hasBaseData, etc.)
+    """
+
+  def _convertToBaseFormat():
+    """Placeholder method. Must be subclassed by classes
+    which need a base format. Refer to OOoDocument
+    for an example of ODF base format which is used
+    as a way to convert about any file format into
+    about any file format.
+
+    Other possible applications: conversion of HTML
+    text to tiddy HTML such as described here:
+    http://www.xml.com/pub/a/2004/09/08/pyxml.html
+    so that resulting text can be processed more
+    easily by XSLT parsers. Conversion of internal
+    links to images of an HTML document to local
+    links (in combindation with populate).
+    """
+
+  def clearConversionCache():
+    """Clear cache (invoked by interaction workflow upon file upload
+    needed here to overwrite class attribute with instance attrs
+    """
+
+  def updateConversionCache():
+    """Initialise Caches
+    """
+
+  def hasConversion(**kw):
+    """Return a boolean if conversion is cached
+    """
+
+  def setConversion(data, mime=None, calculation_time=None, **kw):
+    """Saves a version of the document in a given format; records mime type
+    and conversion time (which is right now).
+    """
+
+  def getConversion(**kw):
+    """Returns version of the document in a given format, if it has it; otherwise
+    returns empty string (the caller should check hasConversion before calling
+    this function.
+    """
+
+  def getConversionSize(**kw):
+    """Returns the size of the converted document.
+    """
+
+  def _getCacheFactory():
+    """Return list of cache plugins
+    """
+
+  def isSupportBaseDataConversion():
+    """This is a public interface to check a document that is support conversion
+    to base format and can be overridden in subclasses.
+    """