alfresco-community-repo/source/java/org/alfresco/repo/content/metadata/AbstractMappingMetadataExtracter.java
Dave Ward a5f31cd37e Merged V3.3 to HEAD
20167: Merged HEAD to BRANCHES/V3.3: (RECORD ONLY)
      20166: Fix ALF-2765: Renditions created via 3.3 RenditionService are not exposed via OpenCMIS rendition API
   20232: Fix problem opening AVM web project folders via FTP. ALF-2738.
   20234: ALF-2352: Cannot create folders in Share doclib without admin user in authentication chain
   20235: Fix for unable to create folders in web project via CIFS. ALF-2736.
   20258: Reverse-merged rev 20254: 'When dropping the mysql database ...'
   20262: Merged V3.3-BUG-FIX to V3.3
      20251: Fix for ALF-2804 - Unable to browse into folders in Share Site in certain situations.
              - Browser history filter object in incorrect state after page refresh.
   20264: Updated Oracle build support (to fix grants)
   20282: Merged PATCHES/V3.2.0 to V3.3
      20266: Test reproduction of ALF-2839 failure: Node pre-loading generates needless resultset rows
      20280: Fixed ALF-2839: Node pre-loading generates needless resultset rows
   20283: Merged BRANCHES/DEV/V3.3-BUG-FIX to BRANCHES/V3.3:
      20194: AVMTestSuite - scale down unit tests (slightly)
      20247: AVMServiceTest.testVersionByDate - build (add delay)
   20290: Fixed ALF-2851 "Drag n Drop issues in IE6 & IE7"
      - Reordering rules-list with drag and drop didn't work at all because each rule was created using a template that had the "id"-attribute set, which made IE confused after using HTMLELement.clone() even though the id was resetted
      - Both customise-dashlets & rules-list got an error when "throwing" away the dashlet or rule instead of releasing it "carefully", reason was becuuase IE didnt capture the x:y-position which made the animation fail. Now no animation is done if x:y isn't found.
   20296: Merged PATCHES/V3.1.0 to V3.3 (RECORD ONLY)
      20249: Merged V3.1 to PATCHES/V3.1.0
         14565: Updated version to include revision number (x.y.z)
      20246: Merged V3.1 to PATCHES/V3.1.0
         13841: Build fix
      20245: Merged V3.1 to PATCHES/V3.1.0
         16185: AbstractLuceneIndexerAndSearcherFactory.getTransactionId() must return null when there is no transaction
      20241: Merged V3.1 to PATCHES/V3.1.0
         14187: Fix for ETHREEOH-2023: LDAP import must lower case the local name of the association to person.
         16167: ETHREEOH-2475: Fixed nested transaction handling in AbstractLuceneIndexerAndSearcherFactory to allow duplicate user processing in PersonServiceImpl to actually work
         16168: ETHREEOH-2797: Force patch.db-V2.2-Person to apply one more time to fix up corrupt users created by LDAP Import
            - Problem due to ETHREEOH-2023, fixed in 3.1.1
            - Also corrects ldap.synchronisation.defaultHomeFolderProvider to be userHomesHomeFolderProvider
            - Also requires fix to ETHREEOH-2475 to fix up duplicate users
      20221:Merged PATCHES/V3.1.2 to PATCHES/V3.1.0
         20217: Merged PATCHES/V3.2.0 to PATCHES/V3.1.2
            19793: Merged HEAD to V3.2.0
               19786: Refactor of previous test fix. I have pushed down the OOo-specific parts of the change from AbstractContentTransformerTest to OpenOfficeContentTransformerTest leaving an extension point in the base class should other transformations need to be excluded in the future.
               19785: Fix for failing test OpenOfficeContentTransformerTest.testAllConversions.
                  Various OOo-related transformations are returned as available but fail on our test server with OOo on it.
                  Pending further work on these failings, I am disabling those transformations in test code whilst leaving them available in the product code. This is because in the wild a different OOo version may succeed with these transformations.
                  I had previously explicitly disabled 3 transformations in the product and I am moving that restriction from product to test code for the same reason.
               19707: Return value from isTransformationBlocked was inverted. Fixed now.
               19705: Refinement of previous check-in re OOo transformations.
                  I have pulled up the code that handles blocked transformations into a superclass so that the JodConverter-based transformer worker can inherit the same list of blocked transformations. To reiterate, blocked transformations are those that the OOo integration code believes should work but which are broken in practice. These are blocked by the transformers and will always be unavailable regardless of the OOo connection state.
               19702: Fix for HEAD builds running on panda build server.
                  OOo was recently installed on panda which has activated various OOo-related transformations/extractions in the test code.
                  It appears that OOo does not support some transformations from Office 97 to Office 2007. Specifically doc to docx and xls to xlsx. These transformations have now been marked as unavailable.
      20220: Created hotfix branch off TAGS/ENTERPRISE/V3.1.0
   20297: Merged PATCHES/V3.1.2 to V3.3 (RECORD ONLY)
      20268: Increment version number
      20267: ALF-550: Merged V3.2 to PATCHES/V3.1.2
         17768: Merged DEV/BELARUS/V3.2-2009_11_24 to V3.2
            17758: ETHREEOH-3757: Oracle upgrade issue: failed "inviteEmailTemplate" patch - also causes subsequent patches to not be applied
      20217: Merged PATCHES/V3.2.0 to PATCHES/V3.1.2
         19793: Merged HEAD to V3.2.0
            19786: Refactor of previous test fix. I have pushed down the OOo-specific parts of the change from AbstractContentTransformerTest to OpenOfficeContentTransformerTest leaving an extension point in the base class should other transformations need to be excluded in the future.
            19785: Fix for failing test OpenOfficeContentTransformerTest.testAllConversions.
               Various OOo-related transformations are returned as available but fail on our test server with OOo on it.
               Pending further work on these failings, I am disabling those transformations in test code whilst leaving them available in the product code. This is because in the wild a different OOo version may succeed with these transformations.
               I had previously explicitly disabled 3 transformations in the product and I am moving that restriction from product to test code for the same reason.
            19707: Return value from isTransformationBlocked was inverted. Fixed now.
            19705: Refinement of previous check-in re OOo transformations.
               I have pulled up the code that handles blocked transformations into a superclass so that the JodConverter-based transformer worker can inherit the same list of blocked transformations. To reiterate, blocked transformations are those that the OOo integration code believes should work but which are broken in practice. These are blocked by the transformers and will always be unavailable regardless of the OOo connection state.
            19702: Fix for HEAD builds running on panda build server.
               OOo was recently installed on panda which has activated various OOo-related transformations/extractions in the test code.
               It appears that OOo does not support some transformations from Office 97 to Office 2007. Specifically doc to docx and xls to xlsx. These transformations have now been marked as unavailable.
      20204: Moved version label to '.6'
   20298: Merged PATCHES/V3.2.0 to V3.3 (RECORD ONLY)
      20281: Incremented version number to '10'
      20272: Backports to help fix ALF-2839: Node pre-loading generates needless resultset rows
         Merged BRANCHES/V3.2 to PATCHES/V3.2.0:
            18490: Added cache for alf_content_data
         Merged BRANCHES/DEV/V3.3-BUG-FIX to PATCHES/V3.2.0:
            20231: Fixed ALF-2784: Degradation of performance between 3.1.1 and 3.2x (observed in JSF)
   20299: Merged PATCHES/V3.2.1 to V3.3 (RECORD ONLY)
      20279: Incremented version label
      20211: Reinstated patch 'patch.convertContentUrls' (reversed rev 20205 ALF-2719)
      20210: Incremented version label to '.3'
      20206: Bumped version label to '.2'
      20205: Workaround for ALF-2719 by disabling patch.convertContentUrls and ContentStoreCleaner
      20149: Incremented version label
      20101: Created hotfix branch off ENTERPRISE/V3.2.1
   20300: Merged BRANCHES/DEV/BELARUS/HEAD-2010_04_28 to BRANCHES/V3.3:
      20293: ALF-767: remove-AVM-issuer.sql upgrade does not account for column (mis-)order - fixed for MySQL, PostgreSQL and Oracle (DB2 & MS SQL Server already OK)
   20301: Merged PATCHES/V3.2.1 to V3.3
      20278: ALF-206: Make it possible to follow hyperlinks to document JSF client URLs from MS Office
         - A request parameter rather than a (potentially forgotten) session attribute is used to propagate the URL to redirect to after successful login
   20303: Fixed ALF-2855: FixAuthorityCrcValuesPatch reports NPE during upgrade from 2.1.7 to 3.3E
      - Auto-unbox NPE on Long->long: Just used the Long directly for reporting
   20319: Fixed ALF-2854: User Usage Queries use read-write methods on QNameDAO
   20322: Fixed ALF-1998: contentStoreCleanerJob leads to foreign key exception
      - Possible concurrent modification of alf_content_url.orphan_time led to false orphan detection
      - Fixed queries to check for dereferencing AND use the indexed orphan_time column
      - More robust use of EagerContentStoreCleaner: On eager cleanup, ensure that URLs are deleted
      - Added optimistic lock checks on updates and deletes of alf_content_url
   20335: Merged DEV/V3.3-BUG-FIX to V3.3
      20334: ALF-2473: Changes for clean startup and shutdown of subsystems on Spring 3
         - Removed previous SafeEventPublisher workaround for startup errors and associated changes
         - Replaced with SafeApplicationEventMulticaster which queues up events while an application context isn't started
         - Now all subsystems shut down cleanly
         - Fixes problem with FileContentStore visibility in JMX too!
   20341: ALF-2517 Quick fix which means rules which compare the creation/modification date of content should now correctly be applied when content is uploaded to a folder.
   20346: ALF-2839: Node pre-loading generates needless resultset rows
      - Added missing Criteria.list() call
   20347: Merged BRANCHES/DEV/V3.3-BUG-FIX to BRANCHES/V3.3:
      20231: Fixed ALF-2784: Degradation of performance between 3.1.1 and 3.2x (observed in JSF)
   20356: Merged DEV/BELARUS/HEAD-2010_03_30 to V3.3 (with corrections)
      19735: ALF-686: Alfresco cannot start if read/write mode in Sysadmin subsystem is configured
         1. org.alfresco.repo.module.ModuleComponentHelper was modified to allow “System” user run write operations in read-only system.
         2. Startup of “Synchronization” subsystem failed with the same error as was occurred in issue during modules start. org.alfresco.repo.security.sync.ChainingUserRegistrySynchronizer was also modified to allow “System” user run write operations in read-only mode.
   20361: Merged HEAD to BRANCHES/V3.3: (RECORD ONLY)
      20345: Fix ALF-2319: CMIS 'current' version mapping is not compliant with spec
      20354: Update test to reflect changes to CMIS version mapping.
   20363: Merge from V3.2 to V3.2 (all record-only)
      c. 19448 OOoJodConverter worker bean correctly handles isAvailable() when subsystem is disabled.
      c. 19484 JodConverter-backed thumbnailing test now explicitly sets OOoDirect and OOoJodconverter enabled-ness back to default settings in tearDown
      c. 20175 Fix for ALF-2773 JMX configuration of enterprise logging broken
   20376: Altered URL of online help to point at http://www.alfresco.com/help/33/enterprise/webeditor/
   20395: set google docs off
   20398: Fixed ALF-2890: Upgrade removes content if transaction retries are triggered
      - Setting ContentData that was derived outside of the current transaction opened up a window
        for the post-rollback code to delete the underlying binary. The binaries are only registered
        for writers fetched via the ContentService now; the low-level DAO no longer does management
        because it can't assume that a new content URL indicates a new underlying binary.
      - The contentUrlConverter was creating new URLs and thus the low-level DAO cleaned up
        live content when retrying collisions took place. The cleanup is no longer on the stack
        for the patch.
      - Removes the ALF-558 changes around ContentData.reference()
   20399: Remove googledocs aspect option
   20400: PurgeTestP (AVM) - increase wait cycles
   20422: Added ooo converter properties
   20425: Merge V3.3-BUG-FIX to V3.3
      20392 : ALF-2716 - imap mail metadata extraction fails when alfresco server locale is non English
      20365 : Merge DEV to V3.3-BUG_FIX     
         18011 : ETHREEOH-3804 - IMAP message body doesn't appears in IMAP folder when message subject is equal to the attachment name
      20332 : Build fix - rework to the ImapServiceUnit tests.
      20325 : build fix
      20318 : MERGE DEV TO V3.3-BUG-FIX    
         20287 : ALF-2754: Alfresco IMAP and Zimbra Desktop Client.
      20317 : ALF-2716 - imap mail metadata extraction fails when alfresco server locale is non English   This change reworks the received date metadata extraction.
      20316 : ALF-1912 : Problem with IMAP Sites visibility   Now only IMAP favouries are shown.   Also major rework to the way that this service uses the FileFolderService.
      20315 : ALF-1912 Updates to the FileFolderService to support the Imap Service    - add listDeepFolders    - remove "makeFolders" which moves to its own Utility class.    - update to JavaDoc
   20429: Merged BRANCHES/DEV/V3.3-BUG-FIX to BRANCHES/V3.3:
      20171: 3.3SP1 bug fix branch
      20174: Fix for ALF-960 and ALFCOM-1980: WCM - File Picker Restriction relative to folder not web project
      20179: ALF-2629 Now when a workflow timer signals a transition it also ends the associated task.
   20433: Merged BRANCHES/DEV/V3.3-BUG-FIX to BRANCHES/V3.3:
      20184: ALF-2772: Added new test case to RepoTransferReceiverImplTest and fixed the fault in the primary manifest processor.
      20196: Temporary fix to SandboxServiceImplTest, which reverses the fix to ALF-2529.
   20434: Merged BRANCHES/DEV/V3.3-BUG-FIX to BRANCHES/V3.3: (RECORD ONLY)
      20213: (RECORD ONLY) Merge from V3.3 to V3.3-BUG-FIX
         r20176 Merge from V3.2 to V3.3.
             r20175. JMX configuration of enterprise logging broken (fix).
      20215: (RECORD ONLY) Merge from V3.3 to V3.3-BUG-FIX
         r20178 JodConverter loggers are now exposed in JMX.
      20218: (RECORD ONLY) Merged BRANCHES/V3.3 to BRANCHES/DEV/V3.3-BUG-FIX:
         20195: Form fields for numbers are now rendered much smaller that ...
      20248: (RECORD ONLY) Merging HEAD into V3.3
      20284: (RECORD ONLY) Merged BRANCHES/V3.3 to BRANCHES/DEV/V3.3-BUG-FIX:
         20177: Add 'MaxPermSize' setting for DOD JUnit tests
      20305: (RECORD ONLY) Merged BRANCHES/V3.3 to BRANCHES/DEV/V3.3-BUG-FIX:
         20236: Add Oracle support for creating/dropping "databases" (users) in continuous.xml
         20264: Updated Oracle build support (to fix grants)
   20435: Merged BRANCHES/DEV/V3.3-BUG-FIX to BRANCHES/V3.3:
      20233: Part fix for ALF-2811: DOD5015 module breaks CMIS tck
      20239: Final part of fix for ALF-2811: DOD5015 module breaks CMIS tck
      20250: Merge from DEV/BELARUS/HEAD-2010_04_28 to V3.3-BUG-FIX
         20230 ALF-2450: latin/utf-8 HTML file cannot be text-extracted.
      20253: ALF-2629 Now tasks should correctly be ended when an associated timer is triggered. Should no longer cause WCM workflows to fail.
      20254: ALF-2579 Changed teh status code on incorrect password to '401' to reflect that it is an authorisation error.
      20263: Fix for ALF-2500: query with a ! in contains search make it strange
      20265: Fix for ALF-1495. Reindexing of OOo-transformed content after OOo crash.
   20436: Merged BRANCHES/DEV/V3.3-BUG-FIX to BRANCHES/V3.3:
      20292: (RECORD ONLY) Latest SpringSurf libs:
      20308: (RECORD ONLY) Latest SpringSurf libs:
      20366: (RECORD ONLY) Latest SpringSurf libs:
      20415: Latest SpringSurf libs:
   20437: Merged BRANCHES/DEV/V3.3-BUG-FIX to BRANCHES/V3.3:
      20270: Build times: SearchTestSuite
      20273: Fix for ALF-2125 - Accessing a deleted page in Share does not return an error page, instead the document-details page breaks
      20274: Fix for ALF-2518: It's impossible to find user by user name in Add User or Group window at Manage permissions page (also allows users to be found by username in the Share Admin Console).
      20277: Fix for ALF-2417: Create Web Content Wizard if cancelling/aborting Step Two - Author Web Content, any asset being uploaded gets locked
      20291: Reduce build time: Added security test suite to cover 17 security tests 
   20439: Merged BRANCHES/DEV/V3.3-BUG-FIX to BRANCHES/V3.3:
      20302: Fixed ALF-727:  Oracle iBatis fails on PropertyValueDAOTest Double.MAX_VALUE
      20307: VersionStore - minor fixes if running deprecated V1 
      20310: Fixed a bug in UIContentSelector which was building lucene search queries incorrectly.
      20314: Fix for ALF-2789 - DispatcherServlet not correctly retrieving Object ID from request parameters
      20320: Merged DEV/TEMPORARY to V3.3-BUG-FIX
         20313: ALF-2507: Not able to email space users even if the user owns the space 
      20324: Fixed ALF-2078 "Content doesn't make checked in after applying 'Check-in' rule in Share"
      20327: Fix Quickr project to compile in Eclipse
      20367: ALF-2829: Avoid reading entire result set into memory in FixNameCrcValuesPatch
      20368: Work-around for ALF-2366: patch.updateDmPermissions takes too long to complete
      20369: Part 1 of fix for ALF-2943: Update incorrect mimetypes (Excel and Powerpoint)
      20370: Version Migrator (ALF-1000) - use common batch processor to enable multiple workers
      20373: Version Migrator (ALF-1000) - resolve runtime conflict (w/ r20334)
      20378: Merged BRANCHES/DEV/BELARUS/HEAD-2010_04_28 to BRANCHES/DEV/V3.3-BUG-FIX:
         20312: ALF-2162: Error processing WCM form: XFormsBindingException: property 'constraint' already present at model item
      20381: Fixed ALF-2943: Update incorrect mimetypes (Excel and Powerpoint)


git-svn-id: https://svn.alfresco.com/repos/alfresco-enterprise/alfresco/HEAD/root@20571 c4b6b30b-aa2e-2d43-bbcb-ca4b014f7261
2010-06-09 14:01:07 +00:00

1030 lines
40 KiB
Java

/*
* Copyright (C) 2005-2010 Alfresco Software Limited.
*
* This file is part of Alfresco
*
* Alfresco is free software: you can redistribute it and/or modify
* it under the terms of the GNU Lesser General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Alfresco is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU Lesser General Public License for more details.
*
* You should have received a copy of the GNU Lesser General Public License
* along with Alfresco. If not, see <http://www.gnu.org/licenses/>.
*/
package org.alfresco.repo.content.metadata;
import java.io.InputStream;
import java.io.Serializable;
import java.lang.reflect.Array;
import java.text.DateFormat;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Collection;
import java.util.Collections;
import java.util.Date;
import java.util.HashMap;
import java.util.HashSet;
import java.util.List;
import java.util.Locale;
import java.util.Map;
import java.util.Properties;
import java.util.Set;
import java.util.StringTokenizer;
import org.alfresco.error.AlfrescoRuntimeException;
import org.alfresco.service.cmr.dictionary.DataTypeDefinition;
import org.alfresco.service.cmr.dictionary.DictionaryService;
import org.alfresco.service.cmr.dictionary.PropertyDefinition;
import org.alfresco.service.cmr.repository.ContentReader;
import org.alfresco.service.cmr.repository.MimetypeService;
import org.alfresco.service.cmr.repository.datatype.DefaultTypeConverter;
import org.alfresco.service.cmr.repository.datatype.TypeConversionException;
import org.alfresco.service.namespace.InvalidQNameException;
import org.alfresco.service.namespace.QName;
import org.springframework.extensions.surf.util.ISO8601DateFormat;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
/**
* Support class for metadata extracters that support dynamic and config-driven
* mapping between extracted values and model properties. Extraction is broken
* up into two phases:
* <ul>
* <li>Extract ALL available metadata from the document.</li>
* <li>Translate the metadata into system properties.</li>
* </ul>
* <p>
* Migrating an existing extracter to use this class is straightforward:
* <ul>
* <li>
* Construct the extracter providing a default set of supported mimetypes to this
* implementation. This can be overwritten with configurations.
* </li>
* <li>
* Implement the {@link extractInternal} method. This now returns a raw map of extracted
* values keyed by document-specific property names. The <b>trimPut</b> method has
* been replaced with an equivalent {@link #putRawValue(String, Serializable, Map)}.
* </li>
* <li>
* Provide the default mapping of the document-specific properties to system-specific
* properties as describe by the {@link #getDefaultMapping()} method. The simplest
* is to provide the default mapping in a correlated <i>.properties</i> file.
* </li>
* <li>
* Document, in the class-level javadoc, all the available properties that are extracted
* along with their approximate meanings. Add to this, the default mappings.
* </li>
* </ul>
*
* @see #getDefaultMapping()
* @see #extractRaw(ContentReader)
* @see #setMapping(Map)
*
* @since 2.1
*
* @author Jesper Steen Møller
* @author Derek Hulley
*/
abstract public class AbstractMappingMetadataExtracter implements MetadataExtracter
{
public static final String NAMESPACE_PROPERTY_PREFIX = "namespace.prefix.";
private static final String ERR_TYPE_CONVERSION = "metadata.extraction.err.type_conversion";
protected static Log logger = LogFactory.getLog(AbstractMappingMetadataExtracter.class);
private MetadataExtracterRegistry registry;
private MimetypeService mimetypeService;
private DictionaryService dictionaryService;
private boolean initialized;
private Set<String> supportedMimetypes;
private OverwritePolicy overwritePolicy;
private boolean failOnTypeConversion;
private Set<DateFormat> supportedDateFormats = new HashSet<DateFormat>(0);
private Map<String, Set<QName>> mapping;
private boolean inheritDefaultMapping;
/**
* Default constructor. If this is called, then {@link #isSupported(String)} should
* be implemented. This is useful when the list of supported mimetypes is not known
* when the instance is constructed. Alternatively, once the set becomes known, call
* {@link #setSupportedMimetypes(Collection)}.
*
* @see #isSupported(String)
* @see #setSupportedMimetypes(Collection)
*/
protected AbstractMappingMetadataExtracter()
{
this(Collections.<String>emptySet());
}
/**
* Constructor that can be used when the list of supported mimetypes is known up front.
*
* @param supportedMimetypes the set of mimetypes supported by default
*/
protected AbstractMappingMetadataExtracter(Set<String> supportedMimetypes)
{
this.supportedMimetypes = supportedMimetypes;
// Set defaults
overwritePolicy = OverwritePolicy.PRAGMATIC;
failOnTypeConversion = true;
mapping = null; // The default will be fetched
inheritDefaultMapping = false; // Any overrides are complete
initialized = false;
}
/**
* Set the registry to register with. If this is not set, then the default
* initialization will not auto-register the extracter for general use. It
* can still be used directly.
*
* @param registry a metadata extracter registry
*/
public void setRegistry(MetadataExtracterRegistry registry)
{
this.registry = registry;
}
/**
* @param mimetypeService the mimetype service. Set this if required.
*/
public void setMimetypeService(MimetypeService mimetypeService)
{
this.mimetypeService = mimetypeService;
}
/**
* @return Returns the mimetype helper
*/
protected MimetypeService getMimetypeService()
{
return mimetypeService;
}
/**
* @param dictionaryService the dictionary service to determine which data conversions are necessary
*/
public void setDictionaryService(DictionaryService dictionaryService)
{
this.dictionaryService = dictionaryService;
}
/**
* Set the mimetypes that are supported by the extracter.
*
* @param supportedMimetypes
*/
public void setSupportedMimetypes(Collection<String> supportedMimetypes)
{
this.supportedMimetypes.clear();
this.supportedMimetypes.addAll(supportedMimetypes);
}
/**
* {@inheritDoc}
*
* @see #setSupportedMimetypes(Collection)
*/
public boolean isSupported(String sourceMimetype)
{
return supportedMimetypes.contains(sourceMimetype);
}
/**
* @return Returns <code>1.0</code> if the mimetype is supported, otherwise <tt>0.0</tt>
*
* @see #isSupported(String)
*/
public double getReliability(String mimetype)
{
return isSupported(mimetype) ? 1.0D : 0.0D;
}
/**
* Set the policy to use when existing values are encountered. Depending on how the extracer
* is called, this may not be relevant, i.e an empty map of existing properties may be passed
* in by the client code, which may follow its own overwrite strategy.
*
* @param overwritePolicy the policy to apply when there are existing system properties
*/
public void setOverwritePolicy(OverwritePolicy overwritePolicy)
{
this.overwritePolicy = overwritePolicy;
}
/**
* Set the policy to use when existing values are encountered. Depending on how the extracer
* is called, this may not be relevant, i.e an empty map of existing properties may be passed
* in by the client code, which may follow its own overwrite strategy.
*
* @param overwritePolicyStr the policy to apply when there are existing system properties
*/
public void setOverwritePolicy(String overwritePolicyStr)
{
this.overwritePolicy = OverwritePolicy.valueOf(overwritePolicyStr);
}
/**
* Set whether the extractor should discard metadata that fails to convert to the target type
* defined in the data dictionary model. This is <tt>true</tt> by default i.e. if the data
* extracted is not compatible with the target model then the extraction will fail. If this is
* <tt>false<tt> then any extracted data that fails to convert will be discarded.
*
* @param failOnTypeConversion <tt>false</tt> to discard properties that can't get converted
* to the dictionary-defined type, or <tt>true</tt> (default)
* to fail the extraction if the type doesn't convert
*/
public void setFailOnTypeConversion(boolean failOnTypeConversion)
{
this.failOnTypeConversion = failOnTypeConversion;
}
/**
* Set the date formats, over and above the {@link ISO8601DateFormat ISO8601 format}, that will
* be supported for string to date conversions. The supported syntax is described by the
* {@link http://java.sun.com/j2se/1.5.0/docs/api/java/text/SimpleDateFormat.html SimpleDateFormat Javadocs}.
*
* @param supportedDateFormats a list of supported date formats.
*/
public void setSupportedDateFormats(List<String> supportedDateFormats)
{
this.supportedDateFormats = new HashSet<DateFormat>(5);
for (String dateFormatStr : supportedDateFormats)
{
try
{
/**
* Regional date format
*/
DateFormat df = new SimpleDateFormat(dateFormatStr);
this.supportedDateFormats.add(df);
/**
* Date format can be locale specific - make sure English format always works
*/
/*
* TODO MER 25 May 2010 - Added this as a quick fix for IMAP date parsing which is always
* English regardless of Locale. Some more thought and/or code is required to configure
* the relationship between properties, format and locale.
*/
DateFormat englishFormat = new SimpleDateFormat(dateFormatStr, Locale.US);
this.supportedDateFormats.add(englishFormat);
}
catch (Throwable e)
{
// No good
throw new AlfrescoRuntimeException("Unable to set supported date format: " + dateFormatStr, e);
}
}
}
/**
* Set if the property mappings augment or override the mapping generically provided by the
* extracter implementation. The default is <tt>false</tt>, i.e. any mapping set completely
* replaces the {@link #getDefaultMapping() default mappings}.
*
* @param inheritDefaultMapping <tt>true</tt> to add the configured mapping
* to the list of default mappings.
*
* @see #getDefaultMapping()
* @see #setMapping(Map)
* @see #setMappingProperties(Properties)
*/
public void setInheritDefaultMapping(boolean inheritDefaultMapping)
{
this.inheritDefaultMapping = inheritDefaultMapping;
}
/**
* Set the mapping from document metadata to system metadata. It is possible to direct
* an extracted document property to several system properties. The conversion between
* the document property types and the system property types will be done by the
* {@link org.alfresco.service.cmr.repository.datatype.DefaultTypeConverter default converter}.
*
* @param mapping a mapping from document metadata to system metadata
*/
public void setMapping(Map<String, Set<QName>> mapping)
{
this.mapping = mapping;
}
/**
* Set the properties that contain the mapping from document metadata to system metadata.
* This is an alternative to the {@link #setMapping(Map)} method. Any mappings already
* present will be cleared out.
*
* The property mapping is of the form:
* <pre>
* # Namespaces prefixes
* namespace.prefix.cm=http://www.alfresco.org/model/content/1.0
* namespace.prefix.my=http://www....com/alfresco/1.0
*
* # Mapping
* editor=cm:author, my:editor
* title=cm:title
* user1=cm:summary
* user2=cm:description
* </pre>
* The mapping can therefore be from a single document property onto several system properties.
*
* @param mappingProperties the properties that map document properties to system properties
*/
public void setMappingProperties(Properties mappingProperties)
{
mapping = readMappingProperties(mappingProperties);
}
/**
* Helper method for derived classes to obtain the mappings that will be applied to raw
* values. This should be called after initialization in order to guarantee the complete
* map is given.
* <p>
* Normally, the list of properties that can be extracted from a document is fixed and
* well-known - in that case, just extract everything. But Some implementations may have
* an extra, indeterminate set of values available for extraction. If the extraction of
* these runtime parameters is expensive, then the keys provided by the return value can
* be used to extract values from the documents. The metadata extraction becomes fully
* configuration-driven, i.e. declaring further mappings will result in more values being
* extracted from the documents.
* <p>
* Most extractors will not be using this method. For an example of its use, see the
* {@linkplain OpenDocumentMetadataExtracter OpenDocument extractor}, which uses the mapping
* to select specific user properties from a document.
*/
protected final Map<String, Set<QName>> getMapping()
{
if (!initialized)
{
throw new UnsupportedOperationException("The complete mapping is only available after initialization.");
}
return Collections.unmodifiableMap(mapping);
}
/**
* A utility method to read mapping properties from a resource file and convert to the map form.
*
* @param propertiesUrl A standard Properties file URL location
*
* @see #setMappingProperties(Properties)
*/
protected Map<String, Set<QName>> readMappingProperties(String propertiesUrl)
{
InputStream is = null;
try
{
is = getClass().getClassLoader().getResourceAsStream(propertiesUrl);
if(is == null)
{
throw new AlfrescoRuntimeException(
"Metadata Extracter mapping properties not found: \n" +
" Extracter: " + this + "\n" +
" Bundle: " + propertiesUrl);
}
Properties props = new Properties();
props.load(is);
// Process it
Map<String, Set<QName>> map = readMappingProperties(props);
// Done
if (logger.isDebugEnabled())
{
logger.debug("Loaded mapping properties from resource: " + propertiesUrl);
}
return map;
}
catch (Throwable e)
{
throw new AlfrescoRuntimeException(
"Unable to load properties file to read extracter mapping properties: \n" +
" Extracter: " + this + "\n" +
" Bundle: " + propertiesUrl,
e);
}
finally
{
if (is != null)
{
try { is.close(); } catch (Throwable e) {}
}
}
}
/**
* A utility method to convert mapping properties to the Map form.
*
* @see #setMappingProperties(Properties)
*/
@SuppressWarnings("unchecked")
protected Map<String, Set<QName>> readMappingProperties(Properties mappingProperties)
{
Map<String, String> namespacesByPrefix = new HashMap<String, String>(5);
// Get the namespaces
for (Map.Entry entry : mappingProperties.entrySet())
{
String propertyName = (String) entry.getKey();
if (propertyName.startsWith(NAMESPACE_PROPERTY_PREFIX))
{
String prefix = propertyName.substring(17);
String namespace = (String) entry.getValue();
namespacesByPrefix.put(prefix, namespace);
}
}
// Create the mapping
Map<String, Set<QName>> convertedMapping = new HashMap<String, Set<QName>>(17);
for (Map.Entry entry : mappingProperties.entrySet())
{
String documentProperty = (String) entry.getKey();
String qnamesStr = (String) entry.getValue();
if (documentProperty.startsWith(NAMESPACE_PROPERTY_PREFIX))
{
// Ignore these now
continue;
}
// Create the entry
Set<QName> qnames = new HashSet<QName>(3);
convertedMapping.put(documentProperty, qnames);
// The to value can be a list of QNames
StringTokenizer tokenizer = new StringTokenizer(qnamesStr, ",");
while (tokenizer.hasMoreTokens())
{
String qnameStr = tokenizer.nextToken().trim();
// Check if we need to resolve a namespace reference
int index = qnameStr.indexOf(QName.NAMESPACE_PREFIX);
if (index > -1 && qnameStr.charAt(0) != QName.NAMESPACE_BEGIN)
{
String prefix = qnameStr.substring(0, index);
String suffix = qnameStr.substring(index + 1);
// It is prefixed
String uri = namespacesByPrefix.get(prefix);
if (uri == null)
{
throw new AlfrescoRuntimeException(
"No prefix mapping for extracter property mapping: \n" +
" Extracter: " + this + "\n" +
" Mapping: " + entry);
}
qnameStr = QName.NAMESPACE_BEGIN + uri + QName.NAMESPACE_END + suffix;
}
try
{
QName qname = QName.createQName(qnameStr);
// Add it to the mapping
qnames.add(qname);
}
catch (InvalidQNameException e)
{
throw new AlfrescoRuntimeException(
"Can't create metadata extracter property mapping: \n" +
" Extracter: " + this + "\n" +
" Mapping: " + entry);
}
}
if (logger.isDebugEnabled())
{
logger.debug("Added mapping from " + documentProperty + " to " + qnames);
}
}
// Done
return convertedMapping;
}
/**
* Registers this instance of the extracter with the registry. This will call the
* {@link #init()} method and then register if the registry is available.
*
* @see #setRegistry(MetadataExtracterRegistry)
* @see #init()
*/
public final void register()
{
init();
// Register the extracter, if necessary
if (registry != null)
{
registry.register(this);
}
}
/**
* Provides a hook point for implementations to perform initialization. The base
* implementation must be invoked or the extracter will fail during extraction.
* The {@link #getDefaultMapping() default mappings} will be requested during
* initialization.
*/
protected void init()
{
Map<String, Set<QName>> defaultMapping = getDefaultMapping();
if (defaultMapping == null)
{
throw new AlfrescoRuntimeException("The metadata extracter must provide a default mapping: " + this);
}
// Was a mapping explicitly provided
if (mapping == null)
{
// No mapping, so use the default
mapping = defaultMapping;
}
else if (inheritDefaultMapping)
{
// Merge the default mapping into the configured mapping
for (String documentKey : defaultMapping.keySet())
{
Set<QName> systemQNames = mapping.get(documentKey);
if (systemQNames == null)
{
systemQNames = new HashSet<QName>(3);
mapping.put(documentKey, systemQNames);
}
Set<QName> defaultQNames = defaultMapping.get(documentKey);
systemQNames.addAll(defaultQNames);
}
}
// The configured mappings are empty, but there were default mappings
if (mapping.size() == 0 && defaultMapping.size() > 0)
{
logger.warn(
"There are no property mappings for the metadata extracter.\n" +
" Nothing will be extracted by: " + this);
}
// Done
initialized = true;
}
/** {@inheritDoc} */
public long getExtractionTime()
{
return 1000L;
}
/**
* Checks if the mimetype is supported.
*
* @param reader the reader to check
* @throws AlfrescoRuntimeException if the mimetype is not supported
*/
protected void checkIsSupported(ContentReader reader)
{
String mimetype = reader.getMimetype();
if (!isSupported(mimetype))
{
throw new AlfrescoRuntimeException(
"Metadata extracter does not support mimetype: \n" +
" reader: " + reader + "\n" +
" supported: " + supportedMimetypes + "\n" +
" extracter: " + this);
}
}
/**
* {@inheritDoc}
*/
public final Map<QName, Serializable> extract(ContentReader reader, Map<QName, Serializable> destination)
{
return extract(reader, this.overwritePolicy, destination, this.mapping);
}
/**
* {@inheritDoc}
*/
public final Map<QName, Serializable> extract(
ContentReader reader,
OverwritePolicy overwritePolicy,
Map<QName, Serializable> destination)
{
return extract(reader, overwritePolicy, destination, this.mapping);
}
/**
* {@inheritDoc}
*/
public Map<QName, Serializable> extract(
ContentReader reader,
OverwritePolicy overwritePolicy,
Map<QName, Serializable> destination,
Map<String, Set<QName>> mapping)
{
// Done
if (logger.isDebugEnabled())
{
logger.debug("Starting metadata extraction: \n" +
" reader: " + reader + "\n" +
" extracter: " + this);
}
if (!initialized)
{
throw new AlfrescoRuntimeException(
"Metadata extracter not initialized.\n" +
" Call the 'register' method on: " + this + "\n" +
" Implementations of the 'init' method must call the base implementation.");
}
// check the reliability
checkIsSupported(reader);
Map<QName, Serializable> changedProperties = null;
try
{
Map<String, Serializable> rawMetadata = null;
// Check that the content has some meat
if (reader.getSize() > 0 && reader.exists())
{
rawMetadata = extractRaw(reader);
}
else
{
rawMetadata = new HashMap<String, Serializable>(1);
}
// Convert to system properties (standalone)
Map<QName, Serializable> systemProperties = mapRawToSystem(rawMetadata);
// Convert the properties according to the dictionary types
systemProperties = convertSystemPropertyValues(systemProperties);
// Now use the proper overwrite policy
changedProperties = overwritePolicy.applyProperties(systemProperties, destination);
}
catch (Throwable e)
{
if (logger.isDebugEnabled())
{
logger.debug(
"Metadata extraction failed: \n" +
" Extracter: " + this + "\n" +
" Content: " + reader,
e);
}
else
{
logger.warn(
"Metadata extraction failed (turn on DEBUG for full error): \n" +
" Extracter: " + this + "\n" +
" Content: " + reader + "\n" +
" Failure: " + e.getMessage());
}
}
finally
{
// check that the reader was closed (if used)
if (reader.isChannelOpen())
{
logger.error("Content reader not closed by metadata extracter: \n" +
" reader: " + reader + "\n" +
" extracter: " + this);
}
// Make sure that we have something to return
if (changedProperties == null)
{
changedProperties = new HashMap<QName, Serializable>(0);
}
}
// Done
if (logger.isDebugEnabled())
{
logger.debug("Completed metadata extraction: \n" +
" reader: " + reader + "\n" +
" extracter: " + this + "\n" +
" changed: " + changedProperties);
}
return changedProperties;
}
/**
*
* @param rawMetadata Metadata keyed by document properties
* @return Returns the metadata keyed by the system properties
*/
private Map<QName, Serializable> mapRawToSystem(Map<String, Serializable> rawMetadata)
{
Map<QName, Serializable> systemProperties = new HashMap<QName, Serializable>(rawMetadata.size() * 2 + 1);
for (Map.Entry<String, Serializable> entry : rawMetadata.entrySet())
{
String documentKey = entry.getKey();
// Check if there is a mapping for this
if (!mapping.containsKey(documentKey))
{
// No mapping - ignore
continue;
}
Serializable documentValue = entry.getValue();
Set<QName> systemQNames = mapping.get(documentKey);
for (QName systemQName : systemQNames)
{
systemProperties.put(systemQName, documentValue);
}
}
// Done
if (logger.isDebugEnabled())
{
logger.debug(
"Converted extracted raw values to system values: \n" +
" Raw Properties: " + rawMetadata + "\n" +
" System Properties: " + systemProperties);
}
return systemProperties;
}
/**
* Converts all values according to their dictionary-defined type. This uses the
* {@link #setFailOnTypeConversion(boolean) failOnTypeConversion flag} to determine how failures
* are handled i.e. if values fail to convert, the process may discard the property.
*
* @param systemProperties the values keyed to system property names
* @return Returns a modified map of properties that have been converted.
*/
@SuppressWarnings("unchecked")
private Map<QName, Serializable> convertSystemPropertyValues(Map<QName, Serializable> systemProperties)
{
Map<QName, Serializable> convertedProperties = new HashMap<QName, Serializable>(systemProperties.size() + 7);
for (Map.Entry<QName, Serializable> entry : systemProperties.entrySet())
{
QName propertyQName = entry.getKey();
Serializable propertyValue = entry.getValue();
// Get the property definition
PropertyDefinition propertyDef = (dictionaryService == null) ? null : dictionaryService.getProperty(propertyQName);
if (propertyDef == null)
{
// There is nothing in the DD about this so just transfer it
convertedProperties.put(propertyQName, propertyValue);
continue;
}
// It is in the DD, so attempt the conversion
DataTypeDefinition propertyTypeDef = propertyDef.getDataType();
Serializable convertedPropertyValue = null;
try
{
// Attempt to make any date conversions
if (propertyTypeDef.getName().equals(DataTypeDefinition.DATE) || propertyTypeDef.getName().equals(DataTypeDefinition.DATETIME))
{
if (propertyValue instanceof Date)
{
convertedPropertyValue = propertyValue;
}
else if (propertyValue instanceof Collection)
{
convertedPropertyValue = (Serializable) makeDates((Collection) propertyValue);
}
else if (propertyValue instanceof String)
{
convertedPropertyValue = makeDate((String) propertyValue);
}
else
{
if (logger.isWarnEnabled())
{
StringBuilder mesg = new StringBuilder();
mesg.append("Unable to convert Date property: ").append(propertyQName)
.append(", value: ").append(propertyValue).append(", type: ").append(propertyTypeDef.getName());
logger.warn(mesg.toString());
}
}
}
else
{
if (propertyValue instanceof Collection)
{
convertedPropertyValue = (Serializable) DefaultTypeConverter.INSTANCE.convert(
propertyTypeDef,
(Collection) propertyValue);
}
else if (propertyValue instanceof Object[])
{
convertedPropertyValue = (Serializable) DefaultTypeConverter.INSTANCE.convert(
propertyTypeDef,
(Object[]) propertyValue);
}
else
{
convertedPropertyValue = (Serializable) DefaultTypeConverter.INSTANCE.convert(
propertyTypeDef,
propertyValue);
}
}
convertedProperties.put(propertyQName, convertedPropertyValue);
}
catch (TypeConversionException e)
{
// Do we just absorb this or is it a problem?
if (failOnTypeConversion)
{
throw AlfrescoRuntimeException.create(
e,
ERR_TYPE_CONVERSION,
this,
propertyQName,
propertyTypeDef.getName(),
propertyValue);
}
}
}
// Done
return convertedProperties;
}
/**
* Convert a collection of date <tt>String</tt> to <tt>Date</tt> objects
*/
private Collection<Date> makeDates(Collection<String> dateStrs)
{
List<Date> dates = new ArrayList<Date>(dateStrs.size());
for (String dateStr : dateStrs)
{
Date date = makeDate(dateStr);
dates.add(date);
}
return dates;
}
/**
* Convert a date <tt>String</tt> to a <tt>Date</tt> object
*/
private Date makeDate(String dateStr)
{
Date date = null;
try
{
date = DefaultTypeConverter.INSTANCE.convert(Date.class, dateStr);
}
catch (TypeConversionException e)
{
// Try one of the other formats
for (DateFormat df : this.supportedDateFormats)
{
try
{
date = df.parse(dateStr);
}
catch (ParseException ee)
{
// Didn't work
}
}
if (date == null)
{
// Still no luck
throw new TypeConversionException("Unable to convert string to date: " + dateStr);
}
}
return date;
}
/**
* Adds a value to the map if it is non-trivial. A value is trivial if
* <ul>
* <li>it is null</li>
* <li>it is an empty string value after trimming</li>
* <li>it is an empty collection</li>
* <li>it is an empty array</li>
* </ul>
* String values are trimmed before being put into the map.
* Otherwise, it is up to the extracter to ensure that the value is a <tt>Serializable</tt>.
* It is not appropriate to implicitly convert values in order to make them <tt>Serializable</tt>
* - the best conversion method will depend on the value's specific meaning.
*
* @param key the destination key
* @param value the serializable value
* @param destination the map to put values into
* @return Returns <tt>true</tt> if set, otherwise <tt>false</tt>
*/
@SuppressWarnings("unchecked")
protected boolean putRawValue(String key, Serializable value, Map<String, Serializable> destination)
{
if (value == null)
{
return false;
}
if (value instanceof String)
{
String valueStr = ((String) value).trim();
if (valueStr.length() == 0)
{
return false;
}
else
{
// Keep the trimmed value
value = valueStr;
}
}
else if (value instanceof Collection)
{
Collection valueCollection = (Collection) value;
if (valueCollection.isEmpty())
{
return false;
}
}
else if (value.getClass().isArray())
{
if (Array.getLength(value) == 0)
{
return false;
}
}
// It passed all the tests
destination.put(key, value);
return true;
}
/**
* Helper method to fetch a clean map into which raw values can be dumped.
*
* @return Returns an empty map
*/
protected final Map<String, Serializable> newRawMap()
{
return new HashMap<String, Serializable>(17);
}
/**
* This method provides a <i>best guess</i> of where to store the values extracted
* from the documents. The list of properties mapped by default need <b>not</b>
* include all properties extracted from the document; just the obvious set of mappings
* need be supplied.
* Implementations must either provide the default mapping properties in the expected
* location or override the method to provide the default mapping.
* <p>
* The default implementation looks for the default mapping file in the location
* given by the class name and <i>.properties</i>. If the extracter's class is
* <b>x.y.z.MyExtracter</b> then the default properties will be picked up at
* <b>classpath:/x/y/z/MyExtracter.properties</b>.
* Inner classes are supported, but the '$' in the class name is replaced with '-', so
* default properties for <b>x.y.z.MyStuff$MyExtracter</b> will be located using
* <b>x.y.z.MyStuff-MyExtracter.properties</b>.
* <p>
* The default mapping implementation should include thorough Javadocs so that the
* system administrators can accurately determine how to best enhance or override the
* default mapping.
* <p>
* If the default mapping is declared in a properties file other than the one named after
* the class, then the {@link #readMappingProperties(String)} method can be used to quickly
* generate the return value:
* <pre><code>
* protected Map<<String, Set<QName>> getDefaultMapping()
* {
* return readMappingProperties(DEFAULT_MAPPING);
* }
* </code></pre>
* The map can also be created in code either statically or during the call.
*
* @return Returns the default, static mapping. It may not be null.
*
* @see #setInheritDefaultMapping(boolean inherit)
*/
protected Map<String, Set<QName>> getDefaultMapping()
{
String className = this.getClass().getName();
// Replace $
className = className.replace('$', '-');
// Replace .
className = className.replace('.', '/');
// Append .properties
String propertiesUrl = className + ".properties";
// Attempt to load the properties
return readMappingProperties(propertiesUrl);
}
/**
* Override to provide the raw extracted metadata values. An extracter should extract
* as many of the available properties as is realistically possible. Even if the
* {@link #getDefaultMapping() default mapping} doesn't handle all properties, it is
* possible for each instance of the extracter to be configured differently and more or
* less of the properties may be used in different installations.
* <p>
* Raw values must not be trimmed or removed for any reason. Null values and empty
* strings are
* <ul>
* <li><b>Null:</b> Removed</li>
* <li><b>Empty String:</b> Passed to the {@link OverwritePolicy}</li>
* <li><b>Non Serializable:</b> Converted to String or fails if that is not possible</li>
* </ul>
* <p>
* Properties extracted and their meanings and types should be thoroughly described in
* the class-level javadocs of the extracter implementation, for example:
* <pre>
* <b>editor:</b> - the document editor --> cm:author
* <b>title:</b> - the document title --> cm:title
* <b>user1:</b> - the document summary
* <b>user2:</b> - the document description --> cm:description
* <b>user3:</b> -
* <b>user4:</b> -
* </pre>
*
* @param reader the document to extract the values from. This stream provided by
* the reader must be closed if accessed directly.
* @return Returns a map of document property values keyed by property name.
* @throws All exception conditions can be handled.
*
* @see #getDefaultMapping()
*/
protected abstract Map<String, Serializable> extractRaw(ContentReader reader) throws Throwable;
}