Enable explicit Tika content transform for OOXML files
Allow the Excel transformer (which does CSV as well as text/html) to handle .xlsx as well as .xls
Also update the .doc parser test to ensure that the older word 6 and word 95 files are correctly handled too
git-svn-id: https://svn.alfresco.com/repos/alfresco-enterprise/alfresco/HEAD/root@20781 c4b6b30b-aa2e-2d43-bbcb-ca4b014f7261
New POI-general converter, for things other than excel, and convert the PDF converter too.
The POI-excel converter now does CSV properly, and notes exist for the Text mining converter on the Tika bits needed before it can be replaced.
git-svn-id: https://svn.alfresco.com/repos/alfresco-enterprise/alfresco/HEAD/root@20780 c4b6b30b-aa2e-2d43-bbcb-ca4b014f7261
The POI HSSF transformer has been updated to use Tika. A Tika auto-detect
transformer has also been added, which caters for a large number of
previously un-handled cases. Unit tests check this.
git-svn-id: https://svn.alfresco.com/repos/alfresco-enterprise/alfresco/HEAD/root@20769 c4b6b30b-aa2e-2d43-bbcb-ca4b014f7261
The code for extracting .dwg files has been contributed to Apache tika, and the Alfresco metadata extractor deep calls into Tika to have the work done. We retain our own tests of this however.
git-svn-id: https://svn.alfresco.com/repos/alfresco-enterprise/alfresco/HEAD/root@19927 c4b6b30b-aa2e-2d43-bbcb-ca4b014f7261
Added a new metadata extractor based on POI for docx, xlsx and pptx mime types.
Changed OpenOfficeMetadataExtracter so that it no longer supports these mime types.
Added the new test code to ContentMinimalContextTestSuite
Some tidying up of code in AbstractMetadataExtracterTest and OpenOfficeMetadataExtracter to reflect the fact that this extractor does not handle these mime types any more.
git-svn-id: https://svn.alfresco.com/repos/alfresco-enterprise/alfresco/HEAD/root@19792 c4b6b30b-aa2e-2d43-bbcb-ca4b014f7261