6 Commits

Author SHA1 Message Date
Nick Burch
325f8e7923 Tika content transformer support for OOXML office
Enable explicit Tika content transform for OOXML files
Allow the Excel transformer (which does CSV as well as text/html) to handle .xlsx as well as .xls
Also update the .doc parser test to ensure that the older word 6 and word 95 files are correctly handled too


git-svn-id: https://svn.alfresco.com/repos/alfresco-enterprise/alfresco/HEAD/root@20781 c4b6b30b-aa2e-2d43-bbcb-ca4b014f7261
2010-06-23 15:51:03 +00:00
Nick Burch
228d111c56 More Tika content transform updates
New POI-general converter, for things other than excel, and convert the PDF converter too.
The POI-excel converter now does CSV properly, and notes exist for the Text mining converter on the Tika bits needed before it can be replaced.


git-svn-id: https://svn.alfresco.com/repos/alfresco-enterprise/alfresco/HEAD/root@20780 c4b6b30b-aa2e-2d43-bbcb-ca4b014f7261
2010-06-23 14:27:10 +00:00
Nick Burch
f3a7a0aa7c Initial Tika support for Text content transforms
The POI HSSF transformer has been updated to use Tika. A Tika auto-detect
 transformer has also been added, which caters for a large number of 
 previously un-handled cases. Unit tests check this.


git-svn-id: https://svn.alfresco.com/repos/alfresco-enterprise/alfresco/HEAD/root@20769 c4b6b30b-aa2e-2d43-bbcb-ca4b014f7261
2010-06-23 11:40:17 +00:00
Nick Burch
45c757fee8 Add metadata extractor support for .dwg files (ALF-2262)
The code for extracting .dwg files has been contributed to Apache tika, and the Alfresco metadata extractor deep calls into Tika to have the work done. We retain our own tests of this however.


git-svn-id: https://svn.alfresco.com/repos/alfresco-enterprise/alfresco/HEAD/root@19927 c4b6b30b-aa2e-2d43-bbcb-ca4b014f7261
2010-04-21 10:17:11 +00:00
Neil McErlean
de612572d9 Proper fix for unreported issue with OOo-based extraction of Office 07 metadata.
Added a new metadata extractor based on POI for docx, xlsx and pptx mime types.
Changed OpenOfficeMetadataExtracter so that it no longer supports these mime types.
Added the new test code to ContentMinimalContextTestSuite

Some tidying up of code in AbstractMetadataExtracterTest and OpenOfficeMetadataExtracter to reflect the fact that this extractor does not handle these mime types any more.


git-svn-id: https://svn.alfresco.com/repos/alfresco-enterprise/alfresco/HEAD/root@19792 c4b6b30b-aa2e-2d43-bbcb-ca4b014f7261
2010-04-09 12:10:06 +00:00
Nick Burch
21b6c8cf10 Tweak the minimal context to hopefully work on the build machine too, and then re-enable tests + combine one suite
git-svn-id: https://svn.alfresco.com/repos/alfresco-enterprise/alfresco/HEAD/root@19122 c4b6b30b-aa2e-2d43-bbcb-ca4b014f7261
2010-03-08 14:23:51 +00:00