HXENG-64 refactor ATS (#657)

Refactor to clean up packages in the t-model and to introduce a simpler to implement t-engine base.

The new t-engines (tika, imagemagick, libreoffice, pdfrenderer, misc, aio, aspose) and t-router may be used in combination with older components as the API between the content Repo and between components has not changed. As far as possible the same artifacts are created (the -boot projects no longer exist). They may be used with older ACS repo versions.

The main changes to look for are:
* The introduction of TransformEngine and CustomTransformer interfaces to be implemented.
* The removal in t-engines and t-router of the Controller, Application, test template page, Controller tests and application config, as this is all now done by the t-engine base package.
* The t-router now extends the t-engine base, which also reduced the amount of duplicate code.
* The t-engine base provides the test page, which includes drop downs of known transform options. The t-router is able to use pipeline and failover transformers. This was not possible to do previously as the router had no test UI.
* Resources including licenses are automatically included in the all-in-one t-engine, from the individual t-engines. They just need to be added as dependencies in the pom. 
* The ugly code in the all-in-one t-engine and misc t-engine to pick transformers has gone, as they are now just selected by the transformRegistry.
* The way t-engines respond to http or message queue transform requests has been combined (eliminates the similar but different code that existed before).
* The t-engine base now uses InputStream and OutputStream rather than Files by default. As a result it will be simpler to avoid writing content to a temporary location.
* A number of the Tika and Misc CustomTransforms no longer use Files.
* The original t-engine base still exists so customers can continue to create custom t-engines the way they have done previously. the project has just been moved into a folder called deprecated.
* The folder structure has changed. The long "alfresco-transform-..." names have given way to shorter easier to read and type names.
* The t-engine project structure now has a single project rather than two. 
* The previous config values still exist, but there are now a new set for config values for in files with names that don't misleadingly imply they only contain pipeline of routing information. 
* The concept of 'routing' has much less emphasis in class names as the code just uses the transformRegistry. 
* TransformerConfig may now be read as json or yaml. The restrictions about what could be specified in yaml has gone.
* T-engines and t-router may use transform config from files. Previously it was just the t-router.
* The POC code to do with graphs of possible routes has been removed.
* All master branch changes have been merged in.
* The concept of a single transform request which results in multiple responses (e.g. images from a video) has been added to the core processing of requests in the t-engine base.
* Many SonarCloud linter fixes.
This commit is contained in:
Alan Davis
2022-09-14 13:40:19 +01:00
committed by GitHub
parent ea83ef9ebc
commit babe26b0ba
652 changed files with 19479 additions and 18195 deletions

View File

@@ -1,224 +0,0 @@
swagger: '2.0'
info:
description: |
**Alfresco Transform Engines REST API**
Transform Request & Response API to allow a source file to be transformed into a
target file, given a set of transform options.
The new JSON-based Transform Engines API is used by the Alfresco Transform Service (ATS).
ATS provides an independently-scalable transform service, initially used by ACS
Content Repository, as part of the overall Alfresco Digital Business Platform (DBP).
Note: Each kind of Transform Engine implements this Transform Engines API, including:
* ImageMagick
* LibreOffice
* PDF Renderer
* Tika
In the future, this Transform Engines API may form the basis for adding custom Transform Engines.
version: '1'
title: Alfresco Transform Engines REST API
basePath: /alfresco/api/-default-/private/transformer/versions/1
tags:
- name: Transform
description: Transform Engine Request / Respone
paths:
'/transform':
post:
x-alfresco-since: "2.0"
tags:
- Transform
summary: Transform Engines API
description: |
**Note:** available with Alfresco Transform Engines 2.0 and newer versions.
This endpoint supports both JSON and Multipart. The JSON API is used within the
Alfresco Transform Service (eg. ACS 6.1). The Multipart API remains for backwards
compatibility (eg. ACS 6.0).
**Using JSON (application/json -> application/json)**
The ACS Content Repository 6.1 (or higher) provides the option to offload
supported transformations to the Alfresco Transform Service.
The JSON API is used within the Alfresco Transform Service. It relies on the
source and target files being stored and retrieved via the Alfresco Shared File
Store (see also [alfresco-sfs.yaml](https://github.com/Alfresco/alfresco-shared-file-store/blob/master/docs/api-definitions/alfresco-sfs.yaml)).
Here's a pseudo-example transform request:
```JSON
{
"schema": 1,
"requestId": "0aead31c-e3ca-42c9-8e16-c1938ff64c3a",
"clientData": "opaque-client-specific-data-123xyz",
"sourceReference": "598387b8-d85d-4557-816e-50f44c969e04",
"sourceSize": 32713,
"sourceMediaType: "image/jpeg",
"sourceExtension": "jpeg",
"targetMediaType: "image/png",
"targetExtension": "png",
"transformRequestOptions": {
"resizeWidth": "25",
"resizePercentage": "true",
"maintainAspectRatio": "true"
}
}
```
Here's a pseudo-example response of a successful transform:
```JSON
{
"schema": 1,
"status": 201
"requestId": "0aead31c-e3ca-42c9-8e16-c1938ff64c3a",
"clientData": "opaque-client-specific-data-123xyz",
"sourceReference": "598387b8-d85d-4557-816e-50f44c969e04",
"targetReference": "5bc81e48-e17a-4727-bd1c-3a279aa6b421"
}
```
Here's a pseudo-example response of a failed transform:
```JSON
{
"schema": 1,
"status": 400,
"errorDetails": "Lorem ipsum dolor sit amet, ..."
"requestId": "0aead31c-e3ca-42c9-8e16-c1938ff64c3a",
"clientData": "opaque-client-specific-data-123xyz",
"sourceReference": "598387b8-d85d-4557-816e-50f44c969e04"
}
```
**Using Multipart (multipart/form-data -> application/octet-stream)**
The Multipart API remains for backwards compatibility (eg. ACS 6.0). It requires
the source file to be uploaded via multipart/form-data (along with transformation
options). The target file is returned as a binary response (application/octet-steam).
operationId: transformOperation
parameters:
- in: body
name: transformRequest
description: The Transform Request including source reference and transform options
required: true
schema:
$ref: '#/definitions/transformRequest'
consumes:
- application/json
- multipart/form-data
produces:
- application/json
- application/octet-stream
responses:
'201':
description: Successful response
schema:
$ref: '#/definitions/transformReply'
default:
description: Unexpected error
schema:
$ref: '#/definitions/Error'
'/transformer/options':
get:
tags:
- Transform
description: List transform options
operationId: transformOptions
produces:
- application/json
responses:
200:
description: Successful response
schema:
type: array
xml:
name: transformOptions
wrapped: true
items:
$ref: '#/definitions/transformOption'
definitions:
Error:
type: object
required:
- error
properties:
error:
type: object
required:
- statusCode
- briefSummary
- stackTrace
- descriptionURL
properties:
errorKey:
type: string
statusCode:
type: integer
format: int32
briefSummary:
type: string
stackTrace:
type: string
descriptionURL:
type: string
logId:
type: string
transformRequest:
type: object
properties:
requestId:
type: string
sourceReference:
type: string
sourceMediaType:
type: string
sourceSize:
type: integer
format: int64
sourceExtension:
type: string
targetMediaType:
type: string
targetExtension:
type: string
clientData:
type: string
schema:
type: integer
transformRequestOptions:
type: object
additionalProperties:
type: string
transformReply:
type: object
properties:
status:
type: integer
requestId:
type: string
sourceReference:
type: string
targetReference:
type: string
clientData:
type: string
schema:
type: integer
errorDetails:
type: string
transformOption:
type: object
required:
- required
- name
properties:
required:
type: boolean
name:
type: string

View File

@@ -1,168 +0,0 @@
## T-Engine configuration
T-Engines provide a */transform/config* end point for clients (e.g. Transform-Router or
Alfresco-Repository) that indicate what is supported. T-Engines store this
configuration as a JSON resource file named *engine_config.json*.
The config can be found under `alfresco-transform-core\<t-engine-name>\src\main\resources
\engine_config.json`; current configuration files are:
* [Pdf-Renderer T-Engine configuration](https://github.com/Alfresco/alfresco-transform-core/blob/master/alfresco-docker-alfresco-pdf-renderer/src/main/resources/engine_config.json).
* [ImageMagick T-Engine configuration](https://github.com/Alfresco/alfresco-transform-core/blob/master/alfresco-docker-imagemagick/src/main/resources/engine_config.json).
* [Libreoffice T-Engine configuration](https://github.com/Alfresco/alfresco-transform-core/blob/master/alfresco-docker-libreoffice/src/main/resources/engine_config.json).
* [Tika T-Engine configuration](https://github.com/Alfresco/alfresco-transform-core/blob/master/alfresco-docker-tika/src/main/resources/engine_config.json).
* [Misc T-Engine configuration](https://github.com/Alfresco/alfresco-transform-core/blob/master/alfresco-docker-transform-misc/src/main/resources/engine_config.json).
*Snippet from Tika T-engine configuration:*
```json
{
"transformOptions": {
"tikaOptions": [
{"value": {"name": "targetEncoding"}}
],
"pdfboxOptions": [
{"value": {"name": "notExtractBookmarksText"}},
{"value": {"name": "targetEncoding"}}
]
},
"transformers": [
{
"transformerName": "PdfBox",
"supportedSourceAndTargetList": [
{"sourceMediaType": "application/pdf", "targetMediaType": "text/html"},
{"sourceMediaType": "application/pdf", "maxSourceSizeBytes": 26214400, "targetMediaType": "text/plain"}
],
"transformOptions": [
"pdfboxOptions"
]
},
{
"transformerName": "TikaAuto",
"supportedSourceAndTargetList": [
{"sourceMediaType": "application/msword", "priority": 55, "targetMediaType": "text/xml"}
],
"transformOptions": [
"tikaOptions"
]
},
{
"transformerName": "TextMining",
"supportedSourceAndTargetList": [
{"sourceMediaType": "application/msword", "targetMediaType": "text/xml"}
],
"transformOptions": [
"tikaOptions"
]
}
]
}
```
### Transform Options
* **transformOptions** provides a list of transform options that may be
referenced for use in different transformers. This way common options
don't need to be repeated for each transformer, they can be shared between
T-Engines. In this example there are two groups of options called **tikaOptions**
and **pdfboxOptions** which has a group of options **targetEncoding** and
**notExtractBookmarksText**. Unless an option has a **"required": true** field it is
considered to be optional. You don't need to specify *sourceMimetype*,
*targetMimetype*, *sourceExtension* or *targetExtension* as options as
these are automatically added.
*Snippet from ImageMagick T-engine configuration:*
```json
"transformOptions": {
"imageMagickOptions": [
{"value": {"name": "alphaRemove"}},
{"group": {"transformOptions": [
{"value": {"name": "cropGravity"}},
{"value": {"name": "cropWidth"}},
{"value": {"name": "cropHeight"}},
{"value": {"name": "cropPercentage"}},
{"value": {"name": "cropXOffset"}},
{"value": {"name": "cropYOffset"}}
]}},
]
},
```
* There are two types of transformOptions, *transformOptionsValue* and *transformOptionsGroup*:
* _TransformOptionsValue_ is used to represent a single transformation option, it is defined
by a **name** and an optional **required** field.
* _TransformOptionGroup_ represents a group of one or more options, it is used to group
options that define a
characteristic. In the above snippet all the options for crop are defined under a group, it is recommended to
use this approach as it is easier to read. A transformOptionsGroup can contain one or more transformOptionsValue
and transformOptionsGroup.
**Limitations**:
* For a transformOptions to be referenced in a different T-engine, another transformer
with the complete definition of the transformOptions needs to return the config to the client.
* In a transformOptions definition it is not allowed to use a reference to another tranformOption.
### Transformers
* **transformers** - A list of transformer definitions.
Each transformer definition should have a unique **transformerName**,
specify a **supportedSourceAndTargetList** and indicate which
options it supports. As it is shown in the Tika snippet, an *engine_config*
can describe one or more transformers, as a T-engine can have
multiple transformers (e.g. Tika, Misc). A transformer configuration may
specify references to 0 or more transformOptions.
### Supported Source and Target List
* **supportedSourceAndTargetList** is simply a list of source and target
Media Types that may be transformed, optionally specifying a
**maxSourceSizeBytes** and a **priority** value.
* *maxSourceSizeBytes* is used to set the upper size limit of a transformation.
* If not specified, the default value for maxSourceSizeBytes is **unlimited**.
* *priority* it is used by clients to determine which transfomer to call or by T-engines
with multiple transformers to determine which one to use. In the above Tika snippet,
both *TikaAuto* and *TextMining* have the capability to transform *"application/msword"*
into *"text/xml"*, the transformer containing the source-target media type with higher priority will be chosen by the
T-engine as the one to execute the transformation, in this case it will be *TextMining*, because:
* If not specified, the default value for priority is **50**.
* Note: priority values are like the order in a queue, the **lower** the number the **higher the
priority** is.
## Transformer selection strategy
The ACS repository will use the T-Engine configuration to choose which T-Engine will perform a transform.
A transformer definition contains a supported list of source and target Media Types. This is used for the
most basic selection. This is further refined by checking that the definition also supports transform options
(parameters) that have been supplied in a transform request or a Rendition Definition used in a rendition request.
Order for selection is:
1. Source->Target Media Types
2. transformOptions
3. maxSourceSizeBytes
4. priority
#### Case 1:
```
Transformer 1 defines options: Op1, Op2
Transformer 2 defines options: Op1, Op2, Op3, Op4
```
```
Rendition provides values for options: Op2, Op3
```
If we assume both transformers support the required source and target Media Types, Transformer 2 will be selected
because it knows about all the supplied options. The definition may also specify that some options are required or grouped.
#### Case 2:
```
Transformer 1 defines options: Op1, Op2, maxSize
Transformer 2 defines options: Op1, Op2, Op3
```
```
Rendition provides values for options: Op1, Op2
```
If we assume both transformers support the required source and target Media Types, and file size is greater than *maxSize*
,Transformer 2 will be selected because if can handle *maxSourceSizeBytes* for this transformation.
#### Case 3:
```
Transformer 1 defines options: Op1, Op2, priorty1
Transformer 2 defines options: Op1, Op2, Op3, priority2
```
```
Rendition provides values for options: Op1, Op2
```
If we assume both transformers support the required source and target Media Types and
*priority1* < *priority2*, Transformer 1 will be selected because its priority is higher.

60
docs/t-engines.md Normal file
View File

@@ -0,0 +1,60 @@
# T-Engines
The t-engines provide the basic transform operations. The Transform Service
provides a common base for the communication with other components. It is
this base that is described in this section. The base is a Spring Boot
application to which transform specific code is added and then wrapped
in a Docker image with any programs that the transforms need. The base
does not need to be used as long as there appears to be a process responding
endpoints and messages.
A t-engine groups together one of more Transformers. Each Transformer
(provided by transform specific code) knows how to perform a set of
transformations from one MIME Type to another with a common set of
t-options.
~~~yaml
0010 my-t-engine
Transformer 1
mimetype A -> mimetype B
mimetype A -> mimetype C
mimetype B -> mimetype C
option1
option2
Transformer 2
mimetype A -> mimetype B
mimetype D -> mimetype C
option2
option3
0020 another-t-engine
...
0030 yet-another-t-engine
...
~~~
## Endpoints
* `POST /transform` to perform a transform. There are two forms:
* For asynchronous transforms: Perform a transform using a
`TransformRequest` received from the t-router via a message queue. The
`TransformReply` is sent back via the queue.
* For synchronous transforms: Performs a transform on content uploaded as
a Multipart File and provides the resulting content as a download.
Transform options are extracted from the request properties. The
following are not added as transform options, but are used to select the
transformer: `sourceMimetype` & `targetMimetype`.
* `GET /transform/config` to obtain t-config about what the t-engine supports.
It has a parameter `configVersion` to allow a caller and the t-engine to
negotiate down to a common format. The value is an integer which indicate
which elements may to be added to the config. These elements reflect
functionality supported by the base (such as pre-signed URLs). The
`CoreVersionDecorator` adds to the Config returned by the transform
specific code.
* `GET /` provides an html test page to upload a source file, enter transform
options and issue a synchronous transform request. Useful in testing.
* `GET /log` provides a page with basic log information. Useful in testing.
* `GET /error` provides an error page when testing.
* `GET /version` provides a String message to be included in client debug
messages.
* `GET /ready` used by Kubernetes as a ready probe.
* `GET /live` used by Kubernetes as a ready probe.

310
docs/transform-config.md Normal file
View File

@@ -0,0 +1,310 @@
# T-Engine configuration
Each t-engine provides an endpoint that returns t-config that defines what
it supports. The t-router and t-engines may also have external t-config files.
These are combined in name order. As sorting is alphanumeric, you may wish to
consider using a fixed length numeric prefix in filenames and t-engine names. As will be seen
t-config may reference elements from other components or modify elements
from earlier t-config.
Current configuration files are:
* [Pdf-Renderer T-Engine configuration](https://github.com/Alfresco/alfresco-transform-core/blob/master/engines/pdfrenderer/src/main/resources/pdfrenderer_engine_config.json).
* [ImageMagick T-Engine configuration](https://github.com/Alfresco/alfresco-transform-core/blob/master/engines/imagemagick/src/main/resources/imagemagick_engine_config.json).
* [Libreoffice T-Engine configuration](https://github.com/Alfresco/alfresco-transform-core/blob/master/engines/libreoffice/src/main/resources/libreoffice_engine_config.json).
* [Tika T-Engine configuration](https://github.com/Alfresco/alfresco-transform-core/blob/master/engines/tika/src/main/resources/tika_engine_config.json).
* [Misc T-Engine configuration](https://github.com/Alfresco/alfresco-transform-core/blob/master/engines/misc/src/main/resources/misc_engine_config.json).
Additional config files (which may be resources on the classpath or external
files) are specified in Spring Boot properties or such as
`transform.config.file.<filename>` or environment variables like
`TRANSFORM_CONFIG_FILE_<filename>`.
The following is a simple t-config file from an example Hello World
t-engine.
~~~json
{
"transformOptions":
{
"helloWorldOptions":
[
{"value": {"name": "language"}}
]
},
"transformers":
[
{
"transformerName": "helloWorld",
"supportedSourceAndTargetList":
[
{"sourceMediaType": "text/plain", "maxSourceSizeBytes": 50, "targetMediaType": "text/html" }
],
"transformOptions":
[
"helloWorldOptions"
]
}
]
}
~~~
* **transformOptions** provides a list of transform options (each with its own
name) that may be referenced for use in different transformers. This way
common options don't need to be repeated for each transformer. They can
even be shared between T-Engines. In this example there is only one group
of options called `helloWorldOptions`, which has just one option the
`language`. Unless an option has a `"required": true` field it is considered
to be optional. You don't need to specify _sourceMimetype, sourceExtension,
sourceEncoding, targetMimetype, targetExtension_ or _timeout_ as options as
these are available to all transformers.
* **transformers** is a list of transformer definitions. Each transformer
definition should have a unique `transformerName`, specify a
`supportedSourceAndTargetList` and indicate which options it supports.
In this case there is only one transformer called `Hello World` and it
accepts `helloWorldOptions`. A transformer may specify references to 0
or more transformOptions.
* **supportedSourceAndTargetList** is simply a list of source and target
Media Types that may be transformed, optionally specifying
`maxSourceSizeBytes` and `priority` values. In this case there is only one
from text to HTML and we have limited the source file size, to avoid
transforming files that clearly don't contain names.
## Transform pipelines
Transforms may be combined in a pipeline to form a new transformer, where
the output from one becomes the input to the next and so on. The t-config
defines the sequence of transform steps and intermediate Media Types. Like
any other transformer, it specifies a list of supported source and target
Media Types. If you don't supply any, all possible combinations are assumed
to be available. The definition may reuse the `transformOptions` of
transformers in the pipeline, but typically will define its own subset
of these.
The following example begins with the `helloWorld` Transformer, which takes a
text file containing a name and produces an HTML file with `Hello <name>`
message in the body. This is then transformed back into a text file. This
example contains just one pipeline transformer, but many may be defined
in the same file.
~~~json
{
"transformers": [
{
"transformerName": "helloWorldText",
"transformerPipeline" : [
{"transformerName": "helloWorld", "targetMediaType": "text/html"},
{"transformerName": "html"}
],
"supportedSourceAndTargetList": [
{"sourceMediaType": "text/plain", "priority": 45, "targetMediaType": "text/plain" }
],
"transformOptions": [
"helloWorldOptions"
]
}
]
}
~~~
* **transformerName** Try to create a unique name for the transform.
* **transformerPipeline** A list of transformers in the pipeline. The
`targetMediaType` specifies the intermediate Media Types between
transformers. There is no final `targetMediaType` as this comes from the
`supportedSourceAndTargetList`. The `transformerName` may reference a
transformer that has not been defined yet. A warning is issued if
it remains undefined after all t-config has been combined. Generally
it is better for a t-engine rather than the t-router to define pipeline
transformers as this limits the number of places that have to be changed.
Normally it is obvious which t-engine should contain the definition.
* **supportedSourceAndTargetList** The supported source and target Media
Types, which refer to the Media Types this pipeline transformer can
transform from and to, additionally you can set the `priority` and the
`maxSourceSizeBytes`. If blank, this indicates that all possible
combinations are supported. This is the cartesian product of all source
types to the first intermediate type and all target types from the last
intermediate type. Any combinations supported by the first transformer
are excluded. They will also have the priority from the first transform.
* **transformOptions** A list of references to options required by the
pipeline transformer.
## Failover transforms
A failover transform, simply provides a list of transforms to be attempted
one after another until one succeeds. For example, you may have a fast
transform that is able to handle a limited set of transforms and another
that is slower but handles all cases.
~~~json
{
"transformers": [
{
"transformerName": "imgExtractOrImgCreate",
"transformerFailover" : [ "imgExtract", "imgCreate" ],
"supportedSourceAndTargetList": [
{"sourceMediaType": "application/vnd.oasis.opendocument.graphics", "priority": 150, "targetMediaType": "image/png" },
...
{"sourceMediaType": "application/vnd.sun.xml.calc.template", "priority": 150, "targetMediaType": "image/png" }
]
}
]
}
~~~
* **transformerName** Try to create a unique name for the transform.
* **transformerFaillover** A list of transformers to try. This may include
references to transformer that have not been defined yet. Generally it
is better for the t-engine rather than the t-router to define failover
transformers as this limits the number of places that have to be changed.
Normally it is obvious which t-engine should contain the definition.
* **supportedSourceAndTargetList** The supported source and target Media
Types, which refer to the Media Types this failover transformer can
transform from and to, additionally you can set the `priority` and the
`maxSourceSizeBytes`. Unlike pipelines, it must not be blank.
* **transformOptions** A list of references to options required by the
pipeline transformer.
## Overriding transforms
It is possible to override a previously defined transform definition. The
following example removes most of the supported source to target media
types from the standard `"libreoffice"` transform. It also changes the
max size and priority of others. This is not something you would normally
want to do.
~~~json
{
"transformers": [
{
"transformerName": "libreoffice",
"supportedSourceAndTargetList": [
{"sourceMediaType": "text/csv", "maxSourceSizeBytes": 1000, "targetMediaType": "text/html" },
{"sourceMediaType": "text/csv", "targetMediaType": "application/vnd.oasis.opendocument.spreadsheet" },
{"sourceMediaType": "text/csv", "targetMediaType": "application/vnd.oasis.opendocument.spreadsheet-template" },
{"sourceMediaType": "text/csv", "targetMediaType": "text/tab-separated-values" },
{"sourceMediaType": "text/csv", "priority": 45, "targetMediaType": "application/vnd.ms-excel" },
{"sourceMediaType": "text/csv", "priority": 155, "targetMediaType": "application/pdf" }
]
}
]
}
~~~
## Removing a transformer
To discard a previous transformer definition include its name in the
optional `"removeTransformers"` list. You might want to do this if you
have a replacement and wish keep the overall configuration simple (so it
contains no alternatives), or you wish to temporarily remove it. The
following example removes two transformers before processing any other
configuration in the same T-Engine or pipeline file.
~~~json
{
"removeTransformers" : [
"libreoffice",
"Archive"
]
...
}
~~~
## Overriding the supportedSourceAndTargetList
Rather than totally override an existing transform definition, it is
generally simpler to modify the `"supportedSourceAndTargetList"` by adding
elements to the optional `"addSupported"`, `"removeSupported"` and
`"overrideSupported"` lists. You will need to specify the
`"transformerName"` but you will not need to repeat all the other
`"supportedSourceAndTargetList"` values, which means if there are changes
in the original, the same change is not needed in a second place. The
following example adds one transform, removes two others and changes
the `"priority"` and `"maxSourceSizeBytes"` of another. This is done before
processing any other configuration in the same T-Engine or pipeline file.
~~~json
{
"addSupported": [
{
"transformerName": "Archive",
"sourceMediaType": "application/zip",
"targetMediaType": "text/csv",
"priority": 60,
"maxSourceSizeBytes": 18874368
}
],
"removeSupported": [
{
"transformerName": "Archive",
"sourceMediaType": "application/zip",
"targetMediaType": "text/xml"
},
{
"transformerName": "Archive",
"sourceMediaType": "application/zip",
"targetMediaType": "text/plain"
}
],
"overrideSupported": [
{
"transformerName": "Archive",
"sourceMediaType": "application/zip",
"targetMediaType": "text/html",
"priority": 60,
"maxSourceSizeBytes": 18874368
}
]
...
}
~~~
## Default maxSourceSizeBytes and priority values
When defining `"supportedSourceAndTargetList"` elements the `"priority"`
and `"maxSourceSizeBytes"` are optional and normally have the default
values of 50 and -1 (no limit). It is possible to change those defaults.
In precedence order from most specific to most general these are defined
by combinations of `"transformerName"` and `"sourceMediaType"`.
* **transformer and source media type default** both specified
* **transformer** default only the transformer name is specified
* **source media type default** only the source media type is specified
* **system wide default** neither are specified.
Both `"priority"` and `"maxSourceSizeBytes"` may be specified in an element,
but if only one is specified it is only that value that is being defaulted.
Being able to change the defaults is particularly useful once a T-Engine
has been developed as it allows a system administrator to handle
limitations that are only found later. The `system wide defaults` are
generally not used but are included for completeness. The following
example says that the `"Office"` transformer by default should only handle
zip files up to 18 Mb and by default the maximum size of a `.doc` file to be
transformed is 4 Mb. The third example defaults the priority, possibly
allowing another transformer that has specified a priority of say `50` to
be used in preference.
Defaults values are only applied after all t-config has been read.
~~~json
{
"supportedDefaults": [
{
"transformerName": "Office", // default for a source type within a transformer
"sourceMediaType": "application/zip",
"maxSourceSizeBytes": 18874368
},
{
"sourceMediaType": "application/msword", // defaults for a source type
"maxSourceSizeBytes": 4194304,
"priority": 45
},
{
"priority": 60 // system wide default
},
{
"maxSourceSizeBytes": -1 // system wide default
}
]
...
}
~~~

View File

@@ -0,0 +1,140 @@
# Transform specific code
To create a new t-engine an author uses a base t-engine (a Spring Boot
application) and implements the following interfaces. An implementation of
the `CustomTransformer` provides the actual transformation code and the
implementation of the `TransformEngine` says what it is capable of
transforming. The `TransformConfig` is normally read from a json file on the
classpath. Multiple `CustomTransformer` implementations may be in a singe
t-engine. As a result the author can concentrate on the code that transforms
one format to another without really worrying about all the plumbing.
Typically, the transform specific code uses a 3rd party library or an
external executable which needs to be added to the Docker image.
~~~java
package org.alfresco.transform;
import org.alfresco.transform.config.TransformConfig;
import org.alfresco.transformer.probes.ProbeTestTransform;
import java.util.Set;
/**
* Interface to be implemented by transform specific code. Provides information
* about the t-engine as a whole. So that it is automatically picked up, it must
* exist in a package under {@code org.alfresco.transform} and have the Spring
* {@code @Component} annotation.
*/
public interface TransformEngine
{
/**
* @return the name of the t-engine. The t-router reads config from t-engines
* in name order.
*/
String getTransformEngineName();
/**
* @return a definition of what the t-engine supports. Normally read from a json
* Resource on the classpath.
*/
TransformConfig getTransformConfig();
/**
* @return a ProbeTestTransform (will do a quick transform) for k8 liveness and
* readiness probes.
*/
ProbeTransform getProbeTransform();
}
~~~
implementations of the following interface provide the actual transform code.
~~~java
package org.alfresco.transform;
import java.io.InputStream;
import java.io.OutputStream;
import java.util.Map;
/**
* Interface to be implemented by transform specific code. The
* {@code transformerName} should match the transformerName in the
* {@link TransformConfig} returned by the {@link TransformEngine}. So that it is
* automatically picked up, it must exist in a package under
* {@code org.alfresco.transform} and have the Spring {@code @Component} annotation.
*
* Implementations may also use the {@link TransformManager} if they wish to
* interact with the base t-engine.
*/
public interface CustomTransformer
{
String getTransformerName();
void transform(String sourceMimetype, InputStream inputStream,
String targetMimetype, OutputStream outputStream,
Map<String, String> transformOptions,
TransformManager transformManager) throws Exception;
}
~~~
The implementation of the following interface is provided by the t-base,
allows the `CustomTransformer` to interact with the base t-engine. The
creation of Files is discouraged as it is better not to leave files on disk.
~~~java
package org.alfresco.transform.base;
import java.io.File;
import java.io.InputStream;
import java.io.OutputStream;
import java.util.Map;
/**
* Allows {@link CustomTransformer} implementations to interact with the base
* t-engine.
*/
public interface TransformManager
{
/**
* Allows a CustomTransformer to use a local source File rather than the
* supplied InputStream. To avoid creating extra files, if a File has already
* been created by the base t-engine, it is returned.
*/
File createSourceFile();
/**
* Allows a CustomTransformer to use a local target File rather than the
* supplied OutputStream. To avoid creating extra files, if a File has already
* been created by the base t-engine, it is returned.
*/
File createTargetFile();
/**
* Allows a single transform request to have multiple transform responses. For
* example, images from a video at different time offsets or different pages of
* a document. Following a call to this method a transform response is made with
* the data sent to the current {@code OutputStream}. If this method has been
* called, there will not be another response when {@link CustomTransformer#
* transform(String, InputStream, String, OutputStream, Map, TransformManager)}
* returns and any data written to the final {@code OutputStream} will be
* ignored.
* @param index returned with the response, so that the fragment may be
* distinguished from other responses. Renditions use the index
* as an offset into elements. A {@code null} value indicates
* that there is no more output and any data sent to the current
* {@code outputStream} will be ignored.
* @param finished indicates this is the final fragment. {@code False} indicates
* that it is expected there will be more fragments. There need
* not be a call with this parameter set to {@code true}.
* @return a new {@code OutputStream} for the next fragment. A {@code null} will
* be returned if {@code index} was {@code null} or {@code
* finished} was {@code true}.
* @throws TransformException if a synchronous (http) request has been made as
* this only works with requests on queues, or the first call to
* this method indicated there was no output, or another call is
* made after it has been indicated that there should be no more
* fragments.
* @throws IOException if there was a problem sending the response.
OutputStream respondWithFragment(Integer index);
}
~~~

View File

@@ -0,0 +1,28 @@
# Transformer selection strategy
The TransformRegistry uses t-config to choose which Transformer will be
used. A transformer definition contains a supported list of source and
target Media Types. This is used for the most basic selection. It is further
refined by checking that the definition also supports transform options (the
parameters) that have been supplied in a transform request.
~~~text
Transformer 1 defines options: Op1, Op2
Transformer 2 defines options: Op1, Op2, Op3, Op4
Transform request provides values for options: Op2, Op3
~~~
If we assume both transformers support the required source and target Media
Types, Transformer 2 will be selected because it knows about all the supplied
options. The definition may also specify that some options are required or
grouped. If any members of an optional group are supplied, all required
members of that group become required.
The configuration may impose a source file size limit resulting in the
selection of a different transformer. Size limits are normally added to avoid
the transforms consuming too many resources.
The configuration may also specify a priority which will be used in
Transformer selection if there are a number of possible transformers. The
highest priority is the one with the lowest number.

46
docs/transformerDebug.md Normal file
View File

@@ -0,0 +1,46 @@
# TransformerDebug
In addition to any normal logging, the t-engines, t-router and t-client also
use the `TransformerDebug` class to provide request based logging. The
following is an example from Alfresco after the upload of a `docx` file.
~~~text
163 docx json AGM 2016 - Masters report.docx 14.8 KB -- metadataExtract -- TransformService
163 workspace://SpacesStore/0db3a665-328d-4437-85ed-56b753cf19c8 1563306426
163 docx json 14.8 KB -- metadataExtract -- PoiMetadataExtractor
163 cm:title=
163 cm:author=James Dobinson
163 Finished in 664 ms
...
164 docx png AGM 2016 - Masters report.docx 14.8 KB -- doclib -- TransformService
164 workspace://SpacesStore/0db3a665-328d-4437-85ed-56b753cf19c8 1563306426
164 docx png 14.8 KB -- doclib -- officeToImageViaPdf
164.1 docx pdf libreoffice
164.2 pdf png pdfToImageViaPng
164.2.1 pdf png pdfrenderer
164.2.2 png png imagemagick
164.2.2 endPage="0"
164.2.2 resizeHeight="100"
164.2.2 thumbnail="true"
164.2.2 startPage="0"
164.2.2 resizeWidth="100"
164.2.2 autoOrient="true"
164.2.2 allowEnlargement="false"
164.2.2 maintainAspectRatio="true"
164 Finished in 725 ms
~~~
This log happens to be from the t-client, but similar log lines exist in the
t-router and individual t-engines.
All lines start with a reference, which starts with the clients request
number (`163`, `164` if known) and then a nested pipeline or failover
structure. The first request extracts metadata and the second creates a
thumbnail rendition (called `doclib`). The second request is handled by a
pipeline called `officeToImageViaPdf` which uses `libreoffice` to transform
to `pdf` and then another pipeline to convert to `png`. The last step
(`164.2.2`) in the process resizes the `png` using a number of transform
options.
If requested, log information is passed back in the TransformReply's
clientData.