An Alfresco T-Engine that converts PDF, Office documents, HTML, and images to Markdown using MarkItDown, pymupdf4llm, and Markdown to PDF using md2pdf.
- Features
- Supported Transformations
- Getting Started
- API Reference
- Docker Deployment
- Testing
- Contributing
- Security
- License
- Bidirectional conversion: PDF/Office/HTML/Images to Markdown and Markdown to PDF
- Dockerized: Ready-to-use Docker container with all dependencies
- Alfresco integration: Native T-Engine for Alfresco Content Services
- Lightweight: Uses MarkItDown and pymupdf4llm instead of heavy ML-based pipelines
- Image support: Convert images to Markdown descriptions
- Extensible: Easy to add new transformation capabilities
This project provides an Alfresco Content Services (ACS) transformer that converts a wide variety of document types into Markdown format using three lightweight Python libraries:
- MarkItDown (MIT License) - Converts Office documents, HTML, CSV, and images to Markdown
- pymupdf4llm (AGPL-3.0) - Converts PDF documents to Markdown
- md2pdf (MIT License) - Converts Markdown to PDF
The transformer runs inside a Docker container and can be integrated into Alfresco as a local transformer.
The following source formats are converted to text/markdown (or text/x-markdown):
Document Types
| Source | MIME Type |
|---|---|
| Word (.docx) | application/vnd.openxmlformats-officedocument.wordprocessingml.document |
| Excel (.xlsx) | application/vnd.openxmlformats-officedocument.spreadsheetml.sheet |
| PowerPoint (.pptx) | application/vnd.openxmlformats-officedocument.presentationml.presentation |
| HTML | text/html |
| XHTML | application/xhtml+xml |
| CSV | text/csv |
Image Types
| Source | MIME Type |
|---|---|
| PNG | image/png |
| JPEG | image/jpeg |
| TIFF | image/tiff |
| BMP | image/bmp |
| Source | Target | Description |
|---|---|---|
application/pdf |
text/markdown |
Convert PDF documents to Markdown |
| Source | Target | Description |
|---|---|---|
text/markdown |
application/pdf |
Convert Markdown documents to PDF |
To create the transformer Docker image, run:
./run.sh buildThis uses alfresco-base-java and installs Python 3.10 with MarkItDown, pymupdf4llm, and md2pdf via pip.
To run the image:
./run.sh start
- Port 8090 is for transformations
To enable remote debugging locally, start the container with:
docker compose -f target/docker-compose.yml run -p 8099:8099 -e JAVA_OPTS="-Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=*:8099" becpg-transform-markdownExample of request to transform a PDF file to Markdown:
curl --location --request POST 'http://localhost:8090/transform' \
--form 'file=@"/path/to/sample.pdf"' \
--form 'sourceMimetype="application/pdf"' \
--form 'targetMimetype="text/markdown"'You can declare the Docker service as follow in a docker-compose.yml file:
becpg-transform-markdown:
image: becpg-transform-markdown:1.0.0
ports:
- "8090:8090"
environment:
- SERVER_PORT=8090
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8090/live"]
interval: 30s
timeout: 10s
retries: 5Add the following JVM property to your Alfresco instance:
-DlocalTransform.becpg-transform-markdown.url=http://localhost:8090/
This allows Alfresco to discover and use the transformer.
GET http://localhost:8090/livePOST http://localhost:8090/transformParameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
file |
File | Yes | Source file to transform |
sourceMimetype |
String | Yes | MIME type of source file |
targetMimetype |
String | Yes | Desired output MIME type |
Example:
curl --location --request POST 'http://localhost:8090/transform' \
--form 'file=@"/path/to/document.pdf"' \
--form 'sourceMimetype="application/pdf"' \
--form 'targetMimetype="text/markdown"'Run integration tests (requires MarkItDown and pymupdf4llm installed locally):
mvn testRun a specific test:
mvn test -Dtest=MarkitdownTransformerITThe DockerTransformIT builds a Docker image, starts a container, and verifies every supported transformation end-to-end. It is a parameterized test covering all 24 source/target combinations from the config (12 unique transformations, each with both text/markdown and text/x-markdown targets, plus MD-to-PDF and PDF-to-MD).
mvn package -DskipTests
mvn test -Dtest=DockerTransformITOutput files are written to src/test/resources/output/ for manual inspection.
We welcome contributions! Please see our Contributing Guidelines and Code of Conduct for details.
For security-related issues, please see our Security Policy.
- This project is licensed under the GNU Lesser General Public License v3.0 - see the LICENSE file for details.
- This project uses MarkItDown and md2pdf, licensed under the MIT License.
- This project uses pymupdf4llm, licensed under the AGPL-3.0 License.
- Base image from Alfresco Docker Base Java
- beCPG - The open source PLM solution
- MarkItDown - Microsoft's lightweight document-to-Markdown converter
- pymupdf4llm - PDF to Markdown conversion
- md2pdf - Markdown to PDF conversion
- Alfresco - The open content services platform
Made with care by the beCPG team