performance: image processing optimizations#638
Merged
Conversation
Avoid calling python-docx per-image by generating a CT_Inline-based XML template once and using str.format() to fill sentinels (keeping compatibility with installed python-docx). Add caching of generated image XML per (part, descriptor, width, height) to skip repeated I/O, SHA1 work and header parsing. Use package.get_or_add_image_part and relate_to with RT.IMAGE, compute scaled_dimensions, assign shape_id from docx_ids_index, and xml-escape filenames. Also add a _image_cache dict on DocxTemplate and adjust hyperlink handling to use the local part variable.
Add an O(1) SHA1 index for image parts and a fast _get_or_add_image_part helper on DocxTemplate to avoid python-docx's O(n) linear scan and repeated SHA1 recomputation. Initialize the index in the constructor (_init_image_parts_index), seed it from existing image parts, and maintain a sequential partname counter to prevent partname collisions. Update InlineImage to call tpl._get_or_add_image_part (which returns (image_part, image)) instead of package.get_or_add_image_part, and use the returned Image object. This improves performance and reduces redundant SHA1 work when inserting/looking up images.
Replace the SHA1-based image-part index with a descriptor-keyed cache (_image_descriptor_index) to deduplicate images by file-path (O(1)) and avoid expensive SHA1 hashing. For string path descriptors the cache is used to return existing (image_part, image) tuples; non-string descriptors (e.g. file-like objects) fall back to always creating a new part. Keeps sequential partname assignment and appends new ImagePart to the package; caches the result for string descriptors. This improves performance when adding many images (e.g. large photos) by eliminating repeated SHA1 computation.
Cache only the expensive image metadata (rId, dimensions, filename) per (part, descriptor, width, height) instead of the full inline XML. A fresh shape_id is now assigned for every insertion so drawing IDs remain unique (important for headers/footers/footnotes which aren't renumbered by fix_docpr_ids()). This preserves performance benefits (avoids repeated image part lookup, hashing and header parsing) while preventing duplicate drawing IDs; cx/cy are stored as ints and filename is xml-escaped when cached.
Use id() for non-hashable image descriptors (e.g. file-like objects) when building the image cache key to avoid TypeError on dict lookup. Also escape double quotes in image filenames for XML attribute usage by passing a mapping to xml_escape so quotes become ". Cache semantics and per-insertion shape_id assignment are otherwise unchanged.
Avoid using len() of image parts to pick the next image partname index, which could collide when numbering is non-contiguous. Instead scan existing image partnames (using partname.baseURI when available, otherwise str(partname)), extract numeric suffixes with a regex (/image(\d+)\.), track the maximum index, and set the image part counter to that max. This ensures new image partnames won't reuse an already-present index.
Replace conditional use of partname.baseURI with a direct str(partname) conversion when iterating image parts. This makes the code rely on a consistent string representation for part names (used by the /imageN.ext regex) and avoids depending on the presence of a baseURI attribute across different part implementations.
Replace the hardcoded docx_ids_index initialization with a routine that scans all package parts (body, headers, footers, footnotes) for wp:docPr elements and sets the counter above the maximum found id (minimum 1000). This prevents id collisions when inserting new drawings into parts that were not renumbered by fix_docpr_ids. The new method is called during initialization and safely skips non-XML or unreadable parts.
Treat image.filename == None (e.g., BytesIO/file-like descriptors) as an empty string before calling xml_escape so XML attribute generation matches python-docx behavior. Added a clarifying comment and ensure the escaped filename is stored in the cache to avoid None-related issues when rendering.
Only build and use a cache key when the image_descriptor is hashable. Previously id() was used for non-hashable descriptors (e.g. file-like objects), which could risk aliasing after GC and lead to incorrect deduplication. Now the code attempts to construct a cache key with the descriptor and falls back to skipping caching for unhashable descriptors; cache entries are only read/written when a valid cache_key exists. Filename normalization and per-insertion shape_id behavior are unchanged.
performance: image processing optimizations
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This pull request introduces substantial performance optimizations for inline image handling within the
docxtpllibrary.The changes focus on reducing redundant XML generation, file I/O, hashing, and image processing during template rendering. Together, these improvements dramatically reduce rendering times for image-heavy documents.
In a real-world example containing approximately 850 images, rendering time was reduced from 45–50 seconds to approximately 2–3 seconds.
Key Improvements
Inline Image XML Generation Optimizations
_INLINE_IMAGE_XML) generated once at module load time.str.format()operations instead of repeatedly invokingCT_Inline.new_pic_inline().Inline Image Caching
InlineImage._insert_image()to cache generated image XML and related processing.This prevents repeated:
for images reused throughout a document.
Internal Image Part Deduplication
Fast Image Lookup & Reuse
Added:
_image_cache_init_image_parts_index()_get_or_add_image_part()to support fast,
O(1)image deduplication and retrieval.Improvements over Default python-docx Behaviour
The new implementation bypasses the default
python-docximage deduplication mechanism, which relies heavily on content hashing and repeated package inspection.Instead:
This significantly improves rendering performance for templates containing many images.
Reduced File I/O and Processing Overhead
The
_get_or_add_image_part()implementation ensures:This results in substantially lower CPU and I/O overhead during rendering.
Real-World Performance Impact
These optimizations provide major performance improvements for image-heavy templates while preserving existing rendering behaviour and compatibility.