This issue has been created
 
 
XWiki Platform / cid:jira-generated-image-avatar-4d7ceb25-1f7f-40f4-aa72-2137d54e220f XWIKI-23271 Open

Cache attachment context extracted by Tika

 
View issue   ยท   Add comment
 

Issue created

 
cid:jira-generated-image-avatar-e22637e7-ab8a-477d-91ad-0348ccf2d900 Michael Hamann created this issue on 04/Jun/25 17:30
 
Summary: Cache attachment context extracted by Tika
Issue Type: cid:jira-generated-image-avatar-4d7ceb25-1f7f-40f4-aa72-2137d54e220f Improvement
Affects Versions: 16.10.0
Assignee: Unassigned
Components: Search - Solr
Created: 04/Jun/25 17:30
Labels: performance
Priority: cid:jira-generated-image-static-major-0a5d72b7-607b-4751-b3f9-b9f601805783 Major
Reporter: Michael Hamann
Description:

The Solr indexer currently extracts the textual content of attachments using Tika whenever an attachment is indexed. The same attachment is usually indexed at least twice: once on the document entry, once for a separate attachment entry, and then if the document has translations, also once for every translation. Whenever a document is changed, the attachment will be re-parsed even if it didn't change. As parsing attachments with Tika can be slow and resource-consuming, we should introduce a cache for the extracted text. This cache should ideally be stored in the same store as the attachment itself. There should be a way to manually or automatically clear the cache, e.g., on Tika upgrades.