Add Claude Code and Gemini transcript support for context importer#159
Add Claude Code and Gemini transcript support for context importer#159Ankit-Kotnala wants to merge 2 commits intoXortexAI:mainfrom
Conversation
|
@ishaanxgupta Please review this. |
There was a problem hiding this comment.
Code Review
This pull request centralizes transcript parsing logic into a new utility module, src/utils/transcripts.py, replacing duplicate implementations in server.py and src/api/routes/memory.py. The new shared parser adds support for Claude Code (JSONL), Gemini CLI, and Claude-style markdown exports while implementing logic to filter out tool calls and thinking blocks. Feedback identified several instances where consecutive user messages would be overwritten rather than concatenated, leading to potential data loss. Additionally, a bug was found in the text cleaning utility that would incorrectly strip markdown list markers.
| def _clean_text(text: str) -> str: | ||
| return text.strip().strip("-").strip() |
There was a problem hiding this comment.
The strip("-") call in _clean_text is problematic for Markdown content. It will remove leading bullet points from list items (e.g., - Item becomes Item) and can strip horizontal rules or other intentional formatting. It should be removed to preserve the integrity of the message content.
| def _clean_text(text: str) -> str: | |
| return text.strip().strip("-").strip() | |
| def _clean_text(text: str) -> str: | |
| return text.strip() |
| if role in _USER_ROLES: | ||
| flush_pair() | ||
| current_user_query = text | ||
| elif current_user_query: | ||
| assistant_chunks.append(text) |
There was a problem hiding this comment.
This logic discards previous user messages if multiple user turns occur consecutively without an intervening assistant response. In many chat transcripts (especially from CLI tools or when users send multiple fragments), it's better to concatenate consecutive user messages to ensure no context is lost during import.
| if role in _USER_ROLES: | |
| flush_pair() | |
| current_user_query = text | |
| elif current_user_query: | |
| assistant_chunks.append(text) | |
| if role in _USER_ROLES: | |
| if assistant_chunks: | |
| flush_pair() | |
| current_user_query = text | |
| else: | |
| current_user_query = (current_user_query + "\n\n" + text) if current_user_query else text | |
| elif current_user_query: | |
| assistant_chunks.append(text) |
| if role in _USER_ROLES: | ||
| flush_pair() | ||
| current_user_query = content | ||
| elif role in assistant_roles and current_user_query: | ||
| assistant_chunks.append(content) |
There was a problem hiding this comment.
Similar to the JSON parser, this role-heading parser also overwrites the current_user_query if multiple user headings are encountered before an assistant response. Concatenating them would prevent data loss.
| if role in _USER_ROLES: | |
| flush_pair() | |
| current_user_query = content | |
| elif role in assistant_roles and current_user_query: | |
| assistant_chunks.append(content) | |
| if role in _USER_ROLES: | |
| if assistant_chunks: | |
| flush_pair() | |
| current_user_query = content | |
| else: | |
| current_user_query = (current_user_query + "\n\n" + content) if current_user_query else content | |
| elif role in assistant_roles and current_user_query: | |
| assistant_chunks.append(content) |
| if section.startswith("**User**"): | ||
| current_user_query = section.replace("**User**", "", 1).strip() |
There was a problem hiding this comment.
In the Cursor transcript parser, consecutive user sections will result in the earlier sections being lost. Consider concatenating them to preserve all user input.
if section.startswith("**User**"):
content = section.replace("**User**", "", 1).strip()
if current_user_query:
current_user_query += "\n\n" + content
else:
current_user_query = content
Summary
Fixes #155.
Adds deterministic transcript parsing support for additional
/contextupload formats:/chat shareJSON exports/chat shareMarkdown exportsAlso keeps existing Cursor and Antigravity behavior by moving transcript parsing into a shared helper used by both the production memory route and the legacy server entrypoint.
Changes
src/utils/transcripts.pyas the shared transcript parser module./v1/memory/parse_transcriptto use the shared parser.server.pyparsing wrapper to use the same shared parser.