Skip to content

Add Claude Code and Gemini transcript support for context importer.#160

Open
Ankit-Kotnala wants to merge 2 commits intoXortexAI:mainfrom
Ankit-Kotnala:fix/issue-155-context-transcripts
Open

Add Claude Code and Gemini transcript support for context importer.#160
Ankit-Kotnala wants to merge 2 commits intoXortexAI:mainfrom
Ankit-Kotnala:fix/issue-155-context-transcripts

Conversation

@Ankit-Kotnala
Copy link
Copy Markdown
Contributor

Summary

Fixes #155.

Adds deterministic transcript parsing support for additional /context upload formats:

  • Claude Code JSONL session transcripts
  • Claude/Claude Code role-heading exports
  • Gemini CLI /chat share JSON exports
  • Gemini CLI /chat share Markdown exports

Also keeps existing Cursor and Antigravity behavior by moving transcript parsing into a shared helper used by both the production memory route and the legacy server entrypoint.

Changes

  • Added src/utils/transcripts.py as the shared transcript parser module.
  • Updated /v1/memory/parse_transcript to use the shared parser.
  • Updated legacy server.py parsing wrapper to use the same shared parser.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request centralizes transcript parsing logic into a new utility module, src/utils/transcripts.py, and removes redundant implementations from server.py and src/api/routes/memory.py. The new shared parser adds support for Claude Code (JSONL) and Gemini formats while improving the filtering of non-message content like tool calls and thinking blocks. Feedback includes suggestions to improve the Cursor parser by accumulating multiple assistant blocks, removing a redundant check in the JSON record pairing logic to better handle mixed messages, and ensuring tool markdown is stripped from Antigravity transcripts for consistency.

Comment thread src/utils/transcripts.py
Comment on lines +83 to +117
def _parse_cursor_transcript(text: str) -> list[ParsedMessagePair]:
"""Parse a Cursor-exported markdown transcript into message pairs."""
pairs: list[ParsedMessagePair] = []
sections = text.split("---")

start_idx = 0
if sections and "Exported on" in sections[0]:
start_idx = 1

current_user_query: str | None = None

for section in sections[start_idx:]:
section = section.strip()
if not section:
continue

if section.startswith("**User**"):
content = section.replace("**User**", "", 1).strip()
current_user_query = _append_user_text(current_user_query, content)
elif section.startswith("**Cursor**") or section.startswith("**Assistant**"):
content = (
section.replace("**Cursor**", "", 1)
.replace("**Assistant**", "", 1)
.strip()
)
if current_user_query:
pairs.append(
{
"user_query": current_user_query,
"agent_response": content,
}
)
current_user_query = None

return pairs
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The Cursor transcript parser currently only captures the first assistant block per turn. For consistency with the Antigravity and JSON parsers, it should accumulate multiple assistant blocks (if they exist) and flush them as a single response when a new user turn begins.

def _parse_cursor_transcript(text: str) -> list[ParsedMessagePair]:
    """Parse a Cursor-exported markdown transcript into message pairs."""
    pairs: list[ParsedMessagePair] = []
    sections = text.split("---")

    start_idx = 1 if sections and "Exported on" in sections[0] else 0
    current_user_query: str | None = None
    assistant_chunks: list[str] = []

    def flush_pair() -> None:
        nonlocal current_user_query, assistant_chunks
        if current_user_query and assistant_chunks:
            pairs.append(
                {
                    "user_query": current_user_query,
                    "agent_response": "\n\n".join(assistant_chunks).strip(),
                }
            )
        assistant_chunks = []

    for section in sections[start_idx:]:
        section = section.strip()
        if not section:
            continue

        if section.startswith("**User**"):
            if assistant_chunks:
                flush_pair()
                current_user_query = None
            content = section.replace("**User**", "", 1).strip()
            current_user_query = _append_user_text(current_user_query, content)
        elif section.startswith("**Cursor**") or section.startswith("**Assistant**"):
            content = (
                section.replace("**Cursor**", "", 1)
                .replace("**Assistant**", "", 1)
                .strip()
            )
            if current_user_query:
                assistant_chunks.append(content)

    flush_pair()
    return pairs

Comment thread src/utils/transcripts.py
Comment on lines +276 to +277
if role in _USER_ROLES and _record_has_tool_result(record):
continue
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This check is redundant because _record_text (via _extract_text) already filters out tool-related blocks. If a message consists entirely of tool results, text will be empty and the record will be skipped at line 280. Removing this allows mixed messages (text + tool results) to still contribute their text content to the transcript.

Comment thread src/utils/transcripts.py
Comment on lines +120 to +174
def _parse_antigravity_transcript(text: str) -> list[ParsedMessagePair]:
"""Parse an Antigravity-exported markdown transcript into message pairs."""
pairs: list[ParsedMessagePair] = []
blocks = re.split(r"(?m)^(###\s+.+)$", text)

current_user_query: str | None = None
planner_chunks: list[str] = []

for i, block in enumerate(blocks):
block = block.strip()
if not block:
continue

if re.match(r"###\s+User Input", block, re.IGNORECASE):
if current_user_query and planner_chunks:
pairs.append(
{
"user_query": current_user_query,
"agent_response": "\n\n".join(planner_chunks).strip(),
}
)
planner_chunks = []
current_user_query = None

elif re.match(r"###\s+Planner Response", block, re.IGNORECASE):
continue

elif i > 0:
prev_heading = blocks[i - 1].strip() if i >= 1 else ""
if re.match(r"###\s+User Input", prev_heading, re.IGNORECASE):
if current_user_query and planner_chunks:
pairs.append(
{
"user_query": current_user_query,
"agent_response": "\n\n".join(planner_chunks).strip(),
}
)
planner_chunks = []
current_user_query = block
else:
current_user_query = _append_user_text(current_user_query, block)

elif re.match(r"###\s+Planner Response", prev_heading, re.IGNORECASE):
if block:
planner_chunks.append(block)

if current_user_query and planner_chunks:
pairs.append(
{
"user_query": current_user_query,
"agent_response": "\n\n".join(planner_chunks).strip(),
}
)

return pairs
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The Antigravity parser should also apply _strip_tool_markdown to its content blocks. While Antigravity exports usually separate tool steps into different Planner Response blocks, those blocks themselves might still contain the raw tool markdown that should be cleaned before ingestion.

def _parse_antigravity_transcript(text: str) -> list[ParsedMessagePair]:
    """Parse an Antigravity-exported markdown transcript into message pairs."""
    pairs: list[ParsedMessagePair] = []
    blocks = re.split(r"(?m)^(###\s+.+)$", text)

    current_user_query: str | None = None
    planner_chunks: list[str] = []

    for i, block in enumerate(blocks):
        block = block.strip()
        if not block:
            continue

        if re.match(r"###\s+User Input", block, re.IGNORECASE):
            if current_user_query and planner_chunks:
                pairs.append(
                    {
                        "user_query": current_user_query,
                        "agent_response": "\n\n".join(planner_chunks).strip(),
                    }
                )
                planner_chunks = []
                current_user_query = None

        elif re.match(r"###\s+Planner Response", block, re.IGNORECASE):
            continue

        elif i > 0:
            prev_heading = blocks[i - 1].strip() if i >= 1 else ""
            content = _strip_tool_markdown(block)
            if not content:
                continue

            if re.match(r"###\s+User Input", prev_heading, re.IGNORECASE):
                if current_user_query and planner_chunks:
                    pairs.append(
                        {
                            "user_query": current_user_query,
                            "agent_response": "\n\n".join(planner_chunks).strip(),
                        }
                    )
                    planner_chunks = []
                    current_user_query = content
                else:
                    current_user_query = _append_user_text(current_user_query, content)

            elif re.match(r"###\s+Planner Response", prev_heading, re.IGNORECASE):
                planner_chunks.append(content)

    if current_user_query and planner_chunks:
        pairs.append(
            {
                "user_query": current_user_query,
                "agent_response": "\n\n".join(planner_chunks).strip(),
            }
        )

    return pairs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

add support of gemini, claude in /context route

1 participant