Files
llm-in-text/backend/prompt.py
ydy0615 03bb21d5c6 feat(copilot): enhance OCR handling with inline tags and document serializer
- Replace HTML comment OCR metadata with inline `<OCR:...>` tags
- Implement serializer-based markdown conversion for prefix/suffix content
- Add extractTextFromOCR utility function for text extraction
- Enable Table, Diagram, and ListCheck features in MilkdownEditor
- Add periodic debug logging for document state analysis
2026-02-14 23:53:26 +08:00

94 lines
3.0 KiB
Python

from typing import Tuple
MAX_PREFIX_CHARS = 12000
MAX_SUFFIX_CHARS = 4000
def _sanitize_language_id(language_id: str) -> str:
if not language_id:
return "markdown"
allowed = []
for ch in language_id.strip():
if ch.isalnum() or ch in "-_+.":
allowed.append(ch)
value = "".join(allowed)[:32]
return value or "markdown"
def _prepare_context(prefix: str, suffix: str) -> Tuple[str, str]:
"""
Prepare prefix/suffix for model completion context.
Keep the historical one-char lookahead behavior to reduce boundary drift.
"""
if suffix:
prefix = prefix + suffix[0]
suffix = suffix[1:]
return prefix[-MAX_PREFIX_CHARS:], suffix[:MAX_SUFFIX_CHARS]
def build_prompt(prefix: str, suffix: str, language_id: str = "markdown") -> str:
safe_language_id = _sanitize_language_id(language_id)
recent_prefix, recent_suffix = _prepare_context(prefix, suffix)
prompt = f"""You are an inline completion engine for a {safe_language_id} editor with ghost-text suggestions.
Your job:
- Return ONLY the text that should be inserted at the cursor between PREFIX and SUFFIX.
- Prefer a meaningful, non-empty insertion with moderate length.
- Avoid overly short outputs with little information value.
Important context:
- PREFIX may contain OCR metadata inline after images, e.g. ![alt](url) <OCR:description>.
- The <OCR:...> is hidden context describing image content.
- Never copy, rewrite, or emit OCR tags in output.
- Never output <OCR: or >.
Hard rules:
1. Seamless join:
PREFIX + OUTPUT + SUFFIX must read naturally as one continuous document.
2. No suffix repetition:
Do NOT repeat text that already appears at the start of SUFFIX.
3. Balanced length:
Prefer concise but meaningful continuation, not ultra-short fragments.
Default target is 20-120 characters and 1-3 lines.
You may go shorter only when syntax requires it.
4. Avoid trivial output:
Do not output only punctuation or filler such as ".", ",", ";", ":".
Do not output just one token unless it is structurally necessary.
5. Preserve local style:
Match nearby language, tone, punctuation, spacing, and indentation.
6. Markdown awareness:
Continue active list/checkbox/ordered-list patterns when applicable.
Preserve indentation in nested list/code contexts.
Close obvious unclosed inline markdown markers only when needed to bridge.
7. Strict output format:
Output insertion text only.
No explanations, labels, quotes, or code fences.
Decision policy:
- If PREFIX already connects naturally to SUFFIX, add a brief but useful continuation when possible.
- If uncertain, prefer a complete short phrase or sentence with clear meaning.
Examples:
<PREFIX>The quick brown fox </PREFIX>
<SUFFIX>jumps over the lazy dog.</SUFFIX>
Output: "moved quietly and then "
<PREFIX>## TODO\\n- [ ] Buy milk\\n- [ ] </PREFIX>
<SUFFIX></SUFFIX>
Output: "Write release notes and share draft with team"
Now produce the insertion.
<PREFIX>
{recent_prefix}
</PREFIX>
<SUFFIX>
{recent_suffix}
</SUFFIX>
Output:"""
return prompt.strip()