Prompt Caching์€ โ€œ๋ชจ๋ธ์„ ๋ฐ”๊พธ๋ฉด ๋นจ๋ผ์ง„๋‹คโ€๋ณด๋‹ค ๋” ๋จผ์ € ๋ด์•ผ ํ•˜๋Š” ์šด์˜ ๋ ˆ๋ฒ„์ž…๋‹ˆ๋‹ค. ํŠนํžˆ ๋ฐ˜๋ณต ์ž‘์—…์ด ๋งŽ์€ ํŒ€์—์„œ๋Š” ํ”„๋กฌํ”„ํŠธ์˜ ๊ณตํ†ต ์•ž๋ถ€๋ถ„(prefix)์„ ์–ผ๋งˆ๋‚˜ ์•ˆ์ •์ ์œผ๋กœ ์žฌ์‚ฌ์šฉํ•˜๋А๋ƒ๊ฐ€ ์ง€์—ฐ์‹œ๊ฐ„๊ณผ ์ž…๋ ฅ๋น„์šฉ์— ์ง์ ‘ ์˜ํ–ฅ์„ ์ค๋‹ˆ๋‹ค.

OpenAI ๊ณต์‹ ๊ฐ€์ด๋“œ ๊ธฐ์ค€์œผ๋กœ ์บ์‹ฑ์€ ์ž๋™์œผ๋กœ ๋™์ž‘ํ•˜์ง€๋งŒ(์กฐ๊ฑด ์ถฉ์กฑ ์‹œ), prefix ์„ค๊ณ„ยทkey ์ „๋žตยท์ธก์ • ๋ฃจํ”„๋ฅผ ํ•จ๊ป˜ ์šด์˜ํ•ด์•ผ ์‹ค์ œ ์„ฑ๋Šฅ ์ด๋“์ด ์ปค์ง‘๋‹ˆ๋‹ค.

์•ˆ๋‚ด: ์ด ๋ฌธ์„œ๋Š” ์ƒ์„ฑํ˜• AI๋ฅผ ํ™œ์šฉํ•ด ์ดˆ์•ˆ์„ ์ž‘์„ฑํ–ˆ๊ณ , ๊ณต๊ฐœ๋œ ๊ณต์‹ ์ž๋ฃŒ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์‚ฌ๋žŒ์ด ๊ฒ€ํ† ยท๋ณด์ •ํ–ˆ์Šต๋‹ˆ๋‹ค.

์ด ๊ธ€์˜ ๊ทผ๊ฑฐ ์ž๋ฃŒ

ํ•ต์‹ฌ ์š”์•ฝ

  • ์บ์‹œ ํžˆํŠธ๋Š” ์ •ํ™•ํžˆ ๊ฐ™์€ prefix์—์„œ๋งŒ ๋ฐœ์ƒํ•œ๋‹ค.
  • 1024 ํ† ํฐ ์ด์ƒ ์š”์ฒญ๋ถ€ํ„ฐ ์บ์‹ฑ ํšจ๊ณผ๊ฐ€ ๋ณธ๊ฒฉ์ ์œผ๋กœ ๋‚˜ํƒ€๋‚œ๋‹ค.
  • ํ•ต์‹ฌ ์šด์˜์ถ•์€ prefix ๊ณ ์ •, prompt_cache_key ๋ถ„๋ฆฌ, cached_tokens ๊ธฐ๋ฐ˜ ์ธก์ •์ด๋‹ค.
  • ํŠธ๋ž˜ํ”ฝ์ด ํ•œ ํ‚ค/ํ”„๋ฆฌํ”ฝ์Šค ์กฐํ•ฉ์œผ๋กœ ๋ชฐ๋ฆฌ๋ฉด(๊ฐ€์ด๋“œ์˜ ์•ฝ 15rpm ์–ธ๊ธ‰) ์บ์‹œ ํšจ์œจ์ด ๋–จ์–ด์งˆ ์ˆ˜ ์žˆ๋‹ค.
flowchart LR
    A["์š”์ฒญ ์ž…๋ ฅ"] --> B["๊ณ ์ • Prefix ์ •๋ ฌ"]
    B --> C["prompt_cache_key ์ ์šฉ"]
    C --> D["๋ณด์กด ์ •์ฑ… ์„ ํƒ\n(in_memory / 24h)"]
    D --> E["์บ์‹œ ํžˆํŠธ ์ธก์ •\n(cached_tokens)"]
    E --> F["ํ…œํ”Œ๋ฆฟ/ํ‚ค ํŠœ๋‹"]
    F --> B

์‹œ๊ฐํ™” (Excalidraw)

Agent/OpenAI ๋น„์ฆˆ๋‹ˆ์Šค ํ™œ์šฉ/images/openai-biz-15-cache-routing.png

  • ์›๋ณธ ํŽธ์ง‘ ํŒŒ์ผ: Agent/OpenAI ๋น„์ฆˆ๋‹ˆ์Šค ํ™œ์šฉ/images/openai-biz-15-cache-routing.excalidraw

๐Ÿง  ์น ํŒ ์น˜ํŠธ์‹œํŠธ

  • ๊ณ ์ • ๊ทœ์น™/์˜ˆ์‹œ/ํˆด์€ ์•ž, ์‚ฌ์šฉ์ž ๋ณ€๋™๊ฐ’์€ ๋’ค
  • prompt_cache_key๋Š” ์›Œํฌํ”Œ๋กœ์šฐ ๋‹จ์œ„๋กœ ๋ถ„๋ฆฌ
  • cached_tokens๋ฅผ ์ฃผ๊ฐ„ ์ง€ํ‘œ๋กœ ์•ˆ ๋ณด๋ฉด ์ตœ์ ํ™”๊ฐ€ ์•„๋‹ˆ๋ผ ๊ฐ ์šด์˜
  • ์บ์‹œ ๋ณด์กด์ •์ฑ…์€ ์„ฑ๋Šฅ๋ฟ ์•„๋‹ˆ๋ผ ๋ฐ์ดํ„ฐ ์ •์ฑ… ๋งฅ๋ฝ๊นŒ์ง€ ๊ฐ™์ด ๋ณธ๋‹ค

๊ณต์‹ ๋ฌธ์„œ ๊ธฐ์ค€์œผ๋กœ ๊ผญ ์•Œ์•„์•ผ ํ•  5๊ฐ€์ง€

1) ์บ์‹œ๋Š” ์ž๋™์ด์ง€๋งŒ, ์„ฑ๋Šฅ์€ ์ž๋™์ด ์•„๋‹ˆ๋‹ค

๊ณต์‹ ๊ฐ€์ด๋“œ์ฒ˜๋Ÿผ Prompt Caching์€ ๋ณ„๋„ ์ถ”๊ฐ€ ๊ณผ๊ธˆ ์—†์ด ์ž๋™ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ โ€œ์ž๋™โ€์€ ์ตœ์ ํ™”๊นŒ์ง€ ์ž๋™์ด๋ผ๋Š” ๋œป์ด ์•„๋‹™๋‹ˆ๋‹ค. prefix๊ฐ€ ํ”๋“ค๋ฆฌ๋ฉด ์บ์‹œ ํžˆํŠธ์œจ์€ ๋ฐ”๋กœ ๋–จ์–ด์ง‘๋‹ˆ๋‹ค.

์ฐธ๊ณ :

2) ์บ์‹œ ํžˆํŠธ๋Š” ์ •ํ™•ํ•œ prefix ์ผ์น˜๊ฐ€ ์ „์ œ

์•„๋ž˜๊ฐ€ ์ž์ฃผ ๋†“์น˜๋Š” ํฌ์ธํŠธ์ž…๋‹ˆ๋‹ค.

  • ๊ฐ™์€ ์˜๋ฏธ๋ผ๋„ ๋ฌธ์žฅ ์ˆœ์„œ๊ฐ€ ๋‹ค๋ฅด๋ฉด ๋‹ค๋ฅธ prefix๋กœ ์ธ์‹๋  ์ˆ˜ ์žˆ์Œ
  • tools/structured output schema/image detail์ด ๋‹ฌ๋ผ์ง€๋ฉด ํžˆํŠธ์œจ ํ•˜๋ฝ
  • ์‚ฌ์šฉ์ž ์‹๋ณ„์ž/๋‚ ์งœ ๊ฐ™์€ ๋ณ€๋™๊ฐ’์„ ์•ž์— ๋‘๋ฉด ๋งค๋ฒˆ ๋ฏธ์Šค๊ฐ€ ๋Š˜์–ด๋‚จ

์‹ค๋ฌด ๊ทœ์น™:

  • ์•ž(๊ณ ์ •): system rules, output format, tools, schema, ๊ณตํ†ต ์˜ˆ์‹œ
  • ๋’ค(๋ณ€๋™): user_id, ๋‚ ์งœ, ์š”์ฒญ payload

3) prompt_cache_key๋Š” ๋ผ์šฐํŒ… ๋ ˆ๋ฒ„

๊ณต์‹ ๋ฌธ์„œ ๊ธฐ์ค€์œผ๋กœ prompt_cache_key๋Š” prefix hash์™€ ํ•จ๊ป˜ ๋ผ์šฐํŒ…์— ์˜ํ–ฅ์„ ์ฃผ๋Š” ๋ณ€์ˆ˜์ž…๋‹ˆ๋‹ค. ๊ฐ™์€ ์›Œํฌํ”Œ๋กœ์šฐ์—์„œ ์ผ๊ด€๋œ ํ‚ค๋ฅผ ์“ฐ๋ฉด ์บ์‹œ ํšจ์œจ์„ ๋†’์ด๊ธฐ ์ข‹์Šต๋‹ˆ๋‹ค.

๊ถŒ์žฅ ๋„ค์ด๋ฐ:

  • ops_followup_v1
  • ops_weekly_report_v1
  • ops_helpdesk_v1

์ฐธ๊ณ :

4) ๋ณด์กด ์ •์ฑ…์€ ์šด์˜ ๋ชฉ์ ์— ๋งž์ถฐ ์„ ํƒ

๊ณต์‹ ๊ฐ€์ด๋“œ์—์„œ ๊ธฐ๋ณธ์€ in_memory, ํ•„์š”ํ•˜๋ฉด 24h ํ™•์žฅ ๋ณด์กด์„ ์ง€์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • ๊ธฐ๋ณธ(in_memory): ์งง์€ ์ฃผ๊ธฐ์˜ ๋ฐ˜๋ณต ์š”์ฒญ์— ์œ ๋ฆฌ
  • 24h: ํ•˜๋ฃจ ๋‹จ์œ„ ๋ฐ˜๋ณต ์š”์ฒญ์ด ๋งŽ์„ ๋•Œ ์œ ๋ฆฌ

๋‹จ, ํ”„๋กœ์ ํŠธ์˜ ๋ฐ์ดํ„ฐ ์ •์ฑ…(์˜ˆ: ZDR ์šด์šฉ ๋งฅ๋ฝ)๊ณผ ํ•จ๊ป˜ ํŒ๋‹จํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

5) ์ตœ์ข… ํŒ๋‹จ์€ cached_tokens๋กœ ๋‹ซ๋Š”๋‹ค

์บ์‹ฑ ์šด์˜์€ ์ฒด๊ฐ์ด ์•„๋‹ˆ๋ผ ์ˆ˜์น˜๋กœ ๋‹ซ์•„์•ผ ํ•ฉ๋‹ˆ๋‹ค.

  • usage.prompt_tokens_details.cached_tokens
  • p95 latency
  • input cost

ํŠนํžˆ 1024ํ† ํฐ ๋ฏธ๋งŒ ์š”์ฒญ์€ cached_tokens=0์ด ๋‚˜์˜ฌ ์ˆ˜ ์žˆ์œผ๋‹ˆ, ์š”์ฒญ ํฌ๊ธฐ ๋ถ„ํฌ๊นŒ์ง€ ํ•จ๊ป˜ ๋ด์•ผ ์˜คํŒ์„ ์ค„์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ฐธ๊ณ :

๋ฐ”๋กœ ๋ถ™์—ฌ ์“ฐ๋Š” ์ฝ”๋“œ

1) Responses API ๊ธฐ๋ณธ ์˜ˆ์‹œ (Python)

from openai import OpenAI
 
client = OpenAI()
 
response = client.responses.create(
    model="gpt-5.1",
    input=[
        {
            "role": "system",
            "content": "๋„ˆ๋Š” ์šด์˜ ์ž๋™ํ™” ๋„์šฐ๋ฏธ๋‹ค. ์ถœ๋ ฅ ํ˜•์‹์€ JSON์œผ๋กœ ๊ณ ์ •ํ•œ๋‹ค."
        },
        {
            "role": "user",
            "content": "์ด๋ฒˆ ์ฃผ follow-up ๋ฉ”์ผ ์ดˆ์•ˆ์„ 3๊ฐœ ๋งŒ๋“ค์–ด์ค˜."
        }
    ],
    prompt_cache_key="ops_followup_v1",
    prompt_cache_retention="in_memory"  # ๋˜๋Š” "24h"
)
 
usage = getattr(response, "usage", None)
prompt_details = getattr(usage, "prompt_tokens_details", None)
cached_tokens = getattr(prompt_details, "cached_tokens", 0)
print("cached_tokens:", cached_tokens)

2) Prefix ๊ณ ์ • ํ…œํ”Œ๋ฆฟ ํŒจํ„ด

[System rules - ๊ณ ์ •]
[Output schema - ๊ณ ์ •]
[Tool definitions - ๊ณ ์ •]
[Few-shot examples - ๊ณ ์ •]
 
[Dynamic block - ๋ณ€๋™]
- tenant_id:
- user_id:
- date:
- request_payload:

3) ์ฃผ๊ฐ„ ๋น„์šฉ/์ง€์—ฐ ์ ๊ฒ€ ์ฝ”๋“œ (๊ฐœ๋… ์˜ˆ์‹œ)

# logs: [{cached_tokens, prompt_tokens, latency_ms, req_count}, ...]
 
def summarize(logs):
    total_prompt = sum(x["prompt_tokens"] for x in logs)
    total_cached = sum(x["cached_tokens"] for x in logs)
    hit_ratio = (total_cached / total_prompt) if total_prompt else 0
 
    p95_latency = sorted(x["latency_ms"] for x in logs)[int(len(logs) * 0.95) - 1]
    return {
        "cached_ratio": round(hit_ratio, 4),
        "p95_latency_ms": p95_latency,
        "requests": sum(x["req_count"] for x in logs),
    }

๋ฏธ๋‹ˆ ์‚ฌ๋ก€ 3๊ฐ€์ง€

์‚ฌ๋ก€ A) Follow-up ์ž๋™ํ™” (์„ฑ๊ณต)

์ดˆ๊ธฐ์—๋Š” ๊ณ ๊ฐ์‚ฌ๋ช…/๋‹ด๋‹น์ž๋ช…์„ ํ”„๋กฌํ”„ํŠธ ์ƒ๋‹จ์— ๋„ฃ์–ด ์บ์‹œ๊ฐ€ ๊ฑฐ์˜ ์•ˆ ๋จน์—ˆ์Šต๋‹ˆ๋‹ค. ๊ณตํ†ต ๊ทœ์น™/์˜ˆ์‹œ๋ฅผ ์•ž์œผ๋กœ, ๊ณ ๊ฐ์‚ฌ ๊ฐœ๋ณ„๊ฐ’์„ ๋’ค๋กœ ์ด๋™ํ•œ ๋’ค cached_tokens๊ฐ€ ๋ˆˆ์— ๋„๊ฒŒ ์ฆ๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค.

์‚ฌ๋ก€ B) ์ฃผ๊ฐ„ ๋ณด๊ณ ์„œ (์‹คํŒจ โ†’ ๋ณต๊ตฌ)

prompt_cache_key๋ฅผ ๋‹จ์ผ ํ‚ค๋กœ ๋ชฐ์•„ ์“ฐ๋‹ค๊ฐ€, ํŠธ๋ž˜ํ”ฝ ๋ชฐ๋ฆผ ๊ตฌ๊ฐ„์—์„œ ์ง€์—ฐ ํŽธ์ฐจ๊ฐ€ ์ปค์กŒ์Šต๋‹ˆ๋‹ค. ์›Œํฌํ”Œ๋กœ์šฐ๋ณ„ ํ‚ค๋ฅผ ๋ถ„๋ฆฌํ•˜๊ณ  ํ…œํ”Œ๋ฆฟ ๋ฒ„์ „์„ ๋‚˜๋ˆ„์ž ์ง€์—ฐ ๋ถ„์‚ฐ์ด ์•ˆ์ •ํ™”๋์Šต๋‹ˆ๋‹ค.

์‚ฌ๋ก€ C) ๋‚ด๋ถ€ QA ๋ด‡ (ํ’ˆ์งˆ ๊ฐœ์„ )

FAQ ๊ทœ์น™๊ณผ ์ถœ๋ ฅ ํฌ๋งท์ด ๋งค๋ฒˆ ๋ฐ”๋€Œ์–ด ์‘๋‹ต ํ†ค์ด ํ”๋“ค๋ ธ์Šต๋‹ˆ๋‹ค. prefix๋ฅผ ๊ณ ์ •ํ•˜๊ณ  key๋ฅผ ์ผ๊ด€๋˜๊ฒŒ ์šด์šฉํ•˜์ž ์‘๋‹ต ์ผ๊ด€์„ฑ๊ณผ ๋น„์šฉ ์˜ˆ์ธก ๊ฐ€๋Šฅ์„ฑ์ด ํ•จ๊ป˜ ์˜ฌ๋ผ๊ฐ”์Šต๋‹ˆ๋‹ค.

30๋ถ„ ๋„์ž… ๋ฃจํ‹ด

  1. 10๋ถ„: ํ˜„์žฌ ํ”„๋กฌํ”„ํŠธ๋ฅผ ๊ณ ์ •/๋ณ€๋™ ๋ธ”๋ก์œผ๋กœ ๋ถ„๋ฆฌ
  2. 8๋ถ„: ์›Œํฌํ”Œ๋กœ์šฐ๋ณ„ prompt_cache_key ๋„ค์ด๋ฐ ๊ทœ์น™ ํ™•์ •
  3. 7๋ถ„: ๋กœ๊ทธ์— cached_tokens, prompt_tokens, latency ํ•„๋“œ ๊ฐ•์ œ
  4. 5๋ถ„: ๋‹ค์Œ ์ฃผ ์‹คํ—˜ 1๊ฑด ์˜ˆ์•ฝ(ํ…œํ”Œ๋ฆฟ ์ •๋ ฌ ๋˜๋Š” key ๋ถ„๋ฆฌ)

์™„๋ฃŒ ๊ธฐ์ค€:

  • cached_tokens ๋น„์œจ ์ƒ์Šน
  • p95 latency ํ•˜๋ฝ
  • ์›”๊ฐ„ ์ž…๋ ฅ๋น„์šฉ ์˜ˆ์ธก ์˜ค์ฐจ ์ถ•์†Œ

์ ์šฉ ์ฒดํฌ๋ฆฌ์ŠคํŠธ

  • ์‹œ์Šคํ…œ ๊ทœ์น™/์˜ˆ์‹œ/ํˆด ์Šคํ‚ค๋งˆ๋ฅผ prefix ์•ž์ชฝ์— ๊ณ ์ •ํ–ˆ๋‹ค
  • ์‚ฌ์šฉ์ž๋ณ„ ๋ณ€๋™๊ฐ’์„ ๋’ค์ชฝ ๋ธ”๋ก์œผ๋กœ ๋ถ„๋ฆฌํ–ˆ๋‹ค
  • ์›Œํฌํ”Œ๋กœ์šฐ๋ณ„ prompt_cache_key๋ฅผ ๋ถ„๋ฆฌํ–ˆ๋‹ค
  • prompt_cache_retention ์ •์ฑ…์„ ์šด์˜ ๋ชฉ์ ์— ๋งž๊ฒŒ ์„ ํƒํ–ˆ๋‹ค
  • cached_tokens, p95, input cost๋ฅผ ์ฃผ๊ฐ„ ๋ฆฌํฌํŠธ๋กœ ์ถ”์ ํ•œ๋‹ค

๋‹ค์Œ ์ฝ๊ธฐ