LLM Tracing & Agent Tuning: LangFuse

Summary #

LangFuse is the recommended solution for capturing LLM traces and enabling agent tuning in this project. It provides a lightweight TypeScript SDK that integrates cleanly with @google/genai without requiring framework lock-in.

Its standout feature for "agent tuning" is the Datasets workflow: you can tag specific high-quality traces (or user-rated interactions) and export them directly as JSONL/CSV formatted for Gemini fine-tuning. It has gained significant popularity (2.5k+ GitHub stars) and offers a "store everything" approach that is critical for debugging complex agentic loops[1].

Philosophy & Mental Model #

Treat observability as your dataset pipeline. In traditional software, logs are for debugging errors. In AI engineering, logs (traces) are the source code for your next model iteration.

Trace: The root object representing a single execution (e.g., a user request).
Span: A unit of work within a trace (e.g., "retrieve_context", "execute_tool").
Generation: A specialized span for LLM calls that captures token counts, model names, and prompt/completion pairs.
Score: A quality metric attached to a trace (e.g., user thumb-up/down, or model-based eval), crucial for filtering data for tuning[2].

Setup #

Install the SDK:

1pnpm add langfuse

Configure environment variables (get these from cloud.langfuse.com or your self-hosted instance):

1# .env
2LANGFUSE_PUBLIC_KEY=pk-lf-...
3LANGFUSE_SECRET_KEY=sk-lf-...
4LANGFUSE_HOST=https://cloud.langfuse.com # or your host

Core Usage Patterns #

Pattern 1: Manual Instrumentation (Recommended) #

Since @google/genai is a newer SDK, manual instrumentation provides the most reliable data capture.

 1import { Langfuse } from "langfuse";
 2import { GoogleGenAI } from "@google/genai";
 3
 4const langfuse = new Langfuse();
 5const genai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
 6
 7async function runAgent(prompt: string) {
 8  // 1. Start a trace
 9  const trace = langfuse.trace({
10    name: "cli-agent-run",
11    input: { prompt },
12    metadata: { env: "dev" }
13  });
14
15  try {
16    // 2. Create a generation span for the LLM call
17    const generation = trace.generation({
18      name: "gemini-pro-call",
19      model: "gemini-1.5-pro",
20      input: prompt
21    });
22
23    const result = await genai.models.generateContent({
24      model: "gemini-1.5-pro",
25      contents: [{ role: "user", parts: [{ text: prompt }] }]
26    });
27
28    const outputText = result.response.text();
29
30    // 3. End generation with usage stats
31    generation.end({
32      output: outputText,
33      usage: {
34        input: result.response.usageMetadata?.promptTokenCount,
35        output: result.response.usageMetadata?.candidatesTokenCount
36      }
37    });
38
39    // 4. Update trace status
40    trace.update({ output: outputText });
41    return outputText;
42  } catch (error) {
43    trace.update({ level: "ERROR", statusMessage: String(error) });
44    throw error;
45  } finally {
46    // 5. FLUSH - Critical for CLI tools!
47    await langfuse.shutdownAsync();
48  }
49}

Pattern 2: The "Observe" Decorator #

For simpler functions, you can wrap them to auto-create spans.

1import { observe } from "langfuse";
2
3const cleanOutput = observe(async (text: string) => {
4  return text.trim();
5}, { name: "clean-output" }); // Creates a span named "clean-output"
6
7// Usage inside a trace context
8// Note: This requires AsyncLocalStorage context propagation 
9// which LangFuse handles if you use their `observe` API correctly.

Pattern 3: Dataset Creation for Tuning #

This is the key workflow for agent tuning. You programmatically add good examples to a dataset.

1async function markForFineTuning(traceId: string, correction: string) {
2  // Retrieve the trace (conceptually) or just add the data directly
3  await langfuse.createDatasetItem({
4    datasetName: "gemini-tuning-v1",
5    input: { role: "user", content: "..." }, // data from trace
6    expectedOutput: { role: "model", content: correction }
7  });
8}

Anti-Patterns & Pitfalls #

❌ Don't: Forget to Flush in CLI #

In long-running servers, LangFuse batches events in the background. In a CLI tool (like this project), the process might exit before logs are sent.

1// ❌ WRONG
2await runAgent();
3process.exit(0); // Logs likely lost

✅ Instead: Always Shutdown #

1// ✅ CORRECT
2await runAgent();
3await langfuse.shutdownAsync(); // Forces flush
4process.exit(0);

❌ Don't: Rely on Auto-Instrumentation for Bleeding Edge SDKs #

While packages like opentelemetry-instrumentation-google-genai exist, they often lag behind the official @google/genai releases. Manual generation() calls are robust and future-proof.

Caveats #

Latency: Adding await to tracing calls (if not using the async background queue) can slow down the agent. LangFuse SDK is async by default, but ensuring reliability in serverless/CLI requires care.
Data Privacy: If using LangFuse Cloud, you are sending prompts to their servers (US/EU). For strict enterprise data boundaries, use the Docker self-hosted version.

References #

[1] LangFuse Documentation - Core concepts and SDK reference [2] LangFuse Datasets - Guide to using traces for evaluation and fine-tuning [3] Gemini SDK Instrumentation - specific examples for Google models

last updated: 2025-12-01