Summary #
LangFuse is the recommended solution for capturing LLM traces and enabling agent tuning in this project. It provides a lightweight TypeScript SDK that integrates cleanly with @google/genai without requiring framework lock-in.
Its standout feature for "agent tuning" is the Datasets workflow: you can tag specific high-quality traces (or user-rated interactions) and export them directly as JSONL/CSV formatted for Gemini fine-tuning. It has gained significant popularity (2.5k+ GitHub stars) and offers a "store everything" approach that is critical for debugging complex agentic loops[1].
Philosophy & Mental Model #
Treat observability as your dataset pipeline. In traditional software, logs are for debugging errors. In AI engineering, logs (traces) are the source code for your next model iteration.
- Trace: The root object representing a single execution (e.g., a user request).
- Span: A unit of work within a trace (e.g., "retrieve_context", "execute_tool").
- Generation: A specialized span for LLM calls that captures token counts, model names, and prompt/completion pairs.
- Score: A quality metric attached to a trace (e.g., user thumb-up/down, or model-based eval), crucial for filtering data for tuning[2].
Setup #
Install the SDK:
1pnpm add langfuse
Configure environment variables (get these from cloud.langfuse.com or your self-hosted instance):
1# .env
2LANGFUSE_PUBLIC_KEY=pk-lf-...
3LANGFUSE_SECRET_KEY=sk-lf-...
4LANGFUSE_HOST=https://cloud.langfuse.com # or your host
Core Usage Patterns #
Pattern 1: Manual Instrumentation (Recommended) #
Since @google/genai is a newer SDK, manual instrumentation provides the most reliable data capture.
1import { Langfuse } from "langfuse";
2import { GoogleGenAI } from "@google/genai";
3
4const langfuse = new Langfuse();
5const genai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
6
7async function runAgent(prompt: string) {
8 // 1. Start a trace
9 const trace = langfuse.trace({
10 name: "cli-agent-run",
11 input: { prompt },
12 metadata: { env: "dev" }
13 });
14
15 try {
16 // 2. Create a generation span for the LLM call
17 const generation = trace.generation({
18 name: "gemini-pro-call",
19 model: "gemini-1.5-pro",
20 input: prompt
21 });
22
23 const result = await genai.models.generateContent({
24 model: "gemini-1.5-pro",
25 contents: [{ role: "user", parts: [{ text: prompt }] }]
26 });
27
28 const outputText = result.response.text();
29
30 // 3. End generation with usage stats
31 generation.end({
32 output: outputText,
33 usage: {
34 input: result.response.usageMetadata?.promptTokenCount,
35 output: result.response.usageMetadata?.candidatesTokenCount
36 }
37 });
38
39 // 4. Update trace status
40 trace.update({ output: outputText });
41 return outputText;
42 } catch (error) {
43 trace.update({ level: "ERROR", statusMessage: String(error) });
44 throw error;
45 } finally {
46 // 5. FLUSH - Critical for CLI tools!
47 await langfuse.shutdownAsync();
48 }
49}
Pattern 2: The "Observe" Decorator #
For simpler functions, you can wrap them to auto-create spans.
1import { observe } from "langfuse";
2
3const cleanOutput = observe(async (text: string) => {
4 return text.trim();
5}, { name: "clean-output" }); // Creates a span named "clean-output"
6
7// Usage inside a trace context
8// Note: This requires AsyncLocalStorage context propagation
9// which LangFuse handles if you use their `observe` API correctly.
Pattern 3: Dataset Creation for Tuning #
This is the key workflow for agent tuning. You programmatically add good examples to a dataset.
1async function markForFineTuning(traceId: string, correction: string) {
2 // Retrieve the trace (conceptually) or just add the data directly
3 await langfuse.createDatasetItem({
4 datasetName: "gemini-tuning-v1",
5 input: { role: "user", content: "..." }, // data from trace
6 expectedOutput: { role: "model", content: correction }
7 });
8}
Anti-Patterns & Pitfalls #
❌ Don't: Forget to Flush in CLI #
In long-running servers, LangFuse batches events in the background. In a CLI tool (like this project), the process might exit before logs are sent.
1// ❌ WRONG
2await runAgent();
3process.exit(0); // Logs likely lost
✅ Instead: Always Shutdown #
1// ✅ CORRECT
2await runAgent();
3await langfuse.shutdownAsync(); // Forces flush
4process.exit(0);
❌ Don't: Rely on Auto-Instrumentation for Bleeding Edge SDKs #
While packages like opentelemetry-instrumentation-google-genai exist, they often lag behind the official @google/genai releases. Manual generation() calls are robust and future-proof.
Caveats #
- Latency: Adding
awaitto tracing calls (if not using the async background queue) can slow down the agent. LangFuse SDK is async by default, but ensuring reliability in serverless/CLI requires care. - Data Privacy: If using LangFuse Cloud, you are sending prompts to their servers (US/EU). For strict enterprise data boundaries, use the Docker self-hosted version.
References #
[1] LangFuse Documentation - Core concepts and SDK reference [2] LangFuse Datasets - Guide to using traces for evaluation and fine-tuning [3] Gemini SDK Instrumentation - specific examples for Google models