Pydantic Output Parser in LangChain
https://python.langchain.com.cn/docs/modules/model_io/output_parsers/pydantic
Pydantic Output Parser in LangChain
This content is based on LangChain’s official documentation (langchain.com.cn) and explains the PydanticOutputParser—a tool to parse LLM outputs into structured Pydantic models (JSON-schema compliant objects)—in simplified terms. It strictly preserves original source codes, examples, and knowledge points without arbitrary additions or modifications.
Key Note: Large language models are imperfect abstractions! Use an LLM with sufficient capacity (e.g., OpenAI’s DaVinci) to generate valid JSON—smaller models like Curie may fail to produce correctly formatted outputs.
1. What is PydanticOutputParser?
PydanticOutputParser converts unstructured LLM responses into structured Pydantic model instances.
- Pydantic’s
BaseModelacts as a “data schema”—it defines expected fields, types, and validation rules (like Python dataclasses but with strict type checking and coercion). - The parser injects auto-generated
format_instructionsinto the prompt, guiding the LLM to output JSON that matches the Pydantic model. - Supports custom validation logic (e.g., “a joke’s setup must end with a question mark”) and complex types (e.g., lists of strings).
2. Step 1: Import Required Modules
The code below imports all necessary classes—exactly as in the original documentation:
from langchain.prompts import PromptTemplate
from langchain.llms import OpenAI
from langchain.chat_models import ChatOpenAI # Included as in original import
from langchain.output_parsers import PydanticOutputParser
from pydantic import BaseModel, Field, validator
from typing import List
3. Step 2: Configure the LLM
Use a capable LLM (e.g., text-davinci-003) to ensure valid JSON output. The code is identical to the original:
model_name = "text-davinci-003"
temperature = 0.0 # Fixed temperature for consistent results
model = OpenAI(model_name=model_name, temperature=temperature)
4. Example 1: Parse a Joke into a Pydantic Model
Define a Pydantic model for a joke (with custom validation) and use the parser to extract structured data.
Step 4.1: Define the Pydantic Model
# Define the desired data structure (schema)
class Joke(BaseModel):setup: str = Field(description="question to set up a joke") # Joke's setup (question)punchline: str = Field(description="answer to resolve the joke") # Joke's punchline# Custom validation: Ensure the setup ends with a question mark@validator("setup")def question_ends_with_question_mark(cls, field):if field[-1] != "?":raise ValueError("Badly formed question!")return field
Step 4.2: Initialize Parser and Prompt Template
joke_query = "Tell me a joke." # User query# Initialize parser with the Pydantic model
parser = PydanticOutputParser(pydantic_object=Joke)# Create prompt template with auto-generated format instructions
prompt = PromptTemplate(template="Answer the user query.\n{format_instructions}\n{query}\n",input_variables=["query"],partial_variables={"format_instructions": parser.get_format_instructions()}
)
Step 4.3: Generate and Parse LLM Output
# Format the prompt (inject query and format instructions)
_input = prompt.format_prompt(query=joke_query)# Get LLM response (convert prompt to string for compatibility)
output = model(_input.to_string())# Parse LLM output into the Joke model
parsed_joke = parser.parse(output)
Parsed Output (exact as original):
Joke(setup='Why did the chicken cross the road?', punchline='To get to the other side!')
5. Example 2: Parse Compound Types (List) into a Pydantic Model
Define a model with a List field (for an actor’s filmography) to demonstrate support for complex types.
Step 5.1: Define the Pydantic Model
class Actor(BaseModel):name: str = Field(description="name of an actor") # Actor's namefilm_names: List[str] = Field(description="list of names of films they starred in") # List of films
Step 5.2: Initialize Parser and Prompt Template
actor_query = "Generate the filmography for a random actor." # User query# Initialize parser with the Actor model
parser = PydanticOutputParser(pydantic_object=Actor)# Reuse the same prompt template (inject new format instructions)
prompt = PromptTemplate(template="Answer the user query.\n{format_instructions}\n{query}\n",input_variables=["query"],partial_variables={"format_instructions": parser.get_format_instructions()}
)
Step 5.3: Generate and Parse LLM Output
# Format the prompt
_input = prompt.format_prompt(query=actor_query)# Get LLM response
output = model(_input.to_string())# Parse into the Actor model
parsed_actor = parser.parse(output)
Parsed Output (exact as original):
Actor(name='Tom Hanks', film_names=['Forrest Gump', 'Saving Private Ryan', 'The Green Mile', 'Cast Away', 'Toy Story'])
6. Key Details Explained
-
Format Instructions:
parser.get_format_instructions()auto-generates rules like:
“Output a JSON object with the following keys: ‘setup’ (string, question to set up a joke), ‘punchline’ (string, answer to resolve the joke). The ‘setup’ must end with a question mark.”
This ensures the LLM outputs JSON compatible with the Pydantic model. -
Custom Validation: The
@validatordecorator in theJokemodel enforces business rules (e.g., question mark check). If the LLM’s output violates this, the parser raises aValidationError. -
Compound Types: The
List[str]type in theActormodel tells the LLM to return a list of film names, and the parser converts the JSON array into a Python list.
Key Takeaways
- PydanticOutputParser links LLM outputs to structured Pydantic models using auto-generated format instructions.
- Define data schemas with
BaseModel, add context withField, and enforce rules with@validator. - Use capable LLMs (e.g., DaVinci) to ensure valid JSON output—smaller models may fail.
- Supports complex types (lists, nested models) for versatile structured data extraction.
