Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.valyu.ai/llms.txt

Use this file to discover all available pages before exploring further.

The Contents API enables you to extract clean, structured content from web pages with optional AI-powered processing, including summarization and structured data extraction.

Basic Usage

from valyu import Valyu

valyu = Valyu()

response = valyu.contents([
    "https://en.wikipedia.org/wiki/Machine_learning"
])

print(f"Processed {response.urls_processed} of {response.urls_requested} URLs")
if response.results:
    for result in response.results:
        print(f"Title: {result.title}")
        print(f"Content length: {result.length} characters")
        print(f"Content preview: {result.content[:200]}...")

Parameters

URLs (Required)

ParameterTypeDescription
urlsList[str]Array of URLs to process (max 10 sync, max 50 async)

Options (Optional)

ParameterTypeDescriptionDefault
summarybool | str | dictAI processing configuration: False (none), True (auto), string (custom), or JSON schemaFalse
extract_effort"normal" | "high" | "auto"Processing effort level for content extraction”normal”
response_lengthstr | intContent length per URL: "short" (25k), "medium" (50k), "large" (100k), "max", or custom”short”
screenshotboolRequest page screenshots. When True, results include screenshot_url fieldFalse

Response Format

class ContentsResponse:
    success: bool
    error: Optional[str]
    tx_id: str
    urls_requested: int
    urls_processed: int
    urls_failed: int
    results: List[ContentsResult]
    total_cost_dollars: float
    total_characters: int

class ContentsResult:
    url: str
    title: str
    content: Union[str, dict]  # string for raw content, dict for structured
    description: Optional[str]
    length: int
    price: float
    source: str
    summary_success: Optional[bool]
    data_type: Optional[str]
    image_url: Optional[Dict[str, str]]
    screenshot_url: Optional[str]  # Only present when screenshot=True
    citation: Optional[str]

Parameter Examples

Basic Content Extraction

Extract clean content without AI processing:
response = valyu.contents([
    "https://www.python.org",
    "https://nodejs.org"
])

if response.results:
    for result in response.results:
        print(f"{result.title}: {result.length} characters")

AI Summary (Boolean)

Get automatic AI summaries of the extracted content:
response = valyu.contents([
    "https://en.wikipedia.org/wiki/Artificial_intelligence"
], summary=True, response_length="medium")

if response.results and response.results[0].content:
    print("AI Summary:", response.results[0].content)

Custom Summary Instructions

Provide specific instructions for AI summarization:
response = valyu.contents([
    "https://en.wikipedia.org/wiki/Artificial_intelligence"
], 
summary="Summarize the main AI trends mentioned in exactly 3 bullet points",
response_length="medium",
extract_effort="high")

Structured Data Extraction

Extract specific data points using JSON schema:
response = valyu.contents([
    "https://www.openai.com"
], 
extract_effort="high",
response_length="large",
summary={
    "type": "object",
    "properties": {
        "company_name": { 
            "type": "string",
            "description": "The name of the company"
        },
        "industry": { 
            "type": "string",
            "enum": ["tech", "finance", "healthcare", "retail", "other"],
            "description": "Primary industry sector"
        },
        "key_products": {
            "type": "array",
            "items": {"type": "string"},
            "maxItems": 5,
            "description": "Main products or services"
        },
        "founded_year": {
            "type": "number",
            "description": "Year the company was founded"
        }
    },
    "required": ["company_name", "industry"]
})

if response.results and response.results[0].content:
    print("Extracted data:", response.results[0].content)

Response Length Control

Control the amount of content extracted per URL:
response = valyu.contents([
    "https://arxiv.org/abs/2301.00001",
    "https://arxiv.org/abs/1706.03762",
    "https://www.science.org/doi/10.1126/science.1234567"
], 
response_length="large",  # More content for academic papers
summary="Extract the main research findings and methodology",
extract_effort="high")

Extract Effort Levels

Control the extraction quality and processing intensity:
# Normal (default) - Fast
normal_response = valyu.contents(urls, extract_effort="normal")

# High - Enhanced quality for complex layouts and JS heavy pages
high_quality_response = valyu.contents(urls, extract_effort="high")

# Auto - Intelligent effort selection
auto_response = valyu.contents(urls, extract_effort="auto")

Response Length Options

Control content length with predefined or custom limits:
# Predefined lengths
short_response = valyu.contents(urls, response_length="short")    # 25k characters
medium_response = valyu.contents(urls, response_length="medium")  # 50k characters  
large_response = valyu.contents(urls, response_length="large")    # 100k characters
full_response = valyu.contents(urls, response_length="max")       # No limit

# Custom length
custom_response = valyu.contents(urls, response_length=15000)     # Custom character limit

Use Case Examples

Research Paper Analysis

Build an AI-powered academic research assistant that extracts and analyzes research papers:
def analyze_research_paper(paper_url: str):
    response = valyu.contents([paper_url], 
    summary={
        "type": "object",
        "properties": {
            "title": {"type": "string"},
            "authors": { 
                "type": "array", 
                "items": {"type": "string"} 
            },
            "abstract": {"type": "string"},
            "key_contributions": {
                "type": "array",
                "items": {"type": "string"},
                "maxItems": 5,
                "description": "Main contributions of the research"
            },
            "methodology": { 
                "type": "string",
                "description": "Research methodology and approach"
            },
            "results_summary": { 
                "type": "string",
                "description": "Summary of key findings and results"
            },
            "implications": {
                "type": "string",
                "description": "Broader implications and significance"
            },
            "citations_count": {"type": "number"},
            "publication_date": {"type": "string"}
        },
        "required": ["title", "abstract", "key_contributions", "methodology"]
    },
    response_length="max",
    extract_effort="high")

    if response.success and response.results and response.results[0].summary:
        analysis = response.results[0].summary
        
        print("=== Research Paper Analysis ===")
        print(f"Title: {analysis['title']}")
        print(f"Authors: {', '.join(analysis.get('authors', []))}")
        print(f"\nAbstract: {analysis['abstract']}")
        
        print("\nKey Contributions:")
        for i, contrib in enumerate(analysis.get('key_contributions', []), 1):
            print(f"{i}. {contrib}")
        
        print(f"\nMethodology: {analysis['methodology']}")
        print(f"\nResults: {analysis['results_summary']}")
        print(f"\nImplications: {analysis['implications']}")
        
        return analysis
    
    return None

# Usage
paper_analysis = analyze_research_paper(
    "https://arxiv.org/abs/2024.01234"
)

E-commerce Product Intelligence

Create a product research tool that extracts comprehensive product data:
def analyze_products(product_urls: List[str]):
    response = valyu.contents(product_urls, 
    summary={
        "type": "object",
        "properties": {
            "products": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "product_name": {"type": "string"},
                        "brand": {"type": "string"},
                        "price": {"type": "string"},
                        "original_price": {"type": "string"},
                        "discount_percentage": {"type": "string"},
                        "description": {"type": "string"},
                        "key_features": {
                            "type": "array",
                            "items": {"type": "string"},
                            "maxItems": 8
                        },
                        "specifications": {
                            "type": "object",
                            "description": "Technical specifications"
                        },
                        "customer_rating": {"type": "number"},
                        "review_count": {"type": "number"},
                        "availability": { 
                            "type": "string",
                            "enum": ["in_stock", "out_of_stock", "limited", "pre_order"]
                        },
                        "shipping_info": {"type": "string"},
                        "warranty_info": {"type": "string"}
                    },
                    "required": ["product_name", "price", "description"]
                }
            },
            "comparison_summary": {
                "type": "string",
                "description": "Overall comparison of the products"
            }
        }
    },
    extract_effort="high",
    response_length="large")

    if response.success and response.results and response.results[0].content:
        analysis = response.results[0].content
        
        print("=== Product Analysis ===")
        for i, product in enumerate(analysis.get('products', []), 1):
            print(f"\n{i}. {product['product_name']}")
            print(f"   Brand: {product['brand']}")
            print(f"   Price: {product['price']}")
            print(f"   Rating: {product['customer_rating']}/5 ({product['review_count']} reviews)")
            print(f"   Availability: {product['availability']}")
            
            if product.get('key_features'):
                print("   Key Features:")
                for feature in product['key_features']:
                    print(f"     • {feature}")
        
        print(f"\n=== Comparison Summary ===")
        print(analysis['comparison_summary'])
        
        return analysis
    
    return None

# Usage
product_comparison = analyze_products([
    "https://amazon.com/product1",
    "https://bestbuy.com/product2",
    "https://target.com/product3"
])

Technical Documentation Processor

Build a documentation analysis tool that extracts API information and technical details:
def process_documentation(doc_urls: List[str]):
    response = valyu.contents(doc_urls, 
    summary={
        "type": "object",
        "properties": {
            "documentation_overview": {
                "type": "string",
                "description": "Overview of what the documentation covers"
            },
            "api_endpoints": {
                "type": "array",
                "items": {
                    "type": "object", 
                    "properties": {
                        "method": {"type": "string"},
                        "path": {"type": "string"},
                        "description": {"type": "string"},
                        "parameters": {
                            "type": "array",
                            "items": {
                                "type": "object",
                                "properties": {
                                    "name": {"type": "string"},
                                    "type": {"type": "string"},
                                    "required": {"type": "boolean"},
                                    "description": {"type": "string"}
                                }
                            }
                        },
                        "response_format": {"type": "string"}
                    }
                }
            },
            "authentication": {
                "type": "object",
                "properties": {
                    "method": {"type": "string"},
                    "description": {"type": "string"},
                    "example": {"type": "string"}
                }
            },
            "rate_limits": {"type": "string"},
            "code_examples": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "language": {"type": "string"},
                        "example": {"type": "string"},
                        "description": {"type": "string"}
                    }
                }
            },
            "common_errors": {
                "type": "array",
                "items": {"type": "string"}
            }
        },
        "required": ["documentation_overview", "api_endpoints", "authentication"]
    },
    extract_effort="high",
    response_length="large")

    if response.success and response.results and response.results[0].content:
        docs = response.results[0].content
        
        print("=== API Documentation Analysis ===")
        print(f"\nOverview: {docs['documentation_overview']}")
        
        print("\n=== Authentication ===")
        auth = docs.get('authentication', {})
        print(f"Method: {auth.get('method')}")
        print(f"Description: {auth.get('description')}")
        
        print("\n=== API Endpoints ===")
        for i, endpoint in enumerate(docs.get('api_endpoints', []), 1):
            print(f"\n{i}. {endpoint['method']} {endpoint['path']}")
            print(f"   Description: {endpoint['description']}")
            
            if endpoint.get('parameters'):
                print("   Parameters:")
                for param in endpoint['parameters']:
                    required = "(required)" if param['required'] else "(optional)"
                    print(f"     • {param['name']} ({param['type']}) {required}: {param['description']}")
        
        if docs.get('rate_limits'):
            print(f"\n=== Rate Limits ===")
            print(docs['rate_limits'])
        
        return docs
    
    return None

# Usage
api_docs = process_documentation([
    "https://docs.example.com/api-reference",
    "https://developers.service.com/guide"
])

Async Processing

For large-scale extraction (11-50 URLs) or non-blocking workflows, use async mode.
Async mode is required when submitting more than 10 URLs. Max 50 URLs per request, processed in batches of 5 with 120s timeout per URL (vs 25s sync). Jobs expire after 7 days.

Submit and wait

The simplest approach — pass wait=True to block until the job completes:
result = valyu.contents(
    urls=["https://example.com/page1", "https://example.com/page2", ...],
    async_mode=True,
    wait=True,              # blocks until job completes
    poll_interval=5,        # seconds between polls (default: 5)
    max_wait_time=3600,     # max seconds to wait (default: 3600)
)

for r in result["results"]:
    print(f"{r['title']}: {r['length']} characters")
print(f"Total cost: ${result['actual_cost_dollars']}")

Submit, then wait separately

For more control, submit first, then call wait_for_contents_job() with a progress callback:
# Submit — returns immediately
job = valyu.contents(
    urls=["https://example.com/page1", "https://example.com/page2", ...],
    async_mode=True,
    webhook_url="https://your-app.com/webhooks/valyu",  # optional
)
print(f"Job ID: {job['job_id']}")

# Store the webhook_secret immediately — it is ONLY returned here
if job.get("webhook_secret"):
    save_webhook_secret(job["job_id"], job["webhook_secret"])

# Wait with progress tracking
result = valyu.wait_for_contents_job(
    job["job_id"],
    poll_interval=5,
    max_wait_time=3600,
    on_progress=lambda s: print(f"  {s['status']} — batch {s.get('current_batch', '?')}/{s.get('total_batches', '?')}"),
)

if result["status"] in ("completed", "partial"):
    for r in result["results"]:
        print(f"{r['title']}: {r['length']} characters")

Manual polling

If you prefer full control over the polling loop:
import time

while True:
    status = valyu.get_contents_job(job["job_id"])
    print(f"Status: {status['status']}")

    if status["status"] in ("completed", "partial", "failed"):
        break
    time.sleep(2)

Async parameters

ParameterTypeDescriptionDefault
async_modeboolProcess URLs asynchronously. Required for more than 10 URLs.False
webhook_urlstrHTTPS URL to receive results via webhook POST.None
waitboolBlock until the job completes (SDK handles polling).False
poll_intervalintSeconds between polls when wait=True or using wait_for_contents_job.5
max_wait_timeintMax seconds to wait before timing out.3600
on_progressCallableCallback invoked on each poll with the current status dict.None

Webhook verification

Webhooks are signed using HMAC-SHA256 with format "{timestamp}.{json_body}". See the Content Extraction guide for full verification examples.

Async response types

# Initial response (HTTP 202)
class ContentsAsyncResponse:
    success: bool
    job_id: str
    status: str               # Always "pending"
    urls_total: int
    poll_url: str
    tx_id: str
    webhook_secret: Optional[str]  # ONLY returned here — store immediately

# Job status response (polling / wait result)
class ContentsJobResponse:
    success: bool
    job_id: str
    status: str               # "pending" | "processing" | "completed" | "partial" | "failed"
    urls_total: int
    urls_processed: int
    urls_failed: int
    created_at: int           # Milliseconds since epoch
    updated_at: int
    current_batch: Optional[int]           # Present during "processing"
    total_batches: Optional[int]           # Present during "processing"
    results: Optional[List[ContentsResult]]  # Present when completed/partial
    actual_cost_dollars: Optional[float]     # Present when completed/partial
    error: Optional[str]                     # Present when partial/failed

Async Client (asyncio)

This section covers the AsyncValyu Python client — calling the Contents API with async/await inside an event loop. That’s different from the server-side async jobs described under Async Processing above, which use async_mode=True to submit a long-running job server-side. You can combine both: use AsyncValyu to submit and poll a server-side async job without blocking your event loop. The last example below shows how.
AsyncValyu.contents accepts the exact same arguments and returns the same response types (ContentsResponse, ContentsJobCreateResponse, ContentsJobStatus) as the synchronous contentssummary, extract_effort, response_length, max_price_dollars, screenshot, async_mode, webhook_url, wait, and the polling knobs all behave identically. The only difference is that the call is awaited.
import asyncio
from valyu import AsyncValyu

async def main():
    async with AsyncValyu() as valyu:
        response = await valyu.contents(
            urls=["https://arxiv.org/abs/1706.03762"],
        )
        for r in response.results:
            print(r.title, "—", r.length, "chars")

asyncio.run(main())

Fan out single-URL extractions

The natural fit for async contents is many single-URL extractions running in parallel — for example, expanding a list of search hits into full-text, or warming a cache from a feed of URLs:
import asyncio
from valyu import AsyncValyu

urls = [
    "https://arxiv.org/abs/1706.03762",
    "https://arxiv.org/abs/2005.14165",
    "https://arxiv.org/abs/2106.09685",
]

async def main():
    async with AsyncValyu() as valyu:
        responses = await asyncio.gather(*[
            valyu.contents(urls=[u]) for u in urls
        ])
        for url, r in zip(urls, responses):
            print(url, "→", r.urls_processed, "/", r.urls_requested)

asyncio.run(main())
Each request extracts one URL; running them through asyncio.gather means total wall time is the slowest single URL, not the sum. Bound concurrency with asyncio.Semaphore when your list grows past a few dozen — see the Async Usage section of the Python SDK overview for that pattern.

Server-side async jobs without blocking the event loop

When a single job spans 11-50 URLs, you have to use async_mode=True (see Async Processing above). Calling that through AsyncValyu means polling the job status without parking a thread: wait_for_contents_job is awaitable and sleeps between polls using asyncio.sleep, leaving the event loop free to do other work.
import asyncio
from valyu import AsyncValyu

urls = [f"https://example.com/page{i}" for i in range(1, 31)]

async def main():
    async with AsyncValyu() as valyu:
        job = await valyu.contents(urls=urls, async_mode=True)

        result = await valyu.wait_for_contents_job(
            job.job_id,
            poll_interval=5,
            max_wait_time=3600,
        )
        print(f"Completed {result.urls_processed}/{result.urls_total}")

asyncio.run(main())
You can also submit several server-side async jobs concurrently and wait on all of them:
async def main():
    async with AsyncValyu() as valyu:
        batches = [
            [f"https://source-a.com/page{i}" for i in range(30)],
            [f"https://source-b.com/page{i}" for i in range(30)],
        ]
        jobs = await asyncio.gather(*[
            valyu.contents(urls=b, async_mode=True) for b in batches
        ])
        results = await asyncio.gather(*[
            valyu.wait_for_contents_job(j.job_id) for j in jobs
        ])
        for b, r in zip(batches, results):
            print(f"batch of {len(b)}: {r.urls_processed} processed")

Wait-inline form

If you prefer a single call, pass wait=True and AsyncValyu will submit the job and await its terminal state for you, the same way the sync client does with blocking I/O:
async with AsyncValyu() as valyu:
    result = await valyu.contents(
        urls=urls,
        async_mode=True,
        wait=True,
        poll_interval=5,
    )
See the Python SDK overview for all AsyncValyu constructor options and lifecycle patterns.

Error Handling

response = valyu.contents(urls, **options)

if not response.success:
    print("Contents extraction failed:", response.error)
    return

# Check for partial failures
if response.urls_failed and response.urls_failed > 0:
    print(f"{response.urls_failed} of {response.urls_requested} URLs failed")

# Process successful results
if response.results:
    for index, result in enumerate(response.results):
        print(f"Result {index + 1}:")
        print(f"  Title: {result.title}")
        print(f"  URL: {result.url}")
        print(f"  Length: {result.length} characters")
        
        if result.summary_success:
            print(f"  Summary: {result.content}")