> ## Documentation Index
> Fetch the complete documentation index at: https://docs.valyu.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Contents API

> Extract and process content from URLs with AI using the Valyu Python SDK

The Contents API enables you to extract clean, structured content from web pages with optional AI-powered processing, including summarization and structured data extraction.

## Basic Usage

```python theme={null}
from valyu import Valyu

valyu = Valyu()

response = valyu.contents([
    "https://en.wikipedia.org/wiki/Machine_learning"
])

print(f"Processed {response.urls_processed} of {response.urls_requested} URLs")
if response.results:
    for result in response.results:
        print(f"Title: {result.title}")
        print(f"Content length: {result.length} characters")
        print(f"Content preview: {result.content[:200]}...")
```

## Parameters

### URLs (Required)

| Parameter | Type       | Description                                          |
| --------- | ---------- | ---------------------------------------------------- |
| `urls`    | List\[str] | Array of URLs to process (max 10 sync, max 50 async) |

### Options (Optional)

| Parameter         | Type                               | Description                                                                                     | Default  |
| ----------------- | ---------------------------------- | ----------------------------------------------------------------------------------------------- | -------- |
| `summary`         | bool \| str \| dict                | AI processing configuration: `False` (none), `True` (auto), string (custom), or JSON schema     | False    |
| `extract_effort`  | `"normal"` \| `"high"` \| `"auto"` | Processing effort level for content extraction                                                  | "normal" |
| `response_length` | str \| int                         | Content length per URL: `"short"` (25k), `"medium"` (50k), `"large"` (100k), `"max"`, or custom | "short"  |
| `screenshot`      | bool                               | Request page screenshots. When `True`, results include `screenshot_url` field                   | False    |

## Response Format

```python theme={null}
class ContentsResponse:
    success: bool
    error: Optional[str]
    tx_id: str
    urls_requested: int
    urls_processed: int
    urls_failed: int
    results: List[ContentsResult]
    total_cost_dollars: float
    total_characters: int

class ContentsResult:
    url: str
    title: str
    content: Union[str, dict]  # string for raw content, dict for structured
    description: Optional[str]
    length: int
    price: float
    source: str
    summary_success: Optional[bool]
    data_type: Optional[str]
    image_url: Optional[Dict[str, str]]
    screenshot_url: Optional[str]  # Only present when screenshot=True
    citation: Optional[str]
```

## Parameter Examples

### Basic Content Extraction

Extract clean content without AI processing:

```python theme={null}
response = valyu.contents([
    "https://www.python.org",
    "https://nodejs.org"
])

if response.results:
    for result in response.results:
        print(f"{result.title}: {result.length} characters")
```

### AI Summary (Boolean)

Get automatic AI summaries of the extracted content:

```python theme={null}
response = valyu.contents([
    "https://en.wikipedia.org/wiki/Artificial_intelligence"
], summary=True, response_length="medium")

if response.results and response.results[0].content:
    print("AI Summary:", response.results[0].content)
```

### Custom Summary Instructions

Provide specific instructions for AI summarization:

```python theme={null}
response = valyu.contents([
    "https://en.wikipedia.org/wiki/Artificial_intelligence"
], 
summary="Summarize the main AI trends mentioned in exactly 3 bullet points",
response_length="medium",
extract_effort="high")
```

### Structured Data Extraction

Extract specific data points using JSON schema:

```python theme={null}
response = valyu.contents([
    "https://www.openai.com"
], 
extract_effort="high",
response_length="large",
summary={
    "type": "object",
    "properties": {
        "company_name": { 
            "type": "string",
            "description": "The name of the company"
        },
        "industry": { 
            "type": "string",
            "enum": ["tech", "finance", "healthcare", "retail", "other"],
            "description": "Primary industry sector"
        },
        "key_products": {
            "type": "array",
            "items": {"type": "string"},
            "maxItems": 5,
            "description": "Main products or services"
        },
        "founded_year": {
            "type": "number",
            "description": "Year the company was founded"
        }
    },
    "required": ["company_name", "industry"]
})

if response.results and response.results[0].content:
    print("Extracted data:", response.results[0].content)
```

### Response Length Control

Control the amount of content extracted per URL:

```python theme={null}
response = valyu.contents([
    "https://arxiv.org/abs/2301.00001",
    "https://arxiv.org/abs/1706.03762",
    "https://www.science.org/doi/10.1126/science.1234567"
], 
response_length="large",  # More content for academic papers
summary="Extract the main research findings and methodology",
extract_effort="high")
```

### Extract Effort Levels

Control the extraction quality and processing intensity:

```python theme={null}
# Normal (default) - Fast
normal_response = valyu.contents(urls, extract_effort="normal")

# High - Enhanced quality for complex layouts and JS heavy pages
high_quality_response = valyu.contents(urls, extract_effort="high")

# Auto - Intelligent effort selection
auto_response = valyu.contents(urls, extract_effort="auto")
```

### Response Length Options

Control content length with predefined or custom limits:

```python theme={null}
# Predefined lengths
short_response = valyu.contents(urls, response_length="short")    # 25k characters
medium_response = valyu.contents(urls, response_length="medium")  # 50k characters  
large_response = valyu.contents(urls, response_length="large")    # 100k characters
full_response = valyu.contents(urls, response_length="max")       # No limit

# Custom length
custom_response = valyu.contents(urls, response_length=15000)     # Custom character limit
```

## Use Case Examples

### Research Paper Analysis

Build an AI-powered academic research assistant that extracts and analyzes research papers:

```python theme={null}
def analyze_research_paper(paper_url: str):
    response = valyu.contents([paper_url], 
    summary={
        "type": "object",
        "properties": {
            "title": {"type": "string"},
            "authors": { 
                "type": "array", 
                "items": {"type": "string"} 
            },
            "abstract": {"type": "string"},
            "key_contributions": {
                "type": "array",
                "items": {"type": "string"},
                "maxItems": 5,
                "description": "Main contributions of the research"
            },
            "methodology": { 
                "type": "string",
                "description": "Research methodology and approach"
            },
            "results_summary": { 
                "type": "string",
                "description": "Summary of key findings and results"
            },
            "implications": {
                "type": "string",
                "description": "Broader implications and significance"
            },
            "citations_count": {"type": "number"},
            "publication_date": {"type": "string"}
        },
        "required": ["title", "abstract", "key_contributions", "methodology"]
    },
    response_length="max",
    extract_effort="high")

    if response.success and response.results and response.results[0].summary:
        analysis = response.results[0].summary
        
        print("=== Research Paper Analysis ===")
        print(f"Title: {analysis['title']}")
        print(f"Authors: {', '.join(analysis.get('authors', []))}")
        print(f"\nAbstract: {analysis['abstract']}")
        
        print("\nKey Contributions:")
        for i, contrib in enumerate(analysis.get('key_contributions', []), 1):
            print(f"{i}. {contrib}")
        
        print(f"\nMethodology: {analysis['methodology']}")
        print(f"\nResults: {analysis['results_summary']}")
        print(f"\nImplications: {analysis['implications']}")
        
        return analysis
    
    return None

# Usage
paper_analysis = analyze_research_paper(
    "https://arxiv.org/abs/2024.01234"
)
```

### E-commerce Product Intelligence

Create a product research tool that extracts comprehensive product data:

```python theme={null}
def analyze_products(product_urls: List[str]):
    response = valyu.contents(product_urls, 
    summary={
        "type": "object",
        "properties": {
            "products": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "product_name": {"type": "string"},
                        "brand": {"type": "string"},
                        "price": {"type": "string"},
                        "original_price": {"type": "string"},
                        "discount_percentage": {"type": "string"},
                        "description": {"type": "string"},
                        "key_features": {
                            "type": "array",
                            "items": {"type": "string"},
                            "maxItems": 8
                        },
                        "specifications": {
                            "type": "object",
                            "description": "Technical specifications"
                        },
                        "customer_rating": {"type": "number"},
                        "review_count": {"type": "number"},
                        "availability": { 
                            "type": "string",
                            "enum": ["in_stock", "out_of_stock", "limited", "pre_order"]
                        },
                        "shipping_info": {"type": "string"},
                        "warranty_info": {"type": "string"}
                    },
                    "required": ["product_name", "price", "description"]
                }
            },
            "comparison_summary": {
                "type": "string",
                "description": "Overall comparison of the products"
            }
        }
    },
    extract_effort="high",
    response_length="large")

    if response.success and response.results and response.results[0].content:
        analysis = response.results[0].content
        
        print("=== Product Analysis ===")
        for i, product in enumerate(analysis.get('products', []), 1):
            print(f"\n{i}. {product['product_name']}")
            print(f"   Brand: {product['brand']}")
            print(f"   Price: {product['price']}")
            print(f"   Rating: {product['customer_rating']}/5 ({product['review_count']} reviews)")
            print(f"   Availability: {product['availability']}")
            
            if product.get('key_features'):
                print("   Key Features:")
                for feature in product['key_features']:
                    print(f"     • {feature}")
        
        print(f"\n=== Comparison Summary ===")
        print(analysis['comparison_summary'])
        
        return analysis
    
    return None

# Usage
product_comparison = analyze_products([
    "https://amazon.com/product1",
    "https://bestbuy.com/product2",
    "https://target.com/product3"
])
```

### Technical Documentation Processor

Build a documentation analysis tool that extracts API information and technical details:

```python theme={null}
def process_documentation(doc_urls: List[str]):
    response = valyu.contents(doc_urls, 
    summary={
        "type": "object",
        "properties": {
            "documentation_overview": {
                "type": "string",
                "description": "Overview of what the documentation covers"
            },
            "api_endpoints": {
                "type": "array",
                "items": {
                    "type": "object", 
                    "properties": {
                        "method": {"type": "string"},
                        "path": {"type": "string"},
                        "description": {"type": "string"},
                        "parameters": {
                            "type": "array",
                            "items": {
                                "type": "object",
                                "properties": {
                                    "name": {"type": "string"},
                                    "type": {"type": "string"},
                                    "required": {"type": "boolean"},
                                    "description": {"type": "string"}
                                }
                            }
                        },
                        "response_format": {"type": "string"}
                    }
                }
            },
            "authentication": {
                "type": "object",
                "properties": {
                    "method": {"type": "string"},
                    "description": {"type": "string"},
                    "example": {"type": "string"}
                }
            },
            "rate_limits": {"type": "string"},
            "code_examples": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "language": {"type": "string"},
                        "example": {"type": "string"},
                        "description": {"type": "string"}
                    }
                }
            },
            "common_errors": {
                "type": "array",
                "items": {"type": "string"}
            }
        },
        "required": ["documentation_overview", "api_endpoints", "authentication"]
    },
    extract_effort="high",
    response_length="large")

    if response.success and response.results and response.results[0].content:
        docs = response.results[0].content
        
        print("=== API Documentation Analysis ===")
        print(f"\nOverview: {docs['documentation_overview']}")
        
        print("\n=== Authentication ===")
        auth = docs.get('authentication', {})
        print(f"Method: {auth.get('method')}")
        print(f"Description: {auth.get('description')}")
        
        print("\n=== API Endpoints ===")
        for i, endpoint in enumerate(docs.get('api_endpoints', []), 1):
            print(f"\n{i}. {endpoint['method']} {endpoint['path']}")
            print(f"   Description: {endpoint['description']}")
            
            if endpoint.get('parameters'):
                print("   Parameters:")
                for param in endpoint['parameters']:
                    required = "(required)" if param['required'] else "(optional)"
                    print(f"     • {param['name']} ({param['type']}) {required}: {param['description']}")
        
        if docs.get('rate_limits'):
            print(f"\n=== Rate Limits ===")
            print(docs['rate_limits'])
        
        return docs
    
    return None

# Usage
api_docs = process_documentation([
    "https://docs.example.com/api-reference",
    "https://developers.service.com/guide"
])
```

## Async Processing

For large-scale extraction (11-50 URLs) or non-blocking workflows, use async mode.

<Note>
  Async mode is **required** when submitting more than 10 URLs. Max 50 URLs per request, processed in batches of 5 with 120s timeout per URL (vs 25s sync). Jobs expire after 7 days.
</Note>

### Submit and wait

The simplest approach — pass `wait=True` to block until the job completes:

```python theme={null}
result = valyu.contents(
    urls=["https://example.com/page1", "https://example.com/page2", ...],
    async_mode=True,
    wait=True,              # blocks until job completes
    poll_interval=5,        # seconds between polls (default: 5)
    max_wait_time=3600,     # max seconds to wait (default: 3600)
)

for r in result["results"]:
    print(f"{r['title']}: {r['length']} characters")
print(f"Total cost: ${result['actual_cost_dollars']}")
```

### Submit, then wait separately

For more control, submit first, then call `wait_for_contents_job()` with a progress callback:

```python theme={null}
# Submit — returns immediately
job = valyu.contents(
    urls=["https://example.com/page1", "https://example.com/page2", ...],
    async_mode=True,
    webhook_url="https://your-app.com/webhooks/valyu",  # optional
)
print(f"Job ID: {job['job_id']}")

# Store the webhook_secret immediately — it is ONLY returned here
if job.get("webhook_secret"):
    save_webhook_secret(job["job_id"], job["webhook_secret"])

# Wait with progress tracking
result = valyu.wait_for_contents_job(
    job["job_id"],
    poll_interval=5,
    max_wait_time=3600,
    on_progress=lambda s: print(f"  {s['status']} — batch {s.get('current_batch', '?')}/{s.get('total_batches', '?')}"),
)

if result["status"] in ("completed", "partial"):
    for r in result["results"]:
        print(f"{r['title']}: {r['length']} characters")
```

### Manual polling

If you prefer full control over the polling loop:

```python theme={null}
import time

while True:
    status = valyu.get_contents_job(job["job_id"])
    print(f"Status: {status['status']}")

    if status["status"] in ("completed", "partial", "failed"):
        break
    time.sleep(2)
```

### Async parameters

| Parameter       | Type     | Description                                                              | Default |
| --------------- | -------- | ------------------------------------------------------------------------ | ------- |
| `async_mode`    | bool     | Process URLs asynchronously. Required for more than 10 URLs.             | False   |
| `webhook_url`   | str      | HTTPS URL to receive results via webhook POST.                           | None    |
| `wait`          | bool     | Block until the job completes (SDK handles polling).                     | False   |
| `poll_interval` | int      | Seconds between polls when `wait=True` or using `wait_for_contents_job`. | 5       |
| `max_wait_time` | int      | Max seconds to wait before timing out.                                   | 3600    |
| `on_progress`   | Callable | Callback invoked on each poll with the current status dict.              | None    |

### Webhook verification

Webhooks are signed using HMAC-SHA256 with format `"{timestamp}.{json_body}"`. See the [Content Extraction guide](/guides/content-extraction#verifying-webhook-signatures) for full verification examples.

### Async response types

```python theme={null}
# Initial response (HTTP 202)
class ContentsAsyncResponse:
    success: bool
    job_id: str
    status: str               # Always "pending"
    urls_total: int
    poll_url: str
    tx_id: str
    webhook_secret: Optional[str]  # ONLY returned here — store immediately

# Job status response (polling / wait result)
class ContentsJobResponse:
    success: bool
    job_id: str
    status: str               # "pending" | "processing" | "completed" | "partial" | "failed"
    urls_total: int
    urls_processed: int
    urls_failed: int
    created_at: int           # Milliseconds since epoch
    updated_at: int
    current_batch: Optional[int]           # Present during "processing"
    total_batches: Optional[int]           # Present during "processing"
    results: Optional[List[ContentsResult]]  # Present when completed/partial
    actual_cost_dollars: Optional[float]     # Present when completed/partial
    error: Optional[str]                     # Present when partial/failed
```

## Async Client (asyncio)

<Note>
  This section covers the **`AsyncValyu` Python client** — calling
  the Contents API with `async`/`await` inside an event loop. That's
  different from the **server-side async jobs** described under
  [Async Processing](#async-processing) above, which use
  `async_mode=True` to submit a long-running job server-side. You
  can combine both: use `AsyncValyu` to submit and poll a server-side
  async job without blocking your event loop. The last example below
  shows how.
</Note>

`AsyncValyu.contents` accepts the exact same arguments and returns
the same response types (`ContentsResponse`,
`ContentsJobCreateResponse`, `ContentsJobStatus`) as the synchronous
`contents` — `summary`, `extract_effort`, `response_length`,
`max_price_dollars`, `screenshot`, `async_mode`, `webhook_url`,
`wait`, and the polling knobs all behave identically. The only
difference is that the call is `await`ed.

```python theme={null}
import asyncio
from valyu import AsyncValyu

async def main():
    async with AsyncValyu() as valyu:
        response = await valyu.contents(
            urls=["https://arxiv.org/abs/1706.03762"],
        )
        for r in response.results:
            print(r.title, "—", r.length, "chars")

asyncio.run(main())
```

### Fan out single-URL extractions

The natural fit for async contents is **many single-URL extractions
running in parallel** — for example, expanding a list of search
hits into full-text, or warming a cache from a feed of URLs:

```python theme={null}
import asyncio
from valyu import AsyncValyu

urls = [
    "https://arxiv.org/abs/1706.03762",
    "https://arxiv.org/abs/2005.14165",
    "https://arxiv.org/abs/2106.09685",
]

async def main():
    async with AsyncValyu() as valyu:
        responses = await asyncio.gather(*[
            valyu.contents(urls=[u]) for u in urls
        ])
        for url, r in zip(urls, responses):
            print(url, "→", r.urls_processed, "/", r.urls_requested)

asyncio.run(main())
```

Each request extracts one URL; running them through
`asyncio.gather` means total wall time is the slowest single URL,
not the sum. Bound concurrency with `asyncio.Semaphore` when your
list grows past a few dozen — see the [Async Usage section of the
Python SDK overview](../python-sdk#async-usage) for that pattern.

### Server-side async jobs without blocking the event loop

When a single job spans 11-50 URLs, you have to use `async_mode=True`
(see [Async Processing](#async-processing) above). Calling that
through `AsyncValyu` means polling the job status without parking a
thread: `wait_for_contents_job` is awaitable and sleeps between polls
using `asyncio.sleep`, leaving the event loop free to do other work.

```python theme={null}
import asyncio
from valyu import AsyncValyu

urls = [f"https://example.com/page{i}" for i in range(1, 31)]

async def main():
    async with AsyncValyu() as valyu:
        job = await valyu.contents(urls=urls, async_mode=True)

        result = await valyu.wait_for_contents_job(
            job.job_id,
            poll_interval=5,
            max_wait_time=3600,
        )
        print(f"Completed {result.urls_processed}/{result.urls_total}")

asyncio.run(main())
```

You can also submit several server-side async jobs concurrently and
wait on all of them:

```python theme={null}
async def main():
    async with AsyncValyu() as valyu:
        batches = [
            [f"https://source-a.com/page{i}" for i in range(30)],
            [f"https://source-b.com/page{i}" for i in range(30)],
        ]
        jobs = await asyncio.gather(*[
            valyu.contents(urls=b, async_mode=True) for b in batches
        ])
        results = await asyncio.gather(*[
            valyu.wait_for_contents_job(j.job_id) for j in jobs
        ])
        for b, r in zip(batches, results):
            print(f"batch of {len(b)}: {r.urls_processed} processed")
```

### Wait-inline form

If you prefer a single call, pass `wait=True` and `AsyncValyu` will
submit the job and await its terminal state for you, the same way the
sync client does with blocking I/O:

```python theme={null}
async with AsyncValyu() as valyu:
    result = await valyu.contents(
        urls=urls,
        async_mode=True,
        wait=True,
        poll_interval=5,
    )
```

See the [Python SDK overview](../python-sdk#async-usage) for all
`AsyncValyu` constructor options and lifecycle patterns.

## Error Handling

```python theme={null}
response = valyu.contents(urls, **options)

if not response.success:
    print("Contents extraction failed:", response.error)
    return

# Check for partial failures
if response.urls_failed and response.urls_failed > 0:
    print(f"{response.urls_failed} of {response.urls_requested} URLs failed")

# Process successful results
if response.results:
    for index, result in enumerate(response.results):
        print(f"Result {index + 1}:")
        print(f"  Title: {result.title}")
        print(f"  URL: {result.url}")
        print(f"  Length: {result.length} characters")
        
        if result.summary_success:
            print(f"  Summary: {result.content}")
```
