Documentation Index Fetch the complete documentation index at: https://docs.valyu.ai/llms.txt
Use this file to discover all available pages before exploring further.
Turn any web page into clean, structured data. The Contents API extracts content from URLs with batch processing, AI-powered summaries, and structured outputs.
What You Can Do
Feed your AI - Clean data without noise
Aggregate content - Extract structured data from multiple sources
Transform content - Convert web pages into usable formats
Automate research - Pull key information from articles, papers, and reports
Features
Batch Processing Submit up to 10 URLs synchronously, or up to 50 URLs with async mode.
AI-Powered Structuring Use JSON schemas to extract specific data points.
Smart Summarisation Generate tailored summaries with custom instructions.
Pay-per-Success Only pay for URLs that are successfully processed.
Getting Started
from valyu import Valyu
valyu = Valyu() # Uses VALYU_API_KEY from env
data = valyu.contents(
urls = [
"https://en.wikipedia.org/wiki/Artificial_intelligence" ,
],
response_length = "medium" ,
extract_effort = "auto" ,
)
print (data[ "results" ][ 0 ][ "content" ][: 500 ])
Returns clean markdown content for each URL.
Response Length
Length Characters Use for short25,000 Summaries, key points medium50,000 Articles, blog posts large100,000 Academic papers, long-form content maxUnlimited Full document extraction Custom integer 1,000-1,000,000 Specific requirements
Effort Description normalStandard speed and quality (default) highBetter quality, slower autoAutomatically chooses the right level
Screenshot Capture
Capture visual screenshots of pages alongside content extraction:
from valyu import Valyu
valyu = Valyu()
data = valyu.contents(
urls = [ "https://example.com/article" ],
extract_effort = "auto" ,
screenshot = True ,
)
print (data[ "results" ][ 0 ][ "screenshot_url" ])
Screenshots are captured during page rendering and returned as pre-signed URLs. PDF files do not support screenshots.
Advanced Features
Summary Options
The summary field accepts four types of values:
No AI Processing (false)
from valyu import Valyu
valyu = Valyu()
data = valyu.contents(
urls = [ "https://example.com/article" ],
extract_effort = "normal" ,
summary = False ,
)
print (data[ "results" ][ 0 ][ "content" ][: 300 ])
Basic Summary (true)
from valyu import Valyu
valyu = Valyu()
data = valyu.contents(
urls = [ "https://example.com/article" ],
extract_effort = "auto" ,
summary = True ,
)
print (data[ "results" ][ 0 ][ "content" ])
Custom Instructions (string)
from valyu import Valyu
valyu = Valyu()
data = valyu.contents(
urls = [ "https://example.com/research-paper" ],
extract_effort = "auto" ,
summary = "Summarise the methodology, key findings, and practical applications in 2-3 paragraphs" ,
)
print (data[ "results" ][ 0 ][ "content" ])
from valyu import Valyu
valyu = Valyu()
data = valyu.contents(
urls = [ "https://example.com/product-page" ],
extract_effort = "auto" ,
summary = {
"type" : "object" ,
"properties" : {
"product_name" : { "type" : "string" , "description" : "Name of the product" },
"price" : { "type" : "number" , "description" : "Product price in USD" },
"features" : {
"type" : "array" ,
"items" : { "type" : "string" },
"maxItems" : 5 ,
"description" : "Key product features" ,
},
"availability" : {
"type" : "string" ,
"enum" : [ "in_stock" , "out_of_stock" , "preorder" ],
"description" : "Product availability status" ,
},
},
"required" : [ "product_name" , "price" ],
},
)
print (data[ "results" ][ 0 ][ "content" ])
JSON Schema Reference
For structured extraction, you can use any valid JSON Schema. See the JSON Schema Type Reference for details.
Limits:
5,000 characters max
3 levels deep max
20 properties per object max
Common types:
string - Text with optional format validation
number / integer - Numbers with optional min/max
boolean - True/false
array - Lists with optional size limits
object - Nested structures
Async Processing
For large-scale extraction (11-50 URLs) or non-blocking workflows, use async mode. Submit URLs, get a job_id back immediately, then either poll for results or receive them via webhook.
When to use async
More than 10 URLs — required for batches of 11-50 URLs
Non-blocking workflows — submit and continue processing while extraction runs in the background
Webhook-driven architectures — receive results via webhook instead of polling
Extended processing — async mode provides 120s timeout per URL vs 25s for sync
Async mode is required when submitting more than 10 URLs. For 1-10 URLs, you can optionally use async mode for non-blocking workflows.
Quick start
from valyu import Valyu
valyu = Valyu()
# Submit async job — returns immediately with a job_id
job = valyu.contents(
urls = [
"https://example.com/page1" ,
"https://example.com/page2" ,
"https://example.com/page3" ,
# ... up to 50 URLs
],
async_mode = True ,
webhook_url = "https://your-app.com/webhooks/valyu" , # optional
)
print ( f "Job ID: { job[ 'job_id' ] } " )
# Option 1: Block until complete (SDK handles polling)
result = valyu.wait_for_contents_job(
job[ "job_id" ],
poll_interval = 5 , # seconds between polls
max_wait_time = 3600 , # max seconds to wait
on_progress = lambda s : print ( f " { s[ 'status' ] } — batch { s.get( 'current_batch' , '?' ) } / { s.get( 'total_batches' , '?' ) } " ),
)
# Option 2: Or pass wait=True when submitting to auto-poll
result = valyu.contents(
urls = [ "https://example.com/page1" , "https://example.com/page2" ],
async_mode = True ,
wait = True , # blocks until job completes
)
for r in result[ "results" ]:
print ( f " { r[ 'title' ] } : { r[ 'length' ] } characters" )
print ( f "Total cost: $ { result[ 'actual_cost_dollars' ] } " )
The webhook_secret is only returned in the initial 202 response. Store it immediately — you cannot retrieve it later.
Initial response (HTTP 202)
{
"success" : true ,
"job_id" : "cj_a1b2c3d4e5f6g7h8" ,
"status" : "pending" ,
"urls_total" : 25 ,
"poll_url" : "/contents/jobs/cj_a1b2c3d4e5f6g7h8" ,
"tx_id" : "tx_async-1234-5678-abcd-ef0123456789" ,
"webhook_secret" : "a1b2c3d4e5f6..."
}
Manual polling
If you prefer to poll manually instead of using the SDK convenience methods:
import time
while True :
status = valyu.get_contents_job(job[ "job_id" ])
print ( f "Status: { status[ 'status' ] } ( { status.get( 'current_batch' , '?' ) } / { status.get( 'total_batches' , '?' ) } batches)" )
if status[ "status" ] in ( "completed" , "partial" , "failed" ):
break
time.sleep( 2 )
if status[ "status" ] in ( "completed" , "partial" ):
for result in status[ "results" ]:
print ( f " { result[ 'title' ] } : { result[ 'length' ] } characters" )
if status[ "status" ] in ( "partial" , "failed" ):
print ( f "Error: { status.get( 'error' , 'Unknown' ) } " )
Job lifecycle
Jobs progress through these statuses:
Status Description pendingJob created, not yet started processingURLs being processed in batches of 5 completedAll URLs processed successfully partialFinished with some URL failures failedAll URLs failed
Job status fields
Field Type Description Present job_idstring Job identifier (prefixed cj_) Always statusstring Current job status (see above) Always urls_totalnumber Total URLs submitted Always urls_processednumber URLs successfully processed Always urls_failednumber URLs that failed processing Always created_atnumber Job creation time (ms since epoch) Always updated_atnumber Last update time (ms since epoch) Always current_batchnumber Current batch being processed processing onlytotal_batchesnumber Total number of batches processing onlyresultsarray Extraction results completed/partialactual_cost_dollarsnumber Total cost in dollars completed/partialerrorstring Error description partial/failed only
Webhooks
When you provide a webhook_url, Valyu sends a POST request to your endpoint when the job completes (or partially completes/fails).
Headers:
Header Value Content-Typeapplication/jsonUser-AgentValyu-Contents/1.0X-Webhook-Signaturesha256={hex_digest}X-Webhook-TimestampUnix timestamp in seconds
Payload:
{
"job_id" : "cj_a1b2c3d4e5f6g7h8" ,
"status" : "completed" ,
"urls_total" : 25 ,
"urls_processed" : 25 ,
"urls_failed" : 0 ,
"results" : [ ... ],
"actual_cost_dollars" : 0.025 ,
"error" : null
}
Retries: Max 5 attempts with exponential backoff (1s, 2s, 4s, 8s). No retry on 4xx. 5s connect timeout, 15s read timeout.
Verifying webhook signatures
The payload is signed using HMAC-SHA256. The signed payload is the timestamp and JSON body joined by a period: "{timestamp}.{json_payload}".
import hmac
import hashlib
def verify_webhook (
payload : bytes ,
signature_header : str ,
timestamp_header : str ,
webhook_secret : str
) -> bool :
signed_payload = f " { timestamp_header } . { payload.decode( 'utf-8' ) } "
expected = hmac.new(
webhook_secret.encode( "utf-8" ),
signed_payload.encode( "utf-8" ),
hashlib.sha256
).hexdigest()
received = signature_header.removeprefix( "sha256=" )
return hmac.compare_digest(expected, received)
The TypeScript SDK exports a verifyContentsWebhookSignature() helper that handles this for you.
Limits and pricing
Detail Value Max URLs (sync) 10 Max URLs (async) 50 Batch size 5 URLs per batch Timeout per URL (sync) 25 seconds Timeout per URL (async) 120 seconds Base pricing $0.001 per URL AI features (summary/schema) +$0.001 per URL Job expiry (TTL) 7 days
Examples
News Aggregator
Extract structured article data:
{
"urls" : [
"https://en.wikipedia.org/wiki/Artificial_intelligence" ,
"https://archive.org/details/texts" ,
"https://www.gov.uk/search/news"
],
"extract_effort" : "auto" ,
"summary" : {
"type" : "object" ,
"properties" : {
"headline" : { "type" : "string" },
"summary_text" : { "type" : "string" },
"category" : { "type" : "string" },
"tags" : {
"type" : "array" ,
"items" : { "type" : "string" },
"maxItems" : 5
}
},
"required" : [ "headline" ]
}
}
Research Paper
Extract structured academic data:
{
"urls" : [ "https://arxiv.org/paper/example" ],
"response_length" : "max" ,
"extract_effort" : "high" ,
"summary" : {
"type" : "object" ,
"properties" : {
"title" : { "type" : "string" },
"abstract" : { "type" : "string" },
"methodology" : { "type" : "string" },
"key_findings" : {
"type" : "array" ,
"items" : { "type" : "string" }
},
"limitations" : { "type" : "string" }
},
"required" : [ "title" ]
}
}
Product Info
Extract product data:
{
"urls" : [ "https://company.com/product-A" , "https://company.com/product-B" ],
"extract_effort" : "auto" ,
"summary" : {
"type" : "object" ,
"properties" : {
"product_name" : { "type" : "string" },
"features" : {
"type" : "array" ,
"items" : { "type" : "string" }
},
"pricing" : { "type" : "string" },
"target_audience" : { "type" : "string" }
},
"required" : [ "product_name" ]
}
}
Raw Content (summary: false)
{
"success" : true ,
"error" : null ,
"tx_id" : "tx_12345678-1234-1234-1234-123456789abc" ,
"results" : [
{
"title" : "AI Breakthrough in Natural Language Processing" ,
"url" : "https://example.com/article?utm_source=valyu" ,
"content" : "# AI Breakthrough in Natural Language Processing \n\n Page content in markdown..." ,
"description" : "Latest AI developments" ,
"source" : "web" ,
"price" : 0.001 ,
"length" : 12840 ,
"data_type" : "unstructured" ,
"image_url" : {
"main" : "https://example.com/hero-image.jpg"
}
}
],
"urls_requested" : 1 ,
"urls_processed" : 1 ,
"urls_failed" : 0 ,
"total_cost_dollars" : 0.001 ,
"total_characters" : 12840
}
Summary (summary: true or string)
{
"success" : true ,
"results" : [
{
"title" : "AI Breakthrough in Natural Language Processing" ,
"content" : "This article discusses a breakthrough in AI..." ,
"summary_success" : true ,
"price" : 0.002 ,
"data_type" : "unstructured"
}
]
}
Structured (JSON Schema)
{
"success" : true ,
"results" : [
{
"title" : "AI Breakthrough in Natural Language Processing" ,
"content" : {
"title" : "AI Breakthrough in Natural Language Processing" ,
"author" : "John Doe" ,
"category" : "technology" ,
"key_points" : [
"New AI model achieves 95% accuracy" ,
"Reduces computational requirements by 40%"
]
},
"summary_success" : true ,
"price" : 0.002 ,
"data_type" : "structured"
}
]
}
Response Fields
Field Description titleExtracted page title urlOriginal URL with UTM tracking parameters contentExtracted content (markdown or JSON) descriptionPage meta description or excerpt sourceSource type - always “web” for URL processing priceCost for processing this URL in dollars lengthCharacter count of extracted content data_type"unstructured" or "structured"summary_successWhether AI processing succeeded (only when summary parameter is used) image_urlDictionary of extracted image URLs screenshot_urlPre-signed URL to page screenshot (only when screenshot=true was requested)
Best Practices
Choosing Summary Type
false: Fastest and cheapest—no AI
true: Basic summary for overviews
"string": Custom instructions for specific needs
{object}: Structured extraction for data processing
JSON Schema Tips
Use clear descriptions to guide extraction
Use enums for consistent categorisation
Keep schemas under 3 levels deep
Mark essential fields as required
Batch Processing
Group similar content types together
Choose appropriate response length
Check summary_success for AI status
Track total_cost_dollars
Error Handling
# Check for partial failures (HTTP 206)
if response.status_code == 206 :
successful_results = [r for r in response.json()[ "results" ]]
failed_count = response.json()[ "urls_failed" ]
# Check AI processing success
for result in results:
if "summary" in result and "summary_success" in result:
if not result[ "summary_success" ]:
print ( f "AI processing failed for { result[ 'url' ] } " )
# Handle complete failures (HTTP 422)
if response.status_code == 422 :
error_message = response.json()[ "error" ]
Try the Contents API Full API reference with interactive examples
Next Steps
API Reference Complete parameter documentation
Python SDK Python integration
TypeScript SDK TypeScript integration
Integrations LangChain, LlamaIndex, and more