Contents Extraction

Pull clean Markdown or HTML from any URL — ideal for LLM ingestion.

View as MarkdownOpen in Claude

Hand the Contents API a list of URLs and get back clean Markdown (or HTML) plus page metadata — no headless browser, no parsing. It’s the fastest way to turn arbitrary web pages into LLM-ready text without standing up your own crawler.


What You’ll Build

A one-call extractor that turns any URL into clean Markdown — title, body, and page metadata in a single response. The formats parameter lets you ask for Markdown, raw HTML, or both, and crawl_timeout caps the wait per page.


Try It Live

Run a real Contents API request right here — no setup. Open the Try It panel below, paste your API key, drop in a URL, and send it against the live endpoint.

POST
/v1/contents
1curl -X POST https://ydc-index.io/v1/contents \
2 -H "X-API-Key: <apiKey>" \
3 -H "Content-Type: application/json" \
4 -d '{
5 "urls": [
6 "https://en.wikipedia.org/wiki/Main_Page"
7 ],
8 "formats": [
9 "html",
10 "metadata"
11 ]
12}'
Response
1[
2 {
3 "url": "https://en.wikipedia.org/wiki/Main_Page",
4 "title": "Wikipedia, the free encyclopedia",
5 "html": "Wikipedia was just a dream.\ndiv class=\"frb-subheader\">\n<span class=\"frb-replaced\">December 2</span>: Readers <span class=\"frb-replaced\">in the United States</span> deserve an explanation.\n</div>\n</div>\n<div class=\"frb-message-content\">\n<p>\nPlease don't skip this 1-minute read. It's <span class=\"frb-replaced\">Tuesday</span>, <span class=\"frb-replaced\">December 2</span>, and if you're like us, you've used Wikipedia countless times. To settle an argument with a friend. To satisfy a curiosity. Whether it's 3 in the morning or afternoon, Wikipedia is useful in your life. Please give <span class=\"frb-replaced\">$2.75</span>.\n</p>\n<p>\nWikipedia's been around since 2001. Back then, it was just a wildly ambitious, probably impossible dream. But it came together piece by piece—created by people, not machines. Wikipedia's not perfect, but it's always been free thanks to everyday readers.\n</p>\n<p>\nOnly 2% ever donate. But that small group makes a big difference. When you support Wikipedia, you're standing up for something simple",
6 "metadata": {
7 "site_name": "Wikipedia",
8 "favicon_url": "https://api.ydc-index.io/favicon?domain=en.wikipedia.org&size=128"
9 }
10 }
11]

Prerequisites

$pip install youdotcom # Python ≥ 3.10
$npm install @youdotcom-oss/sdk # Node ≥ 20

Walkthrough

contents.py
1"""Contents — fetch clean Markdown from any URL via the You.com Contents API."""
2
3import os
4import sys
5
6from youdotcom import You, models
7
8# take URL from command line, or use a default
9url = sys.argv[1] if len(sys.argv) > 1 else "https://en.wikipedia.org/wiki/Retrieval-augmented_generation"
10
11# initialize the client with your API key
12with You(api_key_auth=os.environ["YDC_API_KEY"]) as you:
13 pages = you.contents.generate(
14 urls=[url],
15 formats=[models.ContentsFormats.MARKDOWN],
16 crawl_timeout=15,
17 )
18
19# print the title and the first 500 chars of the markdown body
20for page in pages:
21 print(page.title)
22 print(page.url)
23 print()
24 print((page.markdown or "")[:500] + "...")
$export YDC_API_KEY="your-api-key-here"
$python contents.py "https://en.wikipedia.org/wiki/Retrieval-augmented_generation"

Example Output

1# Retrieval-augmented generation
2
3Retrieval-augmented generation (RAG) is a technique that grants generative
4artificial intelligence models information retrieval capabilities. It modifies
5interactions with a large language model so that the model responds to user
6queries with reference to a specified set of documents…
7
8(Returned alongside title, url, and metadata.site_name = "Wikipedia". Full
9Markdown body is ~12,000 chars.)

Next Steps


Resources