Contents Extraction | You.com

Hand the Contents API a list of URLs and get back clean Markdown (or HTML) plus page metadata — no headless browser, no parsing. It’s the fastest way to turn arbitrary web pages into LLM-ready text without standing up your own crawler.

What You’ll Build

A one-call extractor that turns any URL into clean Markdown — title, body, and page metadata in a single response. The formats parameter lets you ask for Markdown, raw HTML, or both, and crawl_timeout caps the wait per page.

Try It Live

Run a real Contents API request right here — no setup. Open the Try It panel below, paste your API key, drop in a URL, and send it against the live endpoint.

POST

/v1/contents

1 curl -X POST https://ydc-index.io/v1/contents \
2      -H "X-API-Key: <apiKey>" \
3      -H "Content-Type: application/json" \
4      -d '{
5   "urls": [
6     "https://en.wikipedia.org/wiki/Main_Page"
7   ],
8   "formats": [
9     "html",
10     "metadata"
11   ]
12 }'

Try it

Response

1 [
2   {
3     "url": "https://en.wikipedia.org/wiki/Main_Page",
4     "title": "Wikipedia, the free encyclopedia",
5     "html": "Wikipedia was just a dream.\ndiv class=\"frb-subheader\">\n<span class=\"frb-replaced\">December 2</span>: Readers <span class=\"frb-replaced\">in the United States</span> deserve an explanation.\n</div>\n</div>\n<div class=\"frb-message-content\">\n<p>\nPlease don't skip this 1-minute read. It's <span class=\"frb-replaced\">Tuesday</span>, <span class=\"frb-replaced\">December 2</span>, and if you're like us, you've used Wikipedia countless times. To settle an argument with a friend. To satisfy a curiosity. Whether it's 3 in the morning or afternoon, Wikipedia is useful in your life. Please give <span class=\"frb-replaced\">$2.75</span>.\n</p>\n<p>\nWikipedia's been around since 2001. Back then, it was just a wildly ambitious, probably impossible dream. But it came together piece by piece—created by people, not machines. Wikipedia's not perfect, but it's always been free thanks to everyday readers.\n</p>\n<p>\nOnly 2% ever donate. But that small group makes a big difference. When you support Wikipedia, you're standing up for something simple",
6     "metadata": {
7       "site_name": "Wikipedia",
8       "favicon_url": "https://api.ydc-index.io/favicon?domain=en.wikipedia.org&size=128"
9     }
10   }
11 ]

Prerequisites

Get a You.com API key

$ pip install youdotcom        # Python ≥ 3.10
$ npm install @youdotcom-oss/sdk   # Node ≥ 20

Walkthrough

Python

TypeScript

contents.py

1 """Contents — fetch clean Markdown from any URL via the You.com Contents API."""
2 
3 import os
4 import sys
5 
6 from youdotcom import You, models
7 
8 # take URL from command line, or use a default
9 url = sys.argv[1] if len(sys.argv) > 1 else "https://en.wikipedia.org/wiki/Retrieval-augmented_generation"
10 
11 # initialize the client with your API key
12 with You(api_key_auth=os.environ["YDC_API_KEY"]) as you:
13     pages = you.contents.generate(
14         urls=[url],
15         formats=[models.ContentsFormats.MARKDOWN],
16         crawl_timeout=15,
17     )
18 
19 # print the title and the first 500 chars of the markdown body
20 for page in pages:
21     print(page.title)
22     print(page.url)
23     print()
24     print((page.markdown or "")[:500] + "...")

$ export YDC_API_KEY="your-api-key-here"
$ python contents.py "https://en.wikipedia.org/wiki/Retrieval-augmented_generation"

Example Output

1 # Retrieval-augmented generation
2 
3 Retrieval-augmented generation (RAG) is a technique that grants generative
4 artificial intelligence models information retrieval capabilities. It modifies
5 interactions with a large language model so that the model responds to user
6 queries with reference to a specified set of documents…
7 
8 (Returned alongside title, url, and metadata.site_name = "Wikipedia". Full
9 Markdown body is ~12,000 chars.)

Next Steps

Simple Search

Find the URLs to extract, then feed them to Contents.

Research Agent

Let the Research API search, read, and synthesize for you.

Contents API Reference

Full docs for formats, crawl timeouts, and metadata.