TLDR: We built an AI-driven system for fundamental stock research that mimics how real investment teams work. It ingests data from filings, earnings calls, and real-time news via the You.com Search API. Specialized agents analyze documents, build financial models, and develop an investment thesis. A Portfolio Manager agent then challenges the thesis to ensure analytical depth and defendable insights. Unlike traditional methods, this system uncovers signals missed by surface-level analysis, prioritizes numerical precision, and integrates structured disagreement to avoid mediocre conclusions. The result: AI-generated research with genuine conviction and market-gap identification.
An analyst at a fund does not sit down with a blank page and write a stock report from memory. They read filings, build models, form a thesis, and then defend that thesis to a portfolio manager (PM) who is trying to poke holes in it. The structure of the team is what produces good research, not just the skill of any individual analyst.
We built a system of agents to perform fundamental research on stocks. It ingests earnings calls, SEC filings, financial data, and real-time news and market discourse via the You.com Search API. It produces investment memos backed by qualitative analysis and financial models. It is not one model answering a prompt. It is a team of agents, each with a distinct role: a document analyzer, an analyst, and a PM that challenges the research before it ships.
We designed it this way because that is how real research teams work. Most AI-generated research skips all of that. It reads the same documents everyone else reads and produces the same surface-level conclusions. The output looks like research, but it contains no real insight.
In a domain as competitive as financial markets, that kind of analysis is worthless.
The Challenge Building This Agent
Financial markets are one of the most competitive domains in our economy. The system is built on the assumption that markets are mostly efficient. If we tell an agent to read the news about a company and summarize it, that summary is worthless because all of that information is already priced into the stock. To build something useful, we need a system that can reason over structured and unstructured data to surface insights the market has not yet absorbed.
This is harder than it sounds. In financial documents, the edge is rarely in the headline numbers. It lives in the footnotes of an SEC filing, in the shift in tone from a CEO on an earnings call, in the gap between what management says and what the numbers show. A system that only retrieves relevant text will miss these signals—it needs to actually understand what it has read.
On top of that, LLMs also struggle with numerical reasoning. Financial analysis requires building models, running projections, and comparing valuations. Asking a language model to do arithmetic in its head is unreliable. Any system that produces financial research needs a way to offload computation to something more precise.
And then, lastly, there is the evaluation problem. AI models are trained on historical data from the internet. A model already knows how a stock moved over the past ten years even if you don’t explicitly tell it. This makes traditional backtesting meaningless. You can’t plug a model into last year's data and test its predictions, because the model has already seen the answer.
Each of these problems shaped a specific design decision in the system we built.
From Data to Thesis
The system works in two stages. First, it ingests and analyzes raw data. Then, the analyst builds an investment thesis on top of it.
Data Ingestion
The system ingests data from multiple sources: quarterly financials, price history, SEC filings, and earnings call transcripts. These are the foundation of any fundamental analysis, but they are backward-looking by nature. A 10-Q tells you what happened last quarter and an earnings call tells you what management thought about it. To understand what is happening right now, however, we use the You.com Search API to pull in the freshest news, analyst commentary, and market discourse. This gives the system a layer of real-time context that static filings can’t provide.
You may be thinking, “Why not just use a traditional RAG pipeline?” We considered it. Chunk the documents, embed them, retrieve relevant sections at query time. But, ultimately, we decided against it because the most valuable information in financial documents is not in the obvious places. It’s in footnote disclosures, shifts in management tone across quarters, and gaps between what a CEO says on an earnings call and what the numbers show. A chunked embedding will not preserve those signals.
Instead, we built a Document Analyzer agent that reads every new document at ingestion:
- Each source type gets its own analysis instructions
- The financials analyzer looks for margin trends, pricing power signals, and earnings quality
- The earnings call analyzer detects hedging language, tracks management tone, and pulls direct quotes
- The SEC filing analyzer surfaces buried disclosures and accounting changes
By the time a document reaches the database, it has already been interrogated.
We then store each document alongside its analysis, allowing the analyst to see the key insights at a glance but also drill into the full source text when it needs to. This gives the system two levels of depth rather than forcing it to work from summaries alone.
We also pre-compute derived financial metrics before the LLM ever sees the data. Margins, growth rates, balance sheet ratios, free cash flow. This directly addresses the numerical reasoning problem. The LLM is explicitly told to use these pre-computed numbers rather than doing arithmetic itself.
The Analyst
The analyst agent then takes everything the ingestion phase has produced and develops an investment memo. It has access to all the document analyses, the raw financial data, and the pre-computed metrics. But it is not just synthesizing text. It has tools that let it build and run financial models in code. This is how we solve the numerical reasoning problem. The LLM never does math in its head. It writes code, executes it, and reasons over the results.
We built four financial modeling tools for the analyst: a discounted cash flow (DCF) model, a reverse DCF, a comparables analysis, and a sensitivity analysis.
The most important of these is the DCF. It requires the analyst to commit to a set of assumptions about the future: revenue growth rates, target margins, discount rate, terminal growth. An LLM generating text can be vague about these numbers. An LLM that has to pass them as inputs to a function cannot. The DCF tool forces precision. The analyst can run multiple projections to build a bull case and a bear case and understand how the company should be valued across a range of outcomes.
The reverse DCF works backwards from the current stock price to approximate what growth rate the market is implicitly pricing in. This is where the system connects back to the efficient markets thesis. If the market is pricing in 15% revenue growth and the analyst's analysis suggests 20%, that is a specific, testable insight. If the numbers agree, there may be no edge. Without this tool, the analyst has no way to distinguish between an insight and something the market already knows.
The analyst also runs a comparables analysis across stocks in the same sector and a sensitivity analysis that varies growth and discount rate assumptions across a matrix of scenarios. A single projection is brittle. These tools let the analyst stress-test its thesis and understand how sensitive its conclusion is to the assumptions it made. Together, they ground the report in actual computation rather than LLM intuition.
Every factual claim in the report is backed by an inline citation to source material. This is not optional. It is a structural requirement that feeds directly into the fact-checking and review process that follows.
The Portfolio Manager
Once the analyst produces a draft, it goes through a review process designed to catch two different kinds of failure: factual errors and weak reasoning.
The first is fact-checking. A fact-checker extracts every verifiable quantitative claim from the report and checks it programmatically against the actual data in the database. Revenue figures, margins, growth rates, price movements. Each claim type has its own tolerance threshold. If a claim fails verification, it gets sent back for revision. This is not an LLM judging whether something sounds right, it’s a direct comparison against the source data.
The second, weak reasoning, is harder. We built an agent that acts as a PM. Its job is to challenge the investment thesis and poke holes in the reasoning.
The report grades each stock across several dimensions:
- Growth
- Quality
- Risk
- Valuation
- Momentum.
For each grade, the PM argues for a different one with data-backed reasoning. It identifies the weakest assumptions and the strongest counter-arguments. Then it either approves the report or sends it back with specific critiques.
This is where we ran into one of the most interesting problems in the entire project. When we first implemented this loop, the analyst would receive feedback from the PM and immediately cave. Every critique was accepted, every grade was softened. The result was a set of reports with no conviction about anything—every stock looked roughly the same. The system was producing exactly the kind of mediocre, hedged analysis we had set out to avoid.
The fix came from thinking about how real investment teams operate. The best analysts have strong opinions, loosely held. They form a thesis and defend it. If someone surfaces new information they missed, they update. But they do not abandon their position just because someone pushed back.
We engineered this behavior into the revision process. When the analyst receives a critique, it must explicitly choose to accept, reject, or partially accept each point, with reasoning for each decision. It is instructed to defend its analysis rather than reflexively defer to the PM. The goal is not for the analyst to win every argument. It is for the final report to reflect genuine analytical reasoning rather than a series of concessions.
By default, LLMs are agreeable. That is a useful property in most applications. In this one, it was the primary failure mode. Getting the system to produce good research meant engineering disagreement into it.
Evaluation
Evaluating agent performance in finance is uniquely difficult. As mentioned previously, AI models are trained on historical data from the internet, meaning a model already knows how a stock moved over the past ten years even if you don’t tell it. Therefore, if you ran this system on 2023 data and asked it to predict what happens in 2024, you would have no way to know whether the agent produced a genuine insight or just recalled a memorized outcome. Traditional backtesting, the standard evaluation method in finance, is meaningless here.
So, we went back to our original mental model: a real research team. When a PM evaluates an analyst, they don’t know whether the analyst can predict the future. Nobody can. What a PM can assess is how well the analyst understands the business. How detailed is the research? How well do they understand the margin profile, the competitive position, the growth trajectory? A PM acts on a thesis based on their confidence in the depth of the research behind it.
We evaluate our agent the same way. We score reports on analytical depth, data accuracy, and the specificity of the insights produced.
The most important criterion is market gap identification. Each insight must explain not just what the agent found, but what the market is missing. This connects directly to the thesis that drives the entire system. If the agent cannot articulate what is not priced into the stock, the analysis has no value regardless of how polished the report looks.
This evaluation framework proved its worth early. When we were still considering a traditional RAG pipeline for data ingestion, the evaluation scores told us what we needed to know. Reports generated from RAG-retrieved context were not producing higher quality insights. The retrieval was surfacing relevant text, but the analysis built on top of it lacked the depth we were seeing from the document analyzer approach. Once we had the numbers in front of us, the decision was straightforward. We dropped RAG entirely.
Surprising Challenges, Better Outcomes
We started this project expecting the hard part to be the financial analysis. Getting the models to understand balance sheets, build accurate projections, interpret management tone. Those problems were real, but they were solvable with the right tools and the right data pipeline.
The harder problem was one we did not anticipate: getting the system to have conviction. By default, an LLM will agree with whatever feedback it receives. Engineering an agent that defends its own thesis, updates when the evidence warrants it, and does not cave under pressure turned out to be the difference between a system that produces average research and one that produces something worth reading.
In a follow-up post, we will walk through how we iteratively refined the prompts and agent behaviors that made this possible.