Field Lab 01 · Boston · with Apify

A Reliability Layer for Web-Grounded Agents

Build a design-partner sourcing agent on Apify, then prove which leads are safe to act on.

  • In person, Boston
  • Hands-on build
  • Bring a laptop
  • Date to be announced

You will not leave with a lead-gen toy. You will leave with a reusable trust layer for web-grounded agents and an eval that proves whether it works.

The founder problem

A founder has a product and not enough customers, so they point an agent at the web to find who to sell to. The agent returns fifty clean leads. The founder emails all fifty. A third bounce, several were never a fit, two of the companies shut down last year. The list looked done. It was not.

That gap, between a list that looks done and a list you can act on, is what this Field Lab is about. The web data part is easy. Knowing which of it to trust is the job.

What this actually is

This is not a lead-generation session. It is about the reliability layer around a web-grounded agent: where the data came from, whether the evidence holds, and what to do when it does not. Lead generation is just the first use case. The trust layer is the artifact, and it works for any agent that makes claims off the web.

Two layers, and only one is the product. The trust core is source-agnostic: it takes a claim and its evidence and decides whether it is true and supported right now. The policy filter sits on top and decides whether a true claim fits what the founder asked for. The core is the reusable piece. The fit logic stays on top so it never contaminates it.

The shared scenario

The whole room works the same problem. A seed to Series A B2B SaaS founder selling a developer-facing product, no growth hire, who needs a weekly list of qualified design-partner candidates.

The fit profile, which feeds the policy layer: US-based B2B SaaS, 50 to 300 people, a developer-facing product, a contact who is an engineering leader, and a real reason to reach out now.

You build an agent that returns a prospect list where every field traces to a source that currently supports it. Plausible is not verified.

The four ways web data breaks

  • Stale

    true once, not anymore, and the source is cached.

  • Unsupported

    the evidence is missing, walled, or does not actually back the claim.

  • Conflicting

    two sources disagree on a field that matters.

  • Extraction-drift

    the agent grabbed the wrong field, or the wrong company.

What you build, the trust layer

  • Guard

    check each claim is fresh, supported, and the right shape before it ships.

  • Degrade

    hand over the records you can stand behind and name the gaps, instead of padding the list with junk.

  • Observe

    emit a trace per record: where it came from, what was checked, what was rejected and why.

The bar, the Dirty Thirty

Thirty frozen records with known answers, ten clean and twenty each planted with a specific break. Your guard is graded on two numbers, not one: did it catch the bad records, and did it keep the good ones. An agent that rejects everything scores perfectly on the first and fails the second, so it fails. That is the point.

GroupCountVerdictTests
Clean10acceptkeeps good leads
Stale5reject or reviewoutdated evidence
Unsupported5reject or reviewmissing or weak evidence
Conflicting5reviewsource disagreement
Extraction-drift5rejectwrong field or entity

Pass bar:

  • bad-record recall at or above 85 percent
  • clean-record precision at or above 80 percent
  • evidence coverage 100 percent on accepted records
  • zero unsupported accepted claims
  • a reason on every rejection

What you leave with

A working sourcing agent. A reusable, source-agnostic trust core. A thin growth policy filter. The thirty-case eval harness. Your own measured precision and recall. A founder-facing trust report. And a teardown that says, not “I built an agent,” but “I built a reusable reliability layer for web-grounded agents, benchmarked it against a planted trap set, and produced a safe-to-act-on number.” That is a portfolio artifact, not a demo.

How it runs

  1. Customer frame
  2. Apify source layer
  3. Improve the trust core against the Dirty Thirty
  4. Eval checkpoint
  5. Add the policy filter and run live
  6. Founder report and teardown
  7. Show-and-tell

Bring any coding assistant you already use. Claude Code, Copilot, Cursor, and Codex are all fine. The repo is plain Python on the Apify SDK, so the tool you write with does not change the build. You need a working Python runtime and a model key. A thirty-second smoke test in the pre-work confirms both before you arrive.

Run with

Run with Apify, who provide the data plane: the Actors and datasets that feed the agent. The AI Runtime Field Lab owns the scenario, the reliability bar, the trap set, and the published write-up.

New to The AI Runtime? Start here: