Skip to main content

Announcing our $20m Series A from GV (Google Ventures) and Workday Ventures Read More

Will Leeney Will Leeney · · 8 min
Building Semantic Search with Enhanced Embeddings

Building Semantic Search with Enhanced Embeddings

Table of Contents

StackOne has over 10,000 actions across all our connectors, and growing. Some connectors have 2,000 actions alone because the underlying API is massive. Customers were asking for better search, and rightfully so: keyword matching doesn’t work when someone searches “onboard new hire” but the action is called “Create Employee.”

We needed semantic search, and I want to walk through how I built this feature using Claude Code.

For context on the scale of AI assistance: while Claude runs, I’m typically doing other work. As I write this, I have six other Claude Code sessions running in parallel. Compared to even four months ago, when I had to hand-hold through POCs, the level of autonomy has changed significantly.

The development lifecycle

The phases I went through:

  1. Stakeholder conversations (CTO + product managers)
  2. POC and benchmarking
  3. Local development
  4. API integration
  5. Testing and iteration

This lifecycle hasn’t really changed. AI is involved in the process but at the top level of the hierarchy, the development lifecycle isn’t different.

Phase 1: Specification

The spec remains essential. A significant part of a developer’s job is still specification: understanding the actual problem, defining the solution format, surfacing requirements, and identifying edge cases.

Through stakeholder conversations, we identified clear requirements:

  • Problem: 10k actions, keyword search fails for natural language queries
  • Solution: Semantic search
  • Constraints:
    • Storage must be minimal. We can’t add significant overhead to our API
    • Search latency must remain fast
    • Must handle account-specific custom connectors

These constraints naturally defined our benchmark dimensions: accuracy, storage size, and latency. Without these upfront, we’d have no way to evaluate whether different approaches were actually better.

The deeper investigation revealed important architectural details: connectors live in S3, account-specific connectors need special handling, builds trigger every 24 hours plus a manual endpoint, and Turbopuffer would serve as our vector store. Each of these details shaped the eventual solution.

If we wanted to replace large portions of this process with AI, we would need to give it access to literally all of the knowledge of the business: Slack channels, market positioning info, every git repo and documentation and design decision. It’s feasible, and we’ve actively had conversations about it. I still think there are some human input decision points. There’s nearly always a pareto optimum of solutions given your specification. Is it quicker to have AI do this? Maybe, but really it only probably took two total human hours (split across three people) to design the specification.

Phase 2: POC and benchmarking

With requirements gathered, I outlined everything to Claude and set it running. The key was being explicit about what success looked like: make a benchmark dataset, define the metrics I care about, test BM25 against semantic search, and iterate on the model. I also knew reranking might help, so I asked it to include that in the comparison.

Methods tested

MethodDescription
BM25 OnlyKeyword matching via Orama
Semantic OnlyLocal embeddings using all-MiniLM-L6-v2
Enhanced EmbeddingsEnriches text with synonyms before embedding
Enhanced + RerankEnhanced embeddings + cross-encoder reranking
Semantic + RerankStandard semantic + cross-encoder reranking

The enhanced embeddings approach enriches action text with synonyms before generating embeddings:

"Create Employee" → "Create Employee add new make insert worker staff team member..."

This bridges the vocabulary gap between how users think about actions and how actions are actually named.

Benchmark results

We tested against 103 semantically-challenging queries, queries that use synonyms and natural language rather than exact keywords.

All Connectors (9,340 actions)

ApproachHit@1Hit@3Hit@5Latency
Enhanced Embeddings56%81%84%6ms
Enhanced + Rerank56%81%84%36ms
Semantic Only42%58%70%8ms
BM25 Only9%12%21%19ms

Per Connector (filtered search)

ApproachHit@1Hit@3Hit@5Latency
Enhanced Embeddings67%80%90%0.9ms
Semantic Only62%75%80%0.9ms
BM25 Only47%60%65%0.2ms

Four findings stood out:

  1. Enhanced embeddings wins. 84-90% Hit@5 with ~6ms latency across the full index
  2. Reranking provides no benefit. Adds 30ms with zero accuracy improvement
  3. BM25 fails on natural language. Only 21% Hit@5. It can’t match “onboard” to “create”
  4. Synonym enrichment adds +14 percentage points over standard semantic search

Storage: ~206 MB total (BM25 index 39 MB, enhanced embeddings 72 MB, model 23 MB).

The results validated the approach. Enhanced embeddings met our accuracy targets with acceptable latency and storage. Could we improve further? Probably. But the goal was getting something deployed and collecting real user feedback, not perfecting a system in isolation. For a comparison of lighter-weight approaches (BM25 and TF-IDF) that work without embeddings, see our MCP tool search benchmarks.

Phase 3: API integration

Having deployed the resolutions feature previously, I knew the process. The key insight was pointing Claude Code at both the unified-api and ai-generation repos simultaneously, giving it the full context needed to wire everything together.

Build flow

GitHub Actions ─────► Lambda ◄───── S3 Configs
(cron/manual)         │              (connectors)

unified-api ────► Transformer ────► Turbopuffer
POST /build       + Embeddings      (vector DB)
                  (MiniLM-L6)

Search flow

Client ────► unified-api ────► Lambda
             POST /search      │
             + project_id      ▼
                            Embed Query


             Turbopuffer ◄── Vector Search
             (cosine sim)     + filters


             Results

The build process pulls connector configs from S3, transforms them into indexed actions with enhanced text, generates embeddings, and upserts to Turbopuffer. Search embeds the query, applies connector and project filters, and returns ranked results.

Deployment surfaced the usual issues: missing IAM permissions for listBuckets, which required a redeploy. Then, I found tests passing silently because there were no custom connectors in the staging bucket. Passing tests don’t guarantee correctness. After transferring test data and re-validating locally, everything worked as expected.

The new development pattern

The pattern that’s emerging:

  1. Specification — understand the problem and solution (unchanged from pre-AI)
  2. Context handoff — provide Claude with the spec, relevant code locations, test criteria, benchmark requirements, and decision points where it should pause for input
  3. Verification — review the output and validate it works

I’m also finding value in longer sessions with more context compaction. The CLAUDE.md file that Claude uses, which I initially found annoying, has become really useful as a project tracker and decision point log.

The human role is still specification, verification, and knowing when something is good enough to ship. But execution is increasingly delegated.

Put your AI agents to work

All the tools you need to build and scale AI agents integrations, with best-in-class security & privacy.