Will Leeney · · 8 min Building Semantic Search with Enhanced Embeddings
Table of Contents
StackOne has over 10,000 actions across all our connectors, and growing. Some connectors have 2,000 actions alone because the underlying API is massive. Customers were asking for better search, and rightfully so: keyword matching doesn’t work when someone searches “onboard new hire” but the action is called “Create Employee.”
We needed semantic search, and I want to walk through how I built this feature using Claude Code.
For context on the scale of AI assistance: while Claude runs, I’m typically doing other work. As I write this, I have six other Claude Code sessions running in parallel. Compared to even four months ago, when I had to hand-hold through POCs, the level of autonomy has changed significantly.
The development lifecycle
The phases I went through:
- Stakeholder conversations (CTO + product managers)
- POC and benchmarking
- Local development
- API integration
- Testing and iteration
This lifecycle hasn’t really changed. AI is involved in the process but at the top level of the hierarchy, the development lifecycle isn’t different.
Phase 1: Specification
The spec remains essential. A significant part of a developer’s job is still specification: understanding the actual problem, defining the solution format, surfacing requirements, and identifying edge cases.
Through stakeholder conversations, we identified clear requirements:
- Problem: 10k actions, keyword search fails for natural language queries
- Solution: Semantic search
- Constraints:
- Storage must be minimal. We can’t add significant overhead to our API
- Search latency must remain fast
- Must handle account-specific custom connectors
These constraints naturally defined our benchmark dimensions: accuracy, storage size, and latency. Without these upfront, we’d have no way to evaluate whether different approaches were actually better.
The deeper investigation revealed important architectural details: connectors live in S3, account-specific connectors need special handling, builds trigger every 24 hours plus a manual endpoint, and Turbopuffer would serve as our vector store. Each of these details shaped the eventual solution.
If we wanted to replace large portions of this process with AI, we would need to give it access to literally all of the knowledge of the business: Slack channels, market positioning info, every git repo and documentation and design decision. It’s feasible, and we’ve actively had conversations about it. I still think there are some human input decision points. There’s nearly always a pareto optimum of solutions given your specification. Is it quicker to have AI do this? Maybe, but really it only probably took two total human hours (split across three people) to design the specification.
Phase 2: POC and benchmarking
With requirements gathered, I outlined everything to Claude and set it running. The key was being explicit about what success looked like: make a benchmark dataset, define the metrics I care about, test BM25 against semantic search, and iterate on the model. I also knew reranking might help, so I asked it to include that in the comparison.
Methods tested
| Method | Description |
|---|---|
| BM25 Only | Keyword matching via Orama |
| Semantic Only | Local embeddings using all-MiniLM-L6-v2 |
| Enhanced Embeddings | Enriches text with synonyms before embedding |
| Enhanced + Rerank | Enhanced embeddings + cross-encoder reranking |
| Semantic + Rerank | Standard semantic + cross-encoder reranking |
The enhanced embeddings approach enriches action text with synonyms before generating embeddings:
"Create Employee" → "Create Employee add new make insert worker staff team member..."
This bridges the vocabulary gap between how users think about actions and how actions are actually named.
Benchmark results
We tested against 103 semantically-challenging queries, queries that use synonyms and natural language rather than exact keywords.
All Connectors (9,340 actions)
| Approach | Hit@1 | Hit@3 | Hit@5 | Latency |
|---|---|---|---|---|
| Enhanced Embeddings | 56% | 81% | 84% | 6ms |
| Enhanced + Rerank | 56% | 81% | 84% | 36ms |
| Semantic Only | 42% | 58% | 70% | 8ms |
| BM25 Only | 9% | 12% | 21% | 19ms |
Per Connector (filtered search)
| Approach | Hit@1 | Hit@3 | Hit@5 | Latency |
|---|---|---|---|---|
| Enhanced Embeddings | 67% | 80% | 90% | 0.9ms |
| Semantic Only | 62% | 75% | 80% | 0.9ms |
| BM25 Only | 47% | 60% | 65% | 0.2ms |
Four findings stood out:
- Enhanced embeddings wins. 84-90% Hit@5 with ~6ms latency across the full index
- Reranking provides no benefit. Adds 30ms with zero accuracy improvement
- BM25 fails on natural language. Only 21% Hit@5. It can’t match “onboard” to “create”
- Synonym enrichment adds +14 percentage points over standard semantic search
Storage: ~206 MB total (BM25 index 39 MB, enhanced embeddings 72 MB, model 23 MB).
The results validated the approach. Enhanced embeddings met our accuracy targets with acceptable latency and storage. Could we improve further? Probably. But the goal was getting something deployed and collecting real user feedback, not perfecting a system in isolation. For a comparison of lighter-weight approaches (BM25 and TF-IDF) that work without embeddings, see our MCP tool search benchmarks.
Phase 3: API integration
Having deployed the resolutions feature previously, I knew the process. The key insight was pointing Claude Code at both the unified-api and ai-generation repos simultaneously, giving it the full context needed to wire everything together.
Build flow
GitHub Actions ─────► Lambda ◄───── S3 Configs
(cron/manual) │ (connectors)
▼
unified-api ────► Transformer ────► Turbopuffer
POST /build + Embeddings (vector DB)
(MiniLM-L6)
Search flow
Client ────► unified-api ────► Lambda
POST /search │
+ project_id ▼
Embed Query
│
▼
Turbopuffer ◄── Vector Search
(cosine sim) + filters
│
▼
Results
The build process pulls connector configs from S3, transforms them into indexed actions with enhanced text, generates embeddings, and upserts to Turbopuffer. Search embeds the query, applies connector and project filters, and returns ranked results.
Deployment surfaced the usual issues: missing IAM permissions for listBuckets, which required a redeploy. Then, I found tests passing silently because there were no custom connectors in the staging bucket. Passing tests don’t guarantee correctness. After transferring test data and re-validating locally, everything worked as expected.
The new development pattern
The pattern that’s emerging:
- Specification — understand the problem and solution (unchanged from pre-AI)
- Context handoff — provide Claude with the spec, relevant code locations, test criteria, benchmark requirements, and decision points where it should pause for input
- Verification — review the output and validate it works
I’m also finding value in longer sessions with more context compaction. The CLAUDE.md file that Claude uses, which I initially found annoying, has become really useful as a project tracker and decision point log.
The human role is still specification, verification, and knowing when something is good enough to ship. But execution is increasingly delegated.