Turning 25 Years of Project History Into an AI-Searchable Knowledge Base

How Softjourn's R&D turned team case studies, surveys, and PM logs into a single AI-searchable knowledge base and built a lead intelligence pipeline on top of it.

About the Client:

Industry:Technology Services

Headquarters:Fremont, California

Technologies:TypeScript,AWS,JavaScript,Leading LLM

Services Used:

Check All Services

The Challenge

Softjourn had 25+ years of project experience spread across case studies, PM logs, and team memory, but no structured way to search it. Any company trying to apply AI to its own institutional knowledge hits the same wall: the documentation was never built for it.

The Solution

The R&D team built three separate AI-powered ingestion pipelines to structure institutional knowledge from case studies, team surveys, and PM feature logs into a single searchable vector database. On top of that knowledge base, they built a lead generation pipeline accessible through a custom Slack bot that scores prospects against Softjourn's real delivery experience and returns prioritized recommendations with supporting evidence.

The Results

The team created a structured, AI-searchable knowledge base from 25 years of delivery history, a reusable serverless platform architecture, and gained hands-on experience building production-grade AI systems. The project produced three ingestion pipelines, three Slack bot commands, 398 commits, and 12 releases.

The Challenge: Cleaning Up a Scattered Project History

Like many companies that have been around for decades, Softjourn has built up a substantial library of documented project experience. After 25 years, that knowledge spans published case studies, internal project logs, and the institutional memory of long-tenured team members across ticketing, fintech, expense management, and media.

That experience is valuable, but it isn't easy to search. When the business development team needed to connect a prospective lead to relevant past work, the process was informal: check the blog, ask the marketing team, or rely on an account manager's recollection. Off-the-shelf lead generation tools could surface similar companies or filter by industry and size, but none of them could answer a more specific question: which of these companies might actually need the kind of work we've done before, and why?

This is a problem that extends well beyond lead generation. Any company trying to apply AI to its own institutional knowledge hits the same wall: the documentation wasn't built for it. Content lives in different formats, across different systems, written for different purposes. Before you can build anything useful with AI on top of that knowledge, you have to solve the data organization problem first.

The R&D team saw an opportunity to tackle both a real internal problem and an ambitious technical learning goal. Rather than building a quick prototype, they set out to create a production-grade AI system that could structure Softjourn's project history into a searchable knowledge base, then build a practical application on top of it.

The Solution: Structuring the Knowledge Base

Before anything else, the team needed to solve the foundational problem: turning Softjourn's scattered experience into structured, AI-searchable data. This turned out to be the most complex and valuable part of the entire project.

Each source of institutional knowledge had a completely different shape: case studies were written as narratives, survey responses were freeform text, and project feature logs tracked what was delivered but not in a problem-and-solution format.

Each source needed its own ingestion pipeline with its own AI processing logic to extract structured, searchable data from materials that were never designed to be queried by a machine.

Case Study and Blog Pipeline: The team's published case studies and blog posts contained valuable project history, but in narrative form. The pipeline needed to read through marketing-oriented content and extract structured problem-and-solution pairs: what challenge did the client face, what did Softjourn build, and what was the outcome. It also had to filter out noise and retain only the information relevant for matching against future prospects. The extracted data was then chunked, embedded, and stored in the vector database.
Team Survey Pipeline: To capture knowledge that never made it into published content, the team created an internal survey where engineers and project managers could describe the problems they had solved and how. These responses were freeform and varied widely in detail and format. The pipeline had to normalize those responses into a consistent structure that could be stored alongside case study data in the same knowledge base, so that both sources could be searched together.
Project Feature Log Pipeline: This was the most ambitious of the three. Softjourn's project managers maintain logs of features delivered on each project, but these logs track what was built, not why. There is no mention of the client's original problem or the reasoning behind the solution. The pipeline had to work backwards from a list of shipped features to infer what business problems the client was likely facing and what solutions those features represented. This required more sophisticated prompt design and multiple processing steps compared to the other two pipelines.

All three pipelines fed their output into a shared vector database, creating a single, unified knowledge base from sources that had never been connected before. Getting each pipeline right required iterative testing and close attention to how the outputs from one processing step affected everything downstream.

A Lead Generation Pipeline Built on Top

With a structured knowledge base in place, the team built a lead generation pipeline as a practical application on top of it. A user types a natural language query into a custom Slack bot (for example, "find employee expense management companies in the US with 10-50 employees who received investments last year"). The system translates that into structured search criteria, pulls matching companies from a data provider, enriches each one, then runs a series of AI-generated hypotheses about what problems that company might be facing. Those hypotheses are cross-referenced against the knowledge base, and each company receives a multi-dimensional score based on experience match, hypothesis confidence, and how well it fit the original search criteria.

The top results come back with a company snapshot, a rationale for why the lead is relevant, specific proof from past Softjourn projects, items to verify before outreach, and a suggested first contact angle.

Production-Grade by Design

The team deliberately chose to build reusable infrastructure rather than a disposable prototype. The entire system runs on a serverless architecture using AWS Lambda and Step Functions, organized in a monorepo where each component (ingestion, enrichment, scoring, retrieval, Slack delivery) lives as its own independently testable module. Over six months of part-time work, the project accumulated 398 commits and 12 releases. The goal was a platform baseline that could be picked up and applied to future AI projects, not just a one-off demo.

More Than Lead Search

The Slack bot also supports searching past Softjourn solutions by keyword and brainstorming potential approaches to propose to a prospect. Together with lead generation, the three commands turn the knowledge base into a practical on-demand reference tool for business development conversations.

The Benefits

Internal Knowledge and a Cross-Industry Tool

The project achieved what it set out to do. The R&D team came away with hands-on experience building production-grade AI systems, including multi-step agentic pipelines, vector database design, prompt engineering at scale, and serverless orchestration on AWS. Just as importantly, they gained a realistic understanding of where these systems are genuinely difficult.

The structured knowledge base is itself a lasting asset, independent of the lead generation tool built on top of it. The same foundation could power an internal search tool for project managers, a client-facing experience browser, a support chatbot, or any future application that needs to draw on Softjourn's delivery history. The harder work of structuring 25 years of scattered data into something AI can actually use is already done.

For any services company with a deep project history, a tool like this could significantly reduce the time involved in qualifying leads and connecting prospects to relevant experience. Instead of manually cross-referencing a company's profile against years of past projects, a team could receive scored, prioritized recommendations with supporting evidence and a suggested outreach angle in minutes.

Results at a Glance

Metric	Result
Ingestion pipelines built	3 (case studies, team surveys, PM feature logs)
Slack bot commands	3 (lead search, solution lookup, brainstorm)
Codebase	398 commits, 12 releases
Architecture	Serverless monorepo, modular microservices
Cost per query	~$0.50 per 100 companies processed
Project duration	~6 months, part-time

What We Learned from Making a Searchable Knowledge Base

1. Multi-step AI pipelines multiply errors, they don't hide them. Even when each individual step in the pipeline performed at 70-90% accuracy, the compounded output was noticeably worse than expected. The math is simple: 0.9 x 0.9 x 0.9 does not equal magic. Every step in the chain needs careful attention, not just to its own performance, but to how it connects to the steps before and after it. Building one good prompt is straightforward. Building seven that work together reliably is a different challenge entirely.

2. The AI is about 10% of the work. The team estimated that the actual AI components (prompts, LLM calls, vector search) made up roughly a tenth of the total project effort. The other 90% went to data preparation, infrastructure, serverless orchestration, error handling, deployment, and testing. For any organization planning a similar initiative, that ratio is worth factoring into timelines and budgets early.

3. Transparency should be designed in, not added later. The team deliberately built the system to flag its own limitations. When a query included criteria the data source couldn't verify, the tool returned a caution notice explaining what it could and couldn't confirm, rather than guessing or staying silent. Results were framed as qualified starting points for human review, not finished recommendations. That honesty makes the tool more useful, not less, because the person acting on the results knows exactly where to focus their own judgment.

Conclusion: A Foundation for Future AI Search Tools

The AI Sales Assistant started as an R&D experiment and ended as something more lasting: a production-grade platform, a structured knowledge base built from 25 years of project history, and a team with real, hands-on experience building the kind of AI systems that clients are increasingly asking about.

The platform, parsing pipelines, and architecture patterns all remain available as a foundation for future work, whether that means expanding the tool internally, adapting it for a different use case, or applying the same patterns to a client project. For Softjourn's R&D team, the most valuable outcome was not just a working MVP. It was knowing exactly what it takes to build one well.

Whether you are looking to build an agentic data pipeline, structure your company's institutional knowledge for AI search, or just want to understand what a project like this really involves, our team has been through it firsthand. Contact Softjourn to start the conversation.

Want to Know More?

Fill out the form to discuss your idea with us!