BACK

How We Built This: Company Classifier

5 min
4/9/2024

One way Ansa uses Open AI and Open-Source tools internally

As a firm we constantly reinvent the spaces we cover and believe that is necessary to be successful. In our industry, across a long enough time horizon, every theme degrades. To keep up with the pace of change, we’ve been exploring scaling via software (not just people). While many firms have analyst armies staffed with tagging companies, we wanted to find a way that software could quickly and accurately identify companies within specific subsectors or themes. The reality is, that existing industry tagging from data providers isn’t granular or accurate enough, especially when examining product overlap and developing competitive landscapes.

So, we decided to leverage some of the latest LLM advances to build a bespoke company classification engine tailored to our sourcing approach. 

Building a Bespoke Solution

How? We leveraged a combination of OpenAI and some of our favorite open-source developer tools to build our categorization engine: Unstructured, Langchain, Qwak, and Reflex. We thought it’d be fun to share what we’ve built and provide some context on why we chose these tools and how they work together.

The core stack includes: 

  • Unstructured.io: Unstructured transforms data from a variety of formats into LLM-compatible JSON files. We leverage their HTML parsing capabilities to help us ingest text and links from company websites. Compared to standard scraping tools like Beautiful Soup, Unstructured preserves the structure and context of the content, which is crucial for identifying product positioning.
  • Langchain.com: Langchain is used throughout the entire process to streamline the power of our LLM use across website navigation and summation. Langchain Expressions (LCEL) enables more elegant LLM calls that can pack in additional functionality while requiring fewer lines of code. We use their stateless agents to navigate websites, summarize content, and ultimately use a standard Chain to force our classifications into structured outputs.
  • Qwak.com:  Qwak has a vector database we use to efficiently analyze and compare massive amounts of web content. Our favorite aspect of Qwak’s database is their integration of metadata with vectors When searching for specific business or pricing models, metadata can more efficiently guide us to the relevant subset of vectors through filtering. However, the core reason to leverage Qwak is their end-to-end MLOps capabilities, so we have even kicked around migrating from Databricks to them! 
  • Reflex.dev: Finally, we built a sourcing web app and self-serve database on top of Reflex enabling us to host our own sourcing web application and provide self-serve access to our company database. Reflex allows us to build a modern frontend with a high-performance backend, all in Python. The flexibility of code has allowed us to add features like NL to SQL search and run complex workflows that would be too difficult to build, maintain and manage in no-code / low-code app builders.

A natural question you may be asking is, “Why bother stitching these tools together? Can't you just use Crunchbase?” In short, we chose this route for 3 reasons: 

  1. Accuracy and Granularity: We can categorize companies within specific subsectors and identify similar product offerings, giving us a more precise understanding of the competitive landscape
  1. Flexibility: The system adapts to our evolving needs. As our research expands, and industries evolve, we can quickly add new categories or refine existing ones.
  1. Ownership: We control the data and the process, ensuring the quality and consistency of our classifications. Similar company offerings and classifications from data providers are often a black box. By controlling the parameters we can ensure the results match our needs.

Ultimately, these findings have provided us with deeper insights into new industries, exposed us to companies before our peers, and given us an edge. We're constantly iterating on this tool, and others like it, and we'd love to connect with others who are building software that we can learn from, back, or look to adopt. Reach out to us at marco@ansa.co and ryan@ansa.co, and please let us know if you’d like to see more of How I Built This covering other portions of our stack.

Share