Coginiti menu Coginiti menu

Implementation Retrieval Augmented Generation in Coginiti

Matthew Mullins
March 15, 2024

Coginiti has implemented Retrieval Augmented Generation (RAG) for Coginiti Team and Enterprise customers to improve the quality of AI Assistant interactions. Retrieval Augmented Generation is a way to enhance the quality of responses from a large language model by supplying it with relevant domain knowledge that was not part of its original training data. What follows are some of the technical details about what Retrieval Augmented Generation is, how it works, and how we went about implementing it in Coginiti. 

Coginiti’s AI Assistant

Coginiti’s AI Assistant today is a direct integration to the model service for a number of large language models with some proprietary prompting. Coginiti ships with support for the public APIs from OpenAI and Anthropic along with cloud model services such as AWS Bedrock and Azure. Coginiti’s strategy is to support customer model choices the same way we support their choices around data platforms.

When a user submits a question to Coginiti’s AI Assistant, we enrich the context of the session with some default prompts to enhance the context of the user enquiry. These prompts include the name of the connected data platform, such as AWS Redshift or Databricks, which helps the connected language model generate the appropriate platform syntax when generating SQL.  Coginiti also provides a map of the connected schema, including table names, column names, and primary key relationships. These prompts are included to increase the accuracy of the model response. 

Basic LLM Query Flow

We test these prompts and various database schema across each of our supported model services for consistent results, though language models are nondeterministic so performance can vary. Coginiti also enables users to create static custom prompts, so they can tailor the interaction to the model for their own needs. (Yes, you can make it talk like a pirate)

Using Coginiti’s AI Assistant gives consistently good results with text-to-sql generation, error explanations, performance recommendations, and general data guidance. However, we wanted the AI Assistant implementation to be able to handle more complex questions about the business. To be able to do that, the language model needs to have more exposure to information about the business. One way to approach this would be to fine-tune a model based on information from the business, but this is time consuming, expensive, and somewhat brittle. (If you want help though, reach out) Retrieval Augmented Generation presents a less expensive and more flexible way to get context relevant information to the model during the user interaction. 

What is Retrieval Augmented Generation

Retrieval Augmented Generation could have also been called Search Augmented Generation because it involves searching over a collection of data for relevant results, then injecting those results into the context of the user’s chat session. The Coginiti product was well positioned for this kind of implementation because it ships with an embedded repository of relevant data in the form of our Analytics Catalog. The Analytics Catalog stores all your data and analytics assets from critical transformation code, cleansing routines, and query logic, along with relevant metadata in the form of documentation, comments, and tags. (Hint: if you are not enhancing your catalog assets with metadata today, you will want to start so you can fully leverage the RAG capabilities).

Coginiti could have easily enabled keyword search across catalog assets since it already ships with a search engine, but keyword search is limiting. Traditional keyword search though is like using a flashlight in the dark to find something specific—you might find things that are labeled correctly but miss out on related items. In contrast, semantic search uses a broader beam to illuminate not just the exact words but also associated concepts and meanings. More powerful would be to use semantic search as a way to improve search accuracy for users by capturing the meaning of search queries. For example, an analyst searching for “holiday sales trends” would receive a mix of all queries mentioning “holiday,” “sales,” or “trends,” which could include irrelevant information. With semantic search, the system understands the context of “holiday sales trends” and recognizes the analyst’s likely intent to analyze customer purchasing behavior during holiday seasons.

Semantic search algorithms create a vector space comprised of embeddings of all entries of a context corpus, in our case, the Analytics Catalog. That’s a complicated way of saying that embeddings are a way of translating words or phrases into a language that computers can understand better. Just like each word in a dictionary has a definition, in the world of AI, each word or phrase gets a unique numeric ‘code’ that captures its meaning based on how it’s used in the real world. When a user submits a question, the question is embedded using the same vector space. The most similar queries will be closer together in the vector space. Thus, the closest embedding (or ‘k’ closest embeddings) are returned to the user, providing them the best result for their question. (For a deep read on embeddings see: Vicki Boykis on What are Embeddings)

Coginiti AI Assistant + RAG

Coginiti’s architecture presents a number of design constraints when it comes to implementing new services. Coginiti is deployed software, so any of its services need to be small enough to run containerized on a single server or scale out to serve thousands of users. Many of Coginiti’s customers deploy into highly secure environments limiting our ability to call outside services. We performed our initial testing and proof-of-concept using OpenAI’s embedding API, but as a practical matter, this means that Coginiti cannot make use of such services because of their public nature. We needed to find an embeddings model that could run in a container and be relatively fast using CPUs rather than GPUs for compute. We ended up testing six model families:

Model nameToken number limitModel size (GB)Embedding dimensions (output vector size)Time to embed 1024 tokens (seconds)Time to process Analytics Catalog (seconds)Performance on MTEB leaderboard (Retrieval)
all-MiniLM-L6-v22560.093840.16~6.551
UAE-Large-V15121.34 (0.33 GB quantized)10241.081613
bge-large-en-v1.55121.3410241.09~2004
bge-base-en-v1.55120.447680.3449.96
bge-small-en-v1.55120.113840.243410
gte-large5120.6710241.081557
gte-base5120.227680.245413
gte-small5120.073840.203720
gte-tiny5120.053840.1214.741
udever-bloom-7b1204828.84096N/A (couldn’t run it on CPU)N/A21
voyage-lite-01-instruct4096N/A (cloud service)1024N/AN/A1

To create our embeddings Coginiti selected the BGE-M3 embedding model series as a state-of-the-art multi-lingual and cross-lingual model. The multi-lingual and cross-lingual model support is important given we out embeddings consist of a specialized SQL, but also metadata that might be in a variety of written languages. The performance of the model due to its small size, acceptable token limit, and speed was also a critical factor. This embedding model runs as a standalone service within the Coginiti stack, so no data is sent outside the application stack, and it gives us a consistent embeddings tool across all of our clients.

The embeddings for each catalog asset also need to be stored in a way that Coginiti can easily perform a vector similarity search. There are a number of open source databases purpose built as vector databases, Pinecone and Weaviate being two of the leading candidates. Using a standalone vector store would have meant adding yet another service to our stack, which would add additional management complexity and grow the resource demands to run the product. Fortunately, Coginiti makes use of Postgres as a backend storage layer which has an available extension, pg_vector, to enable storing embeddings and performing vector similarity search directly within Postgres. Our testing showed that pg_vector was more than adequate for our vector storage and search needs. As of this writing pg_vector is available with the managed Postgres service from every major cloud provider and it’s easily installable for self managed users.

Conclusion

We are excited to release our implementation of retrieval augmented generation for Coginiti Team and Enterprise customers! Being able to combine domain specific customer data with the existing power of generative large language models is a powerful combination. Looking forward to learning with our customers how this improves their workflows and accuracy working with Coginiti’s AI Assistant. If you would like to give it a trial or see a demo, reach out to schedule a call.