Vector Space: How the Next Technological Evolution is Evolving Data Science

Global data production is expected to touch 180 zettabytes by 2025. Of this, around 80–90% is unstructured data in the form of images, videos, documents, and plain text that is quite difficult to decipher. While organizations would no doubt benefit immensely from exploring these vast untapped treasure troves of data, the fact is that traditional databases are often incapable of interpreting them correctly.

Thankfully, advanced machine learning (ML) techniques can convert unstructured data into an easily analyzable numerical format called vector embeddings. Vector platforms are custom-built to store, index, and query millions of vectors and rank them accurately in real time. However, building these advanced databases is difficult, as they are highly labor-intensive and demand specialized skill sets.

Jacky Koh, Co-founder, Relevance AI

Relevance AI is building a developer-centric “vector” platform that enables companies to develop artificial intelligence (AI) applications to extract insights from unstructured data quickly, with next to no labeling or algorithm training required. Recently, the company also raised $3 million USD to further develop the platform.

With its products already in use by over three million end users and 100 million weekly requests across various sectors, including SaaS, e-commerce, and gaming, Relevance AI is shaking up the vector database industry.

We caught up with Jacky Koh, co-founder of Relevance AI, to discuss how vector-based technology can help organizations derive valuable insights from unstructured data.

Earlier, Koh led the machine learning team at IAG satellite, focusing on bringing ML to the insurance industry. Koh was also the winner of EY’s first Data Science Challenge and created many mobile applications that have hit No. 1 on the app store and have a total of 7.5 million installs across them.

Also read: Top Data Modeling Tools 2022

Tell us how and why you started Relevance AI.

Jacky Koh: During my time in the corporate industry and across the different companies/clients I worked with, I saw large amounts of unstructured data that wasn’t really being used. Many of the core use cases to improve the business KPIs, such as customer feedback, satisfaction, fraud, etc., produced a lot of unstructured data.

Deep learning embeddings (also more commonly known as vectors) are one of the most effective ways to analyze unstructured data. Vectors represent the data in a format that allows computer algorithms to properly compare and analyze the data. It’s a technique that all the top tech companies utilize. However, it is pretty challenging if you aren’t running a sophisticated data science team with experience handling unstructured data.

So, I started Relevance AI to democratize this access and allow all developers and data teams of any data skill level to utilize this technology and untap their data’s potential for their business while ensuring that the insights and application are produced collaboratively with the stakeholders.

Walk us through the recent advances in data science thanks to vector-based technology.

JK: Typically, data science has been centered around analyzing structured data like demographic and numeric data. The free text field or image_attachement or URL fields were often ignored. Thanks to vectors, now that data can be properly analyzed and used with structured data to improve models and gain more visibility on the insights’ qualitative factors.

Additionally, a vector database is designed to store and query vectors at low latency and high volume. Traditional databases do not have these features, as they typically deal with structured, tabular data. However, most vector databases rarely provide features to actually store the traditional structured data.

The unique thing about the vector database we provide is that we built it by keeping a traditional database in mind. Because unstructured data is best analyzed when it is combined with structured data (e.g., linking common natural language topics surface in the text to the necessary KPI and user demographics).

Share with our readers a few details about your Clustering engine.

JK: Our clustering engine is our flagship capability to help businesses find trends and themes from unstructured data. It empowers developers with a toolset to get the best clusters from their unstructured data and then deploy it into a front-end dashboard that the business can give feedback on and see trends.

What clustering does is it groups your unstructured data by the similarity of their vector, which encapsulates all kinds of semantic features we would associate with them.

For example, from a sample of text that includes one million or more words, two sentences such as “I really disliked the pop-up on the sign-up page” and “Not a fan of the random window that appeared when I tried to join” might appear to be worded differently but are semantically very similar. Thus, the two sentences are part of the same cluster. So, this works not only on sentences but also on word, paragraph, and document levels too.

The unique thing about the vector database we provide is that we built it by keeping a traditional database in mind.

Our clustering application provides not only this grouping but also KPIs to find how they correlate.

Case in point: if you want to analyze your car’s most effective marketing message, feed it through our clustering engine. The end clustering application would show you all the clusters and highlight insights that Cluster-29 out of 60 clusters has the highest click-through rate and lowest cost. That cluster is mainly talking about “how sustainable the car is.”

Our app allows you to drill down to verify and make sure that the cluster is accurate before acting on that insight straight away.

What are the common problems that data scientists and practitioners face with clustering?

JK: The evaluation and interpretation side of it. Applying clustering as an algorithm isn’t difficult, but evaluating it requires deep statistical knowledge to change if the clustering is quantitatively sound.

And once you do that, you have to see if these clusters make sense to domain experts, which means visualizing these clusters for the non-technical domain expert to view and interpret them appropriately.

Give us an overview of the use-cases regarding vector applications on your platform.

JK: The use cases are mainly related to analyzing unstructured text and image data. They are:

Semantic Search

Analyze text data based on customer feedback/inquiries to determine a set number of cluster topics.

Consider you have a sample of textual feedback with over 100K words on a product, out of which you get 30 cluster topics. Cluster-1 could contain sentences like “stopped charging,” “battery isn’t working,” or “power stays at 0.” While the words in this cluster are entirely different, the meaning is semantically similar.

Our vector clustering engine can not only group this, but also surface the common theme. So, in this case, the cluster topic theme would be “battery charging not working.” 

Image Search

Analyze unstructured marketing data to see which images and visual features are working well. Say you have an Instagram influencer campaign going for your wine brand, and you extract 60 visual clustering topics.

  • Cluster-6 could contain your product images being displayed in front of a family at a dinner table. It has an average of 20K likes, $2M in sales, and costs $25K across 15 posts. 
  • Cluster-7 could be of your product images displayed with two people at a hotel, with an average of 50K likes, $1M in sales, and cost of $75K across 10 posts.

By analyzing the visual features, you can now tell that your product is most effective for sales when it’s displayed in front of a family. 

Personalized search

You can offer better and more personalized discovery through search and recommendations based on the product’s description and visual features.There have been use cases outside of this, too notably with unstructured data regarding user interactions in the gaming industry. You can vectorize every action in the game and then use clustering to determine the different playstyles formed from a combination of actions. For instance, Cluster-1 is an all-rounder playstyle where the player attacks first, rests, and then defends. In contrast, Cluster-2 is a super aggressive playstyle where the player attacks, rests, and attacks. By finding out these playstyles, we can figure out which playstyle has the highest win rate, match new players, and help them receive a more personalized experience in improving their gameplay. 

Is the average SMB in a position to adopt vector-based technology?

JK: Absolutely, yes! One of the great things about vectors is that they can be standardized for the data type you input, and it works great across multiple domains out of the box. So, most businesses with the right tools supporting it can get 80% of the value from the standardized process; the other 20% is where more sophisticated teams optimize. 

What makes Relevance AI unique?

JK: We are end-to-end focused strongly on the experimentation and collaboration phase. When we say end-to-end, we mean going from data to data practitioner to domain expert to insights that a business can operate on.

We allow data practitioners to turn unstructured data into vectors and then deploy it into a shareable dashboard in a few lines of code. Once shared, the dashboard captures feedback from domain experts, so the data practitioner can take that feedback and optimize and adjust the analysis to be actually insightful to the business.

Once that experimentation phase finishes, the data practitioner can push to production with a single app line.go_live(). Again, all this is done within our platform.

Looking ahead, what hopes do you have for Relevance AI?

JK: We hope Relevance AI becomes synonymous with the term unstructured data. Like how Salesforce is synonymous with CRM and Slack with team messaging, we want to be synonymous with unstructured data.

For us, that will mean providing as many utilities for our users around unstructured data as possible. Although we are starting with vector clustering for self-serve, we are already looking at many other features in our roadmap, such as vector-based discovery and vector-based model report cards for interpretation, anomaly detection, etc.

Read next: The Pros and Cons of Decentralized Data Storage

Susnigdha Tripathy
Susnigdha Tripathy
Susnigdha Tripathy is a full-time writer and editor who presently lives in Singapore. She has over ten years of experience writing, editing, and delivering exceptional content for her clients. She currently writes for Virtasant, a cloud technology company, and Krista Software, a provider of intelligent automation solutions. Her work also gets published on several high-ranking tech websites.

Latest Articles

Follow Us On Social Media

Explore More