Winning Gen AI Race with Your Custom Data Strategy

Published in

Becoming Human: Artificial Intelligence Magazine

10 min readApr 19, 2024

The landscape of Generative AI (Gen AI) is evolving rapidly, presenting Product Managers with the formidable challenge of devising winning strategies for their products in this fiercely competitive market. While much attention is drawn to the emergence of larger and more advanced Foundational Models (FM) such as Llama, Mistral, Claude, Cohere, Gemma, GPT-4, and Granite, it’s essential to recognize that the FM serves merely as the foundational element of the solution.

Ultimately, success hinges on the strategic utilization of business/custom data to build FM in Gen AI Products.

Drawing from my experience leading various Data & AI products, I aim to provide fellow Product Managers (PM) with a concise guide to crafting a Data Strategy for their Gen AI Product Initiative, offering insights to enhance their chances of achieving remarkable success in their endeavors.

Opportunity — Custom Data as Strategic Edge in the Gen AI Era

Market Trend — Democratization of Gen AI with Open Source FM

The open-source movement undoubtedly fosters innovation while simultaneously establishing a fair competitive landscape for startups vis-à-vis monopolistic corporations. Similar dynamics are unfolding within Gen AI (Gen AI), which traces its roots back to Google’s seminal paper “Attention is all I need.” From this origin, we witness a proliferation of open-source Foundational Models (FM) daily, vigorously competing with their proprietary counterparts. Despite offering comparable quality and performance, open-source models gain a competitive edge over commercial ones due to their cost-effectiveness. Presented here is a benchmarking analysis comparing open-source and closed-source FMs to illuminate this phenomenon.

In the recent past, the advent of open sources, SaaS models and cloud platforms have undeniably democratized the technological landscape, and Gen AI is also following suit. Consequently, it’s plausible to assert that Gen AI Foundational Models (FMs) are poised to become a commodity or foundational capability universally accessible to all, devoid of any strategic advantage for any one particular entity.

Opportunity — Leveraging Custom Data to Succeed in the Gen AI Race

Closed and Open Foundational Models (FMs) excel in generalized behavior but lack business context and domain nuances.

However, customizing these FMs with proprietary business data offers a competitive edge. Leading the race in Gen AI requires leveraging organization-specific data to tailor FMs, creating unique value propositions.

This presents a significant opportunity for PMs to integrate Gen AI capabilities and gain a difficult-to-replicate competitive advantage. To seize this opportunity, PMs must collaborate with technical experts to develop a Custom Data Strategy aligned with their organization’s goals.

Strategy — Custom Data Strategy for Gen AI Product

In the context of Gen AI, data is the cornerstone of the model’s success. As a PM venturing into building such a product, understanding that the model’s effectiveness relies on timely access, quality, and governance of data is crucial. Timely data access ensures relevance and adaptability, while data quality directly impacts output excellence. Robust data governance practices address ethical concerns, like privacy and bias. Prioritizing these aspects in product development lays the foundation for a Gen AI solution that excels in performance while upholding ethical standards and user trust.

Data Requirement Analysis for Gen AI

Before outlining a data strategy, PMs need to examine the primary data needs of a Gen AI Product. The diagram below illustrates the interaction between users and the application with LLM Models, along with four distinct types of data inputs that impact the outcome of the Product.

Behavioral Context — Detailed prompts guide AI on desired output. Prompts should be diverse, covering tasks like customer service responses, emails, product descriptions, and creative content.
Situational Context — Comprehensive knowledge base includes business domain information and scenarios AI operates within. Data covers products, services, policies, industry standards, and customer preferences, sourced from historical records, transactions, feedback, and market research.
Semantic Context — Semantic data aids AI in understanding meaning and context. It includes ontologies, taxonomies, and relationships between entities, such as customer-product and employee-job roles.
Knowledge Base — Conversational data from sources like chat logs and social media, along with internal documents and training materials, help AI understand language usage and organizational operations, aligning content with goals and standards.

Custom Data Strategy for Gen AI Products:

Essentially, the Data Fabric Strategy involves a comprehensive plan to seamlessly integrate diverse data sources, processing capabilities, and AI algorithms to enable the creation, training, and deployment of generative AI models. It provides a unified platform approach for the Collection of Data, Organizing the data, and Allowing good Governance over data, facilitating the development of winning AI Products.

The Product Manager establishes the North Star Metrics (NSM) for the product according to the business context, with the most prevalent and crucial NSM being User Experience, contingent upon three pivotal factors.

Latency — Latency refers to the time taken for the generative AI system to process input data and produce output.
Accuracy — This dimension ensures that the generative AI models produce high-quality outputs that closely resemble the desired content.
Ethics — This dimension is about whether generated content is safe, fair, transparent and explainable.

With these User Experience criteria for Gen AI Product, the Product Manager can now craft a Data Fabric Strategy (Collect, Organize, and Govern) with their respective North Star Metrics.

Data Collection: Gathering data from diverse sources with seamless integration and efficient preprocessing.

Strategy:

Facilitate swift retrieval with seamless integration from diverse data sources using Zero ETL Integrations, ensuring minimal latency in accessing data for Gen AI models.
Aggregate and pre-process extensive volumes of structured and unstructured data in the Data Lake / Warehouse, optimizing for accuracy and efficiency in data processing to support robust Gen AI model training.

Metrics:

Data Retrieval Index (DRI): A composite index measuring Data Retrieval speed for Gen AI models to access required data without delays. It measures the efficiency of data retrieval and integration processes without ETL, indicating the readiness of data for model training and inference.

Data Organization: Establishing a structured data catalog and refining data for better comprehension, contextualization and analysis.

Strategy:

Establish a comprehensive data catalog and graph to build situational context for FMs, enabling better understanding and utilization of data for Gen AI tasks.
Refine and prepare the data into native and vectorized embeddings to enhance semantic comprehension for FMs, improving the accuracy and interpretability of generative outputs.

Metrics:

Data Contextualisation Index (DCI): A composite index measuring the improvement in generative model performance attributed to enhanced semantic comprehension of input data. It reflects the impact of data refinement and contextual understanding on the quality of generated outputs.

Data Governance: Ensuring data security, compliance, and superior quality aligned with ethical AI principles.

Strategy:

Elevate data security framework and compliance adherence across data, models, and prompts to prevent FM hacking attempts, prioritizing data integrity and confidentiality in Gen AI processes.
Ensure superior data quality coverage in alignment with ethical AI principles (Explainability, Fairness, Robustness, Privacy, Transparency) for the FM’s generative nature, fostering trustworthiness and responsible use of AI-generated content.

Metrics:

Ethical AI Compliance Index (ECI): A composite index assessing the adherence to ethical AI principles (Explainability, Fairness, Robustness, Privacy, Transparency) across data governance and model development practices. It assures ethical compliance and trustworthiness of the Gen AI product.

Design — Data Fabric Architecture for Gen AI Product

Data Fabric Capabilities for Gen AI Product:

At this stage, the Product Manager possesses a thorough understanding of the essential capabilities needed to fulfill the data requirements for Generative AI (such as situational and semantic context, and access to a vast knowledge base) sourced from various data repositories (including vectorized data, graph data, data lakes, etc.). With a defined Data Fabric strategy, the PM is well-equipped to delineate the Data Fabric Capability Framework and articulate the necessary functionalities.

Data Collection:

Zero ETL Integration: This enables swift and seamless retrieval of data from diverse sources without the need for ETL processes, ensuring minimal latency in accessing data for generative AI models. By adopting a unified data integration approach, it simplifies data ingestion and enhances real-time data availability, supporting timely responses to generative AI tasks.
Data Lake/Warehouse Aggregation and Pre-processing: Through scalable data aggregation, this component centralizes extensive volumes of structured and unstructured data in a Data Lake or Warehouse, providing a reliable storage solution for generative AI model training. Efficient pre-processing pipelines optimize data quality and readiness, streamlining the preparation process and facilitating smoother and more effective generative AI model training.

Data Organization:

Data Catalog and Graph DB: This component centralizes metadata and lineage information in a comprehensive data catalog, enabling FMs to better understand and utilize data for generative AI tasks, while a graph-based situational context enhances FMs’ awareness of data relationships and dependencies, facilitating more informed decision-making during generation.
Data Refinement and Vector Embedding: Through a refinement pipeline, raw data is processed into standardized formats, ensuring quality and consistency for generative AI tasks, while native and vectorized embeddings capture semantic meaning and context, enhancing FMs’ comprehension and ability to produce accurate and interpretable generative outputs.

Data Governance:

Data Security Management: This ensures data security by implementing robust encryption and access controls, safeguarding sensitive data from unauthorized access or tampering. Additionally, it maintains compliance with regulatory standards through continuous monitoring and auditing processes, minimizing the risk of hacking attempts and ensuring data integrity and confidentiality.
Ethical AI Compliance Management: This promotes trust in AI-generated content by integrating explainability and transparency tools, allowing stakeholders to understand how content is generated and make informed decisions about its use. Furthermore, it mitigates biases and ensures fairness in content generation through the adoption of fairness-aware machine learning techniques, fostering responsible and ethical AI practices.

Data Fabric Architecture for Gen AI Product:

Embarking on the journey of implementing a Data Fabric Strategy, the pinnacle stage lies in sculpting the Solution Architecture tailored for Gen AI product. While the accountability rests with the Product Manager, the creation of this vital blueprint falls under the purview of the Architect.

In dissecting the intricacies of Data Fabric solutions, we encounter two fundamental components: the user-facing interactions and the robust Data Processing Pipeline.

Transactional User Interactions: In this aspect, the Gen AI App orchestrates interactions with users, employing Prompt Templates for processing conversations effectively.
Batch or Streaming Processes: On the other front, the Data Fabric operates by managing Batch or Streaming data, undertaking tasks such as processing, organizing, storing, and feeding data into the LLM model to enable customized behavior.

The PM faces a crucial decision in choosing a cloud platform vendor for constructing a Gen AI solution. Most hyperscalers offer Data Fabric Capabilities for Gen AI, but the PM must choose wisely based on present setup and future needs. Briefly, Azure, AWS, IBM, and Google Cloud offer Data Fabric capabilities for Gen AI. However, the final decision lies with the PM, considering factors like ease of use, intuitiveness, and alignment with organizational preferences.

Impact — Product Performance for Gen AI Products

Product development is undeniably important, yet the real key lies in a successful launch and the ability to adapt strategies over time. In this process, the role of the project manager is indispensable, as they must constantly monitor both Leading and Lagging Indicators to effectively scale the product. Integrating Product-Led Growth (PLG) instrumentation enables monitoring at the user journey level, but it’s essential to correlate this data with underlying technical metrics to enhance the North Star Metrics of User Experience.

In my own research, I’ve examined over five Gen AI solutions where PMs adjusted Data Fabric Strategies for their AI Products. The results were overwhelmingly positive.

In conclusion, here are three key takeaways for Product Managers aiming to harness custom data for success in the Gen AI race:

Custom Data Advantage: Utilize your organization’s data to tailor models, creating distinct value propositions that competitors struggle to match. Develop a Custom Data Strategy in collaboration with senior technical staff.
Data Fabric Strategy: Implement a robust Data Fabric Strategy encompassing Data Collection, Organization, and Governance. Ensure efficient data utilization and ethical standards while prioritizing user experience.
Cloud Platform Selection: Choose a cloud platform vendor offering cost-effective and future-proof solutions for Gen AI. Evaluate options such as AWS, IBM, Azure, and Google Cloud based on their tailored Data Fabric capabilities.

Disclaimer: The postings on this article are my own and don’t necessarily represent IBM’s positions, strategies or opinions.

Saurabh Kaushik: Data and AI Product Management Leader for 24 years. From web 1.0 to cutting-edge AI solutions, he’s pioneered tech products across industries, from startups to enterprises. Saurabh is a renowned thought leader and speaker at global tech forums, and his tech blogs span over a decade. His relentless innovation continues to shape Data and AI solutions worldwide.

Connect with him on Linkedin

Winning Gen AI Race with Your Custom Data Strategy

Opportunity — Custom Data as Strategic Edge in the Gen AI Era

Market Trend — Democratization of Gen AI with Open Source FM

Opportunity — Leveraging Custom Data to Succeed in the Gen AI Race

Strategy — Custom Data Strategy for Gen AI Product

Data Requirement Analysis for Gen AI

Custom Data Strategy for Gen AI Products:

Design — Data Fabric Architecture for Gen AI Product

Data Fabric Capabilities for Gen AI Product:

Data Fabric Architecture for Gen AI Product:

Impact — Product Performance for Gen AI Products

Written by Saurabh Kaushik