Glossary/Core Data Architecture

Data Virtualization

Data Virtualization is a technology that provides unified query and access to data across heterogeneous sources without requiring copying data into a central location.

Data virtualization abstracts the physical location and format of data, allowing users to query as if all data exists in a single system. Queries are routed to appropriate sources, results are combined, and returned through a unified interface. This eliminates the need to copy data for analysis: an analyst can query Oracle database columns, PostgreSQL tables, and S3 data in a single query without extracting and loading. Virtual layers act as a schema mapping layer, translating business-friendly definitions to underlying system schemas.

Data virtualization became practical through advances in query federations: systems that can push computation to sources (predicate pushdown) to avoid moving massive datasets. Trade-offs exist: queries may be slower than directly querying a single warehouse because data movement happens at query time rather than being pre-materialized. Virtual layers add operational complexity because they depend on connectivity to source systems.

Organizations use data virtualization for specific scenarios: accessing rarely-used data that doesn't justify ETL, connecting to source systems that can't be replicated, or providing users access to data without copying sensitive information. Some modern data platforms (Snowflake, BigQuery) include federated query capabilities, reducing need for separate virtualization tools.

Key Characteristics

  • Provides unified interface to heterogeneous data sources
  • Routes queries to appropriate sources without copying data
  • Includes schema mapping layer translating logical to physical schemas
  • Implements query pushdown to minimize data movement
  • Supports caching to improve performance of repeated queries
  • Simplifies data governance by managing access through virtual layer

Why It Matters

  • Reduces latency and cost of accessing data not in central warehouse
  • Enables access to sensitive data without copying to central location
  • Supports real-time queries on operational systems without replication lag
  • Reduces time-to-analytics for new data sources without ETL development
  • Improves security by centralizing access control through virtual layer
  • Reduces total cost of ownership by avoiding unnecessary data movement

Example

A healthcare provider uses data virtualization to query across siloed systems: patient records in on-premises legacy database, insurance claims in a SaaS platform, genomic data in research cloud environment. Analyst writes single query for "patients with diabetes, their medications, and insurance coverage" in the virtual layer, which routes parts of query to each source, retrieves results, and combines them. Data never leaves source systems, reducing compliance risk.

Coginiti Perspective

Coginiti's semantic layer and 21+ native connectors provide a form of practical virtualization grounded in governed definitions rather than just federated access. Analysts interact with consistent business concepts in the semantic layer while Coginiti routes queries to the appropriate underlying platform. This approach complements ELT patterns: data remains in the platforms best suited for its workload, while the semantic layer ensures that users experience a unified, governed view regardless of where the data physically resides.

Related Concepts

See Semantic Intelligence in Action

Coginiti operationalizes business meaning across your entire data estate.