Data Modeling and Architecture Terminology Guide
Welcome to our data architecture and modeling encyclopedia! This reference guide aims to explain key concepts and terminology related to managing, integrating, modeling, and architecting data. Whether you are new to data or a seasoned practitioner, use this as your go-to dictionary to look up definitions and descriptions of important data architecture, modeling, governance, integration, and format topics.
Apache Avro is a binary row-oriented data serialization format supporting schema evolution and requiring less parsing. Avro optimizes storage and data transfer.
Change Data Capture:
Change Data Capture (CDC) is a process that captures changes made to data from source systems and databases and streams those changes to a destination for further processing and integration. CDC enables incremental, continuous data extraction for efficient data integration.
CSV (Comma Separated Values) is a simple tabular data format that uses commas to separate values. CSV files are human-readable but less efficient than binary formats for large datasets.
A data catalog is a metadata repository that enables discovery, inventory, lineage and utilization of data assets across an organization. It centralizes technical, business and operational metadata to provide context and connectivity for data consumers. Data catalogs enable self-service access to data.
A data dictionary is a centralized repository of metadata that provides definitions for data elements, attributes, business terms and rules across an organization. It documents structures, derivations, origins, usage and meaning to promote consistency and enable self-service understanding of data assets.
Data federation is an approach to data integration that leaves data in place while providing a unified virtual view and access across multiple data sources. It relies on a federated query engine to retrieve and combine data on demand from different systems during query execution.
A data lake is a centralized repository that stores large amounts of raw data in its native format until it is needed. It enables storing structured, semi-structured and unstructured data from multiple sources ready for analytics, AI and machine learning use cases.
Data lineage refers to the process of understanding the origins and movement of data from its creation through usage across technology systems over time. Data lineage enables tracing data elements to sources and destinations providing visibility into data processing and transformation.
Data literacy refers to the ability to read, understand, question and work with data. It includes the skills needed to transform data into business value. Data literacy enables asking the right questions, interpreting results correctly and challenging inaccurate assumptions based on the data.
A data mart is a centralized repository of data designed for access, analysis and reporting by a specific business unit or function. Data marts are subsets aligned to needs.
A data mesh is a decentralized data architecture organized around business domains rather than technology. It enables domain-oriented autonomy, governance and data products. Data meshes aim to democratize data and empower domain teams through a self-service data infrastructure.
Data orchestration is the automated coordination of data movement, transformation and processing across different systems such as cloud, on-prem and applications. It allows defining data workflows to efficiently integrate and prepare data for delivery to end users and systems.
A data pipeline is an automated flow of processes that move and transform data from sources to destinations such as data warehouses, lakes and marts. Data pipelines include data extraction, validation, transformation, and loading.
Data profiling is the examination and assessment of the content, structure, and quality of source data as part of data integration and preparation. Profiling provides insights to craft data transformations.
Data quality refers to the state of completeness, validity, consistency, accuracy and timeliness of data. High quality data meets requirements for use in business operations and analytics. Data quality processes profile, cleanse, transform, match and monitor data to detect and correct errors.
A data vault is a database modeling approach optimized for flexibility and scalability when integrating data from multiple sources. It uses a hub and spoke model with business keys linking normalized tables to track historical changes over time. The data vault model supports traceability and auditability.
Data virtualization provides a single, integrated view of data from multiple systems and sources without moving or copying data. It uses an abstraction layer to separate the data consumer from underlying systems and structures. This enables flexible access and delivery of data on demand.
A data warehouse is a centralized repository of integrated data from different sources, optimized for reporting, analysis and data insights. Data warehouses transform, organize, and structure data for business users and analysts to gain business intelligence for informed decision making.
Delta Lake is an open format storage layer that brings reliability, consistency and performance to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing.
A dimension table contains descriptive attributes or categories that provide context to measurements in the fact table. Dimension tables are connected to the fact table via foreign keys and are optimized for querying and segmenting data. Common examples are customer, product, time, etc.
A domain in a data mesh represents a business capability or area of expertise that manages its own data needs. Domains own data products, enforce governance, and enable decentralized data ownership at the business level.
ETL (Extract, Load, Transform) is a data integration approach that extracts data from sources, loads it into a target data store, then performs transformations and cleansing within the data store. ELT shifts data processing to the target instead of the source.
ETL (Extract, Transform, Load) is a data integration process that extracts data from sources, transforms and cleanses it, then loads it into a target data store. ETL performs data processing and cleansing in a specialized engine before loading into the target database or warehouse.
A fact table is a central table in a star schema that contains measurements and metrics, such as sales amounts, units sold, etc. It connects to multiple dimensional tables via foreign keys and is optimized for analysis and reporting.
A foreign key is a column or set of columns in a database table that establishes a link between data in two tables. It refers to the primary key in another table, enforcing referential integrity and allowing navigation between tables. Foreign keys help maintain data accuracy and consistency in related tables.
A HTAP (Hybrid Transactional/Analytical Processing) database combines capabilities of OLTP and OLAP systems in a single product, allowing for transaction processing while supporting real-time analytics and reporting on the same database. This eliminates latency between systems.
Apache Hudi enables managing storage of large analytical datasets on DFS by providing stream processing primitives like incrementality and atomicity. Hudi optimizes storage for data lake use cases.
Apache Iceberg is an open table format optimized for large, slow-moving tabular data. It uses metadata tables to allow querying snapshots efficiently across huge datasets.
The Kimball model is a widely adopted dimensional modeling approach optimized for data warehousing and business intelligence. It utilizes a star schema with conformed dimensions to build an enterprise data warehouse that supports business processes for reporting and analysis.
Master Data Management:
Master data management (MDM) is a set of processes, governance and tools for establishing and maintaining consistent, accurate and current master data or “single version of the truth” across an organization’s systems and functions. MDM optimizes data quality and data-driven operations and decisions.
Modeling is the process of creating conceptual and logical data models that describe the structure, relationships, and constraints required to support business processes and objectives. Data models provide blueprints for building databases and data management solutions.
Normalization is the process of structuring a relational database to minimize redundancy and dependency of data. It involves dividing large tables into smaller tables and defining relationships between them. Normalization improves data integrity and reduces the risk of inconsistencies and anomalies in query results.
OLAP (Online Analytical Processing) databases are optimized for analysis and reporting of aggregated, historical data. OLAP systems support multidimensional, summarized views of business data for management reporting, analysis and decision making. They serve analytical workloads.
OLTP (Online Transaction Processing) databases are optimized for fast query processing and maintaining data integrity while supporting high concurrency levels. OLTP systems serve transactional workloads in applications such as banking, order processing, and retail.
Optimized Row Columnar (ORC) is a columnar file format developed by Hive for Hadoop that is highly optimized for large data analytics. ORC provides compression, performance, and metadata enhancements.
Apache Parquet is a columnar storage format optimized for large datasets and analytics use cases. It provides high compression and efficient data access and query performance.
A primary key is a unique identifier for each record in a database table. It enforces entity integrity by uniquely identifying each row. A primary key column cannot contain null values and the values must be immutable meaning they cannot be updated or changed once inserted.
A data schema describes the structure and organization of data including field names, data types, constraints, keys and relationships between tables. Schemas ensure consistency and integrity.
A snowflake schema is a database schema variant optimized for data warehousing. It normalizes dimensional tables in a star schema into multiple tables to reduce data redundancy. The resulting structure of central fact tables connected to multiple dimensions resembles a snowflake shape.
A star schema is a database schema organized around a central fact table that contains primary keys from multiple dimension tables, allowing for dimensional modeling and simplified queries. It is commonly used in data warehouses and business intelligence systems.
A subject area represents a business focus or data category within a data warehouse design. It may cover business functions like sales, inventory or procurement. Subject areas group related data to simplify reporting and analysis.