A Delta table, which is part of the Delta Lake storage layer, extends the functionality of Parquet by adding ACID (Atomicity, Consistency, Isolation, Durability) transactions, scalable metadata handling, and version control to data lakes.

Files involved in a Delta table

File Type Description
Parquet Files Store the actual data in columnar format.
JSON Files in _delta_log Track each transaction with details on data and metadata changes (add, remove, modify).
Checkpoint Parquet Files Summarize the state of the Delta table up to a specific transaction for faster log reading.
_last_checkpoint References the latest checkpoint file to quickly locate the most recent table snapshot.

How These Files Work Together

  1. Data Storage: Data is stored in Parquet files within the Delta table's directory.
  2. Transaction Tracking: Each operation on the Delta table (e.g., insert, update, delete) is logged as a new JSON file in the _delta_log directory.
  3. Checkpointing: Periodically, checkpoint files are created to consolidate the state of the Delta table. This reduces the need to read through all JSON transaction logs from the beginning.
  4. Table Snapshot: The _last_checkpoint file points to the latest checkpoint, providing a quick way to access the current state of the table without reading all transaction logs.

Let’s have a look at the details of each file.

1. Parquet Files

These files store the actual data in a columnar format. Delta Lake uses Parquet as the underlying file format to benefit from its efficient compression and performance advantages.

Parquet is a columnar storage format, meaning data is stored column by column rather than row by row. This allows for efficient retrieval of specific columns, making it ideal for analytical queries. Parquet files are stored in a binary format, which is more efficient in terms of storage space and I/O operations. Parquet files contain rich metadata, which helps in optimizing queries and understanding the schema without reading the entire file.

Differences between Parquet and CSV :

Aspect Parquet CSV
File Structure Columnar Storage Row-based Storage
Format Binary Text
Metadata Rich Metadata Minimal Metadata
Compression Efficient Compression Typically No Compression
Read Efficiency Reads Specific Columns (Column Pruning) Reads Entire Rows
Encoding Uses Advanced Encoding Techniques No Encoding
Performance High Performance for Analytical Queries Less Efficient for Large Datasets
Use Cases Big Data and Analytics, OLAP Data Exchange, Small to Medium Datasets
Complexity More Complex, Requires Libraries/Tools Simple, Readable by Basic Text Editors
Usability Requires Advanced Technical Knowledge User-Friendly, Accessible to All

Untitled

2. Delta Log Files

The Delta log (also known as the transaction log) is a critical component that records all changes (transactions) made to the Delta table. This log ensures ACID properties and version control. The logs are stored in the _delta_log directory which contains JSON and checkpoint files that track changes to the Delta table.