A Delta table, which is part of the Delta Lake storage layer, extends the functionality of Parquet by adding ACID (Atomicity, Consistency, Isolation, Durability) transactions, scalable metadata handling, and version control to data lakes.
| File Type | Description |
|---|---|
| Parquet Files | Store the actual data in columnar format. |
| JSON Files in _delta_log | Track each transaction with details on data and metadata changes (add, remove, modify). |
| Checkpoint Parquet Files | Summarize the state of the Delta table up to a specific transaction for faster log reading. |
| _last_checkpoint | References the latest checkpoint file to quickly locate the most recent table snapshot. |
Let’s have a look at the details of each file.
These files store the actual data in a columnar format. Delta Lake uses Parquet as the underlying file format to benefit from its efficient compression and performance advantages.
Parquet is a columnar storage format, meaning data is stored column by column rather than row by row. This allows for efficient retrieval of specific columns, making it ideal for analytical queries. Parquet files are stored in a binary format, which is more efficient in terms of storage space and I/O operations. Parquet files contain rich metadata, which helps in optimizing queries and understanding the schema without reading the entire file.
Differences between Parquet and CSV :
| Aspect | Parquet | CSV |
|---|---|---|
| File Structure | Columnar Storage | Row-based Storage |
| Format | Binary | Text |
| Metadata | Rich Metadata | Minimal Metadata |
| Compression | Efficient Compression | Typically No Compression |
| Read Efficiency | Reads Specific Columns (Column Pruning) | Reads Entire Rows |
| Encoding | Uses Advanced Encoding Techniques | No Encoding |
| Performance | High Performance for Analytical Queries | Less Efficient for Large Datasets |
| Use Cases | Big Data and Analytics, OLAP | Data Exchange, Small to Medium Datasets |
| Complexity | More Complex, Requires Libraries/Tools | Simple, Readable by Basic Text Editors |
| Usability | Requires Advanced Technical Knowledge | User-Friendly, Accessible to All |

The Delta log (also known as the transaction log) is a critical component that records all changes (transactions) made to the Delta table. This log ensures ACID properties and version control. The logs are stored in the _delta_log directory which contains JSON and checkpoint files that track changes to the Delta table.
00000000000000000010.json). These files contain information about data operations (e.g., add, remove, modify) and metadata changes (e.g., schema changes).00000000000000000010.checkpoint.parquet). These files summarize the state of the Delta table up to a specific transaction. Checkpoints make it faster to read the transaction log by reducing the number of JSON files that need to be read.