Columnar Databases:
Columnar databases, or column-oriented databases, store data by column rather than by row. They are well-suited for analytical processing and large-scale data warehousing. Their structure allows for efficient reading and aggregation of column data but can be less efficient for write-heavy workloads. Columnar Storage vs. Row Storage In a traditional row-oriented database, the data for each row is stored together. This means that all data for a given record (e.g., name, age, address) is stored contiguously on the disk. This makes it easy to read an entire record at once, but it can be inefficient if you only need a subset of the columns for a query. In a columnar database, all the data for a given column (e.g., all the names) is stored together. This allows for more efficient compression and reading when only a subset of columns is needed.Note: The linkage between the data from different columns that belong to the same row is handled through the order (or other indexing), allowing the database to reconstruct the original rows when needed. How do they work? 1. Column Storage: Each column is stored in a separate data structure or file. If you have a "name" column, all the names would be stored together in that file or structure. 2. Mapping to rows: To know which pieces of data correspond to the same record, a columnar database typically uses some form of mapping or indexing, for example, the address in the file. 3. Compression: Since all the data in a column is of the same type and often has similar or repeated values, it's more conducive to compression. 4. Query Execution: When a query is run that requires data from multiple columns, the database reads the necessary data from each relevant column file and uses the mapping to reconstruct the necessary rows for processing the query. 5. Joining Columns: To reconstruct a complete row or join multiple columns, the database engine refers to the mapping or indexing that links the data from different columns that belong to the same row. Pros: 1. Analytical Queries: They excel at performing analytical queries and aggregations on large datasets. This is because reading a single column's worth of data often requires reading fewer disk blocks. 2. Compression: Similar data within a column can be compressed efficiently. By grouping the same type of data together, algorithms can use patterns in that data to reduce storage space. 3. Query Performance: As you only need to read the columns involved in the query, I/O is reduced, leading to faster query performance. Cons: 1. Write Performance: Inserting, updating, or deleting rows can be slower in a columnar database because changes to a single row touch multiple columns. 2. Transactional Queries: They can be inefficient for queries that retrieve an entire record or perform frequent small writes. 3. Complex Queries: While they are good for simple aggregations, they can sometimes struggle with more complex queries that involve multiple tables and joins. Columnar databases excel in read operations but can become a challenge in write operations, particularly updates. Because the data for a single row is spread across different column files, updating a single row requires touching more places in storage. Optimizing Table Reads: Partitioned Table: Partitioning tables is a technique used to divide a large table into smaller, more manageable partitions. Each partition contains a subset of data, typically based on values in one or more columns (such as the insertion date). Querying a partitioned table can be much faster on large datasets because it allows the database to skip over partitions that don't contain relevant data. Materialized View: A materialized view is a database object that contains the results of a query and can be updated as data changes. Unlike a standard view, which is just a saved query that is executed each time the view is accessed, a materialized view actually stores the result set in physical storage. This allows for faster access since the database can return the stored result directly without having to execute the underlying query again. Materialized views can be indexed for performance but must be refreshed to stay in sync with the underlying data.