Selecting all columns in a table can significantly impact performance due to network costs and the deserialization and compression of large columns.
In a row-store database, selecting all columns can lead to additional IO operations and overhead when accessing external tables and handling large columns.
Deep dives
Select Star is Slow
Select Star is a universal SQL syntax that returns all columns in a table. Although it has been said to be slow and should be avoided, it is important to understand why. One reason is that returning multiple columns is slower than returning a single column. This is especially true when dealing with network aspects, programming logic, memory management, and how the OS works. Another reason is that when using Select Star, index-only scans are no longer feasible. Index-only scans allow the database to directly access the columns specified in the index, without going through the table data. However, when using Select Star, the database has to fetch all columns, including those that are not specified in any index. This leads to additional random reads and increased IO. Additionally, the deserialization cost of retrieving and parsing all columns can add up, especially when dealing with large text, blob, or JSON fields. Lastly, network costs must be considered, as the data needs to be serialized, transmitted across the network, and deserialized by the client app. Overall, Select Star can significantly impact performance and should be used with caution.
Not All Columns are Inline
In a row-store database, where rows are stored one after another, all columns within a row are considered inline. However, if a column is too large to fit within a single page, it cannot be stored inline. In such cases, a pointer is assigned to the column, which refers to an external table where the large column data is stored. This introduces additional IO operations, as accessing the external table can require fetching data from disk. Moreover, the deserialization and compression of these large columns further add to the overhead, as the database needs to allocate memory and perform additional operations to handle these columns effectively. The limitation on the size of columns based on fixed page sizes is also a factor to consider, as it determines the number of columns that can be stored inline.
Network Costs
Selecting all columns in a table incurs network costs, especially when the database and the client app are not running in the same location. The latency introduced by network communication can be significant and affects the overall performance. Furthermore, the size of the data transmitted can impact the network transmission. If the response size is large, it may require multiple IP packets or TCP segments to transmit the data, leading to higher transmission times. The congestion control algorithm for network transmission plays a role as well, starting with small initial transmission sizes and incrementally increasing them. As a result, selecting all columns can add to the network latency and overhead.
Client Deserialization
Once the data is transmitted over the network, the client app needs to deserialize it, converting raw bytes into usable data structures. The client-side library used for database communication plays a role in this process. Depending on its implementation, the library may choose to lazily parse the data, only deserializing and handling parts of it as needed. This can help reduce the overhead of client-side deserialization and memory allocation. However, when selecting all columns, the client may end up deserializing and handling unnecessary data, leading to additional processing time and memory usage. It is important to consider the serialization and deserialization costs on both the server and client sides when selecting all columns in a table.
Fundamentals of Database Engineering udemy course (link redirects to udemy with coupon)
https://database.husseinnasser.com
In a row-store database engine, rows are stored in units called pages. Each page has a fixed header and contains multiple rows, with each row having a record header followed by its respective columns. When the database fetches a page and places it in the shared buffer pool, we gain access to all rows and columns within that page. So, the question arises: if we have all the columns readily available in memory, why would SELECT * be slow and costly? Is it really as slow as people claim it to be? And if so why is it so? In this post, we will explore these questions and more.
0:00 Intro
1:49 Database Page Layout
5:00 How SELECT Works
10:49 No Index-Only Scans
18:00 Deserialization Cost
21:00 Not All Columns are Inline
28:00 Network Cost
36:00 Client Deserialization
https://medium.com/@hnasr/how-slow-is-select-8d4308ca1f0c
Remember Everything You Learn from Podcasts
Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.