For a number of core database query optimization techniques such as cardinality estimation, it is implicitly assumed that columns in the table are independent of each other. The assumption is extremely useful because if column A and column B are independent, the database would still be able to predict the joint behavior of the two columns using only statistics on individual columns. It is also assumed that real world data can be correlated, but exactly how widespread is the correlation in real life is not well understood. In this project, we seek to evaluate the validity of the independence assumption in real life.
For a variety of real world datasets, and among all possible combinations of columns, how often does the independence assumption hold? What are efficient and accurate methods for measuring column independence?
Wouldn’t it be great if we can answer a query even without looking at the data itself? It is possible in some scenarios, with the help of metadata. For example, if a query asks for quantities that are smaller than the minimum of the dataset, we can return the empty answer right away. In this project, we seek to define the space of SQL queries that can be computed or approximated by meta data. We are especially interested in the “knowing when you are right” scenario, or queries that we can answer correctly with high confidence.
What kind of metadata should we store to help answer queries? Metadata can take the form of summary statistics such as column indepences and mean/std of values, or precomputes such as data cubes, but should be limited in size. Given the metadata, what is the set of queries that the meta data can answer with high precision or with some error guarantee?
Traditional sketches (lightweight data summaries) typically focuses on either update time or strict memory requirements. However, these are not necessarily the best metrics to optimize for given how sketches are used in end-to-end scenarios. For example, serialization costs, hashing / rng costs, slack room to spill to disk or go over memory budget, can be significant in practice. By looking at how sketches can be used in emerging distributed systems such as Ray, we seek to pinpoint new bottlenecks and potentially evolve current data sketch designs.
In end-to-end systems, which metrics are important to optimize for sketches? How do the sketches people use today perform based on these metrics and how do we improve them?
As data size rapidly increases, it is important for data analysts to still be able to perform exploratory tasks at an interactive speed. Hillview is a system that provides such high degree of interactivity and does so via visualization sketches. Visualization sketches combines ideas of sampling and approximate with principles of effective rendering. In this project, we will extend the idea of visualization sketches to additional tasks.
Can we build a sketch for pie charts? Can we build a sketch for time series visualizations that supports zooming?
Hash tables are one of the most commonly used data structures for lookups. Previous work has shown that depending on the data distribution, dataset size, read/write ratio etc., choice of hashing schemes and hash functions could lead to significant performance differences. However, the picture is not complete without a quantitative method to compare the performances of these different variations of hash tables for a given workload.
How does the skewness of the data distribution affect the performance of different hashing schemes and hash functions? Given a workload, can we build a cost model that could predict the performances of different hash schemes and hash functions?