Morsel-Driven Parallelism: A NUMA-Aware Query Evaluation Framework for the Many-Core Age
|
#Execution
#Parallel
|
Mosel is a small fragments of input data
Many-Core archtecture should take NUMA local processing into account, i.e. NUMA locality
Machine-dependent number of threads
Threads are pinned to the cores, avoid thread moving across different cores
Keep pipeline with homogeneously sized morsels (exchange between to adjacent pipelines) to avoid skewed data distribution
Scheduling goals
- Preserving (NUMA-)locality by assigning data morsels to cores on which the morsels are allocated
- Full elasticity concerning the level of parallelism of a particular query
- Load balancing requires that all cores participating in a query pipeline finish their work at the same time in order to prevent (fast) cores from waiting for other (slow) cores
Work stealing
The morsel size is not very critical for performance, it only needs to be large enough to amortize scheduling overhead while providing good response times
|
✅ |
★★★★★ |
Push vs. Pull-Based Loop Fusion in Query Engines
|
#Execution
|
|
✅ |
★★★ |
Everything You Always Wanted to Know About Compiled and Vectorized Queries But Were Afraid to Ask
|
#Execution
|
Vectorization(pull base) and data-centric code generation(push base) are both good
Data-centric code generation is better when executing calculation-heavy queries, while vectorization is good at hiding cache miss latency
Two constraints of vectorization: It can (i) only work on one data type2 and it (ii) must process multiple tuples
Data-centric code generation can avoid materialize intermediate result in some degree
Data-centric code generation perform more complex loop, which leads to more expensive penalty of branch miss and cache miss
Observation: SIMD is only beneficial when all data fits into the cache
|
✅ |
★★★★★ |
Implementing Database Operations Using SIMD Instructions
|
#Execution
|
Branch Misprediction
Tres struct with SIMD
B+ Tree’s leaf node do not need to be stored in order if sequential search is choosed
Fully use of SIMD by mapping non-fixed size datatype to fixed size datatype
Fully speed up SIMD by mapping large size to smaller size which means more data in one instruction, and need a second check by the original datatype due to false positive
|
✅ |
★★★★★ |
MonetDB/X100: Hyper-Pipelining Query Execution
|
#Execution
#Vector Processing
|
How CPU works
- Pipelining, branch, cache, memory access
- Super-scalar CPU can take multiple instructions into execution in parallel if they are independent, because CPU have more than one pipeline
- Radix-partitioned hash-join has cache-friendly memory access pattern
Microbenchmark: TPC-H Query 1
- Binary Association Table, BAT
- Full materialization may offset the benefit from the vectorization, because memory access becomes a bottleneck when bandwidth is full
Vectorized Query Processor
- Chunk at a time, cache-resident data, vector size, range from 1k to 8k, works well
- Data in vertically fragmented form
- MonetDB Instruction Language, MIL
|
✅ |
★★★★★ |
Interleaved Multi-Vectorizing
|
#Execution
#Vector Processing
|
[Feature] improve hash join performance by coroutine-based interleaving
Pointer chasing applications:
- traversing skip lists
- lookup hash table
- searching trees
SIMD eliminate branches by compressing and expanding vector
Miss Status Holding Registers, MSHR, and Line Fill Buffers, LFB
Prefetching
- prefetching distance
- prefetching-optimized techniques:
- Group Prefetching, GP
- Software Pipelined Prefetching, SPP
- Memory Access Chaining, AMAC
Interleaved Multi-Vectorizing, IMV
- Coroutine
- Splitting states for a vectorized program
- Residual vectorized states, RVS, with divergent vectorized states, DVS
|
✅ |
★★★★★ |
Self-Tuning Query Scheduling for Analytical Workloads
|
#Execution
#Scheduling
|
Context
- It's difficult to retain competitive query performance when the system is under a high load. In contrast, system tends to become less responsive, taking longer to provide the desired insights, and the query performance tends to become unpredictable
- Performance should degrade as gracefully as possible
- Database systems like PostgreSQL transfer scheduling responsibilities to the operating system (OS). They create individual OS threads or processes for every new connection and execute the incoming requests in an isolated fashion
- Many modern database systems deviate from this classic approach. Their parallelization scheme closely resembles task-based parallelism. Each query is split into a set of independent tasks, which can be executed in parallel by different OS threads
- Other systems like HyPer and Umbra relieve the OS from almost all scheduling decisions. On startup, they spawn as many OS threads as there are CPU cores. Since these threads do not block when executing tasks, this maximizes performance. The system is not oversubscribed and context switches are kept to a minimum. The task-based parallelization scheme is realized through so-called morsels. A morsel represents a fixed set of tuples in the context of an executable pipeline and is the smallest unit of work during query execution
Scalable task scheduling
- Stride scheduling gives smaller stride on bigger priority
- Tasks are executed by OS threads. On startup, Umbra creates as many OS threads as there are CPU cores. We also call these threads worker threads. The worker threads are only responsible for executing scheduler tasks. This design minimizes context switches and prevents oversubscription
- Worker threads should not block
|
👀/Chap2.3 |
★★★★★ |
Parallel Merge Sort
|
#Execution
#Sort
|
Sort models, circuits and PRAM(parallel random access memory)
- CRCW(concurrent read and concurrent write) PRAM
- CREW(concurrent read and exclusive write) PRAM
- EREW(exclusive read and exclusive write) PRAM
Tree-based merge sort, CREW algorighm
- Think about the tree-based merge sort, items are merged level by level from bottom up. For merge operation, different level processes different amount of items, and we use
log(N) stages to merge N items. And here comes the key observation that the merges at the different levels of the tree can be pipelined
- Runs in
O(log(N)) time on N processors
- Definitions:
L(u) denotes the final sorted array of the subtree rooted at node u
UP(u) denotes the subset of L(u) , which will become a more accurate approximation of L(u) as stage goes forward
SUP(u) denotes the subset of UP(u) , which is also sorted
0 < |UP(u)| < |L(u)| , then node u is inside node
|UP(u)| = |L(u)| , then node u is external node
L is a c -cover of J if each interval induced by an item in L ([e, g] , where e and g are two adjacent items in L ) contains at most c items from J . And, usually |L| < |J|
OLDSUP(v) is a 3-cover of SUP(v)
- As
UP(u) = OLDSUP(v) ∪ OLDSUP(w) , we can deduce that UP(v) is a 3-cover of SUP(v) and UP(v) is a 3-cover of SUP(w) , why???
- For every node
u at every stage, we do the following two steps:
- Formulate
SUP(u)
- For inside node:
SUP(u) = every forth item of UP(u)
- For external node:
- First stage as it becomes an external node:
SUP(u) = every forth item of UP(u)
- Second stage as it becomes an external node:
SUP(u) = every second item of UP(u)
- Third or later stage as it becomes an external node:
SUP(u) = UP(u)
- Formulate
NEWUP(u)
- For inside node:
NEWUP(u) = SUP(v) ∪ SUP(w) , v and w are child nodes of node u
- For external node:
NEWUP(u) = UP(u)
- Merge process:
- Assume
UP(u) -> SUP(v) , UP(u) -> SUP(w)
- Step1: compute
NEWUP(u) . If SUP(v) <--> SUP(w) , then we can know rank of any item of SUP(v) or SUP(w) in NEWUP(u) by adding its ranks in SUP(v) and SUP(w)
- Substep1: For each item in
SUP(v) we compute its rank in UP(u) , SUP(v) <--> UP(u) , SUP(w) <--> UP(u)
- Substep2: For each item in
SUP(v) we compute its rank in SUP(w) by making use of SUP(v) <--> UP(u) , SUP(w) <--> UP(u)
- Step2: compute
NEWUP(u) -> NEWSUP(v) , NEWUP(u) -> NEWSUP(w)
|
✅ |
★★★★★ |
On the Nature of Merge: External Merge, Internal Merge, and Parallel Merge
|
#Execution
#Sort
|
External merge takes two distinct rooted structures and joins them into one
Internal merge takes a subpart of an existing structure as one of two objects
Parallel merge combines the properties of both
This article is extremely obscure
|
🤷🏻 |
★ |
Optimizing parallel bitonic sort
|
#Execution
#Sort
|
Given that this article was published in 1997 when CPU had only one core in most cases, so parallel algorithm need to take network communication into consideration, which no longer exists in the multi-core parallelism
Most of the research on parallel algorithm design in the 70s and 80s has focused on fine-grain models of parallel computation, such as PRAM or network-based models
N keys will be sorted in log(N) stages, for each bitonic sequence of i-th stage contains 2^i items, and it can be Butterfly merged in i steps
Naive and improved data layouts
- Blocked layout: mapping
N items on P processors, with n = N / P . The first log(n) stages can be exected locally. For the subsequent stages log(n) + k , the first k steps require remote communication whil the last log(n) steps are completely local
- Cyclic layout: mapping
N items on P processors by assigning the i-th item to the (i % n) % P processor, with n = N / P . The first log(n) stages require remote communication. For the subsequent stages log(n) + k , the first k steps are completely local while last log(n) steps require remote communication
- Cyclic-blocked layout: by periodically remapping the data from a blocked layout to a cyclic layout and vice verse can reduce the communication overhead
Optimization communication
- Absolute address(
log(N) bits long): represents the row number of the node in the bitonic storing network
- Relative address: the first
log(P) bits represent the processor number, and the last log(n) bits represent the local address of the node after the remap
- Key observation: After the first
log(n) stages (which can be entirely executed locally under a blocked layout) the maximum number of successive steps of the bitonic sorting network that can be executed locally, under any data layout, is log(n) (where n = N / P , N is data size, P is number of processors)
- We can thus reformulate the problem as: Given the tuple (stage, step), which uniquely identifies a column of the bitonic sorting network, how to remap the elements at this point in such a way that the next
log(n) steps of the bitonic sorting network are executed locally
log(n) successive steps may cross stage
- The purpose of the first
log(n) stages is to form a monotonically increasing or decreasing sequence of n keys on each processor, thus we can replace all these stages with a single, highly optimized local sort
|
✅ |
★★★★★ |
Parallel Merge Sort with Load Balancing
|
#Execution
#Sort
|
Main contribution: find a way to merge two lists, each of which is distributed in multiple processors, rather than store them on a single processor
Items are merged level by level from bottom up:
- Group is a set of processors that are in charge of one sorted list, the higher level, the more processors there will be
- Each merge comprises two groups(partner groups), each group will maintain a histogram
- Before merging, each group exchanges histograms to form a new one to cover both, and each processor then divideds the intervals of the merged histogram into
2 * |group| parts so that the lower indexed processors will keep the smaller half, and the higher will keep the larger half. And each processor sends out the half intervals that belongs to the other processors for further merge
- Key observation: for each non-overlap interval, which means amoung all the processors only one at most data set exists, no merge operation required. And for each overlap interval, k-merge or cascaded merge is required to merge the 2 or more data sets
- Questions:
- How to form a histogram if there are mutilply sort keys?
- How is the balance when there is data skew?
- How to make sure that each processor maintain a histogram with same intervals
|
✅ |
★★★★★ |
Parallelizing Fundamental Algorithms such as Sorting on Multi-core Processors for EDA Acceleration
|
#Execution
#Sort
|
Granularity vs. Load Balancing. Fine grain has better load balancing while coarse grain has smaller overhead
Parallel quick sort
Parallel merge sort
Parallel map sort
|
✅ |
★★★★★ |
A Simple, Fast Parallel Implementation of Quicksort and its Performance Evaluation on SUN Enterprise 10000
|
#Execution
#Sort
|
Parallel quick sort has four phases:
- The parallel partition of the data phase
- The primary operation unit is a block, which holds a consecutive of keys
- Pivot(a key) is selected by one of the processors, assuming
P0 is the selected processor
- Each processor indenpendly partition its local data with the above pivot using a function called
neutralize which takes two blocks from left and and right end as input, the process is similiar to the generic quick sort's partition in the high level. But there may be at most one block remains after partition, because neutralize always requires two blocks as input
- The sequential partition of the data phase
P0 works on the remaing blocks of all the processors in the similar way(neutralize )
- Finally, we get two subarray stride across the pivot with left side small or equal to pivot, right side large or equal to pivot
- The process partition phase
- Partition all processors into two groups base on the size of the subarray
- Repeat the process from phase one to phase three, until every group contains only one processor
- The sequential sorting in parallel with helping phase
|
✅ |
★★★★★ |
Parallelization of Modified Merge Sort Algorithm
|
#Execution
#Sort
|
Good summary of previous work
|
👀 |
|
Distributed Top-k Query Processing by Exploiting Skyline Summaries
|
#Execution
#Sort
|
Optimize TopN when N is large
|
✅ |
★★★ |
[Efficient-External-Sorting-in-DuckDB]
|
#Execution
#Sort
|
The cost of sorting is dominated by comparing values and moving data around
Two main ways of implementing merge sort: K-way merge and cascade merge
|
👀 |
★★★ |
Merge Path - A Visually Intuitive Approach to Parallel Merging
|
#Execution
#Sort
|
Key idea: use diagonal to cut the merge path
Cache Efficiency
- Three types of cache miss, compulsory, capacity, contention
- Three types of associativity, full associative, direct-mapped, group-mapped
Cache-efficient parallel merge
- Key idea: ensure that all elements that may be active at any given time can co-reside in cache
- Multi-processors process cache-size data at one iteration
|
✅ |
★★★★★ |
Efficient Parallel Merge Sort for Fixed and Variable Length Keys
|
#Execution
#Sort
|
Sort algorithm classification:
- Radix sort, rely on a binary representation of the sort key
- Comparison sort, allow user-specified comparison function. Including quick sort, merge sort
Three stages merge sort with GPU
- Block sort, which can be further divided into two stages. First, each thread loads eight elements in registers and sorts them using bitonic sort. Second, merge these 8-element segments together
- Merge sort-simple, two moving windows, one in register and the other in memory
- Merge sort-multiple, allow CUDA blocks to cooperate in merging two sequences
|
✅ |
★★★ |
Optimization of Analytic Window Functions
|
#Execution
#Window Function
|
Full Sort, Hashed Sort, Segmented Sort
Segment Relation
Reorderable
SS-reorderable, only reorder in segment level to match expected order property, which not degenerating to full sort
Cover Set-based Evaluation
|
✅ |
★★★★★ |
Efficient Processing of Window Functions in Analytical SQL Queries
|
#Execution
#Window Function
|
Basic Concepts:
- Partitioning
- Ordering
- Framing
- Window Expression:
- ranking, rank/dense_rank/row_number/ntile
- distribution, percent_rank/cume_dist
- navigation in partition, lead/lag
- distinct aggregates, min/max/sum
- navigation in frame, first_expr/last_expr
Pre-Partitioning into Hash Groups
Aggregation Algorithms:
- Naive Aggregation
- Cumulative Aggregation
- Removable Cumulative Aggregation
- Segment Tree Aggregation
|
✅ |
★★★★★ |
Analytic Functions in Oracle 8i
|
#Execution
#Window Function
|
Minimization of number of sorts
Predicate Pushdown
|
✅ |
★★★★★ |
Incremental Computation of Common Windowed Holistic Aggregates
|
#Execution
#Window Function
|
Function classification, including tuple-functions, aggregate-functions, window functions
Aggregate-functions can be subdivided into distributive aggregates, algebraic aggregates, holistic aggregates
|
👀 |
|
An Overview of Query Optimization in Relational Systems
|
#Optimizer
#Overview
|
query optimization can be viewed as a difficult search problem:
- A space of plans
- A cost estimation technique
- An enumeration algorithm
System R's enumeration algorithm comprises two important parts:
- Dynamic programming to produce the optimal plan
- Interesting orders, the idea of which is later generalized to physical properties
Search space
- Join sequences, including outerjoin and join, while outerjoin has more limitations
- SPJ with group-by, group-by push down
- Reducing multi-block queries to single-block:
- Merging views,
{Q=R⋈V|V=S⋈T} -> {Q=R⋈S⋈T} , then may be freely reordered
- Merging nested subqueries, the generic way to process the correlated subqueries
- Using semijoin like techniques for optimizating multi-block queries
Statistics and cost estimatio
- Statistical summary is a logical property but the cost of a plan is a physical property
- Histogram
- Equi-depth(height) is effective for either high or low skew data
- Histogram is working on single column, do not provide information on the correlations among columns. One option is to consider 2-dimensional histograms, which will cost much more space
- Sampling
- Sampling is estimation of statistics, the key challenge is to limit the error in estimation
- The task of estimating distinct values is provably error prone
- Propagation of statistical information through operators
- Cost computation, CPU/Memory/IO/Paralleliasm
Enumeration architectures
- Starburst
- Query Graph Model(QGM)
- query rewrite phase, without cost info
- plan optimization phase, with estimated cost and physical properties, properties are propagated as plans are built bottom-up
- Colcano/Cascades
- Including two kinds of rules, transformation rules and implementation rules. And there is no clearly boundary between the two kinds of rules
- Logical properties, physical properties and cost are used during enumeration
- Use dynamic programming in a top-down way, memorization
- Goal-driven
Beyound the fundamentals
- Distributed and parallel databases, replication for physical distribution and parallelism for scale-up
- User defined functions, UDF. Bring problems to cost estimation
- Materialized Views
- Defer generation of complete plans subject to availability of runtime information
|
✅ |
★★★★★ |
The Cascades Framework for Query Optimization
|
#Optimizer
#Framework
|
Framework Concepts and Components
Sketchily
|
✅ |
★★★ |
Efficiency in the Columbia Database Query Optimizer
|
#Optimizer
#Framework
|
Framework Concepts and Components
- Logical/Physical operators
- Expression/ExpressionGroup
- Search space
- Rules
- Tasks
Optimize tasks
- Group optimization task is for finding the cheapest plan in this group
- Group exploration task is for expanding the search space
- Expression optimization task
- Rule application task
- Input optimization task is for property enforcing, cost calculating and pruning
|
✅ |
★★★★★ |
How Good Are Query Optimizers, Really?
|
#Optimizer
|
Cardinality Estimation is more important than Cost Model
|
✅ |
★★★★★ |
Orca: A Modular Query Optimizer Architecture for Big Data
|
#Optimizer
|
Orca Architecture
- Ability to run outside the database system as a stand-alone optimizer through Data eXchange Language(DXL)
- Memo
- Search and Job Scheduler, including three main steps: exploration, implementation, optimization
- Transformations
- Property Enforcement, including logical property(output columns) and physical property(sort order, distribution)
- Metadata Cache
- GPOS
Optimization workflow
- Exploration, possibly creating new group expressions and new groups into memo, such as Join Commutativity Rule/
- Statistics Derivation, used to derive estimates for cardinality and data skew
- Implementation, shifting from logical to physical
- Optimization, properties enforcement and cost computation
Parallel query optimization, task dependency
|
✅ |
★★★★★ |
Orthogonal Optimization of Subqueries and Aggregation
|
#Optimizer
#Subquery
|
Scalar Aggregate means without group by colums, which always returns one row
Subquery classification:
- boolean-valued subquery, including exist/in/quantified comparisons
- scalar subquery, which need
Max1row operator
Correlated subquery have three execution strategies:
- correlated execution
- outerjoin then aggregate
- aggregate then join
Algebraic representation
Remove Correlations, which means the recursive calls between scalar and relational execution are removed, and typically results in outerjoins
Pushing down Apply
For (not) exist/in subquery, semijoin for exists, antisemijoin for not exist
Subquery classes based on different processing strategies:
- Class 1. Subqueries that can be removed with no ad- ditional common subexpressions
- Class 2. Subqueries that are removed by introducing additional common subexpressions
- Class 3. Exception subqueries
GroupBy reorder conditions:
- case groupBy with filter
- case groupBy with join/outerjoin
Local Aggregate, Local GroupBy
SegmentApply, dividing table to indenpendent parts, and perform Apply operator on each parts indenpendently
|
✅ |
★★★★★ |
Of Nests aud Trees: A Untied Approach to Processing Queries That Contain Nested Subqueries, Aggregates, and Quantifiers
|
#Optimizer
#Subquery
|
|
|
|
Complex Query Decorrelation
|
#Optimizer
#Subquery
|
The concept of correlation in SQL is similar to the use of non-local variables in block-structured programming languages
Set-oriented
Count bug, every missing groupBy key is expected to have a zero output
Magic Decorrelation:
- FEED Stage, during which feeding the correlation to its children
- ABSORB Stage, during which absorbing the correlation and resulting in a decorrelated query
- non-SPJ Box, SPJ is the abbreviation of Select-Project-Join
- SPJ Box
Experiment Comparisons:
- Nested Iteratio, NI
- Kim's method
- Dayal's method
- Magic Decorrelation
|
✅ |
★★★★★ |
Enhanced-Subquery-Optimizations-in-Oracle
|
#Optimizer
#Subquery
|
Subquery coalesce
- Of same type, e.g. two exist subqueries
- Of different type, e.g. exist and not exist subqueries
Execution enhancements:
- Cost-based parallel group-by pushdown, GPD. Short circuit by HAVING clause
Subquery removal using window functions:
- Correlated subsumed subquery, TPCH-Q2/Q17
- Uncorrelated subsumed subquery, TPCH-Q15
- Subsumed subquery in HAVING clause, TPCH-Q11
- Subquery Producing a Multi-Set, MAX(MAX), MIN(MIN)
Scalable parallel execution:
- Grand-total window function optimization, which can be extended to low cardinality partition-by keys
Null-aware anti join, NAAJ
|
✅ |
★★★★★ |
WinMagic : Subquery Elimination Using Window Aggregation
|
#Optimizer
#Subquery
|
Key idea: for every distinct value of correlated column, if the row set of the subquery and the outer block are exactly the same, then the aggregate of subquery can be replaced with a corresponding window function of outer block
Conditions that are required to meet
- Scalar correlated aggregation subquery with only equal predicate
- Aggregate function has a corresponding version of window function
- Aggregate function DO NOT contains DISTINCT
- Tables of subquery and outer block are exactly the same, taking correlated outer table into account for subquery
- Predicates of subquery and outer block are exactly the same, except the correlated outer table only related predicates
TPCH-Q2, TPCH-Q17
|
✅ |
★★★★★ |
Outerjoin Simplification and Reordering for Query Optimization
|
#Optimizer
#Join
|
Outerjoin workloads:
- Database merging, two-sided outerjoin
- Hierarchical views
- Nested queries
- Universal quantifiers
Outerjoin optimizations:
- Outerjoin simplification: LOJ can be rewritten as regular join if later operator discards the null-padded tuples. And this can be extended to ROJ and FOJ
- Join/Outerjoin associativity
- Generalized outerjoin
|
🤷🏻 |
★★★ |
Including Group-By in Query Optimization
|
#Optimizer
#Join
#Aggregate
|
A single group-by(with multi group by columns) is replaced by multi group-by in stages, interleaved with join
Too few pictures to understand
|
🤷🏻/2 |
★★ |
Groupwise Processing of Relational Queries
|
#Optimizer
#Join
#Aggregate
|
|
|
|
Optimization of Common Table Expressions in MPP Database Systems
|
#Optimizer
#CTE
|
The purpose of CTEs is to avoid re-execution of expressions referenced more than once within a query
CTEs achieve two goals, making query more readable, making execution more efficient
CTEs follow a producer/consumer model
Chanllenges
- Deadlock hazard, operator's execution order conflicts with the CTE's execution order
- Enumerating inlining alternatives, determining which ones need to be inline (for example, inline to utilize index) and which ones don't
- Contextualized Optimization
CTE representation
- CTEProducer
- CTEConsumer
- CTEAnchor
- Sequence
- For case of CTE inline, CTEAnchor is removed and CTEConsumer is replaced with the whole CTE definition
- For case of CTE no-inline, CTEAnchor is replaced by Sequence operator which has the CTEProducer as its left child and the original child of the CTEAnchor as its second child
Plan enumeration
- Transformation rules, CTEAnchor/Sequence/NoOp in one group, CTEConsumer/CopyOfCTEDefinition in one group
- Avoid invalid plans, such as CTEProducer without CTEConsumer, Sequence without CTEConsumer, NoOp with CTEConsumer(inlined)
- Optimization, predicate push down, always inline single-use CTEs, elimination of unused CTEs
Contextualized optimization
- In ORCA, CTEs are considered later after regular optimization is done
- Enforcing physical properties, pushing the CTEConsumer's distribution requirement to the CTEProducer
- Cost Estimation
CTE based optimizations
- CTE-Generating transformations, such as two count(distinct)
- Common subexpression elimination
|
✅ |
★★★★★ |
Are We Ready For Learned Cardinality Estimation?
|
#Optimizer
#Cardinality Estimator
|
Cardinality Estimator
Cost Model
|
✅ |
★★★★★ |
NeuroCard: One Cardinality Estimator for All Tables
|
#Optimizer
#Cardinality Estimator
|
|
|
|
Sampling-Based Estimation of the Number of Distinct Values of an Attribute
|
#Optimizer
#Sampling
|
Introduce Many Estimators
Data Skewness
|
✅ |
★★★★★ |
Towards Estimation Error Guarantees for Distinct Values
|
#Optimizer
#Sampling
|
|
|
|
SQL Memory Management in Oracle 9i
|
#Memory
|
Ways of memory management:
- Static configuation
- Based on estimation
- Taking into account the used memory
- Based on demand
Oracle Memory Architecture:
- System Global Area, SGA, is shared by all the server processes
- Process Global Area, PGA, is indenpendently hold by each server process
Automatic PGA Memory Management
- Feedback
- Memory Bound is published by a background daemon which indirectly determines the size of each active work
|
👀/4 |
|
Robust and efficient memory management in Apache AsterixDB
|
#Memory
|
AsterixDB has three main memory sections, including one for in-memory components, a buffer cache, and working memory
Each operator has three memory sections, including input buffer, execution memory, output buffer
The budget for a memory-intensive operator is determined by operator-specific parameters (eg, 32 MB) and the system converts the budget into a number of pages (M) using the system's page size parameter. This means the memory-intensive operator can request M pages from the working memory at maximum and uses these pages as its execution memory
Sort operator
- Since comparing normalized binary-representations of field values is more efficient than comparing the original values, the concept of normalized key has been widely used
- Also, rather than moving an actual record during a sort, normally an array of record pointers is used in most implementations to deal with variable-length records
- An external sort consists of two phases, build and merge. In build phase, the operator gradually allocates more pages to hold incoming records until the buget is exhausted, then sorts it and writes it to a temporary run file on disk, and then continute to process the follow-up incoming data. In merge phase, the operator recursively multiway merges run files and generates the final results
- The sort operator also uses an additional data structure called the record pointer array that contains the location and the normalized key of each record in memory. This array is needed to avoid performing an in-place swap between two records during in-memory sorting since each record's length is generally different because of variable-length fields. Also, this array can improve the performance since comparing two normalized keys in the record pointer array using binary string comparison can be much faster than comparing the actual values between two records, which requires accessing the actual records in pages
Group-by operator
- Group-by operator use both data partition table abd hash table. Data partition table holds the aggregate records. Hash table holds the localtion of the aggregate records in the data partition table(can be seen as index)
Hash join operator
- In AsterixDB, the hash join and the hash group-by operators use very similar table structures
Inverted-index search
- In AsterixDB, inverted-index is used to perform key word search
|
👀/Chap6 |
★★★★★ |
An Overview of Data Warehousing and OLAP Technology
|
#Overview
#Warehousing
|
Maximizing transaction throughput is the key performance metric to OLTP
Query throughput and response times of ad hoc/complex queries is the key performance metric to OLAP
Features
- Multi dimensions
- Multi sources might contains data of varying quality, or use inconsistent representations, codes and formats
- Rollup, increasing the level of aggregation
- Drill-down, decreasing the level of aggregation or increasing detail
Extracting/Cleaning/Transforming/Loading
Refresh through replication techniques, including data shipping and transaction shipping
Database design methodology
- Star schema, consists of a single fact table and a single table for each dimension
- Snowfalke schema, based on the star schema, where the dimensional hierarchy is explicitly represented by normalizing the dimension tables
Warehouse server
- Index structures
- Materialized views
- Transformation of complex queries
- Parallel processing
Metadata and warehouse management
|
✅ |
★★★★★ |
Optimizing Queries Using Materialized Views:A Practical, Scalable Solution
|
#Materialized View
|
Three issues:
- View design, about how to store and index
- View maintenance, about how to update
- View exploitation, about when to use
An indexable view must be defined by the following conditions:
- A single-level SQL statement containing selections, (inner) joins, and an optional group-by
- The from clause cannot contain derived tables, i.e. must reference base tables, and subqueries are not allowed
- The output of an aggregation view must include all groupping columns as output columns and a count column
- Aggregation functions are limited to sum and count
View matching is a transformation rule that is invoked on select-project-join-group-by expression, SPJG
For a SPJ query expression to be computable from a view, the view must satisfy the following requirement:
- The view contains all rows needed by the query expression, i.e. checking compability of equality/range/residual predicates
- All required rows can be selected from the view, i.e. whether columns of compensating equality/range/residual predicates exist
- All output expressions can be computed from the output of the view, i.e. whether columns of output expressions exist
- All output rows occur with the correct duplication factor
Views with extra table
- Cardinality-preserving Join, a join between tables T and S is cardinality preserving if every row in T joins with exactly one row in S
Aggregation queries and view, which can be treated as SPJ query followed by a group-by operation. The following requirements should be satisfied
- The SPJ part required requirements
- All columns required by compensating predicates (if any) are available in the view output
- The view contains no aggregation or is less aggregated thanthe query
- All columns required to perform further grouping (if necessary) are available in the view output
- All columns required to compute output expressions are available in the view output
Fast filtering of views:
- Filter tree
- Lattice index
- Filter conditions including source tables, hubs, output columns, output expressions, range constraints, residual constraints, grouping columns, grouping expressions
|
✅ |
★★★★★ |