阅读更多

1 OLAP

题目	分类	概要	状态	推荐级别
Morsel-Driven Parallelism: A NUMA-Aware Query Evaluation Framework for the Many-Core Age	`#Execution` `#Parallel`	Mosel is a small fragments of input data Many-Core archtecture should take NUMA local processing into account, i.e. NUMA locality Machine-dependent number of threads Threads are pinned to the cores, avoid thread moving across different cores Keep pipeline with homogeneously sized morsels (exchange between to adjacent pipelines) to avoid skewed data distribution Scheduling goals Preserving (NUMA-)locality by assigning data morsels to cores on which the morsels are allocated Full elasticity concerning the level of parallelism of a particular query Load balancing requires that all cores participating in a query pipeline finish their work at the same time in order to prevent (fast) cores from waiting for other (slow) cores Work stealing The morsel size is not very critical for performance, it only needs to be large enough to amortize scheduling overhead while providing good response times	✅	★★★★★
Push vs. Pull-Based Loop Fusion in Query Engines	`#Execution`		✅	★★★
Everything You Always Wanted to Know About Compiled and Vectorized Queries But Were Afraid to Ask	`#Execution`	Vectorization(pull base) and data-centric code generation(push base) are both good Data-centric code generation is better when executing calculation-heavy queries, while vectorization is good at hiding cache miss latency Two constraints of vectorization: It can (i) only work on one data type2 and it (ii) must process multiple tuples Data-centric code generation can avoid materialize intermediate result in some degree Data-centric code generation perform more complex loop, which leads to more expensive penalty of branch miss and cache miss Observation: SIMD is only beneficial when all data fits into the cache	✅	★★★★★
Implementing Database Operations Using SIMD Instructions	`#Execution`	Branch Misprediction Tres struct with SIMD B+ Tree’s leaf node do not need to be stored in order if sequential search is choosed Fully use of SIMD by mapping non-fixed size datatype to fixed size datatype Fully speed up SIMD by mapping large size to smaller size which means more data in one instruction, and need a second check by the original datatype due to false positive	✅	★★★★★
MonetDB/X100: Hyper-Pipelining Query Execution	`#Execution` `#Vector Processing`	How CPU works Pipelining, branch, cache, memory access Super-scalar CPU can take multiple instructions into execution in parallel if they are independent, because CPU have more than one pipeline Radix-partitioned hash-join has cache-friendly memory access pattern Microbenchmark: TPC-H Query 1 Binary Association Table, BAT Full materialization may offset the benefit from the vectorization, because memory access becomes a bottleneck when bandwidth is full Vectorized Query Processor Chunk at a time, cache-resident data, vector size, range from 1k to 8k, works well Data in vertically fragmented form MonetDB Instruction Language, MIL	✅	★★★★★
Interleaved Multi-Vectorizing	`#Execution` `#Vector Processing`	[Feature] improve hash join performance by coroutine-based interleaving Pointer chasing applications: traversing skip lists lookup hash table searching trees SIMD eliminate branches by compressing and expanding vector Miss Status Holding Registers, MSHR, and Line Fill Buffers, LFB Prefetching prefetching distance prefetching-optimized techniques: Group Prefetching, GP Software Pipelined Prefetching， SPP Memory Access Chaining, AMAC Interleaved Multi-Vectorizing, IMV Coroutine Splitting states for a vectorized program Residual vectorized states, RVS, with divergent vectorized states, DVS	✅	★★★★★
Self-Tuning Query Scheduling for Analytical Workloads	`#Execution` `#Scheduling`	Context It's difficult to retain competitive query performance when the system is under a high load. In contrast, system tends to become less responsive, taking longer to provide the desired insights, and the query performance tends to become unpredictable Performance should degrade as gracefully as possible Database systems like PostgreSQL transfer scheduling responsibilities to the operating system (OS). They create individual OS threads or processes for every new connection and execute the incoming requests in an isolated fashion Many modern database systems deviate from this classic approach. Their parallelization scheme closely resembles task-based parallelism. Each query is split into a set of independent tasks, which can be executed in parallel by different OS threads Other systems like HyPer and Umbra relieve the OS from almost all scheduling decisions. On startup, they spawn as many OS threads as there are CPU cores. Since these threads do not block when executing tasks, this maximizes performance. The system is not oversubscribed and context switches are kept to a minimum. The task-based parallelization scheme is realized through so-called morsels. A morsel represents a fixed set of tuples in the context of an executable pipeline and is the smallest unit of work during query execution Scalable task scheduling Stride scheduling gives smaller stride on bigger priority Tasks are executed by OS threads. On startup, Umbra creates as many OS threads as there are CPU cores. We also call these threads worker threads. The worker threads are only responsible for executing scheduler tasks. This design minimizes context switches and prevents oversubscription Worker threads should not block	👀/Chap2.3	★★★★★
Parallel Merge Sort	`#Execution` `#Sort`	Sort models, circuits and PRAM(parallel random access memory) CRCW(concurrent read and concurrent write) PRAM CREW(concurrent read and exclusive write) PRAM EREW(exclusive read and exclusive write) PRAM Tree-based merge sort, CREW algorighm Think about the tree-based merge sort, items are merged level by level from bottom up. For merge operation, different level processes different amount of items, and we use `log(N)` stages to merge `N` items. And here comes the key observation that the merges at the different levels of the tree can be pipelined Runs in `O(log(N))` time on `N` processors Definitions: `L(u)` denotes the final sorted array of the subtree rooted at node `u` `UP(u)` denotes the subset of `L(u)`, which will become a more accurate approximation of `L(u)` as stage goes forward `SUP(u)` denotes the subset of `UP(u)`, which is also sorted `0 < \|UP(u)\| < \|L(u)\|`, then node `u` is inside node `\|UP(u)\| = \|L(u)\|`, then node `u` is external node `L` is a `c`-cover of `J` if each interval induced by an item in `L`(`[e, g]`, where `e` and `g` are two adjacent items in `L`) contains at most `c` items from `J`. And, usually `\|L\| < \|J\|` `OLDSUP(v)` is a 3-cover of `SUP(v)` As `UP(u) = OLDSUP(v) ∪ OLDSUP(w)`, we can deduce that `UP(v)` is a 3-cover of `SUP(v)` and `UP(v)` is a 3-cover of `SUP(w)`, why??? For every node `u` at every stage, we do the following two steps: Formulate `SUP(u)` For inside node: `SUP(u) = every forth item of UP(u)` For external node: First stage as it becomes an external node: `SUP(u) = every forth item of UP(u)` Second stage as it becomes an external node: `SUP(u) = every second item of UP(u)` Third or later stage as it becomes an external node: `SUP(u) = UP(u)` Formulate `NEWUP(u)` For inside node: `NEWUP(u) = SUP(v) ∪ SUP(w)`, `v` and `w` are child nodes of node `u` For external node: `NEWUP(u) = UP(u)` Merge process: Assume `UP(u) -> SUP(v)`, `UP(u) -> SUP(w)` Step1: compute `NEWUP(u)`. If `SUP(v) <--> SUP(w)`, then we can know rank of any item of `SUP(v)` or `SUP(w)` in `NEWUP(u)` by adding its ranks in `SUP(v)` and `SUP(w)` Substep1: For each item in `SUP(v)` we compute its rank in `UP(u)`, `SUP(v) <--> UP(u)`, `SUP(w) <--> UP(u)` Substep2: For each item in `SUP(v)` we compute its rank in `SUP(w)` by making use of `SUP(v) <--> UP(u)`, `SUP(w) <--> UP(u)` Step2: compute `NEWUP(u) -> NEWSUP(v)`, `NEWUP(u) -> NEWSUP(w)`	✅	★★★★★
On the Nature of Merge: External Merge, Internal Merge, and Parallel Merge	`#Execution` `#Sort`	External merge takes two distinct rooted structures and joins them into one Internal merge takes a subpart of an existing structure as one of two objects Parallel merge combines the properties of both This article is extremely obscure	🤷🏻	★
Optimizing parallel bitonic sort	`#Execution` `#Sort`	Given that this article was published in 1997 when CPU had only one core in most cases, so parallel algorithm need to take network communication into consideration, which no longer exists in the multi-core parallelism Most of the research on parallel algorithm design in the 70s and 80s has focused on fine-grain models of parallel computation, such as PRAM or network-based models `N` keys will be sorted in `log(N)` stages, for each bitonic sequence of `i-th` stage contains `2^i` items, and it can be Butterfly merged in `i` steps Naive and improved data layouts Blocked layout: mapping `N` items on `P` processors, with `n = N / P`. The first `log(n)` stages can be exected locally. For the subsequent stages `log(n) + k`, the first `k` steps require remote communication whil the last `log(n)` steps are completely local Cyclic layout: mapping `N` items on `P` processors by assigning the `i-th` item to the `(i % n) % P` processor, with `n = N / P`. The first `log(n)` stages require remote communication. For the subsequent stages `log(n) + k`, the first `k` steps are completely local while last `log(n)` steps require remote communication Cyclic-blocked layout: by periodically remapping the data from a blocked layout to a cyclic layout and vice verse can reduce the communication overhead Optimization communication Absolute address(`log(N)` bits long): represents the row number of the node in the bitonic storing network Relative address: the first `log(P)` bits represent the processor number, and the last `log(n)` bits represent the local address of the node after the remap Key observation: After the first `log(n)` stages (which can be entirely executed locally under a blocked layout) the maximum number of successive steps of the bitonic sorting network that can be executed locally, under any data layout, is `log(n)` (where `n = N / P`, `N` is data size, `P` is number of processors) We can thus reformulate the problem as: Given the tuple (stage, step), which uniquely identifies a column of the bitonic sorting network, how to remap the elements at this point in such a way that the next `log(n)` steps of the bitonic sorting network are executed locally `log(n)` successive steps may cross stage The purpose of the first `log(n)` stages is to form a monotonically increasing or decreasing sequence of n keys on each processor, thus we can replace all these stages with a single, highly optimized local sort	✅	★★★★★
Parallel Merge Sort with Load Balancing	`#Execution` `#Sort`	Main contribution: find a way to merge two lists, each of which is distributed in multiple processors, rather than store them on a single processor Items are merged level by level from bottom up: Group is a set of processors that are in charge of one sorted list, the higher level, the more processors there will be Each merge comprises two groups(partner groups), each group will maintain a histogram Before merging, each group exchanges histograms to form a new one to cover both, and each processor then divideds the intervals of the merged histogram into `2 * \|group\|` parts so that the lower indexed processors will keep the smaller half, and the higher will keep the larger half. And each processor sends out the half intervals that belongs to the other processors for further merge Key observation: for each non-overlap interval, which means amoung all the processors only one at most data set exists, no merge operation required. And for each overlap interval, k-merge or cascaded merge is required to merge the 2 or more data sets Questions: How to form a histogram if there are mutilply sort keys? How is the balance when there is data skew? How to make sure that each processor maintain a histogram with same intervals	✅	★★★★★
Parallelizing Fundamental Algorithms such as Sorting on Multi-core Processors for EDA Acceleration	`#Execution` `#Sort`	Granularity vs. Load Balancing. Fine grain has better load balancing while coarse grain has smaller overhead Parallel quick sort Parallel merge sort Parallel map sort	✅	★★★★★
A Simple, Fast Parallel Implementation of Quicksort and its Performance Evaluation on SUN Enterprise 10000	`#Execution` `#Sort`	Parallel quick sort has four phases: The parallel partition of the data phase The primary operation unit is a block, which holds a consecutive of keys Pivot(a key) is selected by one of the processors, assuming `P0` is the selected processor Each processor indenpendly partition its local data with the above pivot using a function called `neutralize` which takes two blocks from left and and right end as input, the process is similiar to the generic quick sort's partition in the high level. But there may be at most one block remains after partition, because `neutralize` always requires two blocks as input The sequential partition of the data phase `P0` works on the remaing blocks of all the processors in the similar way(`neutralize`) Finally, we get two subarray stride across the pivot with left side small or equal to pivot, right side large or equal to pivot The process partition phase Partition all processors into two groups base on the size of the subarray Repeat the process from phase one to phase three, until every group contains only one processor The sequential sorting in parallel with helping phase	✅	★★★★★
Parallelization of Modified Merge Sort Algorithm	`#Execution` `#Sort`	Good summary of previous work	👀
Distributed Top-k Query Processing by Exploiting Skyline Summaries	`#Execution` `#Sort`	Optimize TopN when N is large	✅	★★★
[Efficient-External-Sorting-in-DuckDB]	`#Execution` `#Sort`	The cost of sorting is dominated by comparing values and moving data around Two main ways of implementing merge sort: K-way merge and cascade merge	👀	★★★
Merge Path - A Visually Intuitive Approach to Parallel Merging	`#Execution` `#Sort`	Key idea: use diagonal to cut the merge path Cache Efficiency Three types of cache miss, compulsory, capacity, contention Three types of associativity, full associative, direct-mapped, group-mapped Cache-efficient parallel merge Key idea: ensure that all elements that may be active at any given time can co-reside in cache Multi-processors process cache-size data at one iteration	✅	★★★★★
Efficient Parallel Merge Sort for Fixed and Variable Length Keys	`#Execution` `#Sort`	Sort algorithm classification: Radix sort, rely on a binary representation of the sort key Comparison sort, allow user-specified comparison function. Including quick sort, merge sort Three stages merge sort with GPU Block sort, which can be further divided into two stages. First, each thread loads eight elements in registers and sorts them using bitonic sort. Second, merge these 8-element segments together Merge sort-simple, two moving windows, one in register and the other in memory Merge sort-multiple, allow CUDA blocks to cooperate in merging two sequences	✅	★★★
Optimization of Analytic Window Functions	`#Execution` `#Window Function`	Full Sort, Hashed Sort, Segmented Sort Segment Relation Reorderable SS-reorderable, only reorder in segment level to match expected order property, which not degenerating to full sort Cover Set-based Evaluation	✅	★★★★★
Efficient Processing of Window Functions in Analytical SQL Queries	`#Execution` `#Window Function`	Basic Concepts: Partitioning Ordering Framing Window Expression: ranking, rank/dense_rank/row_number/ntile distribution, percent_rank/cume_dist navigation in partition, lead/lag distinct aggregates, min/max/sum navigation in frame, first_expr/last_expr Pre-Partitioning into Hash Groups Aggregation Algorithms: Naive Aggregation Cumulative Aggregation Removable Cumulative Aggregation Segment Tree Aggregation	✅	★★★★★
Analytic Functions in Oracle 8i	`#Execution` `#Window Function`	Minimization of number of sorts Predicate Pushdown	✅	★★★★★
Incremental Computation of Common Windowed Holistic Aggregates	`#Execution` `#Window Function`	Function classification, including tuple-functions, aggregate-functions, window functions Aggregate-functions can be subdivided into distributive aggregates, algebraic aggregates, holistic aggregates	👀
An Overview of Query Optimization in Relational Systems	`#Optimizer` `#Overview`	query optimization can be viewed as a difficult search problem: A space of plans A cost estimation technique An enumeration algorithm System R's enumeration algorithm comprises two important parts: Dynamic programming to produce the optimal plan Interesting orders, the idea of which is later generalized to physical properties Search space Join sequences, including outerjoin and join, while outerjoin has more limitations SPJ with group-by, group-by push down Reducing multi-block queries to single-block: Merging views, `{Q=R⋈V\|V=S⋈T} -> {Q=R⋈S⋈T}`, then may be freely reordered Merging nested subqueries, the generic way to process the correlated subqueries Using semijoin like techniques for optimizating multi-block queries Statistics and cost estimatio Statistical summary is a logical property but the cost of a plan is a physical property Histogram Equi-depth(height) is effective for either high or low skew data Histogram is working on single column, do not provide information on the correlations among columns. One option is to consider 2-dimensional histograms, which will cost much more space Sampling Sampling is estimation of statistics, the key challenge is to limit the error in estimation The task of estimating distinct values is provably error prone Propagation of statistical information through operators Cost computation, CPU/Memory/IO/Paralleliasm Enumeration architectures Starburst Query Graph Model(QGM) query rewrite phase, without cost info plan optimization phase, with estimated cost and physical properties, properties are propagated as plans are built bottom-up Colcano/Cascades Including two kinds of rules, transformation rules and implementation rules. And there is no clearly boundary between the two kinds of rules Logical properties, physical properties and cost are used during enumeration Use dynamic programming in a top-down way, memorization Goal-driven Beyound the fundamentals Distributed and parallel databases, replication for physical distribution and parallelism for scale-up User defined functions, UDF. Bring problems to cost estimation Materialized Views Defer generation of complete plans subject to availability of runtime information	✅	★★★★★
The Cascades Framework for Query Optimization	`#Optimizer` `#Framework`	Framework Concepts and Components Sketchily	✅	★★★
Efficiency in the Columbia Database Query Optimizer	`#Optimizer` `#Framework`	Framework Concepts and Components Logical/Physical operators Expression/ExpressionGroup Search space Rules Tasks Optimize tasks Group optimization task is for finding the cheapest plan in this group Group exploration task is for expanding the search space Expression optimization task Rule application task Input optimization task is for property enforcing, cost calculating and pruning	✅	★★★★★
How Good Are Query Optimizers, Really?	`#Optimizer`	Cardinality Estimation is more important than Cost Model	✅	★★★★★
Orca: A Modular Query Optimizer Architecture for Big Data	`#Optimizer`	Orca Architecture Ability to run outside the database system as a stand-alone optimizer through Data eXchange Language(DXL) Memo Search and Job Scheduler, including three main steps: exploration, implementation, optimization Transformations Property Enforcement, including logical property(output columns) and physical property(sort order, distribution) Metadata Cache GPOS Optimization workflow Exploration, possibly creating new group expressions and new groups into memo, such as Join Commutativity Rule/ Statistics Derivation, used to derive estimates for cardinality and data skew Implementation, shifting from logical to physical Optimization, properties enforcement and cost computation Parallel query optimization, task dependency	✅	★★★★★
Orthogonal Optimization of Subqueries and Aggregation	`#Optimizer` `#Subquery`	Scalar Aggregate means without group by colums, which always returns one row Subquery classification: boolean-valued subquery, including exist/in/quantified comparisons scalar subquery, which need `Max1row` operator Correlated subquery have three execution strategies: correlated execution outerjoin then aggregate aggregate then join Algebraic representation Remove Correlations, which means the recursive calls between scalar and relational execution are removed, and typically results in outerjoins Pushing down Apply For `(not) exist/in` subquery, semijoin for exists, antisemijoin for not exist Subquery classes based on different processing strategies: Class 1. Subqueries that can be removed with no ad- ditional common subexpressions Class 2. Subqueries that are removed by introducing additional common subexpressions Class 3. Exception subqueries GroupBy reorder conditions: case groupBy with filter case groupBy with join/outerjoin Local Aggregate, Local GroupBy SegmentApply, dividing table to indenpendent parts, and perform `Apply` operator on each parts indenpendently	✅	★★★★★
Of Nests aud Trees: A Untied Approach to Processing Queries That Contain Nested Subqueries, Aggregates, and Quantifiers	`#Optimizer` `#Subquery`
Complex Query Decorrelation	`#Optimizer` `#Subquery`	The concept of correlation in SQL is similar to the use of non-local variables in block-structured programming languages Set-oriented Count bug, every missing groupBy key is expected to have a zero output Magic Decorrelation: FEED Stage, during which feeding the correlation to its children ABSORB Stage, during which absorbing the correlation and resulting in a decorrelated query non-SPJ Box, SPJ is the abbreviation of Select-Project-Join SPJ Box Experiment Comparisons: Nested Iteratio, NI Kim's method Dayal's method Magic Decorrelation	✅	★★★★★
Enhanced-Subquery-Optimizations-in-Oracle	`#Optimizer` `#Subquery`	Subquery coalesce Of same type, e.g. two exist subqueries Of different type, e.g. exist and not exist subqueries Execution enhancements: Cost-based parallel group-by pushdown, GPD. Short circuit by HAVING clause Subquery removal using window functions: Correlated subsumed subquery, TPCH-Q2/Q17 Uncorrelated subsumed subquery, TPCH-Q15 Subsumed subquery in HAVING clause, TPCH-Q11 Subquery Producing a Multi-Set, MAX(MAX), MIN(MIN) Scalable parallel execution: Grand-total window function optimization, which can be extended to low cardinality partition-by keys Null-aware anti join, NAAJ	✅	★★★★★
WinMagic : Subquery Elimination Using Window Aggregation	`#Optimizer` `#Subquery`	Key idea: for every distinct value of correlated column, if the row set of the subquery and the outer block are exactly the same, then the aggregate of subquery can be replaced with a corresponding window function of outer block Conditions that are required to meet Scalar correlated aggregation subquery with only equal predicate Aggregate function has a corresponding version of window function Aggregate function DO NOT contains DISTINCT Tables of subquery and outer block are exactly the same, taking correlated outer table into account for subquery Predicates of subquery and outer block are exactly the same, except the correlated outer table only related predicates TPCH-Q2, TPCH-Q17	✅	★★★★★
Outerjoin Simplification and Reordering for Query Optimization	`#Optimizer` `#Join`	Outerjoin workloads: Database merging, two-sided outerjoin Hierarchical views Nested queries Universal quantifiers Outerjoin optimizations: Outerjoin simplification: LOJ can be rewritten as regular join if later operator discards the null-padded tuples. And this can be extended to ROJ and FOJ Join/Outerjoin associativity Generalized outerjoin	🤷🏻	★★★
Including Group-By in Query Optimization	`#Optimizer` `#Join` `#Aggregate`	A single group-by(with multi group by columns) is replaced by multi group-by in stages, interleaved with join Too few pictures to understand	🤷🏻/2	★★
Groupwise Processing of Relational Queries	`#Optimizer` `#Join` `#Aggregate`
Optimization of Common Table Expressions in MPP Database Systems	`#Optimizer` `#CTE`	The purpose of CTEs is to avoid re-execution of expressions referenced more than once within a query CTEs achieve two goals, making query more readable, making execution more efficient CTEs follow a producer/consumer model Chanllenges Deadlock hazard, operator's execution order conflicts with the CTE's execution order Enumerating inlining alternatives, determining which ones need to be inline (for example, inline to utilize index) and which ones don't Contextualized Optimization CTE representation CTEProducer CTEConsumer CTEAnchor Sequence For case of CTE inline, CTEAnchor is removed and CTEConsumer is replaced with the whole CTE definition For case of CTE no-inline, CTEAnchor is replaced by Sequence operator which has the CTEProducer as its left child and the original child of the CTEAnchor as its second child Plan enumeration Transformation rules, CTEAnchor/Sequence/NoOp in one group, CTEConsumer/CopyOfCTEDefinition in one group Avoid invalid plans, such as CTEProducer without CTEConsumer, Sequence without CTEConsumer, NoOp with CTEConsumer(inlined) Optimization, predicate push down, always inline single-use CTEs, elimination of unused CTEs Contextualized optimization In ORCA, CTEs are considered later after regular optimization is done Enforcing physical properties, pushing the CTEConsumer's distribution requirement to the CTEProducer Cost Estimation CTE based optimizations CTE-Generating transformations, such as two count(distinct) Common subexpression elimination	✅	★★★★★
Are We Ready For Learned Cardinality Estimation?	`#Optimizer` `#Cardinality Estimator`	Cardinality Estimator Cost Model	✅	★★★★★
NeuroCard: One Cardinality Estimator for All Tables	`#Optimizer` `#Cardinality Estimator`
Sampling-Based Estimation of the Number of Distinct Values of an Attribute	`#Optimizer` `#Sampling`	Introduce Many Estimators Data Skewness	✅	★★★★★
Towards Estimation Error Guarantees for Distinct Values	`#Optimizer` `#Sampling`
SQL Memory Management in Oracle 9i	`#Memory`	Ways of memory management: Static configuation Based on estimation Taking into account the used memory Based on demand Oracle Memory Architecture: System Global Area, SGA, is shared by all the server processes Process Global Area, PGA, is indenpendently hold by each server process Automatic PGA Memory Management Feedback Memory Bound is published by a background daemon which indirectly determines the size of each active work	👀/4
Robust and efficient memory management in Apache AsterixDB	`#Memory`	AsterixDB has three main memory sections, including one for in-memory components, a buffer cache, and working memory Each operator has three memory sections, including input buffer, execution memory, output buffer The budget for a memory-intensive operator is determined by operator-specific parameters (eg, 32 MB) and the system converts the budget into a number of pages (M) using the system's page size parameter. This means the memory-intensive operator can request M pages from the working memory at maximum and uses these pages as its execution memory Sort operator Since comparing normalized binary-representations of field values is more efficient than comparing the original values, the concept of normalized key has been widely used Also, rather than moving an actual record during a sort, normally an array of record pointers is used in most implementations to deal with variable-length records An external sort consists of two phases, build and merge. In build phase, the operator gradually allocates more pages to hold incoming records until the buget is exhausted, then sorts it and writes it to a temporary run file on disk, and then continute to process the follow-up incoming data. In merge phase, the operator recursively multiway merges run files and generates the final results The sort operator also uses an additional data structure called the record pointer array that contains the location and the normalized key of each record in memory. This array is needed to avoid performing an in-place swap between two records during in-memory sorting since each record's length is generally different because of variable-length fields. Also, this array can improve the performance since comparing two normalized keys in the record pointer array using binary string comparison can be much faster than comparing the actual values between two records, which requires accessing the actual records in pages Group-by operator Group-by operator use both data partition table abd hash table. Data partition table holds the aggregate records. Hash table holds the localtion of the aggregate records in the data partition table(can be seen as index) Hash join operator In AsterixDB, the hash join and the hash group-by operators use very similar table structures Inverted-index search In AsterixDB, inverted-index is used to perform key word search	👀/Chap6	★★★★★
An Overview of Data Warehousing and OLAP Technology	`#Overview` `#Warehousing`	Maximizing transaction throughput is the key performance metric to OLTP Query throughput and response times of ad hoc/complex queries is the key performance metric to OLAP Features Multi dimensions Multi sources might contains data of varying quality, or use inconsistent representations, codes and formats Rollup, increasing the level of aggregation Drill-down, decreasing the level of aggregation or increasing detail Extracting/Cleaning/Transforming/Loading Refresh through replication techniques, including data shipping and transaction shipping Database design methodology Star schema, consists of a single fact table and a single table for each dimension Snowfalke schema, based on the star schema, where the dimensional hierarchy is explicitly represented by normalizing the dimension tables Warehouse server Index structures Materialized views Transformation of complex queries Parallel processing Metadata and warehouse management	✅	★★★★★
Optimizing Queries Using Materialized Views:A Practical, Scalable Solution	`#Materialized View`	Three issues: View design, about how to store and index View maintenance, about how to update View exploitation, about when to use An indexable view must be defined by the following conditions: A single-level SQL statement containing selections, (inner) joins, and an optional group-by The from clause cannot contain derived tables, i.e. must reference base tables, and subqueries are not allowed The output of an aggregation view must include all groupping columns as output columns and a count column Aggregation functions are limited to sum and count View matching is a transformation rule that is invoked on select-project-join-group-by expression, SPJG For a SPJ query expression to be computable from a view, the view must satisfy the following requirement: The view contains all rows needed by the query expression, i.e. checking compability of equality/range/residual predicates All required rows can be selected from the view, i.e. whether columns of compensating equality/range/residual predicates exist All output expressions can be computed from the output of the view, i.e. whether columns of output expressions exist All output rows occur with the correct duplication factor Views with extra table Cardinality-preserving Join, a join between tables T and S is cardinality preserving if every row in T joins with exactly one row in S Aggregation queries and view, which can be treated as SPJ query followed by a group-by operation. The following requirements should be satisfied The SPJ part required requirements All columns required by compensating predicates (if any) are available in the view output The view contains no aggregation or is less aggregated thanthe query All columns required to perform further grouping (if necessary) are available in the view output All columns required to compute output expressions are available in the view output Fast filtering of views: Filter tree Lattice index Filter conditions including source tables, hubs, output columns, output expressions, range constraints, residual constraints, grouping columns, grouping expressions	✅	★★★★★

1.1 DSPS

Data Stream Processing System, DSPS

题目	分类	概要	状态	推荐级别
A Survey of Distributed Data Stream Processing Frameworks	`#Survey`	Batch Processing Latency, cannot process until all data is loaded Architecture of Data Stream Processing System(DSPS) Data stream ingestion layer, responsible for accepting streams of data into the DSPS Scalable Resilient Fault-tolerant Data stream processing layer, data stream processing engine (DSPE),which pre-processes and analyses data in one or more steps, Data stream management engines, DSME, and the followings are the requirements that should be met Process continuous data on-the-fly without any requirement to store them Support high-level languages such as SQL Handle imperfections such as delayed, missing and out-of-order data Guarantee predictable and repeatable outcomes Efficiently store, access, modify, and combine (with live streaming data) state information Ensure that the integrity of the data is maintained at all times and relevant applications are up and available despite failures Automatically and transparently distribute the data processing load across multiple processors and machines Respond to high-volume data processing applications in real-time using a highly optimized execution path Complex event processing engines, CEPE General-purpose DSPEs, GDSPE Storage layer, which stores, indexes and manages the data and the generated knowledge Organized Indexed Metadata Resource management layer, which manages and coordinates the functions of distributed compute and storage resources Resource allocation Resource scheduling Server as a coordinator Output layer, which directs the output data stream and knowledge to other systems or visualization tools Key features of DSPEs Programming models, given the unbounded nature of streaming data A window is usually defined by either a time duration or a record count. There are many window types: Fixed/Sliding/Session Window Stateless transformations: Map, Filter, FlatMap Stateful transformations: Aggregation, Group-and-aggregate, Join, Sort Data source interaction model Push model where a daemon process of a data stream engine keeps listening to an input channel Pull model, A challenge here is that the frequency of pulling and the speed of processing the data by the DSPEs should match the rate of data generation at the source to avoid data loss Data partitioning strategy Horizontal method divides data into disjoint sets of rows Round-robin Range, which is the most popular approach especially when there is a periodic loading of a new data Hash The vertical method divides data into vertical and disjoint sets of columns and can be categorized further into cost-based and procedural approaches State management Operators can be stateless or stateful In traditional DSPEs, the state information was stored in a centralized database management system to be shared among applications The state management facilities in various DSPEs naturally fall along a complexity continuum from naive in-memory-only choice to a persistent state that can be queried and replicated Message processing guarantee, such as at-most-once, at-least-once, exactly once Fault tolerance and recovery Passive, such as checkpoint, upstream buffer, source replay Active, such as replicas Deployment, such as local, cluster, cloud Community support, such as forums, meetups and others Support for high level languages, such as java, scala, python, r, sql Support for advanced input sources, such as local file systems, socket connections, databases, queuing tools Support for storage systems Support for analytics Popular products: Flink Samza Apex Storm Spark Streaming StreamBase IBM Streams Kafka Streams Google Dataflow Beam	✅	★★★★★
A Survey of State Management in Big Data Processing Systems	`#Survey`	First proposals for parallel batch-oriented data processing, BDP, was MapReduce Advantages: flexibility, fault-tolerance, programming ease, scalability Disadvantages: a low-level programming model, a lack of support for iterations, inability to deal with data streams State: a sequence of values in time that contain the intermediate results of a desired computation Concepts of State Management Operations Purge Update Store On disk In-memory Migrate Expose Incremental Maintenance Transformation Differential Computation View Maintenance State Sharing Shared Operator State Shared Window Shared Queue Load Balancing & Elasticity Scalability Load Balancing Elasticity Performance Optimal Placement Optimal Assignment Checkpoint Interval Applications of State Stateful Computation Operator Level Application Level Iterative Processing Incremental Iteration Bulk Iteration Fault Tolerance Correlated Checkpoint Incremental Checkpoint Indenpendent Checkpoint State in different views System View: Configguration State, Conputation State Application View: Query State, Program State Programming View: Window State, Variable State Operator View: Processing State, Routing State, Buffer State	👀/3.1.2p	★★★★★
A Survey on the Evolution of Stream Processing Systems	`#Survey`

1.2 LakeHouse

1.2.1 Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics

Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics

Warehouse’ problems:

Reliability: Keeping the data lake and warehouse consistent is difficult and costly
Data staleness
Limited support for advanced analytics

Goals of Lakehouse Architecture:

Reliable data management on data lakes
Support for machine learning and data science
- Declarative DataFrame APIs
SQL performance

Lakehouse Architecture: Based on low-cost and directly-accessible storage that also provides traditional analytical DBMS management and performance features such as ACID transactions, data versioning, auditing, indexing, caching, and query optimization

Metadata Layers for Data Management
- Defines which objects are part of a table version
- Provide atomic operation, transaction, zero-copy(provide high-level organization, like table, based on the exsiting storage)
- Provide data quality enforcement feature
- Provide governance features such as access control and audit logging
- Speed up advanced analytics workloads and give them better data management features thanks to the development of declarative DataFrame APIs
- Both Delta Lake(Databricks), Apache Iceberg, Apache Hudi have implemented this feature
SQL Performance in a Lakehouse
- Caching, storing hot data on fast devices such as SSDs
- Auxiliary data, like statistics and indexes
- Data layout

1.2.2 Photon: A Fast Query Engine for Lakehouse Systems

Photon: A Fast Query Engine for Lakehouse Systems

Challenges of Photon’s design:

Perform well on raw, uncurated data, which are highly irregular datasets, poor physical layout, and large fields, all with no useful clustering or data statistics
Semantically compatible with, the existing Apache Spark DataFrame API

2 Serverless

题目	分类	概要	状态	推荐级别
Cloud Programming Simplified: A Berkeley View on Serverless Computing	`#Survey`		✅	★★★★★
Amazon Redshift Re-invented	`#Execution` `#Amazon`	Architecture MPP Code Generation & Compilation Service Prefetching AZ64 Enconding AQUA & Computational Storage, do simple computation at the storage Automatic Table Optimization(for a given workloads) support SUPER value, typeless, can hold anything(int, double, array, json, etc.)	✅	★★★★★

3 Compile Tech

题目	分类	概要	状态	推荐级别
Adaptive LL (*) parsing: the power of dynamic analysis	`#LL`

4 Cpp

题目	分类	概要	状态	推荐级别
Shared Memory Consistency Models: A Tutorial	`#Memory Model`	Effectively, the consistency model places restrictions on the values that can be returned by a read in a shared-memory program execution Memory consistency model will affect programmability, performance, and portability at several different levels Sequential consistency A multiprocessor system is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program Seems like each processor issues memory operations in program order and the switch provides the global serialization among all memory operations Implementing sequential consistency Architectures without caches Write buffers with bypassing capability(hardware optimization), which used in uniprocessors to effectively hide the latency of write operations, but this hardware optimization may violate sequential consistency Overlapping Write Operations(hardware optimization), which allows multiple write operations issued by the same processor may be simultaneously serviced by different memory modules(may let write operations be reordered). And this kind of hardware optimization may violate sequential consistency Non-blocking read operations(hardware optimization) Architectures with caches Cache coherence and sequential consistency Several definitions for cache coherence (also referred to as cache consistency) exist in the literature. The strongest definitions treat the term virtually as a synonym for sequential consistency Specifically, one set of conditions commonly associated with a cache coherence protocol are A write is eventually made visible to all processors Writes to the same location appear to be seen in the same order by all processors (also referred to as serialization of writes to the same location) Detecting the completion of write operations Maintaining the program order from a write to a following operation typically requires an acknowledgement response to signal the completion of the write This problem can be avoided by leting the following operation wait for the current write operation finishing its invalidation or updating processing Maintaining the illusion of atomicity for writes While sequential consistency requires memory operations to appear atomic or instantaneous, propagating changes to multiple cache copies is inherently a non-atomic operation Sequential consistency can be easily volatiled if two write operations reaching other processors in different order(when delivering updating message from write processor to others, time may differ duo to different paths). The violation can be avoided by imposing the condition that writes to the same location be serialized; i.e., all processors see writes to the same location in the same order. Such serialization can be achieved if all updates or invalidates for a given location originate from a single point (e.g., the directory) and the ordering of these messages between a given source and destination is preserved by the network. An alternative is to delay an update or invalidate from being sent out until any updates or invalidates that have been issued on behalf of a previous write to the same location are acknowledged And here comes another problem, one processor may see the newly value of write A, but not write B. One possible restriction that prevents such a violation is to prohibit a read from returning a newly written value until all cached copies have acknowledged the receipt of the invalidation or update messages generated by the write Update-based protocols are more challenging because unlike invalidations, updates directly supply new values to other processors. One solution is to employ a two phase update scheme The first phase involves sending updates to the processor caches and receiving acknowledgements for these updates. In this phase, no processor is allowed to read the value of the updated location In the second phase, a confirmation message is sent to the updated processor caches to confirm the receipt of all acknowledgements. A processor can use the updated value from its cache once it receives the confirmation message from the second phase Compilers The interaction of the program order aspect of sequential consistency with the compiler is analogous to that with the hardware Compiler optimizations can also volatile sequential consistency. For example, keep reading value from a register(loop situation) can prohibit processor from ever observing the newly written value from shared memory Compiler need information to apply all the optimizations properly without unexpectedly volatiling sequential consistency Summary for sequential consistency From the above discussion, it is clear that sequential consistency constrains many common hardware and compiler optimizations. Straightforward hardware implementations of sequential consistency typically need to satisfy the following two requirements First, a processor must ensure that its previous memory operation is complete before proceeding with its next memory operation in program order. We call this requirement the program order requirement. Determining the completion of a write typically requires an explicit acknowledgement message from memory. Additionally, in a cache-based system, a write must generate invalidate or update messages for all cached copies, and the write can be considered complete only when the generated invalidates and updates are acknowledged by the target caches The second requirement pertains only to cache-based systems and concerns write atomicity. It requires that writes to the same location be serialized (i.e., writes to the same location be made visible in the same order to all processors) and that the value of a write not be returned by a read until all invalidates or updates generated by the write are acknowledged (i.e., until the write becomes visible to all processors). We call this the write atomicity requirement Relaxed memory models Characterizing different memory consistency models We categorize relaxed memory consistency models based on two key characteristics: how they relax the program order requirement, we distinguish models based on whether they relax the order from read-read, read-write, write-read, write-write. In all cases, relaxation only applies to operation pairs with different addresses how they relax the write atomicity requirement, we distinguish models based on whether they allow a read to return the value of another processor’s write before all cached copies of the accessed location receive the invalidation or update messages generated by the write; i.e., before the write is made visible to all other processors Relaxations: Relax Write to Read program order Relax Write to Write program order Relax Read to Read and Read to Write program orders Read others’ write early Read own write early Relaxed models(Figure 8): SC/IBM 370/TSO, Total Store Ordering/PC, Processor Consistency/PSO, Partial Store Ordering/WO, Weak Ordering/RCsc, RCpc, Release Consistency/Alpha/RMO, Relaxed Memory Order/PowerPC Relaxing the write to read program order Relaxing the program order constraints in the case of a write followed by a read to a different location. But for same location, reorder is not allowed Models, including IBM 370, TSO, PC, offer this kind of relaxation. More specifically, allowing a read to be reordered with respect to previous writes from the same processor. But there exists some differences between the above three models when it comes to same location: IBM 370 model prohibits a read from returning the value of a write to the same location before the write is made visible to all processors TSO model allows a read to return the value of its own processor’s write even before the write is serialized with respect to other writes to the same location PC model allows a read can return the value of any write before the write is serialized or made visible to other processors Relaxing the write to read and write to write program orders This kind further relaxing the program order requirement by eliminating ordering constraints between writes to different locations PSO offers this kind of relaxation Relaxing all program orders Further, a read or write operation may be reordered with respect to a following read or write to a different location Models, including WO, RCsc/RCpc, Alpha, RMO, PowerPC, offer this kind of relaxation In hardware, this flexibility provides the possibility of hiding the latency of read operations by implementing true non-blocking reads in the context of either static (in-order) or dynamic (out-of-order) scheduling processors An alternate abstraction for relaxed memory models The higher performance is accompanied by a higher level of complexity for programmers. Furthermore, the wide range of models supported by different systems requires programmers to deal with various semantics that differ in subtle ways and complicates the task of porting programs across these systems Instead of exposing performance-enhancing optimizations directly to the programmer as is done by a systemcentric specification, a programmer-centric specification requires the programmer to provide certain information about the program. This information is then used by the system to determine whether a certain optimization can be applied without violating the correctness of the program An operation must be defined as a synchronization operation if it forms a race with another operation in any sequentially consistent execution; other operations can be defined as data. An operation may be conservatively distinguished as a synchronization operation if the programmer is not sure whether the particular operation is involved in a race or not Conveying information at the pogramming language level(through grammar, paradigm or lib). The information conveyed at the programming language level must ultimately be provided to the underlying hardware. Therefore, the compiler is often responsible for appropriately translating the higher level information to a form that is supported by the hardware	✅	★★★★★
What Every Programmer Should Know About Memory	`#Memory Model`	Commodity hardware All CPUs are connected via a common bus (the Front Side Bus, FSB) to the Northbridge The Northbridge contains, among other things, the memory controller, and its implementation determines the type of RAM chips used for the computer To reach all other system devices, the Northbridge must communicate with the Southbridge. The Southbridge, often referred to as the I/O bridge, handles communication with devices through a variety of different buses(PCI/PCI Express, SATA, USB buses, etc.) Such a system structure has a number of noteworthy consequences All data communication from one CPU to another must travel over the same bus used to communicate with the Northbridge All communication with RAM must pass through the Northbridge Communication between a CPU and a device attached to the Southbridge is routed through the Northbridge A couple of bottlenecks are immediately apparent in this design One such bottleneck involves access to RAM for devices. In the earliest days of the PC, all communication with devices on either bridge had to pass through the CPU, negatively impacting overall system performance. To work around this problem some devices became capable of direct memory access (DMA). DMA allows devices, with the help of the Northbridge, to store and receive data in RAM directly without the intervention of the CPU (and its inherent performance cost). Today all high-performance devices attached to any of the buses can utilize DMA A second bottleneck involves the bus from the Northbridge to the RAM. The exact details of the bus depend on the memory types deployed. On older systems there is only one bus to all the RAM chips, so parallel access is not possible. Recent RAM types require two separate buses (or channels as they are called for DDR2) which doubles the available bandwidth RAM types Static RAM Dynamic RAM CPU caches So, instead of putting the SRAM under the control of the OS or user, it becomes a resource which is transparently used and administered by the processors In this mode, SRAM is used to make temporary copies of (to cache, in other words) data in main memory which is likely to be used soon by the processor. This is possible because program code and data has temporal and spatial locality. This means that, over short periods of time, there is a good chance that the same code or data gets reused Realizing that locality exists is key to the concept of CPU caches as we use them today CPU caches in the big picture The CPU core is no longer directly connected to the main memory if cache system is deployed. All loads and stores have to go through the cache It is of advantage to separate the caches used for code and for data. In recent years another advantage emerged: the instruction decoding step for the most common processors is slow; caching decoded instructions can speed up the execution Cache operation at high level Exclusive cache(AMD/VIA), where each line in L1d may not present in L2. Inclusive cache(Intel), where each line in L1d also present in L2, makes evicting from L1d is much faster In symmetric multi-processor (SMP) systems the caches of the CPUs cannot work independently from each other. All processors are supposed to see the same memory content at all times. The maintenance of this uniform view of memory is called “cache coherency” Today’s processors all use internal pipelines of different lengths where the instructions are decoded and prepared for execution. Part of the preparation is loading values from memory (or cache) if they are transferred to a register. If the memory load operation can be started early enough in the pipeline, it may happen in parallel with other operations and the entire cost of the load might be hidden CPU cache implementation details Cache implementers have the problem that each cell in the huge main memory potentially has to be cached Associativity It would be possible to implement a cache where each cache line can hold a copy of any memory location. This is called a fully associative cache. To access a cache line the processor core would have to compare the tags of each and every cache line with the tag for the requested address. Fully associative caches are practical for small caches(for instance, the TLB caches on some Intel processors are fully associative) A direct-mapped cache is fast and relatively easy to implement. But it has a drawback: it only works well if the addresses used by the program are evenly distributed with respect to the bits used for the direct mapping, some cache entries are heavily used and therefore repeated evicted while others are hardly used at all or remain empty A set-associative cache combines the good features of the full associative and direct-mapped caches to largely avoid the weaknesses of those designs	👀/P20	★★★★★

5 AI

5.1 Understanding Deep Learning (Still) Requires Rethinking Generalization

Understanding Deep Learning (Still) Requires Rethinking Generalization

5.2 A Closer Look at Memorization in Deep Networks

A Closer Look at Memorization in Deep Networks

5.3 The Loss Surfaces of Multilayer Networks

The Loss Surfaces of Multilayer Networks

5.4 Attention Is All You Need

Attention Is All You Need

5.5 BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

5.6 Language Models are Unsupervised Multitask Learners

Language Models are Unsupervised Multitask Learners

6 TODO

Survey
- Query Evaluation Techniques for Large Databases
Optimization
Execution
Memory
- Efficient Use of Memory Bandwidth to Improve Network Processor Throughput
Statistics/Sample
- NeuroCard: One Cardinality Estimator for All Tables
- Flow-Loss: Learning Cardinality Estimates That Mater
- Learning to Sample: Counting with Complex Queries
- SetSketch: Filling the Gap between MinHash and HyperLogLog
- Fauce: Fast and Accurate Deep Ensembles with Uncertainty for Cardinality Estimation
- Weighted Distinct Sampling: Cardinality Estimation for SPJ Queries
- Learning to be a Statistician: Learned Estimator for Number of Distinct Values
- Count-distinct problem
- 伯努利采样
- 皮尔逊系数
- 泊松分布
- Kernel Density Estimation
- 影响Cardinality Estimation准确性的因素
  - 数据倾斜
  - 数据相关信息
  - 值域范围
MV
Consensus Algorithm
- In Search of an Understandable Consensus Algorithm
Other DBMS
Computer Architecture
- Residency-Aware Virtual Machine Communication Optimization: Design Choices and Techniques
- GPU Computing
Storage
- What’s the Story in EBS Glory: Evolutions and Lessons in Building Cloud Block Store
Parquet

Liuye Notebook

Papers