Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patch set 53 #659

Merged
merged 25 commits into from
Dec 8, 2023
Merged

Patch set 53 #659

merged 25 commits into from
Dec 8, 2023

Conversation

KnightMurloc and others added 25 commits October 13, 2023 12:13
#620)

Disabling the prefetch_inner mechanism for HashJoin, when the join is on the
bottleneck, could lead to a deadlock if there is a SharedScan in both parts of
the join. The deadlock occured because SharedScan producer in the inner part of
the join, and consumer in the outer. Thus consumer couldn't perform the shared
scan without prefetching the inner part.

The SharedScan producer for HashJoin is always created under the inner part,
because shareinput_walker function always processes HashJoin node from the Hash
part. For CTEs the prefetching was enabled for all cases, except the bottleneck
join, because CTE path is marked as motionHazard. Therefore, if prefetch_inner
is disabled, the executor can start performing join from the outer part while
the SharedScan producer will be inside the inner part and this will probably
lead to the deadlock if the SharedScan is cross-slice.

This patch fixes this by enabling prefetch_inner for HashJoin on bottleneck when
traversing the plan in shareinput_mutator_dag_to_tree if there are shared scans
in the plan and join is on the bottleneck.
Previously the time range filtering of log files was incorrect: there
was a check for belonging a log file to specified time range which
worked like a binary `AND` (if time < begin AND time > end - skip
file), but something like binary `OR` is needed here. Thus file
filtering never worked. Also this validation was executed after the
file was opened, so in case of failed validation there was a redundant
`open/close` operations. This patch fixes such behavior, by:
- moving the logic of belonging log file name the user specified range
  to the separate function (it's also helps to cover corner cases)
- skip file processing, if it doesn't belong to range, at the
  begin of loop to prevent redundant `open/close` operations
There was getting data from pg_partitions at function 'adb_collect_table_stats'.
'pg_partitions' generates from a lot of joins ('pg_class', 'pg_namespace',
'pg_partition_rule' and etc). It takes a lot of time and resources (query plan
of getting all data from 'pg_partitions' contains ~60 rows).

'pg_partitions' was replaces to some joins, which are need for
'adb_collect_table_stats'. Query plans have halved than query with 'pg_partitions'.

New test 'adb_collect_table_stat_test' was added. It checks, that function
'adb_collect_table_stats' generates new partitions for 'db_files_history' correctly.
…urrent (#629)

At "adb_collect_table_stats" temporary table "db_files_current" is used for collecting
data from view "arenadata_toolkit.__db_files_current". Schema "arenadata_toolkit" contains
not temporary table "db_files_current". If schema "arenadata_toolkit" is set by default
it may be get collison at table names.

Temporary table "db_files_current" was renamed to "pg_temp.db_files_current".
…es_now_fast external table (#636)

gpperfmon writes a string to queries_now.dat in the expectation that it will be
processed by gpmon_catqrynow.py, but the queries_now_fast external table reads
from the file directly. This leads to an error, because queries_now.dat contains
the query_hash column that is not declared in the queries_now_fast table. Also
queries_now.dat contains empty columns at the end that should be filled by
python script. These columns are not declared in the table too.

This patch fixes this issue by removing extra columns from queries_now.dat and
adding them directly in the python script and in the write_qlog_full (write into
queries_tail.dat) function.

Co-authored-by: Andrey Sokolov <[email protected]>
Split the existing CatalogUpdateIndexes into two different routines,
CatalogTupleInsert and CatalogTupleUpdate, which do both the heap
insert/update plus the index update.  This removes over 300 lines of
boilerplate code all over src/backend/catalog/ and src/backend/commands.
The resulting code is much more pleasing to the eye.

Also, by encapsulating what happens in detail during an UPDATE, this
facilitates the upcoming WARM patch, which is going to add a few more
lines to the update case making the boilerplate even more boring.

The original CatalogUpdateIndexes is removed; there was only one use
left, and since it's just three lines, we can as well expand it in place
there.  We could keep it, but WARM is going to break all the UPDATE
out-of-core callsites anyway, so there seems to be no benefit in doing
so.

This is backport of 2f5c9d9
CatalogTupleInsertFrozen is taken from 19cd1cf
This patch replaces simple_heap_insert and CatalogUpdateIndexes to CatalogTupleInsert,
and simple_heap_update and CatalogUpdateIndexes to CatalogTupleUpdate

Author: Pavan Deolasee
Discussion: https://www.postgr.es/m/CABOikdOcFYSZ4vA2gYfs=M2cdXzXX4qGHeEiW3fu9PCfkHLa2A@mail.gmail.com
This extends the work done in commit 2f5c9d9 to provide a more nearly
complete abstraction layer hiding the details of index updating for catalog
changes.  That commit only invented abstractions for catalog inserts and
updates, leaving nearby code for catalog deletes still calling the
heap-level routines directly.  That seems rather ugly from here, and it
does little to help if we ever want to shift to a storage system in which
indexing work is needed at delete time.

Hence, create a wrapper function CatalogTupleDelete(), and replace calls
of simple_heap_delete() on catalog tuples with it.  There are now very
few direct calls of [simple_]heap_delete remaining in the tree.

This is backport of ab02896
This patch replaces simple_heap_delete to CatalogTupleDelete

Discussion: https://postgr.es/m/[email protected]
Add CatalogTupleInsertWithInfo and CatalogTupleUpdateWithInfo to let
callers use the CatalogTupleXXX abstraction layer even in cases where
we want to share the results of CatalogOpenIndexes across multiple
inserts/updates for efficiency.  This finishes the job begun in commit
2f5c9d9, by allowing some remaining simple_heap_insert/update
calls to be replaced.  The abstraction layer is now complete enough
that we don't have to export CatalogIndexInsert at all anymore.

Also, this fixes several places in which 2f5c9d9 introduced performance
regressions by using retail CatalogTupleInsert or CatalogTupleUpdate even
though the previous coding had been able to amortize CatalogOpenIndexes
work across multiple tuples.

A possible future improvement is to arrange for the indexing.c functions
to cache the CatalogIndexState somewhere, maybe in the relcache, in which
case we could get rid of CatalogTupleInsertWithInfo and
CatalogTupleUpdateWithInfo again.  But that's a task for another day.

This is backport of aedd554

Discussion: https://postgr.es/m/[email protected]
This is backport of fa42b20

Tomas Vondra
When we insert into gp_fastsequence, we insert a frozen row
(xmin=FrozenTransactionXid) and then update the index. That creates a
possibility of inconsistency in presence of server crash: if the WAL
for the frozen insert is flushed to persistent storage but the index
is not, after crash recovery we would be able to see the frozen row
in seqscan, but not in indexscan.

Now fixing this problem by doing regular heap_insert (normal MVCC) and
update the index first. Only after that we mark the row to be frozen.
That way we remove the inconsistency: if we crashed before we mark the
row frozen, the row follows regular MVCC and we won't be able to see it;
or if we crashed after we marked it frozen (and also it made to the disk),
we would replay both table row and index, so we would see the row in both
seqscan and indexscan.

Note that, now we use the upstream-aligned way to mark the tuple frozen
which is infomask |= HEAP_XMIN_FROZEN (commit 37484ad), instead of
setting xmin = FrozenTransactionId. The freeze effect should be the same
though, but it helps leave behind a trail as to which transaction did the
freezing.

In addition, we use the same WAL type XLOG_HEAP2_FREEZE_PAGE as vacuum,
so we also align with upstream in how to replay the frozen tuple.
The alternative was to use the WAL type for heap_inplace_update, but then
we would need to change it to update the header portion as well as data
portion, which would be an unpleasant difference in WAL semantics.

The only other thing that has been changed from the previous way of doing
frozen_heap_insert is that now we don't set XLogRecord.xl_xid to
FrozenTransactionId anymore. However, this should be OK because:
(1) we are not setting xim=FrozenTransactionId anymore, the WAL record
should remain consistent.
(2) the change where we were setting this field in Greenplum seems to be
outdated itself. Upstream doesn't have even the capability to do that: it
never passes an arbitrary XID to that field, it's always
GetCurrentTransactionIdIfAny() (see XLogRecordAssemble()), even if you
insert a tuple w/ xmin=FrozenTransactionId (e.g. for PG sequence, see
fill_seq_with_data()). GPDB doesn't have a reason to differ.

Finally, need to make a fewn adjustment in the regress test output where
xmin is queried for gp_fastsequence.

This is backport of 961de2d

Co-authored-by: Ashwin Agrawal <[email protected]>
Relations such as temporary tables or unlogged tables should not have
WALs created.

This is backport of 5245dcd
Similar to gp_fastsequence, there was a potential inconsistency between
the bitmap LOV table and its index due to the frozen insert. Now we fix
the inconsistency by similar method as in 961de2d

This is backport of 8accc02
SIMPLE_FAULT_INJECTOR("qe_exec_finished") was taken from e30909f for test.
Fault injectors for insert_bmlov_before_freeze and insert_bmlov_after_freeze
were changed because test does not work with the originals.
Now using the similar method as used in 961de2d
to mark the new aoseg/aocsseg tuples frozen, which is first insert the tuple
using normal heap insert, and then update it to be frozen using same mechanism
as vacuum freeze.

Note that, the inconsistency issue targeted by the above commit does not exist
in aoseg/aocsseg tables, because they do not have index. However, we still want
to be consistent about how we do this kind of things with both Greenplum itself
and the upstream (which never does "insert a new frozen tuple").

With that, we also get rid of CatalogTupleFrozenInsert.

This is backport of 1306d47
frozen_heap_insert was removed from InsertInitialAOCSFileSegInfo by a5e28ec
The previous commits have replaced the logic for "frozen insert" with the
safer way of "insert & freezing". Now removing the remainings of that
concept from the code base. GPDB now should be aligned with the upstream
wrt the interfaces concerned here.

Note that the frozen insert into a table also does frozen insert into its
toast tables (i.e. toast_insert_or_update() called with isFrozen=true).
One may question why we had the code to do that, given that none of
gp_fastsequence, bitmap indexes and ao(cs)seg tables has a toast table.
Here's some context regarding why we had that logic in the first place:
it was added to support an edge case with "error tables" for external tables,
which have been removed from GPDB since 3fd4bd9.
For example, the old syntax for the mentioned feature:
```
copy foo from 'foo.csv' csv LOG ERRORS INTO errtable SEGMENT REJECT LIMIT 10 ROWS;
```
So we should remove the related code for toast table too.

This is backport of 998f4c2
Patch introduces a hash map reloid -> PgStat_TableStatus which improves
performance in case of large number of tables/partitions.

Author: Aleksander Alekseev
Reviewed-by: Andres Freund, Anastasia Lubennikova, Tels, me

https://commitfest.postgresql.org/13/1058/

greenplum 6 specific changes: using HASH_FUNCTION instead of HASH_BLOBS

(cherry picked from commit 090010f)
Code review for commit 090010f.

Fix cases where an elog(ERROR) partway through a function would leave the
persistent data structures in a corrupt state.  pgstat_report_stat got this
wrong by invalidating PgStat_TableEntry structs before removing hashtable
entries pointing to them, and get_tabstat_entry got it wrong by ignoring
the possibility of palloc failure after it had already created a hashtable
entry.

Also, avoid leaking a memory context per transaction, which the previous
code did through misunderstanding hash_create's API.  We do not need to
create a context to hold the hash table; hash_create will do that.
(The leak wasn't that large, amounting to only a memory context header
per iteration, but it's still surprising that nobody noticed it yet.)

greenplum 6 specific changes: using HASH_FUNCTION instead of HASH_BLOBS

(cherry picked from commit 5d00b76)
This patch extends the functionality of the arenadata_toolkit extension
by adding two functions to obtain different table orders for vacuum/analyze.
These two functions are planned to be used in the vacuum command in
the operation.py script. By default, the bundle will use the current approach,
in which new tables are placed at the beginning of the list. But users will
be able to reconfigure it to use a second function that places new tables
at the end of the list, or even use a custom query to determine the order.
…eral locuses (#633)

Plan nodes with the General and SegmentGeneral locus type should return the same
result on each of the segments. But a volatile function does not guarantee the
same result for different calls, and therefore if the target list of a plan node
with General or SegmentGeneral locus type contains a volatile function, then the
locus of such a plan node should be brought to SingleQE. In the current
implementation, plan node with volatile function in target list is brought to
SingleQE not for all subqueries. If volatile function is in target list of CTE,
multiset functions and union/except/intersect then they can hide the volatile
function in the underlying nodes and current implementation can't detect them.
The volatile function can be executed on more than one segment and each of them
will work with a different result set which is not correct.

This patch fixes the error by moving the processing of volatile functions in the
target list for plan nodes with the General and SegmentGeneral locus type to
subquery_planner so that processing is performed for all subqueries.
An attempt to call DML over a table with trigger may give an assertion error in
`CFunctionProp` constructor. `func_stability` passed to the constructor may be
more than max value (`IMDFunction::EfsSentinel`). This happens, because
`CLogicalRowTrigger` class has no default value for `m_efs` in its constructor.
This means that `m_efs` may contain any value (for example 2139062143). Later,
when `InitFunctionProperties()` tries to calculate less strict
stability level, `m_efs` is not overwritten, because it already has greater
value. The patch fixes the error by adding default `m_efs` value to constructor.
The planing may fall on any type of trigger on any DML operaion, so you may
test it with any trigger and `optimizer_enable_dml_triggers` set to enabled.
…ser.

Cherry picked from 3f71222

This PR helps resolve #12748. When the access privilege
is granted to some user, the 'DROP OWNED BY' clause
cannot be executed with an error message saying:
'ERROR: unexpected object class XXXX (aclchk.c:XXX)'.

Simple SQL to reproduce:

```sql
-- Create read, write functions.
CREATE OR REPLACE FUNCTION dummy_read() RETURNS INTEGER
  AS $$ SELECT 0 $$ LANGUAGE SQL;
CREATE OR REPLACE FUNCTION dummy_write() RETURNS INTEGER
  AS $$ SELECT 0 $$ LANGUAGE SQL;
-- Create protocol.
CREATE TRUSTED PROTOCOL dummy_proto (
  readfunc = dummy_read,
  writefunc=dummy_write
);
-- Create a user.
CREATE ROLE test_role1;
-- Grant access privilege.
GRANT ALL ON PROTOCOL dummy_proto TO test_role1;
DROP OWNED BY test_role1;
```
… present (#634)

Current partitioning design faced a problem, which is connected with the
differences in physical attribute layouts of the child partition and the parent
relation in case when the parent relation has dropped columns and the newly
added child partition has not, and vice versa.

When executing modifying DML operations on a leaf partition of a partitioned
table, the executor anyway performs the partition selection in case of INSERT,
UPDATE for legacy planner and in case of INSERT, UPDATE, DELETE for ORCA
optimizer. This is done in selectPartition function, which is called from
slot_get_partition function (in case of modifying DML commands). This procedure
is done based on 1) attribute values of the tuple, which have been retrieved
from the respective subplan of ModifyTable (DML) node and whose attribute
structure corresponds the leaf partition's. 2) the root partition's attribute
numbers and partition rules, which are based on the parent (root) relation's
tuple descriptor. Thus, if the parent relation's tuple descriptor contained
dropped columns and leaf partition's did not (or visa versa, the leaf partition
had dropped columns and parent had not), the partition selection could go wrong
and lead to various execution errors.

Let's consider the following case. The partitioned table was created, and at
one moment it was decided to drop several irrelevant columns from it. After that
some new partitions were added or some were exchanged with new relations. In
this situation these newly added (exchanged) partitions don't have dropped
columns in it's structure, but the original parent relation has such dropped
attributes. Then the INSERT command into the new leaf partition is executed. In
this case for both planners the ExecInsert function is called, which then calls
slot_get_partition function in order to validate that the tuple values
correspond to specified leaf partition. And here the tuple descriptor does not
have dropped attributes, but the partition selection is performed from parent's
view, which have those attributes, and the parent's partitioning attribute
numbers are used to validate the tuple. Because of that the partition selection
procedure could either select wrong target partition (the wrong attribute was
taken from the slot values to check the partition rule) or fail to find the
partition at all, becasue none of the partition rules were satisfied. The same
issue occured when we didn't drop anything from parent relation, however we
exhanged a partition to the one with dropped columns. When inserting into
exchanged leaf partition the partition selection could also go wrong by the same
reasons (selection is performed from parent's view).

Similar issue occured for both planners in other commands, when partition
selection was performed. And in order to prevent the potentially wrong partition
selection, when modifying the leaf partition, this patch proposes the
following solution.

The legacy planner have checkPartitionUpdate function, which is called in UPDATE
cases and which checks that the tuple does not switch its partition at
update. And to our astonishment this function is able to handle the case when
the partition being modified has physically-different attribute numbers from
root partition's due to dropped columns. In this case the function creates the
ri_PartCheckMap map in target (child) ResultRelInfo, which stores correspondance
between target partition attributes and parent's. And because of that map, the
partition selection goes smoothly inside the function.

The proposed solution is to build ri_PartCheckMap for all the cases when the
leaf partition is modified and parent relation has dropped columns (or child
has and parent doesn't), and when the partition selection is expected. The
logic, which checks the matching of leaf's and parent's relation descriptors and
builds the ri_PartCheckMap is moved to a separate function
makePartitionCheckMap. This new function is called inside the ExecModifyTable
and ExecDML functions under different conditions.

For legacy planner this function is called for any command, except DELETE, no
matter which relation is initially specified as target (can be either root or
leaf). For example, when executing UPDATE (either split or common update) on
root relation, the planner builds the plan via inheritance_planner, and,
therefore, the each subplan of ModifyTable node will return the tuple slot with
attributes structure of the specific partition (some subplans will contain
dropped attributes in targetList, some won't). Here call of
makePartitionCheckMap is also required. For INSERT command it will be always
called as well, because makePartitionCheckMap validates whether oid of result
relation is equal to parent't partition oid, and this check allows us to call
the function on root relation too (if it's true, the map is not built, because
it's safe to perform operation through the root relation).

For ORCA optimizer the makePartitionCheckMap is called inside the ExecDML
function only in the case if the leaf partition is being modified. Otherwise, if
the command is called on root relation, the tuple will always have the
parent's structure, which does not break the partition selection procedure.

This patch also changes the slot_get_partition function, which performs the
partition selection. This function is extended with the alternative variant of
the attribute values extraction from the given slot. When the
makePartitionCheckMap function have created the ri_PartCheckMap of the child
ResultRelInfo, the attributes values array is built via that map
(inside reconstructTupleValues), what allows to perform the partition selection
from parent's view.

Finally, this patch additionally solves one more issue, that is not directly
related partition selection problem described above. Let's consider the
case when the parent relation has two partitions, and one of them has
incompatible attribute structure with parent's due to dropped columns. The
issue occured when the split UPDATE on parent partition was executed, and its
plan is built by legacy planner. That means that ModifyTable node contains
several sublans and several result relations inside the
estate->es_result_relations (inheritance planner creates them and the update is
executed for each of them). And during execution of this plan, when update is
executed on the partition, which has the same relation descriptor with parent's,
the invalid behaviour could occur when preparing the tuple for insertion (see
the last test case in tests section). In this case the partition selection at
insert stage works correct, because the partition does not differ from parent
relation. However, inside the get_part function, after the partition was
selected correctly from selectPartition function, the targetid_get_partition
function could wrongly form the final ResultRelInfo by creating unnecessary
ri_partInsertMap map, which is used in reconstructMatchingTupleSlot function
(inside the ExecInsert) to adjust the tuple to the selected partition descriptor
if the tuple does not fit. targetid_get_partition function makes the desicion on
building that map based on check, that compares two relation's descriptors:
one, which is considered to be parent's
(parentInfo = estate->es_result_relations), and one, which corresponds to
selected child partition (childInfo). But in case when UPDATE is formed by
inheritance planner, the parentInfo does not correspond the true parent
relation. It corresponds to one of other partitions. And in the case, when
selected partition is valid (it has the same relation descriptor with parent's),
and parentInfo corresponds to invalid partition (it does not match with parent
descriptor), the ri_partInsertMap could be built by mistake, what led to
unnecessary tuple reformatting and invalid results.

This patch solves this last issue by adding a check inside the
targetid_get_partition function, which ensures that the
estate->es_result_relations corresponds to the true root relation. That
would be true for all the cases when the DML is executed on parent relation,
except the UPDATE by legacy planner. And if the check is passed, the
ri_partInsertMap is built, otherwise it's not. Moreover, during such UPDATE
there is completely no need to reconstruct the tuple, because each ModifyTable's
subplan will give already valid tuple.
gpdb test suite requires enabled ipv6 since 7f3c91f. But there is at
least one src/test/ssl test that expects resolving to ipv4 address and
fails in the ipv6 environment with:

```
psql: FATAL:  no pg_hba.conf entry for host "2001:db8:1::242:ac11:2",
user "ssltestuser", database "postgres", SSL on
pg_regress: cannot read the result
(using postmaster on 9f17c91f4c76, port 6000)
```

so we at least temporary try to change precedence of name resolution in
favor of ipv4
Sync 6.26.0 changes to develop branch
@BenderArenadata
Copy link

Allure report https://allure-ee.adsw.io/launch/59970

@BenderArenadata
Copy link

Failed job Resource group isolation tests on x86_64: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/863102

@BenderArenadata
Copy link

Failed job Resource group isolation tests on ppc64le: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/863103

@BenderArenadata
Copy link

Failed job Regression tests with ORCA on x86_64: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/863096

@Stolb27 Stolb27 changed the title Arenadata53 Patchset 53 Dec 7, 2023
@Stolb27 Stolb27 changed the title Patchset 53 Patch set 53 Dec 7, 2023
@Stolb27 Stolb27 requested a review from a team December 7, 2023 19:59
@Stolb27 Stolb27 marked this pull request as ready for review December 7, 2023 20:00
@Stolb27 Stolb27 merged commit ac00af7 into adb-6.x Dec 8, 2023
2 of 4 checks passed
@Stolb27 Stolb27 deleted the arenadata53 branch December 8, 2023 14:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.