-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Patch set 53 #659
Patch set 53 #659
Conversation
Stolb27
commented
Dec 7, 2023
•
edited
Loading
edited
- ADBDEV-4318: Fix deadlock in case of the bottleneck HashJoin with shared scan below #620 (ADBDEV-4318)
- ADBDEV-4349: gplogfilter: fix timerange validation #627 (ADBDEV-4349)
- arenadata_tookit: replace using pg_partitions #629 (ADBDEV-4378)
- ADBDEV-4488: Fix inconsistency between the format of queries_now.dat and the queries_now_fast external table #636 (ADBDEV-4488)
- ADBDEV-4081: Fix inconsistency between gp_fastsequence row and index after crash #585 and required ADBDEV-4081-old: some need backports #610 (ADBDEV-4081)
- ADBDEV-4406: Backport patches that optimize get_tabstat_entry performance #638 (ADBDEV-4406)
- ADBDEV-4548: Implement vacuum/analyze ordering functions for arenadata_toolkit #641 (ADBDEV-4548)
- ADBDEV-4411: Fix CTEs with volatile target list for case of SegmentGeneral and General locuses #633 (ADBDEV-4411)
- Fix ORCA's triggers on debug build #650 (ADBDEV-4679)
- ADBDEV-4599: Handle external protocols for DROP OWNED BY statement #653 (ADBDEV-4599)
- ADBDEV-4207 Fix partition selection during DML execution when dropped columns are present #634 (ADBDEV-4207)
#620) Disabling the prefetch_inner mechanism for HashJoin, when the join is on the bottleneck, could lead to a deadlock if there is a SharedScan in both parts of the join. The deadlock occured because SharedScan producer in the inner part of the join, and consumer in the outer. Thus consumer couldn't perform the shared scan without prefetching the inner part. The SharedScan producer for HashJoin is always created under the inner part, because shareinput_walker function always processes HashJoin node from the Hash part. For CTEs the prefetching was enabled for all cases, except the bottleneck join, because CTE path is marked as motionHazard. Therefore, if prefetch_inner is disabled, the executor can start performing join from the outer part while the SharedScan producer will be inside the inner part and this will probably lead to the deadlock if the SharedScan is cross-slice. This patch fixes this by enabling prefetch_inner for HashJoin on bottleneck when traversing the plan in shareinput_mutator_dag_to_tree if there are shared scans in the plan and join is on the bottleneck.
Previously the time range filtering of log files was incorrect: there was a check for belonging a log file to specified time range which worked like a binary `AND` (if time < begin AND time > end - skip file), but something like binary `OR` is needed here. Thus file filtering never worked. Also this validation was executed after the file was opened, so in case of failed validation there was a redundant `open/close` operations. This patch fixes such behavior, by: - moving the logic of belonging log file name the user specified range to the separate function (it's also helps to cover corner cases) - skip file processing, if it doesn't belong to range, at the begin of loop to prevent redundant `open/close` operations
There was getting data from pg_partitions at function 'adb_collect_table_stats'. 'pg_partitions' generates from a lot of joins ('pg_class', 'pg_namespace', 'pg_partition_rule' and etc). It takes a lot of time and resources (query plan of getting all data from 'pg_partitions' contains ~60 rows). 'pg_partitions' was replaces to some joins, which are need for 'adb_collect_table_stats'. Query plans have halved than query with 'pg_partitions'. New test 'adb_collect_table_stat_test' was added. It checks, that function 'adb_collect_table_stats' generates new partitions for 'db_files_history' correctly.
…urrent (#629) At "adb_collect_table_stats" temporary table "db_files_current" is used for collecting data from view "arenadata_toolkit.__db_files_current". Schema "arenadata_toolkit" contains not temporary table "db_files_current". If schema "arenadata_toolkit" is set by default it may be get collison at table names. Temporary table "db_files_current" was renamed to "pg_temp.db_files_current".
…es_now_fast external table (#636) gpperfmon writes a string to queries_now.dat in the expectation that it will be processed by gpmon_catqrynow.py, but the queries_now_fast external table reads from the file directly. This leads to an error, because queries_now.dat contains the query_hash column that is not declared in the queries_now_fast table. Also queries_now.dat contains empty columns at the end that should be filled by python script. These columns are not declared in the table too. This patch fixes this issue by removing extra columns from queries_now.dat and adding them directly in the python script and in the write_qlog_full (write into queries_tail.dat) function. Co-authored-by: Andrey Sokolov <[email protected]>
Split the existing CatalogUpdateIndexes into two different routines, CatalogTupleInsert and CatalogTupleUpdate, which do both the heap insert/update plus the index update. This removes over 300 lines of boilerplate code all over src/backend/catalog/ and src/backend/commands. The resulting code is much more pleasing to the eye. Also, by encapsulating what happens in detail during an UPDATE, this facilitates the upcoming WARM patch, which is going to add a few more lines to the update case making the boilerplate even more boring. The original CatalogUpdateIndexes is removed; there was only one use left, and since it's just three lines, we can as well expand it in place there. We could keep it, but WARM is going to break all the UPDATE out-of-core callsites anyway, so there seems to be no benefit in doing so. This is backport of 2f5c9d9 CatalogTupleInsertFrozen is taken from 19cd1cf This patch replaces simple_heap_insert and CatalogUpdateIndexes to CatalogTupleInsert, and simple_heap_update and CatalogUpdateIndexes to CatalogTupleUpdate Author: Pavan Deolasee Discussion: https://www.postgr.es/m/CABOikdOcFYSZ4vA2gYfs=M2cdXzXX4qGHeEiW3fu9PCfkHLa2A@mail.gmail.com
This extends the work done in commit 2f5c9d9 to provide a more nearly complete abstraction layer hiding the details of index updating for catalog changes. That commit only invented abstractions for catalog inserts and updates, leaving nearby code for catalog deletes still calling the heap-level routines directly. That seems rather ugly from here, and it does little to help if we ever want to shift to a storage system in which indexing work is needed at delete time. Hence, create a wrapper function CatalogTupleDelete(), and replace calls of simple_heap_delete() on catalog tuples with it. There are now very few direct calls of [simple_]heap_delete remaining in the tree. This is backport of ab02896 This patch replaces simple_heap_delete to CatalogTupleDelete Discussion: https://postgr.es/m/[email protected]
Add CatalogTupleInsertWithInfo and CatalogTupleUpdateWithInfo to let callers use the CatalogTupleXXX abstraction layer even in cases where we want to share the results of CatalogOpenIndexes across multiple inserts/updates for efficiency. This finishes the job begun in commit 2f5c9d9, by allowing some remaining simple_heap_insert/update calls to be replaced. The abstraction layer is now complete enough that we don't have to export CatalogIndexInsert at all anymore. Also, this fixes several places in which 2f5c9d9 introduced performance regressions by using retail CatalogTupleInsert or CatalogTupleUpdate even though the previous coding had been able to amortize CatalogOpenIndexes work across multiple tuples. A possible future improvement is to arrange for the indexing.c functions to cache the CatalogIndexState somewhere, maybe in the relcache, in which case we could get rid of CatalogTupleInsertWithInfo and CatalogTupleUpdateWithInfo again. But that's a task for another day. This is backport of aedd554 Discussion: https://postgr.es/m/[email protected]
This is backport of fa42b20 Tomas Vondra
When we insert into gp_fastsequence, we insert a frozen row (xmin=FrozenTransactionXid) and then update the index. That creates a possibility of inconsistency in presence of server crash: if the WAL for the frozen insert is flushed to persistent storage but the index is not, after crash recovery we would be able to see the frozen row in seqscan, but not in indexscan. Now fixing this problem by doing regular heap_insert (normal MVCC) and update the index first. Only after that we mark the row to be frozen. That way we remove the inconsistency: if we crashed before we mark the row frozen, the row follows regular MVCC and we won't be able to see it; or if we crashed after we marked it frozen (and also it made to the disk), we would replay both table row and index, so we would see the row in both seqscan and indexscan. Note that, now we use the upstream-aligned way to mark the tuple frozen which is infomask |= HEAP_XMIN_FROZEN (commit 37484ad), instead of setting xmin = FrozenTransactionId. The freeze effect should be the same though, but it helps leave behind a trail as to which transaction did the freezing. In addition, we use the same WAL type XLOG_HEAP2_FREEZE_PAGE as vacuum, so we also align with upstream in how to replay the frozen tuple. The alternative was to use the WAL type for heap_inplace_update, but then we would need to change it to update the header portion as well as data portion, which would be an unpleasant difference in WAL semantics. The only other thing that has been changed from the previous way of doing frozen_heap_insert is that now we don't set XLogRecord.xl_xid to FrozenTransactionId anymore. However, this should be OK because: (1) we are not setting xim=FrozenTransactionId anymore, the WAL record should remain consistent. (2) the change where we were setting this field in Greenplum seems to be outdated itself. Upstream doesn't have even the capability to do that: it never passes an arbitrary XID to that field, it's always GetCurrentTransactionIdIfAny() (see XLogRecordAssemble()), even if you insert a tuple w/ xmin=FrozenTransactionId (e.g. for PG sequence, see fill_seq_with_data()). GPDB doesn't have a reason to differ. Finally, need to make a fewn adjustment in the regress test output where xmin is queried for gp_fastsequence. This is backport of 961de2d Co-authored-by: Ashwin Agrawal <[email protected]>
Relations such as temporary tables or unlogged tables should not have WALs created. This is backport of 5245dcd
Similar to gp_fastsequence, there was a potential inconsistency between the bitmap LOV table and its index due to the frozen insert. Now we fix the inconsistency by similar method as in 961de2d This is backport of 8accc02 SIMPLE_FAULT_INJECTOR("qe_exec_finished") was taken from e30909f for test. Fault injectors for insert_bmlov_before_freeze and insert_bmlov_after_freeze were changed because test does not work with the originals.
Now using the similar method as used in 961de2d to mark the new aoseg/aocsseg tuples frozen, which is first insert the tuple using normal heap insert, and then update it to be frozen using same mechanism as vacuum freeze. Note that, the inconsistency issue targeted by the above commit does not exist in aoseg/aocsseg tables, because they do not have index. However, we still want to be consistent about how we do this kind of things with both Greenplum itself and the upstream (which never does "insert a new frozen tuple"). With that, we also get rid of CatalogTupleFrozenInsert. This is backport of 1306d47 frozen_heap_insert was removed from InsertInitialAOCSFileSegInfo by a5e28ec
The previous commits have replaced the logic for "frozen insert" with the safer way of "insert & freezing". Now removing the remainings of that concept from the code base. GPDB now should be aligned with the upstream wrt the interfaces concerned here. Note that the frozen insert into a table also does frozen insert into its toast tables (i.e. toast_insert_or_update() called with isFrozen=true). One may question why we had the code to do that, given that none of gp_fastsequence, bitmap indexes and ao(cs)seg tables has a toast table. Here's some context regarding why we had that logic in the first place: it was added to support an edge case with "error tables" for external tables, which have been removed from GPDB since 3fd4bd9. For example, the old syntax for the mentioned feature: ``` copy foo from 'foo.csv' csv LOG ERRORS INTO errtable SEGMENT REJECT LIMIT 10 ROWS; ``` So we should remove the related code for toast table too. This is backport of 998f4c2
Patch introduces a hash map reloid -> PgStat_TableStatus which improves performance in case of large number of tables/partitions. Author: Aleksander Alekseev Reviewed-by: Andres Freund, Anastasia Lubennikova, Tels, me https://commitfest.postgresql.org/13/1058/ greenplum 6 specific changes: using HASH_FUNCTION instead of HASH_BLOBS (cherry picked from commit 090010f)
Code review for commit 090010f. Fix cases where an elog(ERROR) partway through a function would leave the persistent data structures in a corrupt state. pgstat_report_stat got this wrong by invalidating PgStat_TableEntry structs before removing hashtable entries pointing to them, and get_tabstat_entry got it wrong by ignoring the possibility of palloc failure after it had already created a hashtable entry. Also, avoid leaking a memory context per transaction, which the previous code did through misunderstanding hash_create's API. We do not need to create a context to hold the hash table; hash_create will do that. (The leak wasn't that large, amounting to only a memory context header per iteration, but it's still surprising that nobody noticed it yet.) greenplum 6 specific changes: using HASH_FUNCTION instead of HASH_BLOBS (cherry picked from commit 5d00b76)
This patch extends the functionality of the arenadata_toolkit extension by adding two functions to obtain different table orders for vacuum/analyze. These two functions are planned to be used in the vacuum command in the operation.py script. By default, the bundle will use the current approach, in which new tables are placed at the beginning of the list. But users will be able to reconfigure it to use a second function that places new tables at the end of the list, or even use a custom query to determine the order.
…eral locuses (#633) Plan nodes with the General and SegmentGeneral locus type should return the same result on each of the segments. But a volatile function does not guarantee the same result for different calls, and therefore if the target list of a plan node with General or SegmentGeneral locus type contains a volatile function, then the locus of such a plan node should be brought to SingleQE. In the current implementation, plan node with volatile function in target list is brought to SingleQE not for all subqueries. If volatile function is in target list of CTE, multiset functions and union/except/intersect then they can hide the volatile function in the underlying nodes and current implementation can't detect them. The volatile function can be executed on more than one segment and each of them will work with a different result set which is not correct. This patch fixes the error by moving the processing of volatile functions in the target list for plan nodes with the General and SegmentGeneral locus type to subquery_planner so that processing is performed for all subqueries.
An attempt to call DML over a table with trigger may give an assertion error in `CFunctionProp` constructor. `func_stability` passed to the constructor may be more than max value (`IMDFunction::EfsSentinel`). This happens, because `CLogicalRowTrigger` class has no default value for `m_efs` in its constructor. This means that `m_efs` may contain any value (for example 2139062143). Later, when `InitFunctionProperties()` tries to calculate less strict stability level, `m_efs` is not overwritten, because it already has greater value. The patch fixes the error by adding default `m_efs` value to constructor. The planing may fall on any type of trigger on any DML operaion, so you may test it with any trigger and `optimizer_enable_dml_triggers` set to enabled.
…ser. Cherry picked from 3f71222 This PR helps resolve #12748. When the access privilege is granted to some user, the 'DROP OWNED BY' clause cannot be executed with an error message saying: 'ERROR: unexpected object class XXXX (aclchk.c:XXX)'. Simple SQL to reproduce: ```sql -- Create read, write functions. CREATE OR REPLACE FUNCTION dummy_read() RETURNS INTEGER AS $$ SELECT 0 $$ LANGUAGE SQL; CREATE OR REPLACE FUNCTION dummy_write() RETURNS INTEGER AS $$ SELECT 0 $$ LANGUAGE SQL; -- Create protocol. CREATE TRUSTED PROTOCOL dummy_proto ( readfunc = dummy_read, writefunc=dummy_write ); -- Create a user. CREATE ROLE test_role1; -- Grant access privilege. GRANT ALL ON PROTOCOL dummy_proto TO test_role1; DROP OWNED BY test_role1; ```
… present (#634) Current partitioning design faced a problem, which is connected with the differences in physical attribute layouts of the child partition and the parent relation in case when the parent relation has dropped columns and the newly added child partition has not, and vice versa. When executing modifying DML operations on a leaf partition of a partitioned table, the executor anyway performs the partition selection in case of INSERT, UPDATE for legacy planner and in case of INSERT, UPDATE, DELETE for ORCA optimizer. This is done in selectPartition function, which is called from slot_get_partition function (in case of modifying DML commands). This procedure is done based on 1) attribute values of the tuple, which have been retrieved from the respective subplan of ModifyTable (DML) node and whose attribute structure corresponds the leaf partition's. 2) the root partition's attribute numbers and partition rules, which are based on the parent (root) relation's tuple descriptor. Thus, if the parent relation's tuple descriptor contained dropped columns and leaf partition's did not (or visa versa, the leaf partition had dropped columns and parent had not), the partition selection could go wrong and lead to various execution errors. Let's consider the following case. The partitioned table was created, and at one moment it was decided to drop several irrelevant columns from it. After that some new partitions were added or some were exchanged with new relations. In this situation these newly added (exchanged) partitions don't have dropped columns in it's structure, but the original parent relation has such dropped attributes. Then the INSERT command into the new leaf partition is executed. In this case for both planners the ExecInsert function is called, which then calls slot_get_partition function in order to validate that the tuple values correspond to specified leaf partition. And here the tuple descriptor does not have dropped attributes, but the partition selection is performed from parent's view, which have those attributes, and the parent's partitioning attribute numbers are used to validate the tuple. Because of that the partition selection procedure could either select wrong target partition (the wrong attribute was taken from the slot values to check the partition rule) or fail to find the partition at all, becasue none of the partition rules were satisfied. The same issue occured when we didn't drop anything from parent relation, however we exhanged a partition to the one with dropped columns. When inserting into exchanged leaf partition the partition selection could also go wrong by the same reasons (selection is performed from parent's view). Similar issue occured for both planners in other commands, when partition selection was performed. And in order to prevent the potentially wrong partition selection, when modifying the leaf partition, this patch proposes the following solution. The legacy planner have checkPartitionUpdate function, which is called in UPDATE cases and which checks that the tuple does not switch its partition at update. And to our astonishment this function is able to handle the case when the partition being modified has physically-different attribute numbers from root partition's due to dropped columns. In this case the function creates the ri_PartCheckMap map in target (child) ResultRelInfo, which stores correspondance between target partition attributes and parent's. And because of that map, the partition selection goes smoothly inside the function. The proposed solution is to build ri_PartCheckMap for all the cases when the leaf partition is modified and parent relation has dropped columns (or child has and parent doesn't), and when the partition selection is expected. The logic, which checks the matching of leaf's and parent's relation descriptors and builds the ri_PartCheckMap is moved to a separate function makePartitionCheckMap. This new function is called inside the ExecModifyTable and ExecDML functions under different conditions. For legacy planner this function is called for any command, except DELETE, no matter which relation is initially specified as target (can be either root or leaf). For example, when executing UPDATE (either split or common update) on root relation, the planner builds the plan via inheritance_planner, and, therefore, the each subplan of ModifyTable node will return the tuple slot with attributes structure of the specific partition (some subplans will contain dropped attributes in targetList, some won't). Here call of makePartitionCheckMap is also required. For INSERT command it will be always called as well, because makePartitionCheckMap validates whether oid of result relation is equal to parent't partition oid, and this check allows us to call the function on root relation too (if it's true, the map is not built, because it's safe to perform operation through the root relation). For ORCA optimizer the makePartitionCheckMap is called inside the ExecDML function only in the case if the leaf partition is being modified. Otherwise, if the command is called on root relation, the tuple will always have the parent's structure, which does not break the partition selection procedure. This patch also changes the slot_get_partition function, which performs the partition selection. This function is extended with the alternative variant of the attribute values extraction from the given slot. When the makePartitionCheckMap function have created the ri_PartCheckMap of the child ResultRelInfo, the attributes values array is built via that map (inside reconstructTupleValues), what allows to perform the partition selection from parent's view. Finally, this patch additionally solves one more issue, that is not directly related partition selection problem described above. Let's consider the case when the parent relation has two partitions, and one of them has incompatible attribute structure with parent's due to dropped columns. The issue occured when the split UPDATE on parent partition was executed, and its plan is built by legacy planner. That means that ModifyTable node contains several sublans and several result relations inside the estate->es_result_relations (inheritance planner creates them and the update is executed for each of them). And during execution of this plan, when update is executed on the partition, which has the same relation descriptor with parent's, the invalid behaviour could occur when preparing the tuple for insertion (see the last test case in tests section). In this case the partition selection at insert stage works correct, because the partition does not differ from parent relation. However, inside the get_part function, after the partition was selected correctly from selectPartition function, the targetid_get_partition function could wrongly form the final ResultRelInfo by creating unnecessary ri_partInsertMap map, which is used in reconstructMatchingTupleSlot function (inside the ExecInsert) to adjust the tuple to the selected partition descriptor if the tuple does not fit. targetid_get_partition function makes the desicion on building that map based on check, that compares two relation's descriptors: one, which is considered to be parent's (parentInfo = estate->es_result_relations), and one, which corresponds to selected child partition (childInfo). But in case when UPDATE is formed by inheritance planner, the parentInfo does not correspond the true parent relation. It corresponds to one of other partitions. And in the case, when selected partition is valid (it has the same relation descriptor with parent's), and parentInfo corresponds to invalid partition (it does not match with parent descriptor), the ri_partInsertMap could be built by mistake, what led to unnecessary tuple reformatting and invalid results. This patch solves this last issue by adding a check inside the targetid_get_partition function, which ensures that the estate->es_result_relations corresponds to the true root relation. That would be true for all the cases when the DML is executed on parent relation, except the UPDATE by legacy planner. And if the check is passed, the ri_partInsertMap is built, otherwise it's not. Moreover, during such UPDATE there is completely no need to reconstruct the tuple, because each ModifyTable's subplan will give already valid tuple.
gpdb test suite requires enabled ipv6 since 7f3c91f. But there is at least one src/test/ssl test that expects resolving to ipv4 address and fails in the ipv6 environment with: ``` psql: FATAL: no pg_hba.conf entry for host "2001:db8:1::242:ac11:2", user "ssltestuser", database "postgres", SSL on pg_regress: cannot read the result (using postmaster on 9f17c91f4c76, port 6000) ``` so we at least temporary try to change precedence of name resolution in favor of ipv4
Sync 6.26.0 changes to develop branch
Allure report https://allure-ee.adsw.io/launch/59970 |
Failed job Resource group isolation tests on x86_64: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/863102 |
Failed job Resource group isolation tests on ppc64le: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/863103 |
Failed job Regression tests with ORCA on x86_64: https://gitlab.adsw.io/arenadata/github_mirroring/gpdb/-/jobs/863096 |