-
Notifications
You must be signed in to change notification settings - Fork 275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data Liberation] Topological sorter, entities remapping and add missing imports #2030
base: trunk
Are you sure you want to change the base?
Conversation
c51d9c4
to
7778714
Compare
packages/playground/data-liberation/src/import/WP_Stream_Importer.php
Outdated
Show resolved
Hide resolved
packages/playground/data-liberation/src/import/WP_Stream_Importer.php
Outdated
Show resolved
Hide resolved
packages/playground/data-liberation/src/import/WP_Topological_Sorter.php
Outdated
Show resolved
Hide resolved
85c850b
to
78f1cb3
Compare
78f1cb3
to
5af5722
Compare
* Quicksort performs badly on already sorted arrays, O(n^2) is the worst case. | ||
* Let's consider using a different sorting algorithm. | ||
*/ | ||
uksort( $elements, $sort_callback ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This requires fitting all $elements
into memory. This may be fine for v1, since every $element is relatively small, but we'll hit the limits of this approach sooner than later. Is it possible to perform topological sorting with at a reasonable speed without holding everything in memory? If not, how much RAM would this need to process one of these huge VIP 1TB exports?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can save the data in a custom DB table (Jetpack have a custom wp_jetpack_sync_queue
table) instead of a simple array in memory. I didn't disturb the database in this first phase, but we can. Reproducing this in a table and using the custom sort here is straightforward. At the end of collecting the byte offsets, it's a matter of ALTER TABLE wxr_sorting ORDER BY custom_sorting_like_the_one_above
, and we are done, even with our new reentrancy model.
When it finds a row in this table, the streamer can jump to the "correct" position (the one it should load before the current one) and proceed to the next one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lovely! If it's straightforward, let's include it in v1 – it seems like we can ship something that's on the right track with relatively little effort. It would also enforce shaping the API to stream the sorted list instead of loading it into memory, which would have implications for the reentrancy cursor. It may seem excessive for v1, but thinking about it more it seems like something that could have a ripple through the entire system if we don't account for streaming early on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may seem excessive for v1, but thinking about it more it seems like something that could have a ripple through the entire system if we don't account for streaming early on.
Agree. It's reasonable, let's proceed this way. 👍
495be8b
to
e95618e
Compare
@@ -93,7 +93,7 @@ public function __construct( $options = array() ) { | |||
$this->mapping['term_id'] = array(); | |||
$this->requires_remapping = $empty_types; | |||
$this->exists = $empty_types; | |||
$this->logger = new Logger(); | |||
$this->logger = isset( $options['logger'] ) ? $options['logger'] : new WP_Logger(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually not sure we need a logger beyond _doing_it_wrong()
. All the progress information are now exposed as numbers and the API consumer may implements its own logging using any technique.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense, thanks.
packages/playground/data-liberation/src/import/WP_Stream_Importer.php
Outdated
Show resolved
Hide resolved
if ( true === $this->topological_sort_next_entity( $count ) ) { | ||
return true; | ||
} | ||
|
||
// We indexed all the entities. Now sort them topologically. | ||
$this->topological_sorter->sort_topologically(); | ||
$this->topological_sorter = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It reads confusing to me:
- Sort every entity we encounter and short circuit while we have entities
- Once we run out of entities to sort, sort them
What would be another way to call these operations?
packages/playground/data-liberation/src/import/WP_Stream_Importer.php
Outdated
Show resolved
Hide resolved
packages/playground/data-liberation/src/import/WP_Topological_Sorter.php
Outdated
Show resolved
Hide resolved
/** | ||
* The base name of the table. | ||
*/ | ||
const TABLE_NAME = 'data_liberation_index'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is called index, but it's not used in the index_next_entities()
stage, only in the topological sort stage. How about data_liberation_topological_index
or something to that effect?
* Run in the 'plugins_loaded' action. | ||
*/ | ||
public static function load() { | ||
if ( self::DB_VERSION !== (int) get_site_option( self::OPTION_NAME ) ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused about the distinction between load()
and activate()
– could we either have one or document them a bit more to explain the usage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The activation/deactivation is when a user clicks "activate" and "deactivate" link. The load action is usually used if you need to change schema:
The best practices are here:
- Installation, activation steps: https://developer.wordpress.org/plugins/plugin-basics/activation-deactivation-hooks/
- Custom tables: https://learn.wordpress.org/lesson/custom-database-tables/
- Schema change: https://developer.wordpress.org/plugins/creating-tables-with-plugins/
Names can be misleading. Let me add more docs.
$table_name = self::get_table_name(); | ||
|
||
// Create the table if it doesn't exist. | ||
// @TODO: remove this custom SQLite declaration after first phase of unit tests is done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the CREATE TABLE statement doesn't work with the legacy SQLite integration we're using now? cc @JanJakes this could be useful for testing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JanJakes it is, thanks! Removed the special SQL.
/** | ||
* Run by register_deactivation_hook. | ||
*/ | ||
public static function deactivate() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would there be any advantage to running this when the plugin is uninstalled and not just deactivated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've read this here https://learn.wordpress.org/lesson/custom-database-tables/:
If the users of your plugin do not need the data in this table if they deactivate the plugin, you could trigger this on the plugin deactivation hook.
register_deactivation_hook( __FILE__, 'wp_learn_delete_table' );
However, if the data in that table is important, and your users might want to keep it, even if the plugin is deactivated, you could delete the table using one of the two uninstall methods available to plugins.
Our data is not "important" in the strict sense of the term. It can be recreated again; it's just bulky. So, removing it when the user deactivates the plugin is okay. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good 👍
* @param int $byte_offset The byte offset of the category. | ||
* @param array $data The category data. | ||
*/ | ||
public function map_category( $byte_offset, $data ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
map makes me think of map/reduce and a mapper callback. How about insert_category
or add_category
or index_category
? Ditto for the other map_ method
'element_id' => (string) $data['term_id'], | ||
'parent_id' => $category_parent, | ||
'byte_offset' => $byte_offset, | ||
// Items with a parent has at least a sort order of 2. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's document this more. What's the logic behind the sort_order of 1, 2 etc? Should it be more for deeply nested categories or not?
* | ||
* @var int | ||
*/ | ||
protected $orphan_post_counter = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to restore this (or any other) value after stopping and resuming the import?
* | ||
* @return int|bool The byte offset of the category, or false if the category is not found. | ||
*/ | ||
public function get_category_byte_offset( $session_id, $slug ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How many methods can we make private
? Every public method makes a public API and gets covered by the WordPress indefinite BC policy upon merge to WordPress core.
|
||
return $wpdb->get_var( | ||
$wpdb->prepare( | ||
'SELECT byte_offset FROM %i WHERE element_id = %s AND element_type = %d AND session_id = %d LIMIT 1', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to fetch these in chunks of 100 or 1000 to avoid hitting the database every time? Also, noting the docstring says "and remove it from the list" while the method doesn't perform the removal.
|
||
// MySQL version - update sort_order using a subquery | ||
return $wpdb->query( | ||
$wpdb->prepare( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How well does this perform on large datasets? Running a single, table-wide UPDATE that would potentially affect a few hundred million rows is making me anxious, although that's just a gut feeling.
ecc7336
to
e3ba973
Compare
* We create a custom table that contains the IDs and the new IDs created in the | ||
* target system sorted in the parent-child order. | ||
* | ||
* This class extends the WP_WXR_Entity_Reader class and overrides the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat idea! I initially thought about ingraining the topological sorting knowledge in the streaming importer to apply it to all possible data sources and worried about disabling it for, say, Markdown imports.
Thinking about that more, WXR may be the only source we'll see in the near future that yields data in a child-first order. I kind of like keeping and maturing this logic in the entity reader class. Even if we need to move it to the importer one day, the API will mostly stay the same. Cool!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see, the logic is split between this reader the Importer. Hm. Couldn't we contain it to just one of these places?
* read_next_entity function to emit the entities in the correct order. | ||
* | ||
* List of entities Sort order | ||
* entity 1 entity 1 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does the number 3
stand for? Is it "max depth of descendants"?
* The base name of the table used to store the IDs, the new IDs and the | ||
* sort order. | ||
*/ | ||
const TABLE_NAME = 'data_liberation_map'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If sorting is WXR-specific, let's give this table a WXR-specific name that also mentions sorting. Maybe data_liberation_wxr_import_topologically_sorted_entities
? It's long, sure, but it's clear.
* The current session ID. | ||
*/ | ||
protected $current_session = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Let's clarify the word "session" as it can have multiple meanings in a PHP app
* The current session ID. | |
*/ | |
protected $current_session = null; | |
* The current import session ID. | |
*/ | |
protected $import_session_id = null; |
// Initialize WP_WXR_Entity_Reader. | ||
$reader = parent::create( $upstream, $cursor, $options ); | ||
|
||
if ( array_key_exists( 'post_id', $options ) ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep the names consistent – if it's a session ID, let's call the option session_id
if ( ! empty( $next_cursor ) ) { | ||
$next_cursor = json_decode( $next_cursor, true ); | ||
|
||
/*if ( ! empty( $next_cursor ) ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be restored or removed? I remember you've mentioned an infinite loop – was it here?
* Run during the register_activation_hook or similar actions. It creates | ||
* the table if it doesn't exist. | ||
*/ | ||
public static function create_or_update_db() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Let's clarify this is about a table
public static function create_or_update_db() { | |
public static function create_or_update_db_table() { |
* Run by register_deactivation_hook or similar. It drops the table and | ||
* deletes the option. | ||
*/ | ||
public static function delete_db() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Let's clarify this is about a table
* @param array $data The data to map. | ||
* @param mixed $cursor_id The stream cursor ID. | ||
*/ | ||
public function add_next_entity( $entity = null ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this logic be moved to the stream importer? I'm confused seeing a Reader that's also a Writer. Also, the method name add_next_entity
suggests I can append an entity that's not in the original WXR file, which is also confusing.
return false; | ||
} | ||
|
||
$entity = $entity ?? $this->current(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to this PR, but for posterity:
I'm having second thoughts about the "every entity reader is an iterator" idea.
There's get_entity()
and current()
. The latter has a name that suggests it's the same as the former, but it doesn't actually do the same thing. The PHP iterator design assumes a newly created iterator is already initialized and current()
comes before next()
, which is in direct opposition to the Reader that expects next_entity()
to be called before get_entity()
.
I'm thinking we could either remove the iterator-ness completely, or create a single new Entity_Reader_Iterator( $reader )
class to have a clearly separated single point of entry.
|
||
// Default sort order is 1. | ||
$sort_order = 1; | ||
$cursor_id = $this->get_reentrancy_cursor(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or just $cursor
?
$parent_id_type = null; | ||
$check_existing = true; | ||
|
||
// Map the parent ID if the entity has one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we mapping IDs while sorting? I think this already came up in one of the previous versions of the sorter. Mapping seems like a separate problem we'll need to handle for every entity source, not just WXR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may be confused here. It seems like "mapping parent ID" refers to assigning the parent
value to $new_entity
and not the "reading a post with one ID but inserting it with another parent ID." Also, entity
refers to the thing we import, $new_entity
refers to a row in the database table we use for sorting.
As I read, I find myself slowing down frequently to understand the ideas behind the words. I'd like to ask you to revisit the terminology used in this PR, especially in comments and variable and function names. Let's make the words more precise than they need to be. My goal is to make all the concepts unambiguous so that a junior WordPress dev could read it without getting confused about any part of the codebase. It will make the next review much easier for me.
You can find a discussion here with a more detailed explanation: #2090
Motivation for the change, related issues
Topological Sorting of WXR entities before starting the import to ensure parent posts are imported before child posts. Processing WXR in a topological order may require an index with all offsets and lengths of items in the WXR.
Implementation details
Entities preloading/loading/mapping
The topological sort happens during a
STAGE_TOPOLOGICAL_SORT
phase runt before theSTAGE_FRONTLOAD_ASSETS
.This PR removes the
WP_Entity_Importer::mapping
andWP_Entity_Importer::exists
arrays and all memory preload. It is slow and memory-consuming, with thousands of entries and not support sessions. It adds a new table with import IDs and mapped IDs. During the first phase, it is prefilled. During theWP_Entity_Importer
import it maps the imported IDs by using thewxr_importer_*
filters and actions.New WP-CLI scriptThis PR also introduces the new CLI script and moves the logger there.moved to #2104New unit tests
Added all the 130 WordPress core unit tests. Updated the WXRs to latest version.Added aPlaygroundTestCase
base class that cleanup the WordPress database after a test, as_delete_all_data
doesMoved to [Data Liberation] Add support for terms meta and new unit tests #2105
Missing imports and hierarchies
Added support for:
Term metaNew PHPUnit filterThis PR adds aPHPUNIT_FILTER
constant topackages/playground/data-liberation/tests/import/blueprint-import.json.
If the value is not falsy it will be passed to PHPUnit when callingnpx nx run playground-data-liberation:test:wp-phpunit
. So, for example, you can set"PHPUNIT_FILTER": "WPRewriteUrlsTests".
It will be the same as runningphpunit --filter WPRewriteUrlsTests
.Moved to #2105
Testing Instructions (or ideally a Blueprint)
Unit tests
Many of the tests can be run only in a real WordPress environment, that's why
wp-phpunit
exists.Data test
.xml
you find inpackages/playground/data-liberation/tests/wxr/*
About to add and pass all the tests found here https://github.com/WordPress/wordpress-importer/tree/master/phpunit/tests.
New scriptor:It accepts a file, a URL, or a folder. Runmoved to #2104wp help data-liberation import
to see all options.