Unexpected content length doc_length and duplicate_ngram_chr_fraction #357

smileBeda · 2024-07-10T06:23:24Z

smileBeda
Jul 10, 2024

def evaluate_content_quality(text):
    processed_text = preprocess_text(text)
    df = td.extract_metrics(text=processed_text, spacy_model="en_core_web_lg", metrics=None)
    return df

text is pre-processed to remove all HTML using beautiful soup.
I print the results in a table.
duplicate_ngram_chr_fraction_5 of 0.159919 thus the valuation does not pass: passed_quality_check is False

Additionally the results indicate doc_length of 733 which cannot be (as long as doc is right and it truly counts characters and nothing else)

This is the source text exact as used:

Introduction to WordPress Caching 
Caching stores copies of files or data, allowing faster access to frequently requested content. For WordPress, caching can greatly improve your website's performance. It reduces server load and speeds up page load times, enhancing the overall user experience. 
By using caching, your server doesn't need to generate content from scratch each time someone visits your site. FastCGI Page Cache and APCu Object Cache are two powerful tools for achieving this. In this guide, you'll learn how to set up both caching methods in an Nginx and PHP 8 environment. 
Why Use FastCGI Page Cache and APCu Object Cache? 
FastCGI Page Cache stores static copies of your dynamic pages. This reduces the need for PHP processes and database queries on every request, greatly enhancing your site's performance. 
APCu Object Cache stores objects like database query results in memory. This reduces the need for frequent database access, resulting in faster load times for your website. Both caching methods not only improve the user experience but also positively impact SEO by reducing server response times and enhancing load speeds. 
Setting Up FastCGI Page Cache on Nginx 
Let's dive into setting up FastCGI Page Cache on your WordPress website to take advantage of these performance enhancements. 
Step 1: Modify Nginx Configuration 
First, open your Nginx configuration file, usually located at . 
Add the following configuration within the server block: 
You can create a cache zone and learn how to store and retrieve cached content with this setup. Next, ensure the cache directory is created and properly set up in the following step. 
Step 2: Create Cache Directory 
Create the cache directory by running the following command: 
Then, set the appropriate permissions with: 
Setting the correct permissions ensures that Nginx can read from and write to the cache directory. Once done, reload Nginx to apply these changes. 
Step 3: Reload Nginx 
Reload Nginx to apply the new configuration: 
Reloading Nginx activates your new cache settings, allowing FastCGI Page Cache to start working immediately. 
By setting up the FastCGI Page Cache, you'll significantly reduce server load and improve page load times, making your site faster for users. 
Setting Up APCu Object Cache 
APCu Object Cache can take your website's performance to the next level by storing frequently accessed objects in memory. 
Step 1: Install APCu 
To install the APCu extension for PHP 8, run the following command: 
Once APCu is installed, you need to configure WordPress to utilize this caching tool. 
Step 2: Configure WordPress 
Add the following lines to your  file to enable APCu caching in WordPress: 
This configuration ensures that WordPress uses APCu for object caching. Next, install a caching plugin to fully integrate APCu with WordPress. 
Step 3: Install a Caching Plugin 
To fully integrate APCu with WordPress, install a caching plugin like W3 Total Cache or WP Super Cache. These plugins offer options to use APCu for object caching, enhancing the performance benefits. 
By setting up APCu Object Cache, you can reduce the load on your database, leading to faster page load times and a smoother user experience. 
Verifying Your Configuration 
To ensure that your caching setup is functioning correctly, use tools like Google PageSpeed Insights or GTmetrix to analyze your website's performance. 
Look for improvements in load times and reduced server response times. These tools will help you verify that your caching configurations are working as expected. 
Reap the Benefits of Enhanced Performance 
By setting up FastCGI Page Cache and APCu Object Cache, you're not only improving your website's speed and performance but also enhancing the overall user experience and potentially boosting your search engine rankings. 
Take advantage of these caching techniques to make your WordPress site faster and more efficient. 
Happy caching!

I am at a loss to understand why the duplicate flag does not pass and why doc length says it is just 700+ when it is really about 3000+ characters.
The issue is replicable at https://huggingface.co/spaces/HLasse/textdescriptives as well, using exact above text and choosing not to split at new line.

Answered by KennethEnevoldsen

Jul 10, 2024

Additionally the results indicate doc_length of 733 which cannot be (as long as doc is right and it truly counts characters and nothing else)

I believe it is the number of spacy tokens. So yes the docs are wrong.

For the other error how does the duplicate_ngram_chr_fraction_5 change when applying the fix? The n_gram used are spacy n_grams and sometimes it does treat e.g. double white spaces as a separate token. I could imagine it could do something similar with special tokens.

View full answer

smileBeda · 2024-07-10T06:46:39Z

smileBeda
Jul 10, 2024
Author

... I can solve the first issue with something like:

text = re.sub(r'[^A-Za-z0-9\s]', ' ', text)  # Remove special characters except spaces
text = re.sub(r'\s+', ' ', text)

Apparently it has some troubles with some invisible characters or newlines?

However the second issue persists even after above cleanse, which results in a single line string text.
Another example:
Introduction to WordPress Caching Caching is a crucial aspect when it comes to improving the performance of your WordPress website By using caching mechanisms like FastCGI Page Cache and APCu Object Cache you can significantly reduce server load and increase page load speeds In this guide we will walk you through the process of setting up FastCGI Page Cache and APCu Object Cache on your WordPress website specifically tailored for an Nginx server environment Why Use FastCGI Page Cache and APCu Object Cache FastCGI Page Cache and APCu Object Cache are powerful tools for enhancing WordPress performance Here are some reasons why you should consider using them Improved Performance Both caching mechanisms can dramatically speed up your website Reduced Server Load Caches reduce the number of database queries and PHP executions Better User Experience Faster load times lead to happier visitors Setting Up FastCGI Page Cache FastCGI Page Cache stores static HTML versions of your pages reducing the need for dynamic processing Here s how to set it up Install Nginx Ensure Nginx is installed on your server You can install it using Configure Nginx for FastCGI Cache Edit your Nginx configuration file usually located at Add the following directives Restart Nginx Apply the changes by restarting Nginx Setting Up APCu Object Cache APCu Object Cache stores frequently accessed data objects in memory speeding up database operations Follow these steps to set it up Install APCu Install APCu via PECL Enable APCu Add APCu to your PHP configuration by creating or editing the file Configure APCu Edit the file to include APCu settings Restart PHP FPM Apply the changes by restarting PHP FPM Configuring WordPress to Use APCu Object Cache To fully utilize APCu Object Cache you ll need to configure WordPress to use it Here s how Install APCu WordPress Plugin Install a plugin like APCu Object Cache from the WordPress Plugin Repository Configure the Plugin Once installed activate the plugin and follow its instructions for configuration Most plugins will automatically detect APCu and start using it Final Thoughts on WordPress Cache Settings Setting up FastCGI Page Cache and APCu Object Cache on your WordPress website can significantly improve its speed and performance By following the steps outlined in this guide you can create a more responsive and efficient site for your visitors Remember website caching is a vital aspect of WordPress performance optimization so keep your cache settings up to date for the best results
This string has over 2000 chars yet doc_length says 407.. which is the word count!
So the doc is probably inaccurate, since it states:

field doc_length: Tuple[Optional[float], Optional[float]] = (10, 100000)
A Range for the document length. Default: (10, 100_000), i.e. between 10 and 100_000 **characters**.

0 replies

KennethEnevoldsen · 2024-07-10T11:03:08Z

KennethEnevoldsen
Jul 10, 2024
Collaborator

Additionally the results indicate doc_length of 733 which cannot be (as long as doc is right and it truly counts characters and nothing else)

I believe it is the number of spacy tokens. So yes the docs are wrong.

For the other error how does the duplicate_ngram_chr_fraction_5 change when applying the fix? The n_gram used are spacy n_grams and sometimes it does treat e.g. double white spaces as a separate token. I could imagine it could do something similar with special tokens.

0 replies

smileBeda · 2024-07-12T05:15:45Z

smileBeda
Jul 12, 2024
Author

yes, I can solve the ngram issue with the "fix" I mention (basically just strip out whatever is not expected in a "normal" text. Fix means the value becomes much lower, as expected.
understood about the token length (weird that it is tokens, since in that specific example it also matched exact word count, which is rarely identical to any sort of tokens, but the, I am not entirely sure how spacy actually tokenizes content)

I guess this is solved then, pending DOC update!
Thans.

2 replies

KennethEnevoldsen Jul 12, 2024
Collaborator

spacy tokens are approximately words (but the definition of words vary wildly depending on language). Even within English you have words like "don't" (is it 2 or 1?)

KennethEnevoldsen Jul 12, 2024
Collaborator

(added a PR for the doc update #358 )

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected content length doc_length and duplicate_ngram_chr_fraction #357

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Unexpected content length doc_length and duplicate_ngram_chr_fraction #357

smileBeda Jul 10, 2024

Replies: 3 comments · 2 replies

smileBeda Jul 10, 2024 Author

KennethEnevoldsen Jul 10, 2024 Collaborator

smileBeda Jul 12, 2024 Author

KennethEnevoldsen Jul 12, 2024 Collaborator

KennethEnevoldsen Jul 12, 2024 Collaborator

smileBeda
Jul 10, 2024

Replies: 3 comments 2 replies

smileBeda
Jul 10, 2024
Author

KennethEnevoldsen
Jul 10, 2024
Collaborator

smileBeda
Jul 12, 2024
Author

KennethEnevoldsen Jul 12, 2024
Collaborator

KennethEnevoldsen Jul 12, 2024
Collaborator