Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tie::File ruins UTF-8 encoded files [rt.cpan.org #96014] #17494

Open
toddr opened this issue Jan 19, 2020 · 6 comments
Open

Tie::File ruins UTF-8 encoded files [rt.cpan.org #96014] #17494

toddr opened this issue Jan 19, 2020 · 6 comments
Labels
dist-Tie-File issues in the dual-life blead-first Tie-File distribution Feature Request

Comments

@toddr
Copy link
Member

toddr commented Jan 19, 2020

Migrated from rt.cpan.org#96014 (status was 'new')

Requestors:

From [email protected] on 2014-05-27 21:10:00
:

It is not safe to use Tie::File to operate on a file with a multibyte encoding, such as UTF-8. This has been a defect in Tie::File since it was first written. I was aware at the time that this would be a problem, but at the time it seemed too hard to fix. I know a little bit more now, and the problem has become considerably more urgent since 2005. It should be fixed. I would like to fix this, but I need some help.

The essential problem is as follows. Tie::File holds an array that records the tell() position at which each record in the file begins, as reported by seek(). Tie::File makes the not-completely-warranted assumption that these offsets are actually byte counts, but I think this is a minor issue.

The major issue this this: Suppose that record 3 begins at tell() offset 100, and record 4 begins at tell() offset 150. In some places Tie::File then concludes from this that record 3 is 50 bytes long, which I think is generally correct.

But in other places Tie::File will calculate the length of record 3 by reading it in and calling length() on the result. Since what is wanted is the length in bytes, this is wrong except for legacy encodings where each character is one byte.

Now suppose Tie::File is asked to replace record 3 with some string $s. Tie::File uses length() on $s to find out how many bytes $s is, which is wrong, and it uses length() on record 3 similarly wrongly. Then it compares these two wrong lengths to decide whether record 3 can be overwritten with $s in-place, or whether the tail of the file needs to be copied upwards (if $s is shorter than the old record 3) or downwards (if it is longer). All these calculations are being done with character lengths, but they should be done with byte lengths. Since they are done wrongly, Tie::File mangles the data file when it modifies the record.

On review, I see the following relevant ten-year-old comment:

 # length($oldrec) here is not consistent with text mode  TODO XXX BUG   

The enclosed program demonstrates the problem. It writes a correct UTF-8 encoded file, copies it to stdout, then uses Tie::File to modify a record in the middle, then copies the resulting carbled file to stdout.

What I think Tie::File needs to do is to find out how many bytes $s will occupy once written to the file, and use that in place of length($s) in its length calculations. When I wrote Tie::File in 2005 there was no way to do this, Encode not having been invented. But I think now I can use Encode to transform $s to a suitably-encoded byte string, and then use the existing Tie::File machinery to write the byte string to the file. But I'm not sure this is correct; I need someone with some domain knowledge to help me make the right changes.

Also it seems to me that to choose the correct encoding, Tie::File needs some way to interrogate the filehandle it is given, if it is given one, and it needs to allow an "encoding => ...." option to be supplied in the tie() call when it is given a filename. There may be other encoding-related options it should support that I have no thought of.


#!/usr/bin/perl

use Tie::File;
use Fcntl;

{ open my($fh), ">:raw", "tf-test-data" or die $!;
  print $fh "Fl\303\274ghafen
Chinese:\344\270\255\345\234\213\345\223\262\345\255\270\346\233\270\351\233\273\345\255\220\345\214\226\350\250\210\345\212\203
Potat\303\270es\n";
}

binmode(STDOUT, ":utf8");
{ open my($fh), "+<:utf8", "tf-test-data"
  or die $!;
  print while <$fh>;
  print "-------\n";
}

{
  open my($fh), "+<:utf8", "tf-test-data";
  my @A;
  tie @A, Tie::File => $fh or die;
  $A[1] = "octopus";
}

{ open my($fh), "+<:utf8", "tf-test-data"
  or die $!;
  print while <$fh>;
  print "-------\n";
}
__DATA__
@toddr
Copy link
Member Author

toddr commented Jan 19, 2020

@khwilliamson any ideas?

@khwilliamson
Copy link
Contributor

I think @tonycoz and @Leont may be better for this than I. I don't know if github will send them this, as it didn't autocomplete their ids.

This might actually be a legitimate use of the bytes pragrma. One could write

sub byte_length {
use bytes;
return length shift;
}

@toddr
Copy link
Member Author

toddr commented Jan 27, 2020

I'm transferring this issue to https://github.com/Perl/perl5 since it is issue with a module in dist.

@toddr toddr transferred this issue from Dual-Life/Tie-File Jan 27, 2020
@toddr toddr added dist-Tie-File issues in the dual-life blead-first Tie-File distribution Needs Triage Feature Request labels Jan 27, 2020
@khwilliamson
Copy link
Contributor

@tonycoz @Leont any thoughts

@Leont
Copy link
Contributor

Leont commented Apr 17, 2022

I don't think this really can work. :encoding makes sense on a stream, for random access you'd want Tie::File to do the encoding itself.

brainbuz added a commit to brainbuz/perl5 that referenced this issue Jan 23, 2024
…ire 5.005.

This partially addresses Perl#17494 but does resolve the issue.
@brainbuz
Copy link
Contributor

I've made an attempt at fixing this issue, https://github.com/brainbuz/perl5/tree/trytofix17494.
Where previously the file could be badly mangled, I'm getting much smaller errors, and seeing weird behavior of an additional character at the beginning of the following line.

brainbuz added a commit to brainbuz/perl5 that referenced this issue Jan 26, 2024
…ire 5.005.

This partially addresses Perl#17494 but does resolve the issue.
brainbuz added a commit to brainbuz/perl5 that referenced this issue Jan 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dist-Tie-File issues in the dual-life blead-first Tie-File distribution Feature Request
Projects
None yet
Development

No branches or pull requests

5 participants