Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ilib-loctool-regex: Add escaping support #75

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions .changeset/mighty-foxes-breathe.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
---
"ilib-loctool-regex": minor
---

- Added the ability to specify the escaping style for
strings that are extracted by the regular expressions
- supports all escaping styles published by
ilib-tools-common
- supports extra "none" style to turn off unescaping
4 changes: 2 additions & 2 deletions git-hooks/pre-push
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
#!/bin/bash

files=$(git diff --cached --name-only --diff-filter=ACM main)
files=$(git diff --cached --name-only --diff-filter=ACM main | grep '\.js$')

# Check for debugger statements in JavaScript files
lines=$(grep -n 'debugger' $files)
lines=$(grep -n 'debugger;' $files)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The git hook wouldn't let me push this branch because the README.md mentioned "debugger" in it. So, I made this apply only to js files and only when it is "debugger;" That's why this had to be included here.

if [ "$lines" != "" ]
then
echo "Debugger statement found in:"
Expand Down
53 changes: 41 additions & 12 deletions packages/ilib-loctool-regex/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,14 @@ used within the `regex` property:
quite long, but it is always unique.
- "truncate" - use the first 32 characters of the source
string as the key. This fixed-length key is usually unique.
- escapeStyle - the style of unescaping to use when the regular expression
matches. The valid styles incude
"csharp" and "js" (the default), as well as many others. The
full list of styles available is given in the documentation
for the [ilib-tools-common library](https://github.com/iLib-js/ilib-mono/blob/main/packages/ilib-tools-common/docs/ilibToolsCommon.md#escaperFactory).
In addition to the styles listed there, the escapeStyle setting
can also be set to "none" to disable escaping altogether for strings
that are extracted using this regular expression.

### Example Configuration

Expand All @@ -135,29 +143,41 @@ Example configuration for a web project with PHP and JavaScript files:
"sourceLocale": "en-US",
"expressions": [
{
"expression": "translate\\s*(\\s*['\"](?<source>[^'\"]*)['\"]\\s*\\)",
"expression": "translate\\s*(\\s*\"(?<source>[^\"]*)\"\\s*\\)",
"flags": "g",
"datatype": "php",
"resourceType": "string",
"keyStrategy": "source"
"keyStrategy": "source",
"escapeStyle": "php-double"
},
{
"expression": "translate\\s*\\(\\s*['\"](?<source>[^'\"]*)['\"]\\s*,\\s*['\"](?<key>[^'\"]*)['\"]\\s*\\)",
"expression": "translate\\s*(\\s*'(?<source>[^']*)'\\s*\\)",
"flags": "g",
"datatype": "php",
"resourceType": "string"
"resourceType": "string",
"keyStrategy": "source",
"escapeStyle": "php-single"
},
{
"expression": "translateArray\\s*\\(\\s*\\[\\s*(?<source>['\"][^'\"]*['\"](\\s*,\\s*['\"][^'\"]*['\"])*)\\s*\\]\\s*\\)",
"expression": "translate\\s*\\(\\s*\"(?<source>[^\"]*)\"\\s*,\\s*\"(?<key>[^\"]*)\"\\s*\\)",
"flags": "g",
"datatype": "php",
"resourceType": "array"
"resourceType": "string",
"escapeStyle": "php-double"
},
{
"expression": "translatePlural\\s*\\(\\s*['\"](?<source>[^'\"]*)['\"]\\s*,\\s*['\"](?<sourcePlural>[^'\"]*)['\"]",
"expression": "translateArray\\s*\\(\\s*\\[\\s*(?<source>\"[^\"]*\"(\\s*,\\s*\"[^\"]*\")*)\\s*\\]\\s*\\)",
"flags": "g",
"datatype": "php",
"resourceType": "plural"
"resourceType": "array",
"escapeStyle": "php-double"
},
{
"expression": "translatePlural\\s*\\(\\s*\"(?<source>[^\"]*)\"]\\s*,\\s*\"(?<sourcePlural>[^\"]*)\"",
"flags": "g",
"datatype": "php",
"resourceType": "plural",
"escapeStyle": "php-double"
}
]
},
Expand Down Expand Up @@ -188,15 +208,21 @@ given regular expressions. Explanation of the above regexes:
that are passed as the first parameter to the `translate` function. It will match
a string like `translate("string to translate")`. Since the string does not have
a unique id, one is generated using the `source` strategy. That is, the source
string itself is re-used as its own unique id.
1. The second regular expression extracts strings that are passed as the first
string itself is re-used as its own unique id. Note that this regular expression
extracts strings with double quotes around them. The `escapeStyle` setting is
used to specify that the `php-double` style should be used to unescape the string.
1. The second regular expression is similar to the first, but extracts strings
that use single quotes instead of double quotes. The `escapeStyle` setting is
used to specify that the `php-single` style should be used to unescape the string.
(Unescaping is different between single and double quoted strings in PHP.)
1. The third regular expression extracts strings that are passed as the first
parameter to the `translate` function and the second parameter is the
key of the string. It will match a string like `translate("string to translate", "unique.id")`.
1. The third regular expression is an example of an array translation. The
1. The fourth regular expression is an example of an array translation. The
`source` capturing group will have a value like `"a", "b", "c"` which this plugin
will transform into an array of 3 strings. This will match a string like
`translateArray(["a", "b", "c"])`.
1. The fourth regular expression is an example of a plural translation. The
1. The fifth regular expression is an example of a plural translation. The
first parameter to the `translatePlural` function is the singular string and is
assigned to the `source` capturing group. The second parameter is the plural
string and is assigned to the `sourcePlural` capturing group. This creates a plural
Expand Down Expand Up @@ -225,6 +251,9 @@ the `hash` strategy. That is, the hash of the source string is calculated
and prepended with an "r" for "resource" (eg. "r34523234") and that is used
as the unique id for that string.

Note that the default escape style is `js` which is used when the `escapeStyle`
setting is not given, which is why it is not specified in the last mapping example.

### Resource Type Field Mapping

The `resourceType` setting for each mapping specifies the type of the
Expand Down
100 changes: 42 additions & 58 deletions packages/ilib-loctool-regex/RegexFile.js
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,17 @@
var fs = require("fs");
var path = require("path");
var Locale = require("ilib-locale");
var IString = require("ilib-istring");
var escaperFactory = require("ilib-tools-common").escaperFactory;

// fake escaper for the identity escaper
var identity = {
escape: function(str) {
return str;
},
unescape: function(str) {
return str;
}
};

/**
* Create a new Regex file with the given path name and within
Expand Down Expand Up @@ -53,47 +63,19 @@ var RegexFile = function(props) {
exp.regex = new RegExp(exp.expression, exp.flags);
}
exp.regex.lastIndex = 0;

// set up the unescaper to use after we have found the strings. The same unescaper
// is used for all strings that match this expression. Escapers vary by expression
// because different types of strings might have different escaping rules, even within
// the same a programming language.
// (e.g. double quoted strings in PHP are escaped differently than single quoted strings)
var escapeStyle = exp.escapeStyle || "js";
exp.escaper = escapeStyle !== "none" ? escaperFactory(escapeStyle) : identity;
});
}
this.resourceIndex = 0;
};

var reUnicodeChar = /\\u([a-fA-F0-9]{1,4})/g;

/**
* Unescape the string to make the same string that would be
* in memory in the target programming language.
*
* @static
* @param {String} string the string to unescape
* @returns {String} the unescaped string
*/
function unescapeString(string) {
if (!string) return string;
var unescaped = string;

// first, unescape unicode characters
while ((match = reUnicodeChar.exec(unescaped))) {
if (match && match.length > 1) {
var value = parseInt(match[1], 16);
unescaped = unescaped.replace(match[0], IString.fromCodePoint(value));
reUnicodeChar.lastIndex = 0;
}
}

unescaped = unescaped.
replace(/\\\\n/g, ""). // line continuation
replace(/\\\n/g, ""). // line continuation
replace(/^\\\\/, "\\"). // unescape backslashes
replace(/([^\\])\\\\/g, "$1\\").
replace(/^\\'/, "'"). // unescape quotes
replace(/([^\\])\\'/g, "$1'").
replace(/^\\"/, '"').
replace(/([^\\])\\"/g, '$1"');

return unescaped;
};

/**
* If the given string is surrounded by quotes, remove the quotes.
* Otherwise, return the string unchanged.
Expand All @@ -119,13 +101,13 @@ function stripQuotes(str) {
* the string from what it looks like in the source
* code but increases matching.
*
* @static
* @param {String} string the string to clean
* @param {Escaper} escaper the escaper to use to unescape
* @returns {String} the cleaned string
*/
function cleanString(string) {
RegexFile.prototype.cleanString = function(string, escaper) {
if (!string) return string;
var unescaped = unescapeString(string);
var unescaped = escaper.unescape(string);

unescaped = unescaped.
replace(/\\[btnfr]/g, " ").
Expand All @@ -136,11 +118,10 @@ function cleanString(string) {
};

/**
* Make a new key for the given string. This must correspond
* exactly with the code in htglob jar file so that the
* resources match up. See the class IResourceBundle in
* this project under the java directory for the corresponding
* code.
* Make a new key for the given source string. This key is a
* hash of the source string that is has a high probability of being
* unique for this source string and can be used to identify the
* resource.
*
* @private
* @param {String} source the source string to make a resource
Expand All @@ -156,16 +137,17 @@ RegexFile.prototype.makeKey = function(source) {
* the array of strings as an actual array.
*
* @param {String} data the string to parse
* @param {Escaper} escaper the escaper to use to unescape the strings
* @returns {Array.<String>} the array of strings
*/
function parseArray(data) {
RegexFile.prototype.parseArray = function(data, escaper) {
var arr;

if (data) {
arr = data.split(",");
arr = arr.map(function(item) {
return cleanString(item);
});
return stripQuotes(escaper.unescape(item).trim());
}.bind(this));
}

return arr;
Expand Down Expand Up @@ -218,24 +200,26 @@ RegexFile.prototype.matchExpression = function(data, exp, cb) {

if (result.groups) {
if (result.groups.sourcePlural) {
sourcePlural = cleanString(result.groups.sourcePlural);
sourcePlural = exp.escaper.unescape(result.groups.sourcePlural);
}
if (result.groups.comment) {
comment = cleanString(result.groups.comment);
comment = exp.escaper.unescape(result.groups.comment);
}
if (result.groups.context) {
context = cleanString(result.groups.context);
context = exp.escaper.unescape(result.groups.context);
}
if (result.groups.flavor) {
flavor = cleanString(result.groups.flavor);
flavor = exp.escaper.unescape(result.groups.flavor);
}
if (result.groups.key) {
key = cleanString(result.groups.key);
// clean string unescapes the key, but also removes things
// that foster greater matching, like compressing white space
key = this.cleanString(result.groups.key, exp.escaper);
}
}

if (exp.resourceType === "array") {
array = parseArray(source);
array = this.parseArray(source, exp.escaper);
}

if (!key) {
Expand All @@ -244,10 +228,10 @@ RegexFile.prototype.matchExpression = function(data, exp, cb) {
switch (exp.resourceType) {
default:
case "string":
src = cleanString(source);
src = exp.escaper.unescape(source);
break;
case "plural":
src = sourcePlural;
src = exp.escaper.unescape(sourcePlural);
break;
case "array":
src = array.join("");
Expand All @@ -270,7 +254,7 @@ RegexFile.prototype.matchExpression = function(data, exp, cb) {

switch (exp.resourceType) {
case "string":
source = cleanString(source);
source = exp.escaper.unescape(source);
r = this.API.newResource({
resType: exp.resourceType,
project: this.project.getProjectId(),
Expand All @@ -295,7 +279,7 @@ RegexFile.prototype.matchExpression = function(data, exp, cb) {
sourceLocale: this.project.sourceLocale,
source: source,
sourcePlurals: {
one: cleanString(source),
one: exp.escaper.unescape(source),
other: sourcePlural
},
pathName: this.pathName,
Expand Down
1 change: 1 addition & 0 deletions packages/ilib-loctool-regex/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,7 @@
"dependencies": {
"ilib-istring": "workspace:^",
"ilib-locale": "workspace:^",
"ilib-tools-common": "workspace:^",
"micromatch": "^4.0.8"
}
}
Loading
Loading