Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

We should have a string array data type #57

Open
stevevanhooser opened this issue Apr 28, 2023 · 6 comments
Open

We should have a string array data type #57

stevevanhooser opened this issue Apr 28, 2023 · 6 comments
Assignees
Labels

Comments

@stevevanhooser
Copy link
Contributor

stevevanhooser commented Apr 28, 2023

We should have a string array type. I've added new parameters to the schema.

There should be a few new search options that allow searching these string arrays with did.query.

  • 'has_exact_string' should take 1 character/string parameter and returns true if one string in the array matches exactly
  • 'has_contains_string' should take 1 characer/string parameter and returns true if one string in the array contains the provided string
  • 'numel is' should take 1 numeric parameter N and return true if the number of array elements is exactly N.
  • 'numel >' should take 1 numeric parameter N and return true if the number of array elements is greater than N.
  • 'numel <' should take 1 numeric parameter N and return true if the number of array elements is less than N.
  • 'numel >=' should take 1 numeric parameter N and return true if the number of array elements is greater than or equal to N.
  • 'numel <=' should take 1 numeric parameter N and return true if the number of array elements is less than or equal to N.
@stevevanhooser
Copy link
Contributor Author

Hi Yair - I assigned this to you. Do you still have some hours left? Please work on the sql branch. Thanks, Steve

@altmany
Copy link
Collaborator

altmany commented May 2, 2023

What will the values of such string-array fields look like?

  • comma-separated list of sub-strings e.g. "a,b,c" ?
  • or perhaps Matlab format e.g. "{'a', 'b', 'c'}" ?
  • or something else?

@stevevanhooser
Copy link
Contributor Author

They will come in from Matlab as a cell array of strings. I don't know how they should be represented in sqlite but maybe as an array of strings?

@altmany
Copy link
Collaborator

altmany commented May 4, 2023

I believe that this is now done, sort of:

  • Matlab strings are now converted into char arrays, and string arrays into cellstrs, before being added to the database
  • In the schema (I did not update the schema):
    • "cellstr" and "stringarray" data types are now synonyms
    • "char" and "string" are now synonyms
    • "struct" and "structure" are now synonyms (neither is documented in the schema)
    • "cell" is not documented in the schema
  • In processing user queries:
    • 'has_exact_string' is now supported
    • 'has_contains_string' is also supported, but it's essentially just a synonym for 'contains_string'
    • 'numel' queries are also supported, but the results will be incorrect if any of the strings contains a comma, or if the query is applied to a field which is not a string array

@altmany altmany assigned stevevanhooser and unassigned altmany May 4, 2023
@stevevanhooser
Copy link
Contributor Author

Oh, I learned something here. sqlite doesn't really have a string array type, but instead it concatenates the strings together apparently. I misunderstood the documentation, I thought sql had a true string array type.

I think we need some real analog of a string array type. I think a comma isn't a good choice for a delimiter for our application here, because someone might input paragraphs of text as one of the strings. Perhaps the ASCII bell (character with number 7) might be a good choice for a delimiter? Or a sequence of [char(10) char(7) char(10)]? When the database receives a cell array of strings as input, it could create a single character string with the delimiter, say, char(7) in between the input strings of the cell array. I don't think anyone uses the bell these days. Then, when it is queried using 'has_exact_string', the system could use a regular expression to implement the search: ['(' search_string char(7) ')'], and 'has_contains_string' would be ['((.*)' search_string '(.*)' char(7) ')'], or something close to that?

I think we should keep our old string type because we use a lot of single string entries, and add a stringarray or cellstr as a new type.

Thanks Yair!
Steve

@altmany
Copy link
Collaborator

altmany commented May 12, 2023

The conversion of Matlab string arrays into an SQL string of the format "{ 'abc', 'def' }" is done (by did.implementations.doc2sql) in vlt.data.cell2str, which is in a different code repo (vhlab-toolbox-matlab); changing it to use a BEL delimiter might break other parts of your codebase that I am not aware of.

So instead, I now use a custom cell2str function within did.implementations.doc2sql, thereby limiting the possible side-effects only to DID-Matlab. I think this new function should replace vlt.data.cell2str, but I didn't want to update code in other repos, so I'll leave this task to you. It should be replaceable as-is: The delimiter is an optional 2nd input arg with default of ', ' so the two functions behave exactly the same when the delimiter is not specified. In the case of doc2sql, I simply call it with a non-default char(7)=BEL delimiter.

Please retry now.

@altmany altmany assigned stevevanhooser and unassigned altmany May 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants