From 4560a3772af66d17a63440419653e4f084e630c7 Mon Sep 17 00:00:00 2001 From: John Marshall Date: Thu, 4 May 2023 20:49:40 +1200 Subject: [PATCH 1/2] Allow for UTF-8 field values in header regular expression Use `[:print:]` in the header regex and note that for ASCII it is equivalent to `[ -~]` and that the aim is to forbid control characters. Fixes #719. --- SAMv1.tex | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/SAMv1.tex b/SAMv1.tex index 7b0b4c7a..9125af76 100644 --- a/SAMv1.tex +++ b/SAMv1.tex @@ -81,6 +81,7 @@ \section{The SAM Format Specification} For example, floating-point values in SAM always use `{\tt .}' for the decimal-point character. The regular expressions in this specification are written using the POSIX\,/\,IEEE Std 1003.1 extended syntax. +For brevity, named character classes are written as~{\tt [\cclass{class}]} without an additional pair of brackets. \subsection{An example}\label{sec:example} Suppose we have the following alignment with bases in lowercase @@ -227,8 +228,10 @@ \subsection{The header section} each data field follows a format `{\tt TAG:VALUE}' where {\tt TAG} is a two-character string that defines the format and content of {\tt VALUE}. Thus header lines match {\tt - /\char94@(HD|SQ|RG|PG)(\char92t[A-Za-z][A-Za-z0-9]:[ - -\char126]+)+\$/} or {\tt /\char94@CO\char92t.*/}. + /\char94@(HD|SQ|RG|PG)(\char92t[A-Za-z][A-Za-z0-9]:[\cclass{print}]+)+\$/} + or {\tt /\char94@CO\char92t.*/}.% +\footnote{{\tt [\cclass{print}]} indicates that header field values contain printable characters, i.e.,~non-control characters. +For fields limited to~ASCII, which is the majority, this is equivalent to~{\tt [ -\char126]}.} Within each (non-{\tt @CO}) header line, no field tag may appear more than once and the order in which the fields appear is not significant. From 3692643ba7ff22ba50180b777ef898aa1141056b Mon Sep 17 00:00:00 2001 From: John Marshall Date: Wed, 29 Jan 2025 09:56:36 +1300 Subject: [PATCH 2/2] Committee has decided not to elide excess brackets in character classes This affects the existing [[:rname:^*=]]... and the new [[:print:]]. --- SAMv1.tex | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/SAMv1.tex b/SAMv1.tex index 9125af76..6987501f 100644 --- a/SAMv1.tex +++ b/SAMv1.tex @@ -33,7 +33,8 @@ \newcommand*{\firstbytebox}[2]{\byteboxAux{#1}{#2}{\put(0,0){\line(0,1){\bytetotalheight}}}} \newcommand*{\bytebox}[2]{\byteboxAux{#1}{#2}{}} -\newcommand*{\cclass}[1]{{\rm\sf :#1:}} +\newcommand*{\cclass}[1]{[{\rm\sf :#1:}]} +\newcommand*{\cclassexcept}[2]{[{\rm\sf :#1:}\caret #2]} \newcommand*{\caret}{\textsuperscript{$\wedge$}} \newcommand*{\memlimited}{\textcolor{gray}{\footnotesize\it limited}} @@ -81,7 +82,6 @@ \section{The SAM Format Specification} For example, floating-point values in SAM always use `{\tt .}' for the decimal-point character. The regular expressions in this specification are written using the POSIX\,/\,IEEE Std 1003.1 extended syntax. -For brevity, named character classes are written as~{\tt [\cclass{class}]} without an additional pair of brackets. \subsection{An example}\label{sec:example} Suppose we have the following alignment with bases in lowercase @@ -213,9 +213,7 @@ \subsubsection{Character set restrictions}\label{sec:charset} {\tt [\verb"0-9A-Za-z!#$%&+./:;?@^_|~-"][\verb"0-9A-Za-z!#$%&*+./:;=?@^_|~-"]*} \end{center} -% Pedantically this should be [[:rname:]^*=][[:rname:]]*, but we take advantage -% of POSIX (Issue 7) section 9.3.5/8 to elide the excess brackets for clarity. -\newcommand*{\rnameRegexp}{[\cclass{rname}\caret*=][\cclass{rname}]*} +\newcommand*{\rnameRegexp}{[\cclassexcept{rname}{*=}][\cclass{rname}]*} \noindent For clarity, elsewhere in this specification we write this set of allowed characters as a character class~{\tt [\cclass{rname}]} and extend the POSIX regular expression notation to use {\tt\caret *=} to indicate the omission of `{\tt *}' and `{\tt =}' from the character class. @@ -305,6 +303,7 @@ \subsection{The header section} These alternative names are not used elsewhere within the SAM file; in particular, they must not appear in alignment records' {\sf RNAME} or~{\sf RNEXT} fields. +\newline \emph{Regular expression}: \emph{name}{\tt (,}\emph{name}{\tt )*} where \emph{name} is {\tt\rnameRegexp}\\\cline{2-3} & {\tt AS} & Genome assembly identifier. \\\cline{2-3}