repo-sync-2023-12-18T16:59:44+0800 (secretflow#1086)

* repo-sync-2023-12-18T16:59:44+0800 * unittest logging_level debug -> info * revert spu version * update docker image version * revert docker image version & sleep time
FantZero · Dec 19, 2023 · 83254ca · 83254ca
1 parent fcb0694
commit 83254ca
Show file tree

Hide file tree

Showing 109 changed files with 2,245 additions and 1,251 deletions.
diff --git a/.bazeliskrc b/.bazeliskrc
@@ -0,0 +1,2 @@
+USE_BAZEL_VERSION=5.1.1
+BAZELISK_BASE_URL=https://github.com/bazelbuild/bazel/releases/download
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -12,23 +12,37 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 `Fixed` for any bug fixes.
 `Security` in case of vulnerabilities.
 
-## [1.3.0.dev231211] - 2023-12-11
+## [1.3.0.dev231218] - 2023-12-18
 ### Added
-- Add IO component including read, write and identity.
-- Change groupby component to by-query style.
-
-## [1.3.0.dev231205] - 2023-12-05
+- Make barrier_on_shutdown optional.
+- Support SGB label holder without features.
+- Support SL Model training on file data with mutiple labels.
+- Add SL ResNet and VGG application.
+- Secretflow ic: Add package interconnection protobuf files.
+- Component: Add feature calculate component to generate new features by performing calculations on original features.
+- Component: Support SGB prediction on big dataset.
 
 ### Changed
-- Add feature selection in all model predict comps.
+- SGB optimize memory usage in prediction.
+- Component: Bump groupby statistics version.
+- Component: Improve translation.
 
 ### Fixed
-- Fix pvalue & more readable assert msg.
+- Component: Fix woe io and fillna.
+
+
+## [1.3.0.dev231211] - 2023-12-11
+### Added
+- Add IO component including read, write and identity.
+- Change groupby component to by-query style.
+
 
 ## [1.3.0.dev231128] - 2023-11-28
 
 ### Added
 - Add secretflow tuner for automl and autoattack.
+- Add IO component including read, write and identity.
+- Change groupby component to by-query style.
 
 ## [1.3.0.dev231120] - 2023-11-20
 

diff --git a/REPO_LAYOUT.md b/REPO_LAYOUT.md
@@ -1,13 +1,26 @@
 # Repository layout
 
-secretflow
-- data: horizontal, vertical and mixed DataFrame and Ndarray (like pandas and numpy)
-- device: various devices and their kernels, such as PYU, SPU, HEU, etc
-- model: federated learning and split learning algorithms
-- preprocessing: common utility functions and transformer classes (like scikit-learn)
-- security: privacy related algorithms, such as secure aggregation, differential privacy
-- util: miscellaneous utility functions
+This is a high level overview of how the repository is laid out. Some major folders are listed below:
 
-tests: unit test cases
-
-docs: documents written in reStructuredText, Markdown, Jupyter-notebook
+* [benchmark_examples/](benchmark_examples/): scripts for secretflow component benchmark.
+* [docker/](docker/): scripts to build secretflow release and dev docker images.
+* [docs/](docs/): documents written in reStructuredText, Markdown, Jupyter-notebook.
+* [examples/](examples/): examples of secretflow.
+* [secretflow/](secretflow/): the core library.
+    * [component/](secretflow/component/): secretflow components.
+    * [compute/](secretflow/compute/): wrapper for pyarrow compute functions.
+    * [data/](secretflow/data/): horizontal, vertical and mixed DataFrame and Ndarray (like pandas and numpy).
+    * [device/](secretflow/device/): various devices and their kernels, such as PYU, SPU, HEU, etc.
+    * [distributed/](secretflow/distributed/): logics related to Ray and RayFed.
+    * [ic/](secretflow/ic/): interconnection.
+    * [kuscia/](secretflow/kuscia/): adapter to kuscia.
+    * [ml/](secretflow/ml/): federated learning and split learning algorithms.
+    * [preprocessing/](secretflow/preprocessing/): preprocessing functions.
+    * [protos/](secretflow/protos/): Protocol Buffers messages.
+    * [security/](secretflow/security/): privacy related algorithms, such as secure aggregation, differential privacy.
+    * [spec/](secretflow/spec/): generated code of spec Protocol Buffers messages.
+    * [stats/](secretflow/stats/): statistics functions.
+    * [tune/](secretflow/tune/): functions related to tuners.
+    * [utils/](secretflow/utils/): miscellaneous utility functions.
+* [secretflow_lib/](secretflow_lib/): some core functions written in C++ and their Python bindings.
+* [tests/](tests/): unit tests with pytest.
diff --git a/WORKSPACE b/WORKSPACE
@@ -4,7 +4,7 @@ load("@bazel_tools//tools/build_defs/repo:git.bzl", "git_repository")
 
 git_repository(
     name = "yacl",
-    commit = "6ba8bd5f02035176ec4daaca1c1269195a1b1b4e",
+    commit = "2b7d8882c78f07bd9e78217b7f9ca13135781e65",
     remote = "https://github.com/secretflow/yacl.git",
 )
 

diff --git a/docker/comp_list.json b/docker/comp_list.json
@@ -252,9 +252,8 @@
       "inputs": [
         {
           "name": "input_data",
-          "desc": "Input dist data",
+          "desc": "Input data",
           "types": [
-            "sf.model.ss_glm",
             "sf.model.ss_glm",
             "sf.model.sgb",
             "sf.model.ss_xgb",
@@ -268,9 +267,8 @@
       "outputs": [
         {
           "name": "output_data",
-          "desc": "Output dist data",
+          "desc": "Output data",
           "types": [
-            "sf.model.ss_glm",
             "sf.model.ss_glm",
             "sf.model.sgb",
             "sf.model.ss_xgb",
@@ -656,6 +654,17 @@
               "b": true
             }
           }
+        },
+        {
+          "name": "batch_size",
+          "desc": "Prediction batch size",
+          "type": "AT_INT",
+          "atomic": {
+            "isOptional": true,
+            "defaultValue": {
+              "i64": "100000"
+            }
+          }
         }
       ],
       "inputs": [
@@ -2147,6 +2156,52 @@
         }
       ]
     },
+    {
+      "domain": "preprocessing",
+      "name": "feature_calculate",
+      "desc": "Generate a new feature by performing calculations on an origin feature",
+      "version": "0.0.1",
+      "attrs": [
+        {
+          "name": "rules",
+          "desc": "input CalculateOpRules rules",
+          "type": "AT_CUSTOM_PROTOBUF",
+          "customProtobufCls": "calculate_rules_pb2.CalculateOpRules"
+        }
+      ],
+      "inputs": [
+        {
+          "name": "in_ds",
+          "desc": "Input vertical table",
+          "types": [
+            "sf.table.vertical_table"
+          ],
+          "attrs": [
+            {
+              "name": "features",
+              "desc": "Feature(s) to operate on",
+              "colMinCntInclusive": "1"
+            }
+          ]
+        }
+      ],
+      "outputs": [
+        {
+          "name": "out_ds",
+          "desc": "output_dataset",
+          "types": [
+            "sf.table.vertical_table"
+          ]
+        },
+        {
+          "name": "out_rules",
+          "desc": "feature calculate rule",
+          "types": [
+            "sf.rule.preprocessing"
+          ]
+        }
+      ]
+    },
     {
       "domain": "preprocessing",
       "name": "feature_filter",
@@ -2190,7 +2245,7 @@
           "atomic": {
             "isOptional": true,
             "defaultValue": {
-              "s": "mean"
+              "s": "constant"
             },
             "allowedValues": {
               "ss": [
@@ -2209,7 +2264,7 @@
           "atomic": {
             "isOptional": true,
             "defaultValue": {
-              "s": "general_na"
+              "s": "custom_missing_value"
             }
           }
         },
@@ -2608,7 +2663,7 @@
       "domain": "stats",
       "name": "groupby_statistics",
       "desc": "Get a groupby of statistics, like pandas groupby statistics.\nCurrently only support VDataframe.",
-      "version": "0.0.2",
+      "version": "0.0.3",
       "attrs": [
         {
           "name": "aggregation_config",

diff --git a/docker/dev/Dockerfile b/docker/dev/Dockerfile
@@ -19,7 +19,7 @@ COPY --from=builder /bin/nsjail /usr/local/bin/
 COPY --from=python /root/miniconda3/envs/secretflow/bin/ /usr/local/bin/
 COPY --from=python /root/miniconda3/envs/secretflow/lib/ /usr/local/lib/
 
-RUN yum install -y protobuf libnl3 && yum clean all
+RUN yum install -y protobuf libnl3 libgomp && yum clean all
 
 RUN grep -rl '#!/root/miniconda3/envs/secretflow/bin' /usr/local/bin/ | xargs sed -i -e 's/#!\/root\/miniconda3\/envs\/secretflow/#!\/usr\/local/g'
 

diff --git a/docker/release/anolis-lite.Dockerfile b/docker/release/anolis-lite.Dockerfile
@@ -19,7 +19,7 @@ COPY --from=builder /bin/nsjail /usr/local/bin/
 COPY --from=python /root/miniconda3/envs/secretflow/bin/ /usr/local/bin/
 COPY --from=python /root/miniconda3/envs/secretflow/lib/ /usr/local/lib/
 
-RUN yum install -y protobuf libnl3 && yum clean all
+RUN yum install -y protobuf libnl3 libgomp && yum clean all
 
 RUN grep -rl '#!/root/miniconda3/envs/secretflow/bin' /usr/local/bin/ | xargs sed -i -e 's/#!\/root\/miniconda3\/envs\/secretflow/#!\/usr\/local/g'
 

diff --git a/docker/release/anolis.Dockerfile b/docker/release/anolis.Dockerfile
@@ -19,7 +19,7 @@ COPY --from=builder /bin/nsjail /usr/local/bin/
 COPY --from=python /root/miniconda3/envs/secretflow/bin/ /usr/local/bin/
 COPY --from=python /root/miniconda3/envs/secretflow/lib/ /usr/local/lib/
 
-RUN yum install -y protobuf libnl3 && yum clean all
+RUN yum install -y protobuf libnl3 libgomp && yum clean all
 
 RUN grep -rl '#!/root/miniconda3/envs/secretflow/bin' /usr/local/bin/ | xargs sed -i -e 's/#!\/root\/miniconda3\/envs\/secretflow/#!\/usr\/local/g'
 

diff --git a/docker/translation.json b/docker/translation.json
@@ -65,9 +65,9 @@
     "map any input to output": "将任何输入映射到输出",
     "0.0.1": "0.0.1",
     "input_data": "input_data",
-    "Input dist data": "输入dist数据",
+    "Input data": "输入数据",
     "output_data": "output_data",
-    "Output dist data": "输出dist 数据"
+    "Output data": "输出数据"
   },
   "io/read_data:0.0.1": {
     "io": "io",
@@ -171,6 +171,8 @@
     "Whether to save ids columns into output prediction table. If true, input feature_dataset must contain id columns, and receiver party must be id owner.": "是否将 id 列保存到输出预测表中；如果为 true，则输入feature_dataset必须包含 id 列，并且接收方必须是 id 所有者",
     "save_label": "保存标签列",
     "Whether or not to save real label columns into output pred file. If true, input feature_dataset must contain label columns and receiver party must be label owner.": "是否将真实的标签列保存到输出预测文件中；如果为 true，则输入feature_dataset必须包含标签列，并且接收方必须是标签所有者",
+    "batch_size": "batch_size",
+    "Prediction batch size": "预测批大小",
     "model": "模型",
     "feature_dataset": "特征数据集",
     "Input vertical table.": "输入联合表",
@@ -449,12 +451,12 @@
     "condition_filter": "条件筛选器",
     "Filter the table based on a single column's values and condition.\nWarning: the party responsible for condition filtering will directly send the sample distribution to other participants.\nMalicious participants can obtain the distribution of characteristics by repeatedly calling with different filtering values.\nAudit the usage of this component carefully.": "根据单个列的值和条件筛选表。\n警告：负责条件过滤的一方将直接将样本分发发送给其他参与者。\n恶意参与者可以通过使用不同的过滤值重复调用来获得特征的分布。\n仔细审核此组件的使用情况。",
     "0.0.1": "0.0.1",
-    "comparator": "比较器",
-    "Comparator to use for comparison. Must be one of '==','<','<=','>','>=','IN'": "用于比较的比较器。必须是'==='、'<'、'<='、'>'、'>='、'IN'之一",
+    "comparator": "比较条件",
+    "Comparator to use for comparison. Must be one of '==','<','<=','>','>=','IN'": "用于比较的条件。必须是'==='、'<'、'<='、'>'、'>='、'IN'之一",
     "value_type": "值类型",
     "Type of the value to compare with. Must be one of ['STRING', 'FLOAT']": "要与之进行比较的值的类型。必须是“STRING”、“FLOAT”中的一个",
-    "bound_value": "边界值",
-    "Input a str with values separated by ','. List of values to compare with. If comparator is not 'IN', we only support one element in this list.": "输入一个str，其值以“，”分隔。 表示比较的值的列表。如果comparator不是“IN”，则此列表中应该仅含一个元素。",
+    "bound_value": "条件值",
+    "Input a str with values separated by ','. List of values to compare with. If comparator is not 'IN', we only support one element in this list.": "输入一个str，其值以“，”分隔。 表示比较的值的列表。如果比较条件不是“IN”，则此列表中应该仅含一个元素。",
     "float_epsilon": "浮点数误差值",
     "Epsilon value for floating point comparison. WARNING: due to floating point representation in computers, set this number slightly larger if you want filter out the values exactly at desired boundary. for example, abs(1.001 - 1.002) is slightly larger than 0.001, and therefore may not be filter out using == and epsilson = 0.001": "用于浮点比较的Epsilon值。警告：由于计算机中的浮点表示，如果您想在所需的边界处过滤掉值，请将此数字设置得稍大一些。例如，abs（1.001-1.002）略大于0.001，因此可能无法使用==和epsilson=0.001进行过滤",
     "in_ds": "输入数据集",
@@ -466,6 +468,22 @@
     "out_ds_else": "输出数据集",
     "Output vertical table that does not satisfies the condition.": "输出不满足条件的垂直表格。"
   },
+  "preprocessing/feature_calculate:0.0.1": {
+    "preprocessing": "预处理",
+    "feature_calculate": "特征计算",
+    "Generate a new feature by performing calculations on an origin feature": "对原特征进行操作生成新特征",
+    "0.0.1": "0.0.1",
+    "rules": "规则",
+    "input CalculateOpRules rules": "输入特征计算规则",
+    "in_ds": "输入数据集",
+    "Input vertical table": "输入联合表",
+    "features": "特征列",
+    "Feature(s) to operate on": "要操作的特征列",
+    "out_ds": "输出数据集",
+    "output_dataset": "输出数据集",
+    "out_rules": "输出规则",
+    "feature calculate rule": "特征计算规则"
+  },
   "preprocessing/feature_filter:0.0.1": {
     "preprocessing": "预处理",
     "feature_filter": "特征过滤",
@@ -482,8 +500,8 @@
     "preprocessing": "预处理",
     "fillna": "异常值填充",
     "0.0.1": "0.0.1",
-    "strategy": "策略",
-    "The imputation strategy. If \"mean\", then replace missing values using the mean along each column. Can only be used with numeric data. If \"median\", then replace missing values using the median along each column. Can only be used with numeric data. If \"most_frequent\", then replace missing using the most frequent value along each column. Can be used with strings or numeric data. If there is more than one such value, only the smallest is returned. If \"constant\", then replace missing values with fill_value. Can be used with strings or numeric data.": "插补策略。如果为“平均值”，则使用每列的平均值替换缺失值。只能与数字数据一起使用。如果为“中值”，则使用每列的中值替换缺失值。只能与数字数据一起使用。如果为“most_frequency”，则使用每列中最频繁的值替换缺失的值。可以与字符串或数字数据一起使用。如果存在多个这样的值，则只返回最小的值。如果为“常量”，则用fill_value替换缺失的值。可以与字符串或数字数据一起使用。",
+    "strategy": "填充缺失值的方式",
+    "The imputation strategy. If \"mean\", then replace missing values using the mean along each column. Can only be used with numeric data. If \"median\", then replace missing values using the median along each column. Can only be used with numeric data. If \"most_frequent\", then replace missing using the most frequent value along each column. Can be used with strings or numeric data. If there is more than one such value, only the smallest is returned. If \"constant\", then replace missing values with fill_value. Can be used with strings or numeric data.": "插补策略。如果为“平均值”，则使用每列的平均值替换缺失值。只能与数字数据一起使用。如果为“中值”，则使用每列的中值替换缺失值。只能与数字数据一起使用。如果为“众数”，则使用每列中最频繁的值替换缺失的值。可以与字符串或数字数据一起使用。如果存在多个这样的值，则只返回最小的值。如果为“自定义值”，则用fill_value替换缺失的值。可以与字符串或数字数据一起使用。",
     "missing_value": "缺失值",
     "Which value should be treat as missing_value? int, float, str, general_na (includes np.nan, None or pandas.NA which are all null in sc.table), default=general_na": "哪个值应该被视为missing_value？int、float、str、general_na（包括sc.table中全部为null的np.nan、None或pandas.na），default=general_na",
     "missing_value_type": "缺失值类型",
@@ -576,18 +594,18 @@
     "test": "测试数据子集",
     "Output test dataset.": "输出测试数据子集"
   },
-  "stats/groupby_statistics:0.0.2": {
+  "stats/groupby_statistics:0.0.3": {
     "stats": "统计",
     "groupby_statistics": "分组统计",
     "Get a groupby of statistics, like pandas groupby statistics.\nCurrently only support VDataframe.": "获取分组统计信息，参考pandas的分组统计。\n目前仅支持 VDataframe。",
-    "0.0.2": "0.0.2",
+    "0.0.3": "0.0.3",
     "aggregation_config": "聚合配置",
     "input groupby aggregation config": "输入聚合配置",
     "max_group_size": "最大组数",
     "The maximum number of groups allowed": "允许的最大组数",
     "input_data": "输入数据",
     "Input table.": "输入表",
-    "by": "组列",
+    "by": "特征列",
     "by what columns should we group the values": "我们应该按哪些列进行分组",
     "report": "报告",
     "Output groupby statistics report.": "输出分组统计信息报告"

diff --git a/docs/awesome-pets/papers/applications/ppml/ppml_crypto.md b/docs/awesome-pets/papers/applications/ppml/ppml_crypto.md
@@ -90,12 +90,12 @@ An overview of existing works is illustrated in the table below.
     Chinese Journal of Computers, [eprint in Chinese](http://cjc.ict.ac.cn/online/onlinepaper/hwl-202375100742.pdf)
 
 - When Machine Learning Meets Privacy: A Survey and Outlook.
-    *Bo Liu, Ming Ding, Sina Shaham, Wenny Rahayu, Farhad Farokhi, and Zihuai Lin* 
+    *Bo Liu, Ming Ding, Sina Shaham, Wenny Rahayu, Farhad Farokhi, and Zihuai Lin*
     ACM Computing Surveys (CSUR), [eprint](https://arxiv.org/pdf/2011.11819.pdf)
 
 
 - Privacy-preserving machine learning: Methods, challenges and directions.
-    *Xu R, Baracaldo N, Joshi J* 
+    *Xu R, Baracaldo N, Joshi J*
     arXiv preprint arXiv, [eprint](https://arxiv.org/pdf/2108.04417.pdf)
 
 ## Two-party Computation (2PC)

diff --git a/requirements.txt b/requirements.txt
@@ -28,7 +28,7 @@ secretflow-rayfed==0.2.0a7 # FEATURE=[lite]
 setuptools>=65.5.1
 sparse>=0.14.0
 spu==0.6.0.b0  # FEATURE=[lite]
-sf-heu==0.5.0.dev20231118  # FEATURE=[lite]
+sf-heu==0.5.0.dev20231128  # FEATURE=[lite]
 tensorflow-macos==2.11.0; platform_machine == "arm64" and platform_system == "Darwin"
 tensorflow==2.11.1; platform_machine != "arm64"
 tf2onnx>=1.13.0
@@ -40,4 +40,5 @@ wheel>=0.38.1
 torch==2.1.1
 torchmetrics==0.11.4
 torchvision==0.16.1
-torchaudio==2.1.1
+torchaudio==2.1.1
+interconnection==0.1.0.dev20231204
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		USE_BAZEL_VERSION=5.1.1
		BAZELISK_BASE_URL=https://github.com/bazelbuild/bazel/releases/download