Add multi_x_science_sum (huggingface#1003)

* Add multi_x_science_sum * Update datasets/multi_x_science_sum/multi_x_science_sum.py Co-authored-by: Quentin Lhoest <[email protected]>
shayne-longpre · Dec 2, 2020 · c4486a9 · c4486a9
1 parent 4aa6824
commit c4486a9
Show file tree

Hide file tree

Showing 4 changed files with 263 additions and 0 deletions.
diff --git a/datasets/multi_x_science_sum/README.md b/datasets/multi_x_science_sum/README.md
@@ -0,0 +1,156 @@
+---
+annotations_creators:
+- found
+language_creators:
+- found
+languages:
+- en
+licenses:
+- unknown
+multilinguality:
+- monolingual
+size_categories:
+- 10K<n<100K
+source_datasets:
+- original
+task_categories:
+- conditional-text-generation
+task_ids:
+- summarization
+---
+
+# Dataset Card for Multi-XScience
+
+## Table of Contents
+- [Dataset Description](#dataset-description)
+  - [Dataset Summary](#dataset-summary)
+  - [Supported Tasks](#supported-tasks-and-leaderboards)
+  - [Languages](#languages)
+- [Dataset Structure](#dataset-structure)
+  - [Data Instances](#data-instances)
+  - [Data Fields](#data-instances)
+  - [Data Splits](#data-instances)
+- [Dataset Creation](#dataset-creation)
+  - [Curation Rationale](#curation-rationale)
+  - [Source Data](#source-data)
+  - [Annotations](#annotations)
+  - [Personal and Sensitive Information](#personal-and-sensitive-information)
+- [Considerations for Using the Data](#considerations-for-using-the-data)
+  - [Social Impact of Dataset](#social-impact-of-dataset)
+  - [Discussion of Biases](#discussion-of-biases)
+  - [Other Known Limitations](#other-known-limitations)
+- [Additional Information](#additional-information)
+  - [Dataset Curators](#dataset-curators)
+  - [Licensing Information](#licensing-information)
+  - [Citation Information](#citation-information)
+
+## Dataset Description
+
+- **Repository:** [Multi-XScience repository](https://github.com/yaolu/Multi-XScience)
+- **Paper:** [Multi-XScience: A Large-scale Dataset for Extreme Multi-document Summarization of Scientific Articles](https://arxiv.org/abs/2010.14235)
+
+### Dataset Summary
+
+Multi-XScience, a large-scale multi-document summarization dataset created from scientific articles. Multi-XScience introduces a challenging multi-document summarization task: writing therelated-work section of a paper based on itsabstract and the articles it references.
+
+### Supported Tasks and Leaderboards
+
+[More Information Needed]
+
+### Languages
+
+The text in the dataset is in English
+
+## Dataset Structure
+
+### Data Instances
+
+{'abstract': 'Author(s): Kuperberg, Greg; Thurston, Dylan P. | Abstract: We give a purely topological definition of the perturbative quantum invariants of links and 3-manifolds associated with Chern-Simons field theory. Our definition is as close as possible to one given by Kontsevich. We will also establish some basic properties of these invariants, in particular that they are universally finite type with respect to algebraically split surgery and with respect to Torelli surgery. Torelli surgery is a mutual generalization of blink surgery of Garoufalidis and Levine and clasper surgery of Habiro.',
+ 'aid': 'math9912167',
+ 'mid': '1631980677',
+ 'ref_abstract': {'abstract': ['This note is a sequel to our earlier paper of the same title [4] and describes invariants of rational homology 3-spheres associated to acyclic orthogonal local systems. Our work is in the spirit of the Axelrod–Singer papers [1], generalizes some of their results, and furnishes a new setting for the purely topological implications of their work.',
+   'Recently, Mullins calculated the Casson-Walker invariant of the 2-fold cyclic branched cover of an oriented link in S^3 in terms of its Jones polynomial and its signature, under the assumption that the 2-fold branched cover is a rational homology 3-sphere. Using elementary principles, we provide a similar calculation for the general case. In addition, we calculate the LMO invariant of the p-fold branched cover of twisted knots in S^3 in terms of the Kontsevich integral of the knot.'],
+  'cite_N': ['@cite_16', '@cite_26'],
+  'mid': ['1481005306', '1641082372']},
+ 'related_work': 'Two other generalizations that can be considered are invariants of graphs in 3-manifolds, and invariants associated to other flat connections @cite_16 . We will analyze these in future work. Among other things, there should be a general relation between flat bundles and links in 3-manifolds on the one hand and finite covers and branched covers on the other hand @cite_26 .'}
+
+### Data Fields
+
+{`abstract`: text of paper abstract \
+ `aid`: arxiv id \
+ `mid`: microsoft academic graph id \
+ `ref_abstract`: \
+   { \
+    `abstract`: text of reference paper (cite_N) abstract \
+    `cite_N`: special cite symbol, \
+    `mid`: reference paper's (cite_N) microsoft academic graph id \
+   }, \
+ `related_work`: text of paper related work \
+ }
+
+### Data Splits
+
+The data is split into a training, validation and test.
+
+| Tain   | Valid | Test |
+| ------ | ----- | ---- |
+| 30369  |  5066 | 5093 |
+
+
+## Dataset Creation
+
+### Curation Rationale
+
+[More Information Needed]
+
+### Source Data
+
+#### Initial Data Collection and Normalization
+
+[More Information Needed]
+
+#### Who are the source language producers?
+
+[More Information Needed]
+
+### Annotations
+
+#### Annotation process
+
+[More Information Needed]
+
+#### Who are the annotators?
+
+[More Information Needed]
+
+### Personal and Sensitive Information
+
+[More Information Needed]
+
+## Considerations for Using the Data
+
+### Social Impact of Dataset
+
+[More Information Needed]
+
+### Discussion of Biases
+
+[More Information Needed]
+
+### Other Known Limitations
+
+[More Information Needed]
+
+## Additional Information
+
+### Dataset Curators
+
+[More Information Needed]
+
+### Licensing Information
+
+[More Information Needed]
+
+### Citation Information
+
+[More Information Needed]
diff --git a/datasets/multi_x_science_sum/dataset_infos.json b/datasets/multi_x_science_sum/dataset_infos.json
@@ -0,0 +1 @@
+{"default": {"description": "\nMulti-XScience,a large-scale multi-document summarization dataset created from scientific articles. Multi-XScience introduces a challenging multi-document summarization task: writing therelated-work section of a paper based on itsabstract and the articles it references.\n", "citation": "\n@article{lu2020multi,\n  title={Multi-XScience: A Large-scale Dataset for Extreme Multi-document Summarization of Scientific Articles},\n  author={Lu, Yao and Dong, Yue and Charlin, Laurent},\n  journal={arXiv preprint arXiv:2010.14235},\n  year={2020}\n}\n", "homepage": "https://github.com/yaolu/Multi-XScience", "license": "", "features": {"aid": {"dtype": "string", "id": null, "_type": "Value"}, "mid": {"dtype": "string", "id": null, "_type": "Value"}, "abstract": {"dtype": "string", "id": null, "_type": "Value"}, "related_work": {"dtype": "string", "id": null, "_type": "Value"}, "ref_abstract": {"feature": {"cite_N": {"dtype": "string", "id": null, "_type": "Value"}, "mid": {"dtype": "string", "id": null, "_type": "Value"}, "abstract": {"dtype": "string", "id": null, "_type": "Value"}}, "length": -1, "id": null, "_type": "Sequence"}}, "post_processed": null, "supervised_keys": null, "builder_name": "multi_x_science_sum", "config_name": "default", "version": {"version_str": "1.1.0", "description": null, "major": 1, "minor": 1, "patch": 0}, "splits": {"train": {"name": "train", "num_bytes": 169364465, "num_examples": 30369, "dataset_name": "multi_x_science_sum"}, "test": {"name": "test", "num_bytes": 27965523, "num_examples": 5093, "dataset_name": "multi_x_science_sum"}, "validation": {"name": "validation", "num_bytes": 28168498, "num_examples": 5066, "dataset_name": "multi_x_science_sum"}}, "download_checksums": {"https://raw.githubusercontent.com/yaolu/Multi-XScience/master/data/train.json.gz": {"num_bytes": 46052652, "checksum": "447590bb3580b489fe388ccaf563b771ce310d4054b188e4d4055916653ce915"}, "https://raw.githubusercontent.com/yaolu/Multi-XScience/master/data/test.json.gz": {"num_bytes": 7601038, "checksum": "40b08749cc676c28b44e9c328dc3befa659eb2a752344e9388918e47efde3444"}, "https://raw.githubusercontent.com/yaolu/Multi-XScience/master/data/val.json.gz": {"num_bytes": 7675614, "checksum": "fc5a84edad99d9d9fe6e71567b1c198f410e34a3b1e71a56e73b77b930258a8a"}}, "download_size": 61329304, "post_processing_size": null, "dataset_size": 225498486, "size_in_bytes": 286827790}}
diff --git a/datasets/multi_x_science_sum/dummy/1.1.0/dummy_data.zip b/datasets/multi_x_science_sum/dummy/1.1.0/dummy_data.zip
diff --git a/datasets/multi_x_science_sum/multi_x_science_sum.py b/datasets/multi_x_science_sum/multi_x_science_sum.py
@@ -0,0 +1,106 @@
+# coding=utf-8
+# Copyright 2020 The TensorFlow Datasets Authors and the HuggingFace Datasets Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Lint as: python3
+"""Multi-XScience Dataset."""
+
+from __future__ import absolute_import, division, print_function
+
+import json
+
+import datasets
+
+
+_CITATION = """
+@article{lu2020multi,
+  title={Multi-XScience: A Large-scale Dataset for Extreme Multi-document Summarization of Scientific Articles},
+  author={Lu, Yao and Dong, Yue and Charlin, Laurent},
+  journal={arXiv preprint arXiv:2010.14235},
+  year={2020}
+}
+"""
+
+_DESCRIPTION = """
+Multi-XScience, a large-scale multi-document summarization dataset created from scientific articles. Multi-XScience introduces a challenging multi-document summarization task: writing the related-work section of a paper based on its abstract and the articles it references.
+"""
+
+_URL_TRAIN = "https://raw.githubusercontent.com/yaolu/Multi-XScience/master/data/train.json.gz"
+_URL_TEST = "https://raw.githubusercontent.com/yaolu/Multi-XScience/master/data/test.json.gz"
+_URL_VAL = "https://raw.githubusercontent.com/yaolu/Multi-XScience/master/data/val.json.gz"
+
+
+class MultiXScienceSum(datasets.GeneratorBasedBuilder):
+    """"Multi-XScience Dataset."""
+
+    VERSION = datasets.Version("1.1.0")
+
+    def _info(selif):
+        return datasets.DatasetInfo(
+            description=_DESCRIPTION,
+            features=datasets.Features(
+                {
+                    "aid": datasets.Value("string"),
+                    "mid": datasets.Value("string"),
+                    "abstract": datasets.Value("string"),
+                    "related_work": datasets.Value("string"),
+                    "ref_abstract": datasets.Sequence(
+                        {
+                            "cite_N": datasets.Value("string"),
+                            "mid": datasets.Value("string"),
+                            "abstract": datasets.Value("string"),
+                        },
+                    ),
+                }
+            ),
+            supervised_keys=None,
+            homepage="https://github.com/yaolu/Multi-XScience",
+            citation=_CITATION,
+        )
+
+    def _split_generators(self, dl_manager):
+        """Returns SplitGenerators."""
+        train_path = dl_manager.download_and_extract(_URL_TRAIN)
+        test_path = dl_manager.download_and_extract(_URL_TEST)
+        val_path = dl_manager.download_and_extract(_URL_VAL)
+
+        return [
+            datasets.SplitGenerator(
+                name=datasets.Split.TRAIN,
+                gen_kwargs={"path": train_path},
+            ),
+            datasets.SplitGenerator(
+                name=datasets.Split.TEST,
+                gen_kwargs={"path": test_path},
+            ),
+            datasets.SplitGenerator(
+                name=datasets.Split.VALIDATION,
+                gen_kwargs={"path": val_path},
+            ),
+        ]
+
+    def _generate_examples(self, path=None):
+        """Yields examples."""
+        with open(path, encoding="utf-8") as f:
+            data = json.load(f)
+            f.close()
+
+        for idx, el in enumerate(data):
+            cite_n = list(el["ref_abstract"].keys())
+            cite_n_mid = [el["ref_abstract"][cite]["mid"] for cite in cite_n]
+            cite_n_abstract = [el["ref_abstract"][cite]["abstract"] for cite in cite_n]
+            tmp = {"cite_N": cite_n, "mid": cite_n_mid, "abstract": cite_n_abstract}
+            d = el.copy()
+            d["ref_abstract"] = tmp
+            yield idx, d
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		{"default": {"description": "\nMulti-XScience,a large-scale multi-document summarization dataset created from scientific articles. Multi-XScience introduces a challenging multi-document summarization task: writing therelated-work section of a paper based on itsabstract and the articles it references.\n", "citation": "\n@article{lu2020multi,\n title={Multi-XScience: A Large-scale Dataset for Extreme Multi-document Summarization of Scientific Articles},\n author={Lu, Yao and Dong, Yue and Charlin, Laurent},\n journal={arXiv preprint arXiv:2010.14235},\n year={2020}\n}\n", "homepage": "https://github.com/yaolu/Multi-XScience", "license": "", "features": {"aid": {"dtype": "string", "id": null, "_type": "Value"}, "mid": {"dtype": "string", "id": null, "_type": "Value"}, "abstract": {"dtype": "string", "id": null, "_type": "Value"}, "related_work": {"dtype": "string", "id": null, "_type": "Value"}, "ref_abstract": {"feature": {"cite_N": {"dtype": "string", "id": null, "_type": "Value"}, "mid": {"dtype": "string", "id": null, "_type": "Value"}, "abstract": {"dtype": "string", "id": null, "_type": "Value"}}, "length": -1, "id": null, "_type": "Sequence"}}, "post_processed": null, "supervised_keys": null, "builder_name": "multi_x_science_sum", "config_name": "default", "version": {"version_str": "1.1.0", "description": null, "major": 1, "minor": 1, "patch": 0}, "splits": {"train": {"name": "train", "num_bytes": 169364465, "num_examples": 30369, "dataset_name": "multi_x_science_sum"}, "test": {"name": "test", "num_bytes": 27965523, "num_examples": 5093, "dataset_name": "multi_x_science_sum"}, "validation": {"name": "validation", "num_bytes": 28168498, "num_examples": 5066, "dataset_name": "multi_x_science_sum"}}, "download_checksums": {"https://raw.githubusercontent.com/yaolu/Multi-XScience/master/data/train.json.gz": {"num_bytes": 46052652, "checksum": "447590bb3580b489fe388ccaf563b771ce310d4054b188e4d4055916653ce915"}, "https://raw.githubusercontent.com/yaolu/Multi-XScience/master/data/test.json.gz": {"num_bytes": 7601038, "checksum": "40b08749cc676c28b44e9c328dc3befa659eb2a752344e9388918e47efde3444"}, "https://raw.githubusercontent.com/yaolu/Multi-XScience/master/data/val.json.gz": {"num_bytes": 7675614, "checksum": "fc5a84edad99d9d9fe6e71567b1c198f410e34a3b1e71a56e73b77b930258a8a"}}, "download_size": 61329304, "post_processing_size": null, "dataset_size": 225498486, "size_in_bytes": 286827790}}