the msra2src.py may have some problems #120

cooper12121 · 2022-11-23T08:49:32Z

i think your py file have problem to process the raw dataset,such as :
当 O
希 O
望 O
工 O
程 O
救 O
助 O
的 O
百 O
万 O
儿 O
童 O
成 O
长 O
起 O
来 O
， O
科 O
教 O
兴 O
国 O
蔚 O
然 O
成 O
风 O
时 O
， O
今 O
天 O
有 O
收 O
藏 O
价 O
值 O
的 O
书 O
你 O
没 O
买 O
， O
明 O
日 O
就 O
叫 O
你 O
悔 O
不 O
当 O
初 O
！ O

藏 O
书 O
本 O
来 O
就 O
是 O
所 O
有 O
传 O
统 O
收 O
藏 O
门 O
类 O
中 O
的 O
第 O
一 O
大 O
户 O
， O
只 O
是 O
我 O
们 O
结 O
束 O
温 O
饱 O
的 O
时 O
间 O
太 O
短 O
而 O
已 O
。 O
so it doesn't work for raw dataset you gived ,and it also doesn't equal to the mrc fomat dataset you gived

dsuanya · 2023-03-16T13:47:10Z

hello！ I also met some problems when running the msra2mrc.py?do you know how to deal with the problem of invilid inputs?thank you so much if you could give me a reply!!!

cooper12121 · 2023-03-17T13:45:58Z

hello！ I also met some problems when running the msra2mrc.py?do you know how to deal with the problem of invilid inputs?thank you so much if you could give me a reply!!!

there are some different formats of this input content,but they all didn't match the paper's data process program.After some trying I gave up,so sorry can't help you

dsuanya · 2023-03-18T02:12:49Z

hello！ I also met some problems when running the msra2mrc.py?do you know how to deal with the problem of invilid inputs?thank you so much if you could give me a reply!!!

there are some different formats of this input content,but they all didn't match the paper's data process program.After some trying I gave up,so sorry can't help you

Anyway,Thank you for your reply!I

algorithm007 · 2023-04-19T08:19:21Z

def convert_file(input_file, output_file, tag2query_file):
    """
    Convert MSRA raw data to MRC format
    """
    origin_count = 0
    new_count = 1
    tag2query = json.load(open(tag2query_file))
    mrc_samples = []
    contexts, labels = [], []
    with open(input_file) as fin:
        for line in fin:
            line = line.strip()
            if line:
                context, label = line.split(" ")
                contexts.append(context)
                labels.append(label)
            else:

                tags = bmes_decode(char_label_list=[(char, label) for char, label in zip(contexts, labels)])

    #         tags = bmes_decode(char_label_list=[(char, label) for char, label in zip(src.split(), labels.split())])
                for label, query in tag2query.items():
                    start_position = [tag.begin for tag in tags if tag.tag == label]
                    end_position = [tag.end-1 for tag in tags if tag.tag == label]
                    mrc_samples.append(
                        {
                            "qas_id": "{}.{}".format(origin_count, new_count),
                            "context": " ".join(contexts),
                            "start_position": start_position,
                            "end_position": end_position,
                            "query": query,
                            "impossible": False if start_position and end_position else True,
                            "entity_label": label,
                            "span_position": [f"{start};{end}" for start, end in zip(start_position, end_position)]
                        }
                    )
                    new_count += 1

                contexts, labels = [], []
                origin_count += 1
                new_count = 1

    json.dump(mrc_samples, open(output_file, "w"), ensure_ascii=False, sort_keys=True, indent=2)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

the msra2src.py may have some problems #120

the msra2src.py may have some problems #120

cooper12121 commented Nov 23, 2022 •

edited

Loading

dsuanya commented Mar 16, 2023

cooper12121 commented Mar 17, 2023

dsuanya commented Mar 18, 2023

algorithm007 commented Apr 19, 2023

the msra2src.py may have some problems #120

the msra2src.py may have some problems #120

Comments

cooper12121 commented Nov 23, 2022 • edited Loading

dsuanya commented Mar 16, 2023

cooper12121 commented Mar 17, 2023

dsuanya commented Mar 18, 2023

algorithm007 commented Apr 19, 2023

cooper12121 commented Nov 23, 2022 •

edited

Loading