Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the msra2src.py may have some problems #120

Open
cooper12121 opened this issue Nov 23, 2022 · 4 comments
Open

the msra2src.py may have some problems #120

cooper12121 opened this issue Nov 23, 2022 · 4 comments

Comments

@cooper12121
Copy link

cooper12121 commented Nov 23, 2022

i think your py file have problem to process the raw dataset,such as :
当 O
希 O
望 O
工 O
程 O
救 O
助 O
的 O
百 O
万 O
儿 O
童 O
成 O
长 O
起 O
来 O
, O
科 O
教 O
兴 O
国 O
蔚 O
然 O
成 O
风 O
时 O
, O
今 O
天 O
有 O
收 O
藏 O
价 O
值 O
的 O
书 O
你 O
没 O
买 O
, O
明 O
日 O
就 O
叫 O
你 O
悔 O
不 O
当 O
初 O
! O

藏 O
书 O
本 O
来 O
就 O
是 O
所 O
有 O
传 O
统 O
收 O
藏 O
门 O
类 O
中 O
的 O
第 O
一 O
大 O
户 O
, O
只 O
是 O
我 O
们 O
结 O
束 O
温 O
饱 O
的 O
时 O
间 O
太 O
短 O
而 O
已 O
。 O
so it doesn't work for raw dataset you gived ,and it also doesn't equal to the mrc fomat dataset you gived

@dsuanya
Copy link

dsuanya commented Mar 16, 2023

hello! I also met some problems when running the msra2mrc.py?do you know how to deal with the problem of invilid inputs?thank you so much if you could give me a reply!!!

@cooper12121
Copy link
Author

hello! I also met some problems when running the msra2mrc.py?do you know how to deal with the problem of invilid inputs?thank you so much if you could give me a reply!!!

there are some different formats of this input content,but they all didn't match the paper's data process program.After some trying I gave up,so sorry can't help you

@dsuanya
Copy link

dsuanya commented Mar 18, 2023

hello! I also met some problems when running the msra2mrc.py?do you know how to deal with the problem of invilid inputs?thank you so much if you could give me a reply!!!

there are some different formats of this input content,but they all didn't match the paper's data process program.After some trying I gave up,so sorry can't help you

Anyway,Thank you for your reply!I

@algorithm007
Copy link

def convert_file(input_file, output_file, tag2query_file):
    """
    Convert MSRA raw data to MRC format
    """
    origin_count = 0
    new_count = 1
    tag2query = json.load(open(tag2query_file))
    mrc_samples = []
    contexts, labels = [], []
    with open(input_file) as fin:
        for line in fin:
            line = line.strip()
            if line:
                context, label = line.split(" ")
                contexts.append(context)
                labels.append(label)
            else:

                tags = bmes_decode(char_label_list=[(char, label) for char, label in zip(contexts, labels)])

    #         tags = bmes_decode(char_label_list=[(char, label) for char, label in zip(src.split(), labels.split())])
                for label, query in tag2query.items():
                    start_position = [tag.begin for tag in tags if tag.tag == label]
                    end_position = [tag.end-1 for tag in tags if tag.tag == label]
                    mrc_samples.append(
                        {
                            "qas_id": "{}.{}".format(origin_count, new_count),
                            "context": " ".join(contexts),
                            "start_position": start_position,
                            "end_position": end_position,
                            "query": query,
                            "impossible": False if start_position and end_position else True,
                            "entity_label": label,
                            "span_position": [f"{start};{end}" for start, end in zip(start_position, end_position)]
                        }
                    )
                    new_count += 1

                contexts, labels = [], []
                origin_count += 1
                new_count = 1

    json.dump(mrc_samples, open(output_file, "w"), ensure_ascii=False, sort_keys=True, indent=2)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants