-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
the msra2src.py may have some problems #120
Comments
hello! I also met some problems when running the msra2mrc.py?do you know how to deal with the problem of invilid inputs?thank you so much if you could give me a reply!!! |
there are some different formats of this input content,but they all didn't match the paper's data process program.After some trying I gave up,so sorry can't help you |
Anyway,Thank you for your reply!I |
def convert_file(input_file, output_file, tag2query_file):
"""
Convert MSRA raw data to MRC format
"""
origin_count = 0
new_count = 1
tag2query = json.load(open(tag2query_file))
mrc_samples = []
contexts, labels = [], []
with open(input_file) as fin:
for line in fin:
line = line.strip()
if line:
context, label = line.split(" ")
contexts.append(context)
labels.append(label)
else:
tags = bmes_decode(char_label_list=[(char, label) for char, label in zip(contexts, labels)])
# tags = bmes_decode(char_label_list=[(char, label) for char, label in zip(src.split(), labels.split())])
for label, query in tag2query.items():
start_position = [tag.begin for tag in tags if tag.tag == label]
end_position = [tag.end-1 for tag in tags if tag.tag == label]
mrc_samples.append(
{
"qas_id": "{}.{}".format(origin_count, new_count),
"context": " ".join(contexts),
"start_position": start_position,
"end_position": end_position,
"query": query,
"impossible": False if start_position and end_position else True,
"entity_label": label,
"span_position": [f"{start};{end}" for start, end in zip(start_position, end_position)]
}
)
new_count += 1
contexts, labels = [], []
origin_count += 1
new_count = 1
json.dump(mrc_samples, open(output_file, "w"), ensure_ascii=False, sort_keys=True, indent=2) |
i think your py file have problem to process the raw dataset,such as :
当 O
希 O
望 O
工 O
程 O
救 O
助 O
的 O
百 O
万 O
儿 O
童 O
成 O
长 O
起 O
来 O
, O
科 O
教 O
兴 O
国 O
蔚 O
然 O
成 O
风 O
时 O
, O
今 O
天 O
有 O
收 O
藏 O
价 O
值 O
的 O
书 O
你 O
没 O
买 O
, O
明 O
日 O
就 O
叫 O
你 O
悔 O
不 O
当 O
初 O
! O
藏 O
书 O
本 O
来 O
就 O
是 O
所 O
有 O
传 O
统 O
收 O
藏 O
门 O
类 O
中 O
的 O
第 O
一 O
大 O
户 O
, O
只 O
是 O
我 O
们 O
结 O
束 O
温 O
饱 O
的 O
时 O
间 O
太 O
短 O
而 O
已 O
。 O
so it doesn't work for raw dataset you gived ,and it also doesn't equal to the mrc fomat dataset you gived
The text was updated successfully, but these errors were encountered: