-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
33 zooma endp #1
base: main
Are you sure you want to change the base?
Changes from 12 commits
ee14fed
1e23558
080e3b8
f68e098
21baa23
3b6b0f0
e4befb0
68e7c96
b77b32d
ce4e6cc
302e060
530c2bc
7bc55b3
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
EXT_PORT=8081 | ||
FLASK_PORT=3001 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
FROM python:3.10 | ||
|
||
WORKDIR /app | ||
|
||
COPY . /app | ||
RUN pip3 install --upgrade pip | ||
RUN pip3 install -r requirements.txt | ||
|
||
EXPOSE 8080 | ||
|
||
CMD ["python3", "server.py"] |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,34 @@ | ||
# cohort-atlas-harmonisation | ||
Harmonisation module of the cohort atlas | ||
Harmonisation module of the Cohort Atlas project. The main aim of this project is to | ||
organize different datasets into comparable groups (or cohorts) based on common features or characteristics. | ||
|
||
This module is responsible for performing operations to harmonize or reconcilate | ||
the datasets, in order to further create cohorts that can be analyzed as a single | ||
group in researches. | ||
|
||
For launch and down of this module: | ||
docker-compose up --build -d | ||
docker-compose down | ||
|
||
Internal and external ports are set in the evn.txt file in the root module directory: <br> | ||
H_PORT=3000<br> | ||
EXT_PORT=8081 | ||
|
||
Another product of EBI, named ZOOMA, is used in this module. | ||
ZOOMA maps text to ontology terms based on curated mappings from selected datasources | ||
(more preferred), and by searching ontologies directly (less preferred).<br> | ||
Documentation for ZOOMA is placed here: https://www.ebi.ac.uk/spot/zooma/docs. | ||
|
||
Example of the harmonisation module endpoint: | ||
http://localhost:8081/match?path=/app/shared/sample_labels_to_annotate.csv <br> | ||
The endpoint gives you information about ontology terms matched to the labels values | ||
from the .csv file. Example of the .csv file:<br> | ||
LABELS<br> | ||
Gender<br> | ||
Birthdate<br> | ||
Year of birth<br> | ||
Agreement date<br> | ||
Age at present<br> | ||
|
||
This endpoint uses ZOOMA by this way: | ||
http://www.ebi.ac.uk/spot/zooma/v2/api/services/annotate?propertyValue={label} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
version: '3' | ||
services: | ||
myapp: | ||
image: cohort-atlas-harmonisation | ||
build: | ||
context: . | ||
dockerfile: Dockerfile | ||
ports: | ||
- ${EXT_PORT}:${FLASK_PORT} | ||
volumes: | ||
- ./shared:/app/shared | ||
env_file: | ||
- .env |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
from harmonise.zooma import ZoomaClient | ||
|
||
|
||
class FieldMatchingService: | ||
|
||
field_dict = { | ||
'propertyValue': None, | ||
'semanticTags': None, | ||
'confidence': None | ||
} | ||
|
||
def __init__(self): | ||
pass | ||
|
||
def get_field_dict(self, url): | ||
z_cl = ZoomaClient() | ||
resp_json = z_cl.get_json(url=url) | ||
|
||
if resp_json is not None: | ||
for i, el in enumerate(resp_json): | ||
try: | ||
self.field_dict['propertyValue'] = el['annotatedProperty']['propertyValue'] | ||
self.field_dict['semanticTags'] = el['semanticTags'] | ||
self.field_dict['confidence'] = el['confidence'] | ||
except Exception as e: | ||
print(e) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should use a proper logging library. Default python logging module will do fine. |
||
|
||
return self.field_dict | ||
|
||
|
||
def get_match(file_path: str): | ||
match_dict = dict() | ||
|
||
with open(file_path, 'r') as f: | ||
labels = list(map(lambda s: s.strip(), f.readlines())) | ||
|
||
for label in labels: | ||
if len(label) != 0: | ||
fm_cl = FieldMatchingService() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. FieldMatchingService and ZoomaClient doing two different things? |
||
field_dict = fm_cl.get_field_dict( | ||
url=f'http://www.ebi.ac.uk/spot/zooma/v2/api/services/annotate?propertyValue={label}' | ||
) | ||
match_dict[label] = field_dict | ||
|
||
return match_dict |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
import requests | ||
|
||
|
||
class ZoomaClient: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This class looks static, there are few ways we can improve this to make it better OOP
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do you mean:
? |
||
def __init__(self): | ||
pass | ||
|
||
def get_json(self, url): | ||
resp = requests.get(url) | ||
if resp.status_code == 200: | ||
return (resp.json()) | ||
return None |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,15 +1,18 @@ | ||
click==8.1.3 | ||
Flask==2.2.5 | ||
flask_cors==3.0.10 | ||
importlib-metadata==6.6.0 | ||
itsdangerous==2.1.2 | ||
Jinja2==3.1.2 | ||
MarkupSafe==2.1.2 | ||
nltk~=3.8.1 | ||
pandas~=1.3.5 | ||
psutil==5.9.4 | ||
pytest==7.4.0 | ||
python-dotenv==1.0.0 | ||
requests==2.30.0 | ||
scikit-learn~=1.0.2 | ||
typing_extensions==4.5.0 | ||
Werkzeug==2.2.3 | ||
zipp==3.15.0 | ||
|
||
wordninja~=2.0.0 | ||
|
||
pandas~=1.3.5 | ||
nltk~=3.8.1 | ||
scikit-learn~=1.0.2 | ||
zipp==3.15.0 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
LABELS | ||
Gender | ||
Birthdate | ||
Year of birth | ||
Agreement date | ||
Age at present | ||
Age at the agreement date | ||
Date of death | ||
Year of death | ||
Last BMI value | ||
Last weight value | ||
Last height value | ||
The date of last weight, height and BMI measurement | ||
Last bmi value source | ||
Last smoking status | ||
Date of last smoking report | ||
Last smoking status source | ||
Last status of alcohol consumption | ||
Alcohol consumption habits | ||
Daily alcohol consumption during the last year (1 unit = 10 g of pure alcohol) | ||
Date of last report of alcohol consumption | ||
Nationality | ||
Last education | ||
The date of the last education | ||
Last education source | ||
Country of residence | ||
County of residence | ||
City of residence | ||
Settlement region type |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,55 @@ | ||
import requests | ||
import os | ||
from dotenv import load_dotenv | ||
from itertools import islice | ||
import pytest | ||
|
||
load_dotenv('./env') | ||
FLASK_PORT = int(os.getenv('FLASK_PORT')) | ||
|
||
|
||
def test_labels(file_path: str): | ||
global FLASK_PORT | ||
|
||
if not os.path.exists(file_path): | ||
print(f"This file doesn't exist: {file_path}") | ||
return dict() | ||
|
||
url = f"http://localhost:{FLASK_PORT}/match" | ||
response = requests.post(url, files={'file': open(file_path, 'rb')}) | ||
|
||
if response.status_code == 200: | ||
outp_json = response.json() | ||
print(f"Response json is: {outp_json}") | ||
else: | ||
outp_json = dict() | ||
print(f"Request failed with status code: {response.status_code}; file path: {file_path}") | ||
|
||
assert len(outp_json) > 0, "Empty json" | ||
assert len(outp_json) == 29, "Wrong size json" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Test cases should be easy to understand. Eg. what does 29 here means, does it need a comment there to explain this, or self explanatory constant will describe it? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. as I'm not going to change csv file for the test I'm expecting the same result about the result json size. But if service here (f'http://www.ebi.ac.uk/spot/zooma/v2/api/services/annotate?propertyValue={label}') will be changed the result will be changed also |
||
|
||
first_5_elements = dict(islice(outp_json.items(), 5)) | ||
|
||
expected_values = { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What happen if zooma has new knowledge and there is another mapping in zooma output. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. possibly we can check json keys only, like: |
||
'Age at present': {'confidence': 'MEDIUM', 'propertyValue': 'mating_type_region', | ||
'semanticTags': ['http://purl.obolibrary.org/obo/SO_0001789']}, | ||
'Age at the agreement date': {'confidence': 'MEDIUM', 'propertyValue': 'mating_type_region', | ||
'semanticTags': ['http://purl.obolibrary.org/obo/SO_0001789']}, | ||
'Agreement date': {'confidence': 'MEDIUM', 'propertyValue': 'mating_type_region', | ||
'semanticTags': ['http://purl.obolibrary.org/obo/SO_0001789']}, | ||
'Alcohol consumption habits': {'confidence': 'MEDIUM', 'propertyValue': 'mating_type_region', | ||
'semanticTags': ['http://purl.obolibrary.org/obo/SO_0001789']}, | ||
'Birthdate': {'confidence': 'MEDIUM', 'propertyValue': 'mating_type_region', | ||
'semanticTags': ['http://purl.obolibrary.org/obo/SO_0001789']} | ||
} | ||
|
||
for key, value in first_5_elements.items(): | ||
assert key in expected_values, f"Unexpected key in json: {key}" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. now I'm checking only keys in the json |
||
assert value == expected_values[key], f"Unexpected value for key {key} in json. " \ | ||
f"Expected: {expected_values[key]}. Got: {value}" | ||
|
||
return outp_json | ||
|
||
|
||
if __name__ == '__main__': | ||
test_labels(file_path=f"sample_labels_to_annotate.csv") |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,12 +1,29 @@ | ||
id,name,label,description,type,values,parent,annotations,tags | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We are not only receiving the 'labels' but a csv file that could contain field 'type', 'description', etc... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do you suggest me to add more 'columns' there? |
||
id_1,Gender,Gender,Gender,string,"MALE, FEMALE, OTHER",,, | ||
id_2,Birthdate,Birthdate,Birthdate,string,,,, | ||
id_3,Year of birth,Year of birth,Year of birth,string,,,, | ||
id_4,Agreement date,Agreement date,Agreement date,string,,,, | ||
id_5,Age at present,Age at present,Age at present,string,,,, | ||
|
||
|
||
|
||
|
||
|
||
|
||
LABELS | ||
Gender | ||
Birthdate | ||
Year of birth | ||
Agreement date | ||
Age at present | ||
Age at the agreement date | ||
Date of death | ||
Year of death | ||
Last BMI value | ||
Last weight value | ||
Last height value | ||
The date of last weight, height and BMI measurement | ||
Last bmi value source | ||
Last smoking status | ||
Date of last smoking report | ||
Last smoking status source | ||
Last status of alcohol consumption | ||
Alcohol consumption habits | ||
Daily alcohol consumption during the last year (1 unit = 10 g of pure alcohol) | ||
Date of last report of alcohol consumption | ||
Nationality | ||
Last education | ||
The date of the last education | ||
Last education source | ||
Country of residence | ||
County of residence | ||
City of residence | ||
Settlement region type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Always good to use self explanatory names. For very short lived names it is acceptable sometimes. Here I would name this zooma_client rather than z_cl.