Transgene curation

See Caltech documentation on the WormBase wiki.

## Transgene objects in WormBase We curate both integrated (Is, In) and extrachromosomal (Ex) transgene arrays. For each transgene, we extract the following information: * public_name of the transgene (in cases where a transgene has not been assigned a name, or one that does not adhere to standard nomenclature, we assign a name based on the WBPaperID) * genomic expression summary * promoter(s) * reporters (GFP, YFP, LacZ, etc.) * other gene product(s) * co-injection markers (as noted as part of the genomic expression summary) * if the transgene is integrated-- the method of integration, and to what LG it was mapped, if known * which papers use the transgene as part of an experiment * any other names for the transgene, which might be reported by different authors

A WBTransgeneID is automatically assigned when the object is entered into the database.

## Transgene model ``` ?Transgene Evidence #Evidence Public_name //added for WS234 Summary UNIQUE ?Text Synonym ?Text Promoter Driven_by_gene ?Gene XREF Drives_Transgene Driven_by_construct ?Text Reporter Reporter_product ?Text Gene ?Gene XREF Transgene_product Text 3_UTR ?Gene Reporter_type ?Text Construction Fragment ?Text Construction_summary ?Text //Added for WS236(?), Remark data was transferred here Coinjection_marker ?Text Integration_method UNIQUE ?Text Laboratory ?Laboratory #Lab_Location Author ?Author Person ?Person Genetic_information Extrachromosomal Integrated Map ?Map #Map_position Map_evidence #Evidence Mapping_data 2_point ?2_point_data Multi_point ?Multi_pt_data Phenotype ?Phenotype XREF Transgene #Phenotype_info Phenotype_not_observed ?Phenotype XREF Not_in_Transgene #Phenotype_info Used_for Expr_pattern ?Expr_pattern XREF Transgene Marker_for ?Text #Evidence Gene_regulation ?Gene_regulation XREF Transgene Interactor ?Interaction Associated_with Marked_rearrangement ?Rearrangement XREF By_transgene Clone ?Clone XREF Transgene Text Strain ?Strain XREF Transgene Reference ?Paper XREF Transgene Species UNIQUE ?Species Remark ?Text #Evidence ``` ## Automated extraction and curation ###In/Is transgenes Arun and Wen have automated the identification of papers that contain transgenes by using Textpresso to scan the C. elegans corpus of papers for the regular expressions (1-3 capital letter)Is or In (1-4 digits). This script will miss any transgenes that do not have a standard name using "Is" or "In".

Transgene names are extracted and sent to populate the transgene postgres table, these entries do not have a summary (genotype) or remark. They are retrieved through the Transgene OA by searching for Arun in the curator field.

The output from the Textpresso search script on textpresso-dev and tazendra: http://textpresso-dev.caltech.edu/wen/transgenes_in_regular_papers.out /home/postgres/work/pgpopulation/textpresso/transgene/transgenes_in_regular_papers.out

"...paper.sup.1" means the transgene name was mentioned in the supplementary file.

All the transgene-paper links are entered into postgres automatically on tazendra /home/postgres/work/pgpopulation/textpresso/transgene/update_textpreso_cur_transgene.pl

Only integrated (Is, Si), lines are entered into postgres from the Textpresso script. 'In' public names are not standard nomeclature; 'In' transgenes are not entered to Textpresso unless the transgene name already matches something in the database, meaning they have been confirmed to be valid transgenes.

###Ex transgenes Ex transgenes are curated when they are associated with an expression pattern, phenotype, or gene regulation experiment. For automation, this script find all Ex transgenes along with genomic summaries in the C. elegans corpus. In thinking of automating their curation a script was developed by Arun R. that finds all papers, Ex transgene, and associated genomic expression, sorted by transgene.
This output file can be viewed here (these are only manually loaded into Postgres): http://textpresso-dev.caltech.edu/transgene/transgenes_summary.out

###Obsolete transgene objects There are some false positive transgene hits, although some of these are not real transgenes, others are real so should not be deleted, rather they need to be uncoupled from the WBPaper from which it was extracted. Flag these objects as "FAIL" through the OA so they will not be picked up again during future transgene object scans.

##Curation tools *Curate with Phenote *Curate with Phenote ##Transgene dumper The transgene.ace dumper was written by Juancarlos and Wen to translate the transgene postgres data into .ace format for uploading into AcEDB. The script is located on tazendra. The output file is dumped into the same directory.

Test the .ace in CitaceMinus

make sure the file reads in fine.
look at all the transgene objects to make sure there are no strange looking ones.
do a count of objects before and after the read-in to make sure the number of new objects is reasonable.

A cron job set up by Juancarlos and Wen runs the Transgene .ace dumper script on Thursday mornings at 6am and deposits it on citace at 8am. If there has been any new data or changes in data between testing the file and the Thursday morning dump, make sure to rerum the script and transfer the new data dump to citace for upload.

##Requested changes Changes recorded on https://bitbucket.org/kyook/ky_wbprojects/wiki/transgene_dump_ace.pl

--User:Kyook 23:26, 20 June 2012 (UTC) *change dump cron job to Wed morning, spica still calls it at 8am on Thursday *fix source of allele codes for obo_laboratory, URL for the lab-allele designations http://www.cbs.umn.edu/cgc/lab-allele Juancarlos will parse out the page and create a local copy. A cron job will compare the page to the local copy, if the page is altered in format the table will revert to the local copy.

##Cross curation Transgenes are used in other datatypes; Expr_pattern, Phenotype, Gene_regulation, Interaction. All curators that use transgenes will need to run a script **script name here*** before each upload to make sure all their transgenes are valid objects.

####Gene_regulation The gene regulation curator can create new transgenes using the Transgene OA or request them from the transgene curator.

####Expr_pattern The expression pattern curator requires many transgene objects to be created on the fly. Rather than impeding the curation flow, when the expression pattern curator needs a transgene that has not been created, they enter the relevant information for the transgene in their Reporter gene text box.

A script script name here, launched manually, will check for lines that have text in the reporter gene field and missing a value in the transgene field. This script will create a new object in the transgene OA containing with the corresponding paper id, expression curator as curator, and the remark field populated with the reporter gene text. A synonym will be created based on the expression pattern value with an appended _Ex. This temporary name will be deposited in the synonym field of the transgene OA for that newly created object. See this wiki page for more information: http://wiki.wormbase.org/index.php/Expression_Pattern#Exporting_Reporter_Gene_description_from_Expr_pattern_OA_to_Transgene_OA

The transgene curator needs to *verify that the object created by the expression pattern curator is not a duplicate transgene, if it is a duplicate, the transgene curator will merge the transgene into the preexisting one, this will make the new transgene invalid its information will not be dumped. The new transgene ID and other synonyms will be pipe added to the synonym list of the pre-existing object *assign a public name if it exists or if needed *fill in all other relevant information

####Interaction The interaction curator can create new transgenes using the Transgene OA or request them from the transgene curator.

####Strain Strain data from the CGC has relevant remarks about the transgenes which needs to be kept in sync with WB extracted information from papers and other sources. Other sources of transgene information: BC-Strain/transgene project modENCODE project

Transgene-Strain info from CGC As of Feb 2013 the CGC sends to Mary Ann a file of strains with transgenes and construct summaries. The transgenes with no WBTransgeneID will need to be entered into postgres.

on tazendra: /home/acedb/karen/cgc_transgenes two scripts: *compare_names.pl - takes transgene_report_latest.txt (sent by Mary Ann and renamed) and shows all transgenes not in postgres *compare_transgenes.pl - takes CGC_transgenes.txt (sent by Mary Ann and manually modified into columns), scans postgres for matches, whatever does not match

compare_transgenes.pl
NO PG MATCH is the value in the 1st column
- get a new pgid sequentially
- take the second column and write that to trp_name
- 3rd column to trp_strain - beware this column could have pipe separated values, all of which should go into trp_strain.
-  4th column to trp_summary
- "WBPerson712" to trp_curator
- "CGC" to trp_location

Example output:
Example :
NO PG MATCH     (WBPaper00006525).sEx14126      BC14126 [rCesC34D1.3::GFP + pCeh361]            Strains different       Synonyms different
NO PG MATCH     (WBPaper00006525).sIs10697      BC10819 [rCesZC101.2e::GFP + pCeh361]           Strains different       Synonyms different
NO PG MATCH     Ex      NW1229 | OH439 | OH441 | OH898  [F25B3.3::GFP; dpy-20(+)] | [pBX; ser2prom2::GFP] | [unc-119::GFP] | [unc-33::GFP; unc-4(+)]            Strains different       Synonyms different
NO PG MATCH     aatIs3  TOL4            [P-unc-17::YC3.60]      Strains different       Synonyms different
NO PG MATCH     acEx102 AY102           [P-vha-6::pmk-1::GFP +rol-6(su1006)]           Strains different       Synonyms different

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transgene curation

Table of Contents

Clone this wiki locally