Jump to content

Wikipedia talk:WikiProject Molecular Biology/Gene Wiki/Archive 2

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia
Archive 1Archive 2Archive 3Archive 4
Gene Wiki – Discussion


more identifiers

Have you noticed the Catalase page? Someone has added another identifiers template above this version. This seems crazy. Do you think its the desire for a direct link to pfam that has driven this edit? Has this been discussed anywhere else? David D. (Talk) 05:59, 5 February 2008 (UTC)

For the record, i found this discussion. David D. (Talk) 06:03, 5 February 2008 (UTC)
Yeah, I think this is one of those cases where a given term refers to both a specific gene/protein as well as a protein family. Not sure what the best solution is here... AndrewGNF (talk) 06:07, 5 February 2008 (UTC)
May be there is no solution, but the two boxes don't work from an aesthetic perspective. One possibility might be to add some new fields for family identifiers into the PBB template that could then be used in these rarer cases. At least then there would be only one box. Have you been in contact with Biophys (talk · contribs), he seems to have been the one to code this box? David D. (Talk) 06:17, 5 February 2008 (UTC)
I think adding PFAM as optional fields to the PBB template as already suggested in the PBB version 2 specs (import and display protein domain information (through Uniprot/PFAM/COGs)) would solve the problem. Boghog2 (talk) 07:08, 5 February 2008 (UTC)
OK, thanks, I didn't check there, it will make a lot more sense than having two boxes. David D. (Talk) 07:09, 5 February 2008 (UTC)
Should be pretty straightforward to add new optional fields to the PBB templates. Obviously they won't get populated by PBB until V2, but that will at least allow merging of the boxes when they appear on the same page. But, I'm still not completely sure they should be merged. Check out what Special:Whatlinkshere/Template:Pfam_box. These are clearly protein families, not individual protein/genes. I almost think that we should split Catalase into two articles -- one on properties of the protein family, and one on specific functions of the catalase gene which happens to share that same name. AndrewGNF (talk) 16:50, 5 February 2008 (UTC)
And just to clarify, the V2 addition will be to add pfam links -- one gene to many PFAMs. So that seems to be a slightly different goal than the Pfam_box, which is meant to give information about one and only one PFAM entry. AndrewGNF (talk) 16:54, 5 February 2008 (UTC)
Of course it would be a great improvement to include Pfam and possibly InterPro links in Protein Box, as was discussed previously. Just remember: this requires to use domain structure of the protein. Different domains of the same protein may belong to different Pfam families. Simply making a Pfam link for each domain of the protein would resolve this (if there are several identical domains in the protein, onle link might be enough). The Catalase is a perfect example of inappropriate merging of an article about a specific human protein, and an article about protein family/domain. Such articles must always remain separate but refer to each other.Biophys (talk) 18:29, 5 February 2008 (UTC)
This may be a semantic question, but I consider the subject of the catalase article to be about the enzyme of which the human variant is one example. In my opinion, information about the human enzyme is therefore appropriate to include in the article. It is also important to point out that in its present form, the PBB template already contains information on more than just the human protein/gene, in particular the EC number and HomoloGene links. While splitting articles into protein families and the specific human protein is a possibility, I would argue that this may unnecessarily fragment closely related information, especially considering the majority of these articles are presently stubs. In the case of catalase, there is probably enough material to split the article. So if someone went ahead and split the catalase article, I wouldn't object. Boghog2 (talk) 19:58, 5 February 2008 (UTC)
I agree, there are multiple right answers, and each example is probably a judgment call of the person dealing with the page. AndrewGNF (talk) 20:09, 5 February 2008 (UTC)
The Boghog2 suggestion is valid (I agree) when there is only one human protein from the family, i.e. only one human catalase. If there were several human catalases, one could include them all in one article, but I would recommend to avoid this.Biophys (talk) 20:51, 5 February 2008 (UTC)

google ranks

Just ran an analysis that I thought was pretty interesting… I took all 8352 PBB pages we currently have, and googled the corresponding gene symbols. In over 60% of cases, the Wikipedia gene page was shown on the front page! (Keep in mind that some gene symbols that match more common acronyms will never show up on the front page, e.g., CAT, AGT, LEP.) There’s still room to grow, but I think this is a pretty darn good start. (Obviously not all PBB's work, since many genes were preexisting, and many people have done significant work improving PBB pages...) Anyway, check out the histogram at right... Cheers, AndrewGNF (talk) 01:29, 6 February 2008 (UTC)

That's pretty exciting. It just shows how wikipedia can be used by scientists to get this information out there. It's not about how good the information is, we all know there are excellent science web sites out there. Most important is where the information is hosted, in this case wikipedia. David D. (Talk) 18:51, 6 February 2008 (UTC)
And if I may articulate David's thought one step further... With all the great science web sites out there, it's pretty much only one-way transfer of information. If web users see incomplete/inaccurate information, their only recourse is to yell and curse at the screen. Here at WP, of course, they can fix the error or omission. This is really the first two-way gene portal, qualitatively different than the rest. (I know, I know, I'm preaching to the choir here...) AndrewGNF (talk) 19:24, 6 February 2008 (UTC)
Congrats Andrew and Jon! I agree that this is a very impressive result. The WP pages seeded by PBB are a great start which I hope will attract more experts to flesh out the details. Already they represent a convenient first stop of useful links to other sources of information. Furthermore the PBB expression data is unique in its accessibility and breadth. Bravo Brava Bravi! Boghog2 (talk) 19:46, 6 February 2008 (UTC)

Sources

Hi! The bot currently doesn't give any sources (like here); that's problematic for verifiability. Could it be instructed to add sources (perhaps also to images it previously uploaded)? -- Lea (talk) 10:42, 8 February 2008 (UTC)

Hello, thanks for the note. This is actually the exact issue that was discussed (and recently resolved for one image) in the section above. We'll work on now making that change for all images that were previously uploaded. AndrewGNF (talk) 16:56, 8 February 2008 (UTC)
Cool, thanks! And sorry I missed the last section! -- Lea (talk) 21:49, 8 February 2008 (UTC)

We need genes here

Good day. I had a look at Tumor necrosis factor receptor, and thought it might need some articles on its genes, since they're all pretty notable in immunology. Can the bot do this? Mikael Häggström (talk) 08:39, 9 February 2008 (UTC)

The tumor necrosis factor receptor article is about a fairly large protein family so a Pfam box on this page would be appropriate. Individual articles about many of the family members with PBB content already existed but not all were linked from the parent article. Finally there were a few family member pages without PBB content. The Pfam box, missing links and PBB content have now all been added. Cheers. Boghog2 (talk) 12:07, 9 February 2008 (UTC)

Simply not Public Domain

Hi there.

I've been watching this for any specific replies from PDB although I've not really expected anything. The restrictions stated by PDB are very plain in meaning.

Online and printed resources are welcome to include PDB data and images from the RCSB PDB website, and may be sold, as long as the images and data are not for sale as commercial items themselves, and their corresponding citations are included.[1]

Breaking it down plainly:

  • Attribution is required in all instances.
  • Commercial use of the images is allowed if the images are part of a publication.
  • Commercial use is prohibited if you wish to sell the images individually.

Hoards of people use "public domain" according to a private interpretation and not a legal one. This is an instance. It is very clearly erroneous to label PDB images as public domain. Moreoever, PDB images simply are not free images per the Wikimedia definition of free. For what it's worth, Citizendium tags PDB images like this.

Stephen Ewen (talk) 08:49, 6 February 2008 (UTC)

Tim, can you take the lead and follow up with the person you've been emailing at the PDB for clarification? (I'm happy to do it as well if you send me contact info...) Thanks... AndrewGNF (talk) 17:44, 6 February 2008 (UTC)

Done. Tim Vickers (talk) 18:18, 6 February 2008 (UTC)

I think the final decision on this was that we need to use the {{attribution}} tag rather than {{PD-release}}. Tim Vickers (talk) 17:34, 10 February 2008 (UTC)

Thanks for uploading Image:PBB_GE_NIPA1_gnf1h07157_at_fs.png. I noticed that the file's description page currently doesn't specify who created the content, so the copyright status is unclear. If you did not create this file yourself, then you will need to specify the owner of the copyright. If you obtained it from a website, then a link to the website from which it was taken, together with a restatement of that website's terms of use of its content, is usually sufficient information. However, if the copyright holder is different from the website's publisher, then their copyright should also be acknowledged.

As well as adding the source, please add a proper copyright licensing tag if the file doesn't have one already. If you created/took the picture, audio, or video then the {{GFDL-self}} tag can be used to release it under the GFDL. If you believe the media meets the criteria at Wikipedia:Fair use, use a tag such as {{non-free fair use in|article name}} or one of the other tags listed at Wikipedia:Image copyright tags#Fair use. See Wikipedia:Image copyright tags for the full list of copyright tags that you can use.

If you have uploaded other files, consider checking that you have specified their source and tagged them, too. You can find a list of files you have uploaded by following this link. Unsourced and untagged images may be deleted one week after they have been tagged, as described on criteria for speedy deletion. If the image is copyrighted under a non-free license (per Wikipedia:Fair use) then the image will be deleted 48 hours after 00:02, 8 February 2008 (UTC). If you have any questions please ask them at the Media copyright questions page. Thank you. Nv8200p talk 00:02, 8 February 2008 (UTC)

Why does everyone always pick on the image for the NIPA1 gene?  ;) Nv8200p, thanks for the note. Do you have a suggestion on how to appropriately tag it? The images were created by me. We originally tagged it with {{self}}, but MZMcBride pointed out that this tag is probably not appropriate for a bot-uploaded image. (See User_talk:ProteinBoxBot/Archives/Archive1#ProteinBoxBot.27s_uploads.) We then switched to {{cc-by-sa-3.0|[[Genomics Institute of the Novartis Research Foundation]]}}, where the attribution goes to my employer. If we need to specify any further "source", please suggest how we should do this. Thanks, AndrewGNF (talk) 00:13, 8 February 2008 (UTC)
Is there any publication where the data from which the graph was created was printed, preferably a peer reviewed journal or reliable magazine, but a company produced article would work. If this data has not been published then the graph might be considered original research. -Thanks Nv8200p talk 01:37, 8 February 2008 (UTC)
The data come from this article: PMID 15075390. Should the article itself then be listed as the source? AndrewGNF (talk) 02:21, 8 February 2008 (UTC)
Yes. I set up it up on the image description page. Please edit it and make it right as I was just guessing at the description. -Regards Nv8200p talk 03:52, 8 February 2008 (UTC)
Super, thanks much... AndrewGNF (talk) 05:56, 8 February 2008 (UTC)

Just great, let's fight to delete a diagram of gene expression but try and save an article on Corey Worthington. The priorites seem a little skewed here. David D. (Talk) 03:16, 8 February 2008 (UTC)

;) I don't envy your job (if one can call it a "job"...) AndrewGNF (talk) 06:07, 8 February 2008 (UTC)
I do not think this is really a "job". This is more like entertainment. Consder this La Comédie humaine where you are one of actors. Biophys (talk) 03:32, 10 February 2008 (UTC)

doi weirdness

Hi, please have a look at [2]; the Luo et.al paper in the further reading section has a wrong doi identifier. I fixed it by hand now, but you should fix your bot too. Cheers, AxelBoldt (talk) 21:55, 10 February 2008 (UTC)

Hi, the problem (I think) is not with the ProteinBoxBot, but the tool used to create the citation. As explained here, some URLs and DOIs contain special characters which cause the wiki link to break. Fortunately there is an easy fix: replace the special characters with their hexidecimal equivalents as I have done here to the Luo et.al citation. User:Diberri graciously fixed the problem with the URLs, but apparently not with the DOIs. I will ask have asked him to fix the DOIs as well. Cheers. Boghog2 (talk) 22:23, 10 February 2008 (UTC)

Not {{Attribution}}, either

Subject:
Re: citing the pdb
From:
Rachel Kramer Green <kramer@rcsb.rutgers.edu>
Date:
Tue, 12 Feb 2008 20:48:38 -0700
To:
Stephen Ewen <ewenste@bellsouth.net>
CC:
info@rcsb.org

Dear Mr. Ewen,

Thank you for your email message.

Just to clarify -- our citation information is located at http://www.rcsb.org/robohelp_f/#site_navigation/citing_the_pdb.htm

In particular:
You *may not* collect PDB images and data and just sell the images and data commercially
You *may *download the images and put them in a book (including a reference to us) and sell that commercially
You *may *download the data, and do something with it, and sell that it commercially

Please let us know if we can be of additional assistance.

Sincerely,
Rachel Green

**************************
Rachel Kramer Green, Ph.D.
RCSB PDB
kramer@rcsb.rutgers.edu
**************************



Stephen Ewen wrote:
> Hello.  I am trying to seek clarification on http://www.rcsb.org/pdb/static.do?p=general_information/about_pdb/contact/index.html
>
> Specifically the portion that states "PDB data and images from the RCSB PDB website, and may be sold, as long as the images and data are not for sale as commercial items themselves."
>
> May I or may I not collect PDB images and data and sell just the images and data commercial?
>
> Kindly advise,
>
> Stephen Ewen, M.Ed
>   

Clearly, {{Attribution}} would also be a misleading way to tag PDB images - as I thought along. You gotta ask the questions real pointed sometimes!

Stephen Ewen (talk) 06:23, 13 February 2008 (UTC)

For whatever it's worth to you, see {{PDB}}. Stephen Ewen (talk) 06:52, 13 February 2008 (UTC)
Thanks Stephen... For simplicity, let's take all the follow-up to Wikipedia:Media_copyright_questions#Protein_Data_Bank where hopefully we can get some clarity on a game plan... Cheers, AndrewGNF (talk) 18:24, 13 February 2008 (UTC)

Integration between enzyme and gene pages

It would be good to increase the number of cross-references between the gene pages such as Choline acetyltransferase and the semi-automatic enzyme pages, such as Carnitine O-octanoyltransferase. At present we have two parallel sets of pages that overlap in places. However, the enzyme pages should be broad and cover the activity in all organisms, while the gene pages cover the enzyme's gene/protein in humans.

I suggest a EC number redirect to a general enzyme page, which can be targeted by a standard field in the PBB template. This general page would have the standard enzyme name as the title eg "Carnitine O-octanoyltransferase". The human gene page would have the gene ID as its title. A "See also" section in the general enzyme page could link to "Carnitine O-octanoyltransferase in humans". Tim Vickers (talk) 19:01, 26 February 2008 (UTC)

(I moved Tim's comment from User:ProteinBoxBot/Google_Summer_of_Code_2008 for discussion here.) So just to be sure I understand... We'll adjust PBB so that EC number is properly noted in the infobox and links to the corresponding enzyme page (through a redirect of the EC number). Creation of enzyme pages is in progress (done?) by Daisy, and those will be linked back to the human gene page. Sound right? If so, then PBB (or some offshoot) will only be responsible for getting ECs to show up correctly in human gene pages? AndrewGNF (talk) 23:06, 4 March 2008 (UTC)

Newline layout issue in created articles

Could the operators of this bot please review this edit that I made to CDH1 (gene)? It improves the layout of the article, eliminating the blank space that gets rendered at the top; compare [3] and [4]. Melchoir (talk) 09:25, 6 March 2008 (UTC)

Thanks, fixing this issue for newly-created pages is definitely on our to do list. Cheers, AndrewGNF (talk) 00:54, 17 March 2008 (UTC)
Okay, great! Melchoir (talk) 17:37, 19 March 2008 (UTC)

First of all, thank you for operating this excellent bot! Would it be feasible to link some or all GO terms to Wikipedia entries, as well as to the GO reference site? Not sure if one would want to do this only where an article already exists, redlink to encourage article creation, or automatically create stubs. Pseudomonas(talk) 19:48, 10 March 2008 (UTC)

Personally, I think the current behavior (linking to geneontology.org) is the best option. I think linking to WP pages would only be good if it were done in concert with a GO term stub creation effort, and even then I'm not sure it's the best option. (For example, what additional info would you add to those pages, besides the ontology term and a link to a central resource?) Having said all that, technically speaking it would be an easy change and we'll gladly comply with the community consensus. If you'd like to pursue this further, I'd suggest taking it over to Wikipedia:WikiProject_Molecular_and_Cellular_Biology/Proposals for more discussion. Cheers, AndrewGNF (talk) 01:00, 17 March 2008 (UTC)

Protein pictures

This CDH1 (gene) example above... It has several representative PDB files but no pictures. It suppose to be there. 1i7w is mouse protein, but 2omv is human cadherin (chain B/2). This is probably a combination of protein chains from two different species. Could that be a reason for missing picture?Biophys (talk) 00:01, 13 March 2008 (UTC)

This is an issue that we noticed before and I think we fixed it for future pages. Essentially PBB was uploading the images fine but not adding the appropriate wiki text. Anyway, I don't think the problem is too widespread, and the naming convention is pretty easy to add manually (e.g., [5]). (The image is always taken from the first PDB entry listed.) If you notice more than a handful of these cases, please let us know and we'll investigate further... AndrewGNF (talk) 01:07, 17 March 2008 (UTC)
Image Copyright problem
Image Copyright problem

Thank you for uploading Image:PBB GE PPP4R1 201594 s at tn.png. However, it currently is missing information on its copyright status. Wikipedia takes copyright very seriously. It may be deleted soon, unless we can determine the license and the source of the image. If you know this information, then you can add a copyright tag to the image description page.

If you have any questions, please feel free to ask them at the media copyright questions page. Thanks again for your cooperation. Polly (Parrot) 21:06, 2 April 2008 (UTC)

Fixed! Thanks... AndrewGNF (talk) 22:18, 2 April 2008 (UTC)
Image Copyright problem
Image Copyright problem

Thank you for uploading Image:PBB GE DMTF1 203301 s at tn.png. However, it currently is missing information on its copyright status. Wikipedia takes copyright very seriously. It may be deleted soon, unless we can determine the license and the source of the image. If you know this information, then you can add a copyright tag to the image description page.

If you have any questions, please feel free to ask them at the media copyright questions page. Thanks again for your cooperation. Polly (Parrot) 21:07, 2 April 2008 (UTC)

Fixed! Thanks... AndrewGNF (talk) 22:17, 2 April 2008 (UTC)

Fault

There is a fault with SLC47A2 -wrong name at the beginning ♦Blofeld of SPECTRE♦ $1,000,000? 17:32, 9 April 2008 (UTC)

Good catch. I've fixed it by hand for now. This error was due to the recent changes by NCBI (note the 29-March-2008 last-edit date) and the subsequent stale data in our database. Definitely hope to reduce this window in the future... Thanks for the note. Cheers, AndrewGNF (talk) 17:42, 9 April 2008 (UTC)

I stumbled upon the CYP2A13 article the other day and went to wikify some words in the article and that's how I learned about this bot (which I think is great by the way). I was about to wikify some words and realized the bot would overwrite the summary with any updates. I didn't think it was necessary to make update_summary = no and make it so the article required manual inspection from then on because it wouldn't be updated. So would it be possible for the bot to automatically wikify certain words that appear in the PBB Summary after an update?

I've looked in your talk page archives and see the issue has come up before. In November 2007 you said "I think wikilinks are more valuable than incremental revisions from NCBI." Do you still think so? Are the summaries frequently updated? I noticed another user suggesting the bot check which words are currently wikilinked in an article and putting those words into a list so they will be wikified after the bot updates a summary. Do you think that's feasible?

I suppose some people would prefer humans wikify words rather than bots (so words are not overlinked). I guess the task could be performed by another editor using AutoWikiBrowser or a similar tool, although the editor would probably have to turn summary updates off. I suppose checking a wordlist may slow the bot down, but I think it would be great if the bot could automatically wikify the first instance of words like cytochrome, cholesterol, steroids, lipids, endoplasmic reticulum, nitrosamine, tobacco, etc. I appreciate your work on the bot and I will understand if you are busy with other things or don't think the idea would be workable. Thank you for your time. --Pixelface (talk) 21:19, 10 April 2008 (UTC)

Good idea! One could link automatically only first instance of the "biological" word in the article to avoid "overlinking" and include in the vocabulary Latin names of species, names of proteins, etc. I posted such proposal to "bot proposal" page, but no one liked the idea. Actually, User:Banus did linking of Latin names of species when he made a semi-prep InterPro file which I used (see wikified names in article Gamma thionin for example - that was done automatically).Biophys (talk) 21:49, 10 April 2008 (UTC)
That was easy: the names of species are enclosed in a "taxonomy" tag, and you only need to convert the tag into the familiar square bracket (an easy string substitution). A more general solution requires the construction of an (article, aliases) table automatically or from user-provided data, a pretty good parser (don't grab locution having a comma within, for example), greedy match (always find the longer locution corresponding to an article) and in general a good heuristic to avoid overlinking and resolve name conflicts. This isn't a impossible task, but it require a little more work than pattern matching and string substitutions ;) The bot developer (JonSDSUGrad (talk · contribs)) said something about it here. —Banus (talk) 22:48, 10 April 2008 (UTC)
First, regarding the bot overwriting. Yes, I still believe that user-contributed content (even as minor as wikilinking) should generally trump any sort of automated updates from Entrez. In fact, for people who are editing the text in the PBB_Summary template in any way, feel free to strip away the surrounding template completely. Regarding automatic overwriting, I'm thinking before our next genome-scale run, I want to look for the presence of any wikilinking in the PBB_Summary, and if found, PBB would automatically turn its own update_summary flag to "no". After all, not everyone should be expected to realize that they have to actively turn that flag off. So that addresses the issue regarding overwriting human contributions...
Next, the issue of auto-wikilinking. It's a reasonable idea with a few limiting factors (solutions to which may be feasible with a little discussion here).
  1. First, it's not a trivial task to create the dictionary of biological terms. It's not a project that I have any special insight on for either doing it manually or in some automated way.
  2. Second, auto-wikilinking also complicates the plan above for avoiding the stomping of human edits. I think implementing auto-wikilinking would mean (practically speaking) the end of auto-summary updates (except for the case where the summary is initially blank). I just can't think of a good way to avoid human edits if we take away the assumption that all wikilinks are human-generated.
  3. Third, we have an overabundance of good ideas and an underabundance of people to implement them. Speaking of which...
It seems appropriate at this point to publicly congratulate JonSDSUGrad on the successful defense of his Master's thesis project. Unfortunately, that also means that we may be losing his focused efforts shortly. For those who have been interested, I've been generating a list of discrete projects here for the next student (still to be found), but actually I'd really like to expand this beyond a one-developer effort. If you know of someone (or are someone) who might be interested in directly contributing to PBB, please let me know. Cheers, AndrewGNF (talk) 22:58, 10 April 2008 (UTC)
Congratulations! As an example of human gene coverage, one can look at the list of human genes in Pleckstrin homology domain. This list was generated automatically from UniProt. A few extra runs by the bot would be great.Biophys (talk) 18:39, 11 April 2008 (UTC)
I've done a few searches based (loosely) on the redlinks in Pleckstrin_homology_domain#Human_proteins_containing_PH_domain and posted the results here: User_talk:ProteinBoxBot/Pleck. First column is the rank (by number of citations in Entrez Gene), second column is Entrez Gene ID, third column is number of citations, and fourth column is gene symbol. Recall that PBB was approved for doing the top 10K ranked genes for notability reasons. There are about 800 or so "tough cases" left that I'll be dealing with soon. Although all the PH domain redlinks that I checked fall below the 10k threshold, I'm always happy to do requests by protein family if you can provide the Entrez Gene ID and the correct target page. Easiest place to put it is in User:ProteinBoxBot/Requests. Cheers, AndrewGNF (talk) 21:17, 11 April 2008 (UTC)
I am sorry for arguing here, but this artificial cutoff has not justification whatsoever. As long as we can provide a couple of sources, an article belongs to WP per WP:verifiability. I looked through the most recent log (April 10). All included articles are sourced pretty well, some even have a PDB picture. I would definitely include all genes that have either an Entrez Gene abstract (at least a couple of phrases) or a PDB structure. I also do not see any reason why not to include all genes with at least 3-5 references.Biophys (talk) 02:49, 12 April 2008 (UTC)
10K was a conservative target so that we'd be reasonably sure that all pages would be pretty assuredly notable. I agree with you, we can probably go lower down the list. But I'm erring on the side of treading cautiously so that no one will accuse PBB of being too rash. Once we get the 10K run completely done, then I'll propose another task at the WP:RBA and I'll look for your enthusiastic support there. In the mean time, you're welcome to post your favorite genes on the requests page or just sit tight for now... Cheers, AndrewGNF (talk) 04:56, 12 April 2008 (UTC)
Agree. I like this project a lot. It has been a great success. Thank you, JonSDSUGrad and others for doing this!Biophys (talk) 06:30, 12 April 2008 (UTC)

MCB template in discussion pages?

Hey, could the maker of this bot be so kind as to include {{Wikiproject MCB|class=Start}} into the talk pages of the new articles it creates. I went through the first 500 in the ProteinBoxBot edit history and added it to those latest ones, but after noticing how many more pages of 500 there were I noticed I don't want to waste my the effort doing it manually (and am ignorant on how to make bots for tasks) when it probably is much easier for you to add that extra function to your bot (and maybe make it go back through and do similar to the new pages it had created in the past with a red "discussion" page that I was unable to get through to). 216.161.88.183 (talk) 03:38, 13 April 2008 (UTC)

Yup, it's on the list of things to do. (Both adding the template to new pages and adding it to previously-created pages...) Cheers, AndrewGNF (talk) 00:52, 14 April 2008 (UTC)
I went ahead and added the template to all new pages made by ProteinBoxBot after January. January & before are still lacking the template. I won't worry about it if it's on your to-do list, seeing how many pages there are just of articles made in January. ;-) Nagelfar (talk) 02:54, 14 April 2008 (UTC)
Well, if you're doing it in some sort of automated manner, we'd certainly welcome the help doing all of the retroactive changes (leaving us to focus on the new pages). If not, no worries, we'll get around to it. cheers, AndrewGNF (talk) 02:59, 14 April 2008 (UTC)

Incorrect naming of articles

Hi. I think that the naming of articles by this bot contravenes the style guidelines. For example MAGI1 should be Membrane associated guanylate kinase, WW and PDZ domain containing 1. The same applies to all the other articles this bot created. Can it automatically move the pages to meet the guidelines? If the page already exists and is a redirect to the page that is being moved, the bot can continue the move; otherwise, human intervention would be needed. --Seans Potato Business 16:10, 20 April 2008 (UTC)

Upon further inspection, I think I was wrong. Please disregard everything I ever said or did. ----Seans Potato Business 16:28, 20 April 2008 (UTC)
(well, I don't know if I'd go that far... Anyway, had the response all typed out, and I guess that others may have the same question.) Hi there. Yes, it was a conscious choice to default the page name to the gene symbol instead of the gene description (I know there's a discussion in the archives here of the village pump discussion...) In many cases, the gene description is quite convoluted. For example, do we really want ABCB7 to be moved to ATP-binding cassette, sub-family B (MDR/TAP), member 7? ATP6V1B1 or ATPase, H+ transporting, lysosomal 56/58kDa, V1 subunit B1 (Renal tubular acidosis with deafness)? CBFA2T3 or Core-binding factor, runt domain, alpha subunit 2; translocated to, 3? Anyway, lots of really gross examples like this. Having said that, I'm open to changing the behavior if there's consensus to. AndrewGNF (talk) 16:44, 20 April 2008 (UTC)
I wonder if the bot could automatically create redirects from the alternative names, though? ----Seans Potato Business 10:34, 21 April 2008 (UTC)
Yeah, in theory it could but is there much value in putting a redirect from a horribly obscure gene/protein name? In my ideal world, the article would be at the most common gene/protein name, to which the symbol would redirect. Unfortunately I think that requires some amount of human intervention, so thus far we've just relied on knowledgeable editors stumbling on the page. But we're certainly open to revisiting this if we can come up with some hard and fast rules... AndrewGNF (talk) 04:14, 22 April 2008 (UTC)

Empty line

Refer to Template:PBB Controls. Since the template is empty and is used first thing in some articles, it is causing a lot of pages to start with an empty line, which i think makes the article look strange. I propose the use of this bot (if possible) to correct the code source of the protein articles where this happens. The solution is to start the article with the current <!-- --> comment and, in the same line, start off the template with the {{PBB Control|...}} command.

I'm adding this same comment on the Template:PBB Control talk page. ~ Jotomicron 15:53, 23 April 2008 (UTC)

Thanks for the feedback. Yes, this is a known issue, and we hope to the problem fixed soon for both previously created pages and future pages. Cheers, AndrewGNF (talk) 17:35, 23 April 2008 (UTC)
Image Copyright problem
Image Copyright problem

Hi ProteinBoxBot!
We thank you for uploading Image:PBB Protein PPARA image.jpg, but there is a problem. Your image is currently missing information on its copyright status. Wikipedia takes copyright very seriously. Unless you can help by adding a copyright tag, it may be deleted by an Administrator. If you know this information, then we urge you to add a copyright tag to the image description page. We apologize for this, but all images must confirm to policy on Wikipedia.

If you have any questions, please feel free to ask them at the media copyright questions page. Thanks so much for your cooperation.
This message is from a robot. --John Bot III (talk) 20:40, 24 April 2008 (UTC)

Taken care of... Thanks Fvasconcellos! AndrewGNF (talk) 22:04, 24 April 2008 (UTC)


Bioinformatic Harvester

  • A ProteinBoxBot link to "Bioinformatic Harvester" would be nice

(http://harvester.fzk.de)

..a very nice bot by the way...thanks :-) Ivo (talk) 05:39, 15 May 2008 (UTC)

Thanks for the note. I've added it to our ideas page. Though adding additional links will likely require consensus from the community. Cheers, AndrewGNF (talk) 17:09, 25 June 2008 (UTC)

Endash

If the bot will be making more edits, it would be appreciated if it could use an endash between page numbers, rather than a hyphen. It's causing a lot of unnecessary bot edits and would hopefully be an easy fix for you to implement! Thanks a lot. Smith609 Talk 10:44, 11 June 2008 (UTC)

Thanks, I've added it to the ideas page -- something I'm sure we can get implemented before any future mass runs... Cheers, AndrewGNF (talk) 17:09, 25 June 2008 (UTC)

A tag has been placed on Image:PBB GE PCDHB11 221408 x at tn.png requesting that it be speedily deleted from Wikipedia. This has been done under section I8 of the criteria for speedy deletion, because it is available as a bit-for-bit identical copy on the Wikimedia Commons under the same name, or all references to the image on Wikipedia have been updated to point to the title used at Commons.

If you think that this notice was placed here in error, you may contest the deletion by adding {{hangon}} to the top of the page that has been nominated for deletion (just below the existing speedy deletion or "db" tag), coupled with adding a note on [[ Talk:Image:PBB GE PCDHB11 221408 x at tn.png|the talk page]] explaining your position, but be aware that once tagged for speedy deletion, if the article meets the criterion it may be deleted without delay. Please do not remove the speedy deletion tag yourself, but don't hesitate to add information to the article that would render it more in conformance with Wikipedia's policies and guidelines. Lastly, please note that if the article does get deleted, you can contact one of these admins to request that a copy be emailed to you. ·Add§hore· Talk/Cont 18:43, 4 July 2008 (UTC)

bot is creating unnecessary empty paragraph

the bot's template is creating an unnecesasry empty paragraph directly under the page title, this should be fixed to conform to other WP-page styles see Apolipoprotein E for example --213.23.255.101 (talk) 08:12, 8 July 2008 (UTC)

Thanks for the note. Believe it or not, this current round of edits is actually removing a lot more of those empty paragraph lines than it's creating (e.g., SIM2, STIL, SGSH, randomly chosen from recent edits). Providing examples like the one above will definitely help us improve our parsing mechanism though, so feel free to let us know if you notice any others. Thanks, AndrewGNF (talk) 13:31, 9 July 2008 (UTC)

A request from a new template patroller

Hi - I was wondering if it was possible to slow your protein box bot down a little. There are only a handful of us patrolling new templates, and I for one generally look through a whole day's worth of templates in one sitting - normally about 150 templates. Your bot has made over 3000 in the last 24 hours, making it a lot harder to get to any other new templates in the new pages lists. If it's possible for you to slow it down to, say, 40-50 an hour, it would be a big help to those of us sifting out the other templates! Grutness...wha? 01:28, 9 July 2008 (UTC)

Hi, thanks for the note. Sorry to swamp your small crew. Is there way for you to exclude bot edits, or edits from a given user account? I'd normally be happy to dial it back a bit, but we're in a bit of a rush. We've just published an article that's getting a fair bit of press, and we're hopeful that this will attract a large number of newbie scientists to edit gene pages. Our rush comes from the realization that pages like this can be pretty intimidating for a newbie to wade through, but this change makes things much more friendly. Anyway, if you can tolerate a swamped list for just a bit longer (I think we're almost done), I think it would really be a net win for WP. thoughts? AndrewGNF (talk) 18:04, 9 July 2008 (UTC)
Well, if it's only going to be like this for another day or so it's probably ok - I was just worried this might become a regular thing. And I understand if you think there may be a flood of new recruits. Grutness...wha? 01:51, 10 July 2008 (UTC)
Nope, it won't be a regular thing. Last I checked the edits should be finished today. Thanks for bearing with us... Cheers, AndrewGNF (talk) 12:18, 10 July 2008 (UTC)

Bot control template

Hi there, I edited GPI (gene) a bit, but since there is no bot control template on the page any more, I wasn't sure if these changes will remain. I added the template myself manually, which I hope is what you need to do. However, finding and adding a template is probably beyond the capabilities of newbies, so although the new arrangement is cleaner, I think we still need to include this control template by default. Tim Vickers (talk) 16:13, 9 July 2008 (UTC)

Hi Tim, The latest PBB run actually moves the control template to the bottom of the page. Do you think that's too hidden for people to find? I edited GPI (gene) to reflect that (copied your new template over the old one at the bottom), but on second thought I probably should have asked your input on whether this is clear enough beforehand. Thoughts? In practice though, we're going to put a lot of thought into things before we do an update run though. I'm even contemplating removing the PBB_Summary template completely. My guess is that the summary doesn't change very much, and I think it's confusing for people who are editing the text. Plus, as I've said before, once the summary gets touched at all by a human editor, I think PBB should not overwrite that text ever. Anyway, discussion on all these issues are welcome... AndrewGNF (talk) 16:35, 9 July 2008 (UTC)

Hadn't noticed it! If you change the hidden text to "To stop automatic updates of the summary text change the update_summary field to "no" in the Template:PBB_Controls at the bottom of the page ." that is a bit more explicit. Tim Vickers (talk) 17:55, 9 July 2008 (UTC)

Ahh, I've actually just removed that comment completely. If PBB doesn't find the PBB_Summary template, it won't try to re-add it in or anything, even if the PBB_Controls has update_summary=yes. You think that works okay? AndrewGNF (talk) 18:08, 9 July 2008 (UTC)

Then we'll need to add text telling people to remove everything apart from the PBB template at the top. We need to make this as newbie-proof and approachable as possible. Tim Vickers (talk) 20:37, 9 July 2008 (UTC)

Sorry, not sure what you mean here. Are we talking about GPI (gene) in particular, or the PBB gene pages in general? AndrewGNF (talk) 12:19, 10 July 2008 (UTC)
Gene pages in general. Otherwise there is the risk of people not removing the template because they don't want to "break" anything. Tim Vickers (talk) 14:40, 10 July 2008 (UTC)

Hello,

You have included this picture inside Category:Copyright holder released public domain images. However, the introductory statement at the top of that category says : "Please include all evidence you have that creator desires the image to be public domain on the image description page". Do you have any such statement that might be included in the image's description page ? Teofilo talk 13:38, 8 July 2008 (UTC)

Sorry, not exactly sure what you think should be added. We have a link there to the pdb, and the pdb releases their data into the public domain. do you suggest we replace the generic www.pdb.org link to the specific license page (which I can't find at the moment, but was referenced ad nauseum in some previous discussions on pdb licensing)? AndrewGNF (talk) 13:41, 9 July 2008 (UTC)
Sorry I did not take part in these "ad nauseam" discussions. I am a mere reader of the French news speaking about your project of using Wikipedia to share knowledge on human genes. When I click on http://www.pdb.org , what strikes me is "© RCSB Protein Data Bank" at the bottom of the page. So, if they have a specific page mentioning the fact that they release all or part of their images in the public domain, it would be better to provide a link to that page (in my humble opinion), in addition to the specific page where the picture first appeared (rather than the pdb.org homepage) (is it http://www.rcsb.org/pdb/explore/images.do?structureId=1LUI ?). An other question/suggestion is : why don't you upload these pictures on commons ? As explained on Help:Images and other uploaded files, uploading images on Commons helps wikipedias in all languages to share pictures. Teofilo talk 14:20, 11 July 2008 (UTC)
Link that states the data availability: http://www.pdb.org/robohelp_f/#site_navigation/citing_the_pdb.htm "Data contained in the RCSB PDB are free of all copyright restrictions and made fully and freely available for both non-commercial and commercial use...." with some more details at that link. Mary Mangan (talk) 19:52, 11 July 2008 (UTC)
Thank you. Here-is a non-flash link http://www.pdb.org/robohelp_f/site_navigation/citing_the_pdb.htm . It looks OK regarding copyright, but « Users of the data should attribute the original authors of that data ». Do we know the names of the original author(s) of Image:PBB Protein ITK image.jpg ? Teofilo talk 10:01, 12 July 2008 (UTC)
In talking with the RCSB folks, I think we determined that the images themselves are created by the PDB, hence the "attribution" to www.pdb.org. You're right that a link directly to the specific PDB entry would be more precise and an even better service to our users. I've added it to our to-do list. Regarding uploading to wiki commons, yes this is something that was also brought to our attention. We can clearly do this with the expression images that we generate, but I'd need to make absolutely sure that the wiki commons license and RCSB licenses are compatible. (I think we did conclude that earlier. more context here.) Thank you for these suggestions. Cheers, AndrewGNF (talk) 16:17, 12 July 2008 (UTC)

The Ortholog section of the genes I've looked at ( such as ITK_(gene) ) list coordinates on mm8 at http://genome.ucsc.edu but the default at ucsc is mm9 ( until you visit and switch to another build ). Adding the db parameter to the url would solve this, it's &db=mm8 in the ones I've checked. PTGS2 is another example, the differences betwixt mm8 and mm9 have shifted it out of the browser window on mm9 —Preceding unsigned comment added by 198.129.91.135 (talk) 00:56, 9 July 2008 (UTC)

Ahh, great point... Thanks for the heads up, and we'll see about the best way to fix that... Cheers, AndrewGNF (talk) 13:45, 9 July 2008 (UTC)
Possibly related bug: when I was on the c3orf58 page I clicked the RefSeq link (which goes to UCSC, would have expected it to go to NCBI). Anyway--since I had been on the mouse genome earlier at UCSC, this RefSeq number searched the mouse genome and linked to the mouse browser. So that same RefSeq link could search human if you had been on human, mouse if you had been on mouse, and it even searched Lizard for that when I went over and was using lizard prior to clicking. Is this where bug reports go?Mary Mangan (talk) 20:33, 11 July 2008 (UTC)
Hi Mary, unfortunately, that older template is not one written by us. I agree that the refseq link should go to NCBI instead of UCSC, but that change should probably be discussed at Template_talk:Protein. AndrewGNF (talk) 16:21, 12 July 2008 (UTC)

I was just editing the C3orf58 page. I wanted to indicate that it may be regulated by MEF2 as the paper states. I tried to create a link to MEF2 and there wasn't one. But I realized there was a link to Mef2. So I used that.

However, Mef2 may not be human--that's a Drosophila symbol, it appears. And that page has 4 genes on it that appear to be the human ones. But they aren't MEF2, they are MEF2A, MEF2B, MEF2C, and MEF2D.

Not sure how to handle that. —Preceding unsigned comment added by Mary Mangan (talkcontribs) 20:19, 11 July 2008 (UTC)

I will admit that some of these pages are bit schizophrenic since it is not clear which species the page refers to. I would suggest that the Mef2 article covers the family of Mef2 genes/proteins in all species in which they are expressed (including human) and that the human MEF2 isoforms are listed as examples. Does this make sense? Cheers. Boghog2 (talk) 21:09, 11 July 2008 (UTC)
But what if there are other species gene pages with a same name, but really aren't the same gene? So that's the more general class of the question--how would you distinguish? But these Mefs aren't isoforms (if isoform means splice variant, which it does for me. But apparently not according to Wikipedia.). There are on different chromosomes. There's 5 genes on that page, and 4 of them are human. —Preceding unsigned comment added by Mary Mangan (talkcontribs) 21:30, 11 July 2008 (UTC) keep forgetting to sign, sorry. Mary Mangan (talk) 21:38, 11 July 2008 (UTC)
To avoid semantic complications, we can replace the word "isoform" with "paralog". Other than human, I am not aware of any species specific gene/protein pages. In other words, to the best of my knowledge, we only have the following type of gene/protein pages:
(A) human gene/protein family (paralogs)
(B) gene/protein family (all species in which it is expressed including human) (orthologs + paralogs)
(C) human gene/protein
(D) gene/protein (all species in which it is expressed including human) (orthologs)
In my opinion, the current Mef2 article is in class B. If there were orthologous gene/protein articles, the article names would clearly need to be different (e.g., "Mef2 (drosophila)", "Mef2 (zebrafish)", "MEF2 (human)", etc.). In the Mef2 article, it would probably be better to replace the human Template:Protein boxes with a Template:Pfam box so that it is clear that the article deals with the Mef2 family (orthologs + paralogs), but unfortunately it does not appear that there is a Pfam classification specifically for Mef2. I need to think about this more. As this is a Wiki, suggestions are welcome ;-) Cheers. Boghog2 (talk) 22:23, 11 July 2008 (UTC)
Ok, I wasn't aware of those types of pages, I don't think that was clear from the Mef2 entry. I had seen some other species pages, although it does seem that there aren't many: [[6]]. Zuotin HHV_capsid_portal_protein TraA HIS3 HisB DmX_gene. Mary Mangan (talk) 23:16, 11 July 2008 (UTC)
I stand corrected on the non-human species specific pages (one more example is HIV-1 protease). An example of an (orthologs + paralogs) article where I think the scope of the article is very clear is Hsp90. I would like to do the same to the Mef2 article, but again, there does not seem to be a Mef2 specific Pfam code. The closest I could find is this. Boghog2 (talk) 23:35, 11 July 2008 (UTC)

Ooops, I made this edit thinking I had a solution, but clearly the issue is deeper than that (as discussed here). Gotta run now, but feel free to undo any of what I did... Cheers, AndrewGNF (talk) 02:11, 12 July 2008 (UTC)

Thanks Andrew. Your edit partially addressed the problem (MEF2 = human -> Mef2 = any other species; if we wanted to be pedantic, we could call the article Mef2/MEF2, but that is going way over board in my opinion). I made some additional edits to the Mef2 article and hopefully it is now clearer that this page is about the family of Mef2 proteins (orthologs + paralogs). Cheers. Boghog2 (talk) 11:00, 12 July 2008 (UTC)
Looks like Boghog2 has taken the lead on fixing this confusing situation, and as usual, he is doing an incredible job bringing clarity to these pages... AndrewGNF (talk) 16:42, 12 July 2008 (UTC)
As I was a contributor in creating the confusion in the first place ;-) I felt it was my responsibility to help remove it. Cheers. Boghog2 (talk) 18:22, 12 July 2008 (UTC)

SVG images

Hi! I've noticed that some images uploaded by this bot, like Image:PBB GE FREM2 gnf1h07842 at tn.png or Image:PBB_GE_FREM2_gnf1h07842_at_fs.png, are in PNG format. Besides that all these images should be better uploaded into the Wikimedia Commons to be shared among all the different Wikipedias (and put into an adequate category there), I'd like to raise your attention about the following policy: Use SVG over PNG (and [7]). The SVG format is much more appropriate for this kind of images, offering more flexibility, more image quality and space efficiency, and the MediaWiki software ensures compatibility with all browsers. Thanks! —surueña 08:20, 11 July 2008 (UTC)

Great idea, thanks for the input. We're a bit bandwidth limited at the moment, but I've added it to our to-do list. Cheers, AndrewGNF (talk) 15:41, 11 July 2008 (UTC)
Wonderful! I suggest reading also other good practices for Wikipedia diagrams like replacing captions in the image with text (see also Wikipedia:Image use policy). Cheers —surueña 11:02, 14 July 2008 (UTC)
Thank you surueña, another good suggestion that has been duly noted. AndrewGNF (talk) 15:42, 14 July 2008 (UTC)

end user suggestions

Hello folks--I've been trying to make some edits learn the ropes, and there are a few things that I thought I would add to the discussion. Feel free to edit/move/disregard/whatever.

1. It would have helped me if there was a box on the gene pages that pointed to the project goals, guidelines, templates, situations for new genes, etc. I had to sift through a bunch of different documentation to try to figure out what to do. And ask questions on the our blog and in the talk pages. Maybe there is some central location/forum (like here?), but I don’t know where it is or how to locate that. I guess I’m asking for a contributors’ forum or something?

2. The new gene creator template is helpful, but what would really help me is to have a whole mock page to simply copy/paste that had the right sections: a discussion section (and some lorem ipsum text is fine), and then some mock references all formatted right, subsections, and the box creator. I kinda tried to do that with the Map4 page, but that didn’t work out very well.

3. I have to advocate for some training…ok, that’s my job probably…but I think there are some people out there who would contribute if you could create a community around it with some support and some carrots (still not sure what the carrots are yet). Retired scientists, stay at home parents with science training, hobbyists, family members with genetic issues, in addition to the practicing scientists and students. But there needs to be some outreach and wrangling there.

Mary Mangan (talk) 17:37, 11 July 2008 (UTC)

Hi Mary, In response to your questions...
  1. Yes, think this is the central discussion form that you're looking for. In retrospect, we should have put this in the paper. I took the liberty of putting the link on your blog and on another blog. Also added links to Template:GNF_Protein_box. I'm reluctant to put a comment on the gene pages themselves for fear of increasing clutter, but other suggestions for places to put a link to this page are certainly welcome...
  2. Yes, the Diberri tool gives just the code for the protein infobox. The ProteinBoxBot gives a slightly more fuller output (with Entrez Gene summary and references included), but we haven't yet made a similar web form to the one Diberri made. Hopefully soon... (But if there are any enthusiastic students who want a summer project here, let me know!)
  3. Ultimately, I think the best "outreach" we can do is to make the existing pages as useful as possible. If a graduate student has just done his microarray experiment, googles the gene symbol of a differentially expressed gene(s), comes to Wikipedia and finds something useful, well that's the best way of attracting new editors I think. Many scientists are interested in getting more visibility for their papers, so I think (hope) they will be motivated to add a few lines on their favorite gene of interest, if they perceive the Gene Wiki to be a useful resource. Anyway, this is also dependent on making editing as easy as possible, which is why your feedback is so valuable. (And other suggestions on avenues of evangelism are certainly welcome!)
More comments/suggestions as you think of them are appreciated! Cheers, AndrewGNF (talk) 16:39, 12 July 2008 (UTC)

New comment: Is the Gene Wiki project the same thing as the Wikipedia:WikiProject_Molecular_and_Cellular_Biology one, as the box on the Talk:MAP4 page shows now? I went over to MCB to understand that better, but it isn't clear to me if it is part of the same thing. —Preceding unsigned comment added by Mary Mangan (talkcontribs) 17:58, 12 July 2008 (UTC)

The MCB project is a loose association of editors interested in bioscience subjects on Wikipedia. This ProteinBoxBot project is supported by the MCB, but the wikiproject is a much broader organization. Tim Vickers (talk) 00:49, 14 July 2008 (UTC)

ProteinBoxBot mitochondrial bug

PBB links to UCSC for mitochondrial (MT) genes is not happening. See ND6, COX2 and the location links. The display looks like you'll get something at UCSC, but you won't. Needs to be chrM. There's some other oddness about the M genes, but I can't quite figure out what that is yet. Mary Mangan (talk) 00:10, 12 July 2008 (UTC)

Hmmm, the issue of mitochondrial genes and their genome locations hadn't even come up yet. I made this change for ND6. I think that fixes it, right? (except for the mm8/mm9 issue for mouse, which we still need to tackle). If that looks good to you, then I'll try to go through and get the full list of gene pages that need this update... AndrewGNF (talk) 16:48, 12 July 2008 (UTC)
Yeah, that appears to make the syntax correct. I think there's a separate issue of the mitochondrial data at UCSC. I need to ask them about that. But I'm not available next week, so don't wait for an answer from me before moving on the others. Mary Mangan (talk) 17:14, 12 July 2008 (UTC)

Questions about your project

Dear Sirs, may I ask you four questions:

  • I was informed by an article in DER SPIEGEL [1] that you are going to make information about human gen sequences available in Wikipedia. There is no information in the Press room and I did not really find some more personal information on your user pages User:JonSDSUGrad and User:AndrewGNF. Anyway, I learnt from SPIEGEL article that there is a team of scientifics around Andrew Su from Genomics Institute of the Novartis Research Foundation. You also mention the success of your press work, and at least you provide a link to this article.[2] Could you improve the transparency?
  • My next question is: how many articles are there to come really? 23andMe is talking about 500,000+ gen parts making a human individual. So I have some doubts that it will stop at 10,000 as said in the moment.
  • Further, who will improve the articles and who will check the edits? The fraction of scientific editors is rather small - due a rough clima here and the need to spend time to do some own scientific publications to improve the personal career. Will a normal user note if someone adds the attribute "makes perspiring feet" to a gene article? So why do you not use an own independent Wiki for specific use only with a team of users where you know more about them?
  • Finally, wouldn't it be better to combine your project with Rare Diseases? For Europe, there are estimalely 17,000 rare diseases with a genetic background and between 27 und 36 millions of patients in the European Community, citing a source in Seltene Krankheiten.

Yours faithfully, Simplicius, Germany

We always welcome more publicity, if you have any suggestions on how to raise the visibility of the project they would be most welcome. The number of articles is limited by verifiability, since most putative genes have no assigned function and no literature associated with them. Articles on genes are treated in the same was as any other Wikipedia article, with the same safeguards against vandalism. The gene stubs already include links to OMIM if there is a genetic disease associated with a gene. Thank you for your interest and comments, Simplicius. Tim Vickers (talk) 15:35, 11 July 2008 (UTC)
Great questions Simplicius. I completely agree with TimVickers comments above but I would like to take the liberty to expand on them:
How many articles are there to come really? Most estimates of the number of functional human genes is in the range of 20,000 - 25,000 ([8]). The 500,000+ number you quoted may be the number of gene products (i.e., proteins), which result from alternative splicing or post-transcriptional modification (i.e., the proteome >> genome). In addition there are between individual differences in genes (e.g., SNPs). The ProteinBoxBot Project is committed to seeding gene/protein pages (one article per gene) for notable human genes. Perhaps 50% of human genes have assigned function with associated published literature. This would mean approximately 10,000 articles would be produced. To date, 9,000 articles have been created or updated to include ProteinBoxBot content.
So why do you not use an own independent Wiki for specific use only with a team of users where you know more about them? That is certainly an option, but as I see it, Wikipedia already has an active community working on these types of articles which this project can tap into. Furthermore Wikipedia has an exceedingly low barrier to entry which is likely to attract additional contributors. If an expert reads an article and sees a glaring error, since editing is easy to do, the expert is likely to jump in and fix the article. Finally I see many parallels between Wikipedia and the Open source software movement. As Richard Stallman once said, "Given enough eyeballs, all bugs are shallow".
Cheers. Boghog2 (talk) 20:50, 11 July 2008 (UTC)
I don't think I have anything else to add... Simplicius, let us know if there's any other question we can answer. AndrewGNF (talk) 16:23, 12 July 2008 (UTC)
Thank you for your kind answers.
About visibility: A portal might help (please compare Portal:Biology or more specific portals). There you could explain who are we? what is our purpose? how do we work? how can you help? You could add useful tools like last changes to review changes. When doing edits you could refer to the portal.
About the 500,000 at 23andMe: yes, it refers to the number of SNPs. So I understand: there are ca 10,000 genes with a known function and there might be another 10,000 up to 15,000 genes that might me more known in future.
About the safeguards against vandalism: An important one is the team of authors per article who understand the matter. Another one are links to the articles that attain readers frequently to cross-read. Normally we try to avoid orphaned articles as RINT1 / What links here?
Concerning rare diseases: you refer to an external database named OMIM. If the presumed advantages of Wikipedia acted here would you call the idea too overdrawn to generate articles depicting rare diseases by another bot (and to set links from the disease to the genes and vice versa)? Simplicius (talk) 00:16, 14 July 2008 (UTC)
Hi Simplicus, a few thoughts from my perspective...
First, a portal is an interesting idea. To be honest, I was only vaguely aware that WP portals exist, and it appears that WP:MCB is more of an active community. Agreed, in retrospect, I wish we'd included a link or two in the paper to direct people to forums for questions. I've been adding them to blog comments though -- this page for discussion, and User:ProteinBoxBot for project overview. On that second page is a link to all page with PBB content.
Second, I don't think orphaned articles are necessarily vandalism targets. If I'm a vandal that wants to spraypaint my work for all to see, then I probably won't choose to make my mark on the backside of a storage shed in Kansas (so to speak). Of course, the advantage of *having* all these stubs there sitting dormant is that we can't predict which genes the next user will be a world-expert in, so better to have all (well, many) genes already there. I think we all recognize that the barrier to creating an article is much higher than editing an existing one.
I'm not sure I understand your third point about rare diseases. If you're suggesting that a bot should systematically create a page for all known diseases, I think that's a great idea. It is dependent on being able to harvest enough content from other sources to create a credible stub, but in principle I'd support it. Unfortunately, it's not our area of expertise and we've got our hands full with genes already, so this isn't a project we (the PBB folks) will take on in the short term... AndrewGNF (talk) 15:55, 14 July 2008 (UTC)
Ok, so why not start a portal, for example Portal:Human Genes or something like this?
About the generated articles: yes, a small correct start is better then an empty box. Maybe a portal would help to increase the popularity. Also some overview articles might help to understand the structures of chromosomes. However the internet search engines would find the single articles better by following overview articles. Also, a place to communicate will not be bad.
If your experiences are positive it might be encouraging for another group to make a systematical start for rare diseases. There could be a link from the gene to a known disease. From he disease there could be a link to the gene article. Your knowledge could help for generating the articles and organization of such a project. By the way, some institutes have lists and data (USA: Office of Rare Deseases Web, Germany: Universität Rostok Web). 25 to 30 million people in the United States have a rare disease.
So if you would like to do a portal I would like to help you as good as I can. Five weeks ago I helped to create a portal with a main page including news and monitoring tool, an editorial page with member list and editing tips and a communication page. If you like I can make a similar draft. -- Simplicius (talk) 17:40, 16 July 2008 (UTC)
Sure, in principle I think a portal page would be great. One common question I get is "How do I get to the gene wiki?", and I usually just point them to User:ProteinBoxBot. But something a bit more sophisticated would probably be good. What do you think about Portal:Gene Wiki (which I just created as a redirect to the user page)? If you (or anyone else for that matter) wanted to help put together a landing page where a completely naive biologist could get acquainted with our effort, I think that would be a real service.
Rare diseases is a good idea, and those links are a good starting point. One question though would be, how much of those data are downloadable in a simple format (or does one have to resort to screen-scraping)? In any case, I've got my hands full with the Gene Wiki, so if someone else wants to take that idea and run with it (including you, Simplicius), go for it! Cheers, AndrewGNF (talk) 16:42, 17 July 2008 (UTC)
A portal in principle sounds like a good idea, but it would take a significant amount of work to include enough detail to make the project more attractive to researchers. Concerning your proposal:
Some overview articles might help to understand the structures of chromosomes.
That is an interesting suggestion since it touches on the scope of the project. Understanding the structure of chromosomes is interesting in its own right but encompasses far more than functional genes. To understand the structure of chromosomes, one must also account for non-coding "Junk DNA" which is clearly outside the scope of the Wiki Gene project. Nevertheless, the scope of Wiki Gene articles encompass more than just genes. The vast majority of genes encode proteins. Therefore to understand the function of genes, one must also understand the function of proteins encoded by those genes. For the purposes of the Wiki Gene project, understanding the structure and function of the corresponding proteins is at least if not more important than the structure of the chromosomes. To this end, a number of overview articles which cover protein families were created before the Wiki Gene project started. Furthermore as a consequence of the Wiki Gene project, many of these existing articles were expanded or new ones created to provide context for Wiki Gene articles. Take the example of the Wiki Gene article, estrogen receptor alpha. This is the most specific article in a chain of articles starting with transcription factor -> zinc finger -> nuclear receptor -> steroid receptor -> estrogen receptor -> estrogen receptor alpha. Similar hierarchies of articles exist for other gene/protein families such as enzymes, ion channels, or G protein-coupled receptors to name a few. In summary, the Wiki Genes project was never intended to be autonomous, but rather integrated with existing Wikipedia articles and projects. Cheers. Boghog2 (talk) 17:11, 17 July 2008 (UTC)
Well put... That is in fact why we wanted to do the Gene Wiki project here at Wikipedia instead of in a standalone site -- exactly to take advantage of all the synergies possible at many levels of biology. In any case, I'd still support it though if someone wanted to create Portal:Gene Wiki, as well as if someone wanted to create Portal:Protein families. Or maybe these should be integrated in WP:MCB subpages, or daughter wikiprojects. But I think it would be good to have a landing point for newbie editors to quickly and easily get acquainted with a small universe that they're interested in (without getting overwhelmed by WP as a whole). If they stick around, they'll undoubtedly find themselves in multiple overlapping universes, but at least that introduction is slow and natural and non-overwhelming. AndrewGNF (talk) 17:57, 17 July 2008 (UTC)
Ok, I will make a start as soon as possible. Simplicius (talk) 12:26, 22 July 2008 (UTC)
I began. Simplicius (talk) 16:52, 30 July 2008 (UTC)
The Portal:Gene Wiki framework that you have constructed is an excellent start! I have added some content to a few of the portal components but there is a lot of work remaining. Thanks Simplicius for your initiative in creating this portal. Boghog2 (talk) 11:12, 31 July 2008 (UTC)
Ok, fine. Maybe we can go on discussing on Portal:Gene_Wiki/Discussion. Simplicius (talk) 21:33, 31 July 2008 (UTC)
This is fantastic, thanks guys... I'll take my followup over there... AndrewGNF (talk) 01:00, 1 August 2008 (UTC)

It would be useful to provide editors an easy way to edit the PBB template page, for example to replace the protein struture graphic, add missing EC numbers, etc. This can be accomplished rather ineligantly using the following syntax:

{{PBB|geneid=2099| align=left | <small class="editlink noprint plainlinksneverexpand">[{{SERVER}}{{localurl:Template:PBB/2099|action=edit}} edit ]</small>}}

This places a small hyperlink labeled "edit" above the PBB box in the article. This solution is not ideal since (1) the syntax is rather messy and detracts from the simplification of using PBB template and (2) the hyperlink is placed above rather than in the template. A much better solution would be to add this functionity directly in the template rather than as a parameter to the template. Boghog2 (talk) 17:48, 24 July 2008 (UTC)

Hi Boghog, do you not see an edit link above the right-side infobox? These changes were meant to create what you describe above (I think)... Let me know what behavior you're seeing. AndrewGNF (talk) 18:21, 24 July 2008 (UTC)
Good grief! I must be going blind ;-) You are right, the link is already there. I guess I never noticed it before. Sorry about that. Cheers. Boghog2 (talk) 18:33, 24 July 2008 (UTC)
No problem at all... (Keep up the great work BTW! You are an amazingly prolific editor...) Incidentally, if anyone feels like helping out, I recently added this link to User:ProteinBoxBot, which shows all edits to pages which contain {{GNF Protein box}}. In short, it's a watchlist for the "Gene Wiki". I think the more eyeballs we have periodically scanning the list (or vandalism, newbie editors, etc.), the better. AndrewGNF (talk) 19:51, 24 July 2008 (UTC)
Thanks for the heads up on how to view recent changes to Gene Wiki pages. My own watch list covers a small fraction of these pages. It will be interesting to monitor to progress of the project and of couse I will jump in and help where I can. Boghog2 (talk) 11:34, 25 July 2008 (UTC)

Does PBB Summary updates remove changes made to text within them?

e.g. with the wikilinks in this edit: http://en.wiki.x.io/w/index.php?title=MAGEA3&diff=227790302&oldid=224566405 be reverted when the summary is updated? Thanks, --Rajah (talk) 07:50, 25 July 2008 (UTC)

Hi Rajah. Thanks for your additions to the MAGEA3 article. As explained in {{PBB Controls}}, to make sure that your changes to the PBB summary section are not overwritten, you need to change the setting of the "update_summary" parameter from yes to no. The PBB control section is normally at the very bottom of the text in the article's edit window. Cheers. Boghog2 (talk) 11:22, 25 July 2008 (UTC)
..., and in thinking more about how these gene pages will progress, I think it's also safe to completely remove the PBB_Summary template whenever you make a change (even something as trivial as wikilinking or fixing a typo). We've always said that any human edits trump a bot edit, and long term, it was always the hope that these summaries from Entrez Gene would eventually just dissolve away into the rest of the text. The template is good, however, for maintaining the thousands of pages that have yet to be substantially touched, especially those that do not yet have a summary from Entrez Gene (but may get one in the future). Cheers, AndrewGNF (talk) 14:57, 25 July 2008 (UTC)
Ok, yeah, my concern was that more information would be added in the future. So, if I set it on no instead of yes, potential information in the future wouldn't get added. Thanks for the clarification. --Rajah (talk) 05:43, 26 July 2008 (UTC)

The protein boxes

I am kind of wondering at the utility of the protein boxes. Here are some basic questions I have. I noticed that on the HLA-DQA1 and HLA-DQB1 page it showed (I changed the name) a picture of HLA-DQ and called it HLA-DQA1, the same was true for HLA-DQB1. On the HLA-DR page it has a HLA-DRA gene and a picture of an HLA-DR molecule. Also in error it claims there are no mouse homologs, the homolog for HLA-DQ in mice is MHC-IA, so that bit of information is also wrong. The repository for HLA sequence information is IMGT/EBI, not pubmed, so it is much better to have a protein box that uses IMGT/EBI as a source of information. The final issue is it displays a list of images but some of these images are minor variations used as a part of single studies. Even for someone who has created pages on all the HLA-A and HLA-B, I think that so much information is a waste, its like having a full author list when author1, author2, author3, et al. will do. In addition the box contains three graphs with essentially the same information whereas it has been known since 1970 and is listed in the DR, DQ and DP pages that these antigens are found on lymphoid tissues. The protein box extract placed as the lead paragraph is not wikified and cannot be wikified. I can go on. What is the benefit of doing this if it simply throws trivia onto pages? Also if a person finds an error in the box and corrects it, the next update will override the correction, probably with the same incorrect information. That process overrides expert review of materials.PB666 yap 05:21, 27 July 2008 (UTC)

Thanx for improving the DQA1 and DQB1 pages. If I can make a recommendation, there should be a special protein box and bot for dealing with heterodimers, trimers and the like. If this can be done I will convert the old manual boxes I have used to the new style. The information in the protein box on HLA-DR page is not correct for HLA-DR, it only contains information on the alpha chain, but DR is composed of a beta and alpha chain, I am enclined to revert it to its old protein box since I don't want to tamper with the new template. I modified the old protein template to deal with the issue of heterodimers. For the time being I expanded HLA-DRA from a redirect page and placed the DRA page. I have replaced the illustration on the HLA-DR page, and moved the illustration of HLA-DR1 to the HLA-DR1 page. One thing that can be done and I wanted to do but did not have the time. Some of these DR pdb images are of different isoforms, if we could have one isoform pdb mentioned on the HLA-DR page, and one isoform pdb on its appropriate HLA-DRx page then that would take some of the trivial information off the protein-box. These long protein boxes and interfere with other picture placements on the pages, so that reducing trivial information in the boxes is a good thing.PB666 yap 17:04, 27 July 2008 (UTC)
[I should point out that HLA-DR is not 1 heterodimer but 3 heterodimers, αβ1, αβ3, αβ4, αβ5 with the beta chains encoded by 4 separate genes HLA-DRB1, HLA-DRB3, HLA-DRB4, HLA-DRB5 It may be best to manually style the protein box for HLA-DR, and have separate boxes for each of the above-linked pages. And why is HLA-DRB4 called HLA-DRB4 (gene)?]PB666 yap 17:04, 27 July 2008 (UTC)
Hi. Thanks for your contributions to the HLA-DQA1 and HLA-DQB1 articles. My edits to the these articles were done before I read your questions above. The PBB summary templates may be a little confusing and/or intimidating initially, but they are in fact designed to be edited. As mentioned above, if you do make changes to this section, please be sure to change the value of the "update_summary" parameter from "yes" to "no" so that your edits are not over written by automatic bot updates. Concerning heterodimeric proteins, may I suggest that we leave the existing gene/protein specific articles in place and create a new article for this gene family and include a section discussing the composition and properties of these heterodimers. This gene family page could contain the old-style {{Protein}} boxes (which are more compact than the PBB template) for each of the four genes. Cheers. Boghog2 (talk) 18:51, 27 July 2008 (UTC)
Per the discussion above, I went ahead edited the HLA-DR article to replace the single {{PBB}} template that only contained information about the HLA-DRA gene with five more compact {{Protein}} templates which include information about the entire gene family (HLA-DRA + HLA-DRB1 + HLA-DRB3 + HLA-DRB4 + HLA-DRB5). Is this OK? Boghog2 (talk) 20:12, 27 July 2008 (UTC)
I went to look at previous versions of the HLA-DR article and now I can see where the problem originated. You already had a nice compact protein box that contained information about all five gene/proteins that at one point was replaced with a {{PBB}} template containing information only about the HLA-DRA gene. This was a mistake. The HLA-DR article should have been left as is and the {{PBB}} template should have been added to the HLA-DRA article. I hope this clarifies things (at least it clarifies the problem for me ;-). Cheers. Boghog2 (talk) 20:54, 27 July 2008 (UTC)

Thanks guys for sorting this all out. The pages look great. I gotta admit, I'm a bit glad I was late to this party. Longstanding confusion from my days as an undergrad leaves antigen presentation (and adapative immunity, in general) as one of my carefully nurtured areas of ignorance.  ;) AndrewGNF (talk) 17:23, 28 July 2008 (UTC)

Image Copyright problem
Image Copyright problem

Thank you for uploading Image:PBB Protein NFAT5 image.jpg. However, it currently is missing information on its copyright status. Wikipedia takes copyright very seriously. It may be deleted soon, unless we can determine the license and the source of the image. If you know this information, then you can add a copyright tag to the image description page.

If you have uploaded other files, consider checking that you have specified their license and tagged them, too. You can find a list of files you have uploaded by following this link.

If you have any questions, please feel free to ask them at the media copyright questions page. Thanks again for your cooperation. Sdrtirs (talk) 16:13, 14 August 2008 (UTC)

Done. Tim Vickers (talk) 16:40, 14 August 2008 (UTC)
Image Copyright problem
Image Copyright problem

Thank you for uploading Image:PBB GE TLK2 212986 s at fs.png. However, it currently is missing information on its copyright status. Wikipedia takes copyright very seriously. It may be deleted soon, unless we can determine the license and the source of the image. If you know this information, then you can add a copyright tag to the image description page.

If you have uploaded other files, consider checking that you have specified their license and tagged them, too. You can find a list of files you have uploaded by following this link.

If you have any questions, please feel free to ask them at the media copyright questions page. Thanks again for your cooperation. Sdrtirs (talk) 16:15, 14 August 2008 (UTC)

Done. Tim Vickers (talk) 16:41, 14 August 2008 (UTC)

Is it possible to copy the contents into another language Wiki?

Hi! Could I copy the contents of the protein box, for example, from the NOTCH1 when I create a page in the Russian Wiki? If yes, how do I do this? Thanx. --CopperKettle (talk) 04:37, 19 August 2008 (UTC)

Hi CopperKettle, sure, no objection from our end. As to the best technical way to do it, I'm not sure. I suppose you could just download the source from the protein box template (for NOTCH1, this lives at Template:PBB/4851). If you were going to do it programmatically, you might even use the special:export function (e.g., [9]). Hope that helps... AndrewGNF (talk) 15:40, 19 August 2008 (UTC)
... and by the way, you may or may not have noticed, every time someone asks me why we did the Gene Wiki project, I hold up Reelin as an example of what each page could become. (e.g., [10], [11], [12]) Anyway, seeing as how you played a big part in developing that page (and completely prior to our effort), I hope you take some pride and satisfaction in that... Cheers, AndrewGNF (talk) 15:47, 19 August 2008 (UTC)

Duplicate images

Your bot has uploaded hundreds of duplicate images. One example:

Was this intentional? And if so, what can we do to eliminate these duplicates? Image redirects now work. So perhaps for a case like this, the image could be re-uploaded as Image:PBB_Protein_PUM_image.jpg and then delete the files at PUM1 and PUM2 and make them redirects? Let me know your thoughts. One thing that's clear is that having all of these duplicates is not a good situation. --MZMcBride (talk) 02:28, 15 September 2008 (UTC)

Hmm, good point, hadn't noticed that... Agreed, we should find a better solution for this. The long-term solution I think is not to index by the gene symbol (PUM1 and PUM2 in your example above), but rather by the PDB code. For example, I think both gene pages should reference Image:PBB_Protein_1ib2.jpg, where 1ib2 is the official PDB structure ID. This fix would actually dovetail nicely with another proposal to systematically upload thumbnail images for all PDB structures to wikicommons.
Unfortunately, we're having a bit of a lag recruiting the next student to take over PBB development. I'm hoping that's due to the summer lull and that we'll have more luck soon. In fact, I'm scheduled to give a pitch to a bunch of students this Friday to see if I can hook someone. But, of course, a new student will have a ramp-up time, which is why this isn't a short-term solution. The question is, do we need to have a short-term solution, or is it sufficient to know that we have a game plan toward a long-term fix? AndrewGNF (talk) 17:14, 15 September 2008 (UTC)
Wikipedia doesn't have a deadline for completion, so a functional but non-optimal solution is fine for now. Indeed, since there is no shortage of space on our servers, duplicate images are not all that serious a problem. Tim Vickers (talk) 17:17, 15 September 2008 (UTC)
Okay great... Although realizing MZM has some scripting skills himself, if you want to implement a short-term fix for your own peace of mind and sense of order, feel free. We'll be happy to adapt PBB later so that it won't break based on your changes. Cheers, AndrewGNF (talk) 18:54, 15 September 2008 (UTC)

Protein vs. Gene

Obviously there is a difference between proteins and genes. Why are there hundreds of pages (e.g. GPR3) that incorrectly state that so-and-so protein is a "human gene"? Perhaps the protein and the gene that codes for it have similar names, but that is no reason to conflate the two. It would be absurd to go through and correct all the articles with this error. Perhaps a bot could do it? Fuzzform (talk) 00:03, 2 October 2008 (UTC)

I see this issue was discussed in the past. Unfortunately, the statement that "the "CD38 molecule" is a special case" is utterly false. There are hundreds, probably even thousands, of articles that incorrectly state that a certain protein is in fact a gene. This is not "minutae". This is not unimportant. This is a ubiquitous trait of pages created with the ProteinBoxBot. It's rather strange that a bot with such a name would mostly create pages about "genes". If the majority of the articles the bot creates are factually erroneous, perhaps it would be best for the community not to have such a bot. I'm not entirely against it, but from what I've seen, it needs work at the very least. (Another consideration is the "jargon issue", brought up in the past. Not only in regards to biology jargon, but the technical jargon that the bot inserts in articles as a marker/controller.) Fuzzform (talk) 00:13, 2 October 2008 (UTC)
I don't see the problem with GPR3, Entrez lists this as a human gene (link), is the problem that this is a gene in other organisms as well? Tim Vickers (talk) 02:23, 2 October 2008 (UTC)
I couldn't resist expanding the GPR3 article. How does it look now? Obviously as noted before, these pages are about both the gene and the protein. The gene is named after the protein, so as Tim points out above, it is not incorrect to refer to GPR3 as a gene, although to be completely correct GPR3 (note italics) refers to the gene and GPR3 (non-italicized) to the protein. The first sentence of the lead which implies that these articles are about genes is at worst incomplete but certainly not incorrect. The discussion about the "CD38 molecule" was really about two issues. The first was the strange yet official name given to the CD38 gene/protein and the second was whether the article was about the gene or the protein. The former as Andrew points out is a special case, the later, as you point out applies to all the articles. I hope this clarifies things. Cheers. Boghog2 (talk) 19:07, 2 October 2008 (UTC)
I don't think there is a case for having articles on both a gene and its products and I'm personally happy with having an article that discusses both - since most of the material will be common to both. Tim Vickers (talk) 20:05, 2 October 2008 (UTC)
I absolutely agree that in the vast majority of cases, we should keep the gene and corresponding protein(s) together on one page. Boghog2 (talk) 21:14, 2 October 2008 (UTC)
Agreed Fuzzform, it's poor wording. If we had to do it over again, we should probably do something like "G protein-coupled receptor 3 is a protein which in humans is encoded by the GPR3 gene". But, for all the exhaustive oversight (by WP:BAG, WP:MCB, etc.) of the bot prior to its mass runs, that one just slipped under the radar. Anyway, as Tim and Boghog have pointed out, the consensus was absolutely to keep information on the genes and their corresponding proteins on the same page, so I don't think we have a huge problem here. And, as always, anyone who feels like helping out fixing this issue (either one-by-one manually or by writing a bot), please feel free... Finally, as you've undoubtedly noticed, PBB itself is on indefinite hiatus while we search for another willing student. Cheers, AndrewGNF (talk) 23:47, 2 October 2008 (UTC)
I wasn't arguing for separate articles for gene/protein information, I just think that a better (standard) opening sentence is in order for these articles. It is unfair to the vast majority of readers for us to simply assume that it is self-evident that an article is about both a gene and a protein. This needs to be said explicitly, not merely implied. Also, in regards to the gene qualifier "human", we can't assume that the average person knows that, for example, "GPI" refers to the human gene, whereas "Gpi" refers to the gene in other species. This is esoteric information. Even if stating such information in every article seems "excessive and overly pedantic", it is preferable to having articles that are incomprehensible to the majority of readers. A very basic rule when writing non-fiction is to never assume that the reader knows what you know - if they did, why would they be reading at all? Which leads me to another point... these gene/protein articles do not (or at least, should not) constitute a reference work for experts in the field. This seems to be where the "self-evident information" fallacy comes from. Anyway, my point is this: Never assume that information is self evident; always state it explicitly. Fuzzform (talk) 19:03, 6 October 2008 (UTC)
... ooh, and just noticed on your userpage that you're a beginning python programmer. Interested in getting your feet wet with a little bot programming? Nothing like real-world experience for learning to program... AndrewGNF (talk) 23:50, 2 October 2008 (UTC)
Sure, I'm always interested in learning something new. Just point me in the right direction. Fuzzform (talk) 19:03, 6 October 2008 (UTC)

Sorry, traveling this week, but will reply as soon as I get a chance... Just didn't want anyone to think they were being ignored (especially when they come offering to participate!) AndrewGNF (talk) 06:44, 9 October 2008 (UTC)

Fuzzform, you still interested in doing some bot programming for the Gene Wiki project? If so, I'd suggest two possibilities. First, we could take the topic that you've highlighted above -- the poorly worded first sentence. I could give you the index of pages in the Gene Wiki, you could programmatically see if that ugly first line exists, and if so, we could change it to something better. Of course, we'd want to agree on what it should be changed to here, but I can't see that being a big obstacle. The second idea is one that I've been very excited about but have yet to find someone help achieve it. I'd like to create a master chart that tracks Gene Wiki growth and progress (what Edward Tufte calls a "supergraphic"). Basically, it involves going through the history of all Gene Wiki pages and tracking each page by time and size. Phase II would be converting all that data into an SVG graphic. Anyway, either of these projects I bet would be a good way to hone your python skills. Let me know if either of those strike your fancy (or if anyone else lurking is interested...) Cheers, AndrewGNF (talk) 00:18, 12 October 2008 (UTC)

Uploding on Commons

Hello AndrewGNF. Thank you for uploading the Gene expression pattern diagrams. I would like to ask you for uploading the images in future on Wikimedia Commons. Then other Wikis are able to use the images, too. Thanks a lot! Regards, --Bcr-abl (talk) 10:22, 7 October 2008 (UTC)

Thanks for the feedback. Yes, it's on our to-do list to get all PD images (expression and PDB) over to wikicommons. Unfortunately, still hunting around for a willing student... (As noted above, any coders who want to take on this challenge, please let me know. In return, you get some good experience, and if your contributions are reasonably substantial, co-authorship on the next gene wiki paper...) AndrewGNF (talk) 06:46, 9 October 2008 (UTC)

Alternate names

Hi Andrew, is there any way the bot could check for alternate names for a gene/protein before it goes ahead and creates an article? I've been trying to link up all the tyrosine kinase articles. I keep coming across articles that have been created by the bot, were an article already exists under the proteins alternate name, see NTRK3 and TrkC, I'll leave these as they are for now as an example.K.murphy (talk) 13:54, 10 November 2008 (UTC)

Hi there... Yes, PBB did try to do some (very) basic synonym checking, looking for articles at the official symbol and all official aliases. In the case above, it looks like PBB checked TRKC, but not TrkC. We originally started with the arrangement that pretty much every page would require a human review to figure out where PBB content should be best placed. However, things were progressing too slow and some in the MCB community were getting impatient. So, the consensus we reached with WP:MCB was what we implemented -- an imperfect solution, but considering that we handled ~9000 genes, I hope not too bad... Cheers, AndrewGNF (talk) 17:14, 10 November 2008 (UTC)
I'll just carry on merging and redirecting duplicate synonym articles as I find them. Of the 9000 genes you handled, were they just random or for specific protein families? sorry if this has already been covered in the archives somewhere. K.murphy (talk) 21:36, 10 November 2008 (UTC)
Thanks, appreciate your efforts on the Gene Wiki articles. There were a couple of protein families that we did en masse (GPCRs and transcription factors?), but mostly they were taken from what my colleague likes to refer to as the "sexy list". Basically, for each gene, count the number of linked Pubmed entries in Entrez Gene, sort descending, and then take the top 9000 or so. We figured these have the best chance at notability. (Still trying to find someone to write a simple web-wrapper around the PBB code so that users can get the formatted data on demand...) Cheers, AndrewGNF (talk) 22:07, 10 November 2008 (UTC)
Could you not sort the "sexy list" into protein family categories? then people can set about linking articles in the same families.
I made a start in categorizing Wikipedia gene/protein articles here. As Andrew has already alluded to, the ligand-gated ion channel, voltage-gated ion channel, solute carrier family, transcription factor, and G protein-coupled receptor families (as well as additional families that I am less familiar with) have already been linked together with navboxes and category templates. Cheers. Boghog2 (talk) 10:16, 11 November 2008 (UTC)

Linking terms in protein summary?

If a protein summary contains a technical term that should be linked to its Wikipedia article but isn't, what is the correct way to proceed? Please document this at the relevant template pages. Thanks, AxelBoldt (talk) 00:39, 31 December 2008 (UTC)

Thanks for your contributions to Gene Wiki articles. In order to make sure that your edits are not overwritten in a future automatic update, the only thing you need to do is turn off the PBB summary update parameter as for example in this edit. This has been documented on the Portal:Gene_Wiki page (General FAQ, item #2). Also if you turn of the automatic update parameter, feel free to remove the template itself ( the text "{{PBB_Summary | section_title = | summary_text = " and "}}" that surrounds the summary) as for example in this edit, since it is no longer needed. Cheers. Boghog2 (talk) 09:45, 31 December 2008 (UTC)
Problem is, I don't feel competent to make the decision to switch off automatic updating, just because I have found a little link that should be made. I don't know what the benefits of automatic updating are. Further the existence of the Portal and FAQ is not transparent to the editor, who is only directed to Template:PBB Controls. Editing these pages is quite intimidating. AxelBoldt (talk) 15:39, 31 December 2008 (UTC)
Hi AxelBoldt, one of our fundamental principles behind the Gene Wiki is that any human edit is more competent than a bot edit. The benefits of automatic updating of the gene/protein summary are far outweighed by attention from human eyes (even adding a wikilink), so you should feel free to do as you see fit. The automatic updating is primarily aimed at the vast majority of pages that haven't yet been edited by a human, just to keep them in sync with the source databases. Agreed, the instructions are a bit more convoluted and intimidating than they need to be, which is why Boghog suggested that you simply remove template markings once you make any sort of edit. In the future, we're hoping to simplify the pages even more (but suffering from lack of bandwidth at the moment). Cheers, AndrewGNF (talk) 17:58, 31 December 2008 (UTC)
If human edits are indeed seen as more important than bot edits, you could simply instruct the bot to only update protein descriptions that haven't been altered since the last bot update round. (The other ones should be added to a list for human review.) AxelBoldt (talk) 16:14, 2 January 2009 (UTC)
The short answer is that we'd like to do that, but technically it's non-trivial to determine if the summary has been touched. The easy thing to do is that if there's any evidence of changes (e.g., wikilinks), then the bot will not update. (So, you can assume that adding a wikilink is the same as turning off the summary.) We may even go a step further and say that any non-blank summary will never be updated by the bot. Anyway, right now, these are all theoretical discussions since we don't have a primary PBB developer at the moment... (volunteers?) Cheers, AndrewGNF (talk) 00:52, 3 January 2009 (UTC)

Just a start

I am trying to built up a portal using two templates to make it a bit easier. The elements of each page can be changed quite easily. What elements are needed? Please feel free to do changes and to fill the elements with contents. Simplicius (talk) 16:57, 30 July 2008 (UTC)

Nice start! Should we just use Portal_talk:Gene_Wiki (which links under the "normal" discussion tab) instead of this discussion subpage? Simplicus, if you have no objection, can you do the move? (Or, I don't hear any objection in a few days I'll do it...) Cheers, AndrewGNF (talk) 01:02, 1 August 2008 (UTC)
Sorry, in the moment I am rather busy (in real life).
The idea is to centralize all communication from all talk pages to one page because there are many talk pages due to the number of sub pages in the portal.
I would like to set a template into the talk pages of the units of this portal for a "please go to [here] for further questions and discussions". Indeed, I have forgotten to do this. If you find this proposal ok I will do so and add a link everywhere for the centralisation of the discussions (by adding a template). -- Simplicius (talk) 12:12, 6 August 2008 (UTC)
PS: we can change "Discussion" into "Talk" or "Communication" if you like.
Further, I can integrate the navigation bar "Portal - Discussion - ..." into the head bar. Simplicius (talk) 13:45, 6 August 2008 (UTC)
All sounds goood... I did change the Portal talk link to a redirect to the discussion page (since, in contrast to the other pages, this one won't have other content). Thanks, AndrewGNF (talk) 18:47, 12 August 2008 (UTC)

I think we absolutely want to maintain and emphasize our connection with WP:MCB. How should we do this? In my mind, this portal will serve two purposes. First (and most obviously), an organizational hub for people who want to focus on gene and protein annotation. Second, I hope many people come to edit the gene wiki through past and upcoming press -- and these newbies will be true novices. I hope this portal is a gentle landing point for those people... AndrewGNF (talk) 01:10, 1 August 2008 (UTC)

Wow, the more I look at it, the more effort I realize went into it. Thanks Boghog and especially simplicius for spearheading the creation of this portal. I've taken my stab at adding some content and reorganizing. I trust others will edit ruthlessly. Two quick thoughts. First, I think the front page looks a bit busy. Perhaps too much bold? Perhaps we need some figures to break up all the text? Second, I like the design where there is a thin right-hand navigation bar. Maybe if someone wanted to take a stab at creating that, we could move the Quick Links section there? Cheers, AndrewGNF (talk) 01:43, 1 August 2008 (UTC)
Why don't we declare the Gene Wiki project a daughter project of MCB and cross link the two project pages? Also I agree that some graphics would be nice. Why not create a project logo (for example a Photoshop mashup of Wikipedia-logo + DNA + human) that could be included as a banner on all Gene Wiki Portal Pages. We could start a contest to see who can devise the best logo. Cheers. Boghog2 (talk) 08:36, 1 August 2008 (UTC)
I think it's a great idea on the logo contest. Any takers? The daughter MCB project sounds like a good idea too, but I don't want this to seem like a distinct community from MCB. Rather, I'd like the MCB and the Gene Wiki communities to be one and the same, where the Gene Wiki is one coordinated effort for MCB members to work on. Hmmm, perhaps I'm arguing semantics here... AndrewGNF (talk) 16:23, 1 August 2008 (UTC)
A logo should be very fine. You can also add a new box into Portal:Gene Wiki for listing other related projects and portals if you like. Simplicius (talk) 12:36, 6 August 2008 (UTC)
I will attempt to put first draft logo together, and if nothing else, I hope it will inspire others to do better ;-) Concerning adding boxes to other relevant links, I think we need some side boxes, but I am not sure how to do this in the format of this portal. Cheers. Boghog2 (talk) 19:02, 8 August 2008 (UTC)
It would be a table with two or three columns.
I can do this in early September.
In the moment I am rather offline due to personal reasons. Simplicius (talk) 12:06, 20 August 2008 (UTC)

Template Box 1 and 2

The portal uses two templates to generate new paragraphs:

  • box 2 if there is a content, and box 1 if only a title is needed.
  • Due to this change by Andrew I made this change to correct paths for a solution via weblinks. The page "Facilites" should look correct now.
  • If there are new boxes please change/add/strike the overview in Portal:Gene Wiki/Sitemapping.

Thx! Simplicius (talk) 12:48, 6 August 2008 (UTC)

Andrew, (no critics, just a question) why do you prefer an [edit]-function via web address instead of wiki link? Simplicius (talk) 13:41, 6 August 2008 (UTC)

Ahh, thanks for that catch and fix... I like the link directly to the edit page because I think it's a bit clearer to the newbie. The concepts of templates and transclusions are somewhat advanced, and I can see a newbie editor being confused if the "edit" link just goes to another wiki page. (BTW, sorry for the late replies -- was out of town last week...) AndrewGNF (talk) 18:49, 12 August 2008 (UTC)

Contents

Please use this template for some "advertisment".

  • {{Portal|Gene Wiki}}

About the portal, please develope the primal questions for interested users:

  • who are we?
  • where do we come from?
  • what do we want?
  • why is our project useful?
  • why are genes important?
  • what are we doing?
  • how can you help?
  • what are our instruments?
  • what we did yet?
  • what are we going to do?
  • and so on

You started already. Please go on.

And please try to put the works' output (WP articles etc.) onto the first page portal as an entertaining and informing page, and the technical hints and helps onto the second page Gene Wiki – Facilities. Simplicius (talk) 14:00, 6 August 2008 (UTC)

Again, great question Simplicius. However before we add more content the portal, I think it is important to discuss this in more detail on this page. I think one problem we face is that we have two fundamentally different audiences. The first is the technical audience of scientists and students who presumably will be adding most of the content to Gene Wiki articles. The second, and perhaps more important is the non-technical/lay audience whose interest provides motivation and validation of the project in general. Therefore I think the above questions by necessity must be split into two sections. These questions could and should also be extended to the individual Gene Wiki articles:
* The lead of each article should be written in such a way as be understandable to a wide audience.
* Additionally, if the gene/protein is associated with a particularly important issue such as a disease, this should be highlighted in the article.
Returning to the project as a whole, the most important question to answer is why. Again, this question has two aspects:
* Why is this project important to scientists? Off the top of my head, I can think of the following advantages. At the risk of over stating the issue, the unique format of Wikipedia allows rapid:
  • inclusion of the latest research results
  • correction of errors in public database and publications
  • collaboration without borders between researchers
  • free format inclusion of data
  • research "trivia" – the meaning of gene/protein synonyms/acronyms, experiments that didn't work, other amusing footnotes that rarely make their way into the peer reviewed literature (like here, here, and here)
  • synergy between existing Wikipedia and Gene Wiki content through internal Wikipedia hyperlinks (to me, this is the most compelling advantage)
* Why is this project important to the general public? (again with the strong risk of hyperbole)
  • genes and the proteins encoded by these genes are critical to understanding both health and disease
  • help justify the enormous investment in biomedical research
  • provide valuable background information which is important for making informed governmental policy decisions (e.g., funding priorities, cloning research, use of genetically modified organisms as foods, sources of drugs, and research tools, etc.)
I apologize for the rambling nature of what I have written above, but again, I think it is important to discuss this here before adding additional content to the front page of the portal. In addition, we should of course solicit input from the MCB project. Your thought? Cheers. Boghog2 (talk) 19:46, 8 August 2008 (UTC)
I think there are many interesting points here. In my mind, the primary objective of the Gene Wiki Portal is to present a page that will encourage a reader (of the scientific paper or one of the derivative news/blog articles) to make their first edit. I think that gives us the best chance of "hooking" a new editor into becoming an active contributor (to the Gene Wiki, to WP:MCB, and to WP as a whole). After that, the users will be willing to dig deeper to find the right links and communities. But if we put too much information up front, we risk overwhelming a newbie.
Also, I think Boghog makes some great selling points of the Gene Wiki. But I think we can assume to some degree that any reader of this page has bought into the concept. True cynics won't even make it this far. So, I think it's important to remind readers of the big picture, but we probably don't need to over do it either... My two cents... AndrewGNF (talk) 18:59, 12 August 2008 (UTC)

Supervision tools

I will try to find the tools for supervising articles in a category so that you can see changes automatically. Please allow me a question: is there are certain category for articles generated by this project or by your tools? -- Simplicius (talk) 13:41, 6 August 2008 (UTC)

As far as categories, this is a bit difficult to identify since the article are distributed over several, partially non-overlapping categories including {{protein}}, {{protein-stub}}, {{gene}}, etc. We already have:
  • Recent changes to the Gene Wiki: [13]
The only thing that we are missing is a recent change log to this portal. Ideally if you can figure out a way to combine monitoring of this portal with the link above, that would be ideal. Cheers. Boghog2 (talk) 21:05, 8 August 2008 (UTC)
Oh yeah, that would be really cool. Nothing encourages a newbie to edit like seeing other people actively editing. Probably we should use this link though [14], which only tracks article namespace edits... AndrewGNF (talk) 19:02, 12 August 2008 (UTC)

Nature article

Link - very cool. Tim Vickers (talk) 02:19, 4 September 2008 (UTC)

I'm really chuffed that they mention PDBWiki :-D --Dan|(talk) 14:02, 4 November 2008 (UTC)


About the Bot?

I know there is an article now, which is great, but what happened to the nice description and images like these: Image:PBB_flowchart.png, Image:PBB_flowchart_Sub.png? It would be really great to have an updated, detailed article. --Dan|(talk) 14:07, 4 November 2008 (UTC)

Well, we've been really rethinking how the overall flow of the program works (including how updates are handled) so those documents became a bit obsolete. Really I think the thing worth reporting is not the program, but the output of the program. Open to being convinced otherwise though... AndrewGNF (talk) 15:21, 4 November 2008 (UTC)

Annotation

I found this page via "PLoS Biology - Metagenome Annotation Using a Distributed Grid of Undergraduate Students".. After half an hour I still have no idea from this page how I may begin to annotate. Is this not a lack in a basic purpose of the page? Mccready (talk) 01:36, 26 November 2008 (UTC)

Hi. Thanks for pointing out this interesting paper by Hingamp et al. While the paper and the Gene Wiki project both deal with gene annotation, they do so at very different stages of the annotation process. Hingamp describe how to annotate a newly sequenced gene whose function has not yet been determined much less described in the published literature. This process largely relies on searching for related sequences with known function. Since related sequences often (but not always) have similar function, one can provisionally assign a function to the newly sequenced gene based on analogy.
The scope of the Gene Wiki project is at the other end of the annotation spectrum. In keeping with Wikipedia's verifiability, no original research, and notability policies, Gene Wiki articles restrict themselves to notable humans gene/proteins whose function has already been described in the published scientific literature. The starting point for Gene Wiki project are User:ProteinBoxBot created stubs which contains basic information about the gene/protein obtained from published literature. In the context of the Gene Wiki Project, annotation means expanding on the bot generated stubs to include more content (backed up by citations to reliable sources) and at the same time, striving to make the article accessible to a wider audience of readers and integrating the article with pre-existing Wikipedia content. The additional content could come from a scientist or student who is already familiar with the gene/protein or someone else who has read a magazine, newspaper article or scientific paper on the subject.
I hope this explanation makes things clearer. Cheers. Boghog2 (talk) 21:32, 26 November 2008 (UTC)
Thanks, good explanation. Mccready (talk) 08:11, 27 November 2008 (UTC)

Double column

Dear Mr. or Ms. Bot: The double-column format for Further Reading does not look so good on my computer. The single-column format at ST3GAL3 looks much better. Can you fix yourself so you output the Further Readings in single column? Sincerely, your friend, GeorgeLouis (talk) 05:08, 30 April 2009 (UTC)

Thanks for your comment. Based on the discussion here, an enhancement where "{{refbegin|2}}" is replaced with "{{refbegin|colwidth=30em}}" has been added here. The advantage of this solution is the number of columns displayed in dynamically adjusted depending on the screen width. If the width is narrow, only a single column is displayed. I have made this change to the ST3GAL3 article. Does this look OK? Cheers. Boghog2 (talk) 15:13, 30 April 2009 (UTC)

Yes, it looks very nice. GeorgeLouis (talk) 17:55, 30 April 2009 (UTC)

PBB lead discussion

Does anyone happen to remember where our last discussion of the lead sentence of PBB articles is? The best I can find is in User_talk:ProteinBoxBot/Archive2#Protein_vs._Gene, but somehow I seem to remember a more detailed discussion with several options. Hoping someone else's memory is better than mine... Cheers, AndrewGNF (talk) 18:09, 30 April 2009 (UTC)

The most recent discussion was here (toward the bottom of the archive). Cheers. Boghog2 (talk) 19:53, 30 April 2009 (UTC)
Super, thanks much! AndrewGNF (talk) 21:21, 30 April 2009 (UTC)

Is is possible for PBB to include wikilinks in the summaries it provides using the PBB_Summary tempalte? For example, TNFRSF21 would benefit from several wikilinks. Most articles in PBB's scope probably have similar stub-like articles, with a brief lead having one or two wikilinks to general terms, and then a paragraph from PBB that has some specific terms that are obscure to a typical reader. In the TNFRSF21, for example, NF-kappaB, MAPK8/JNK, TRADD are meaningless to almost everyone outside of molecular biology. And the following terms are probably meaningless to most people outside of general biology: TNF-receptor, apoptosis, domain, receptor, signal transduction, knockout, T-helper cell.

I suppose it might be difficult to have PBB identify which terms should be linked, whether that term has a corresponding Wikipedia article, and whether the article matching that term is actually about the intended topic (a link to "knockout", for example, would lead to something other than knockout gene). — Twas Now ( talkcontribse-mail ) 10:22, 24 May 2009 (UTC)

Thanks for your questions. The intention of including the PBB_Summary is provide a seed that would later be expanded or entirely replaced by human editors. Therefore please feel free to make changes to the contents of any {{PBB_Summary}} template. These edits will not interfere in any way with the display of the contents of the template. If you do make an edit, please be sure to change the update_summary parameter in the {{PBB_Controls}} from "yes" to "no". This will insure that your edits will not be over written by any future automatic updates done by a bot. Finally if you do make extensive edits to the contents of the summary, please feel free to remove the template altogether.
I agree that wiki links should be included in PBB_Summary where appropriate. However I am not aware of any automatic procedure for inserting wiki links. As you point out, this would be difficult and is probably better done by a human editor. Cheers. Boghog2 (talk) 17:31, 24 May 2009 (UTC)
Ahh, some wording at Template:PBB Summary ("high risk that it will be deleted/overwritten") scared me into thinking the template was intended to be left in the article permanently, and I assumed it was because PBB would periodically update articles as new information about that gene/protein was discovered.
One approach to including wikilinks might be to manually produce a "whitelist" of terms that could be linked, and have PBB include a link to the first instance of any term from that whitelist. (As well as a "wikilink = yes/no" parameter in the PBB controls, to turn off automatic wikilinking). When it comes across "TRADD" for the first time in an entry, it will produce [[TRADD]]; when it comes across "JNK", it will change it to [[c-Jun N-terminal kinases|JNK]]. Ambiguous terms should generally not be added to the whitelist. For example, if PBB comes across "knockout" but not "knockout gene", there is a slight (...very slight) chance it might mean a literal knockout, and not knockout gene. There are more plausible examples, with subtler ambiguity, but I already used this one above... in reality, it would probably be fine to link knockout gene in all cases. — Twas Now ( talkcontribse-mail ) 18:04, 24 May 2009 (UTC)
Hmm, good point, we should probably modify the doc at Template:PBB Summary to be a little less ominous/threatening. But it's protected at the moment. Anyway, on my list of things to do (or anyone else should feel free to modify too)...
Good point on the whitelist. But I actually prefer to leave it as plain text for two reasons. First, the presence of wikilinks means that a human has edited it, so that is a clue to the bot to not touch the template again (for fear of overwriting a human edit). Second, giving a reader a simple and obvious improvement to make (like adding a wikilink) I think is a good way to draw newbies in to make their first edit. But having said that, I can see this issue both ways... AndrewGNF (talk) 18:47, 27 May 2009 (UTC)

Relating gene pages by protein interactions

Anyone watching here might be interested in this topic: Wikipedia_talk:WikiProject_Molecular_and_Cellular_Biology#Relating_gene_pages_by_protein_interactions. Cheers, AndrewGNF (talk) 22:47, 25 June 2009 (UTC)

Protein Box Bot Update

For anyone concerned Protein Box Bot will be making some page edits soon. See the announcement here. JonSDSUGrad (talk) 00:21, 23 July 2009 (UTC)

And it does not seem to be going so well :) Fvasconcellos (t·c) 00:31, 28 July 2009 (UTC)
One more: [15] Fvasconcellos (t·c) 00:31, 28 July 2009 (UTC)
Any more? I'm not sure what happened on those two pages, but it seems to be going fine for the other pages.. JonSDSUGrad (talk) 00:59, 28 July 2009 (UTC)
Yes, quite a few; see my recent contribs. Fvasconcellos (t·c) 01:03, 28 July 2009 (UTC)
Ok, I found the bug. I've stopped the bot and hopefully not too much damage has been done. After the reverts are done I'll start it back up. JonSDSUGrad (talk) 01:08, 28 July 2009 (UTC)
Glad you found it. Fvasconcellos (t·c) 01:27, 28 July 2009 (UTC)

Two more that I found and reverted, just for the record... [16] and [17] AndrewGNF (talk) 03:00, 28 July 2009 (UTC)

The problem will occur on any page missing the refend template. If there is an easy way to cross reference the edits made yesterday with pages that are missing the template, we should find all the errors. JonSDSUGrad (talk) 18:33, 28 July 2009 (UTC)
Update - Thanks to those that helped revert pages damaged by the bug (there were 29 pages damages, now all fixed). The run was otherwise successful with about 2800 pages being updated with the PDB Gallery. JonSDSUGrad (talk) 22:35, 30 July 2009 (UTC)

A request for comment has been made at the above link. Your input is welcome. Boghog2 (talk) 20:32, 19 August 2009 (UTC)

Automatic template generator

We've created the version 0.1 tool for generating a Gene Wiki page on demand, modeled after Diberri's template filler. It's hooked up to our BioGPS application, accessible at http://biogps.gnf.org. So that the default output is focused on the GeneWikiGenerator tool, most users here will be interested in using this URL: http://biogps.gnf.org/GeneWikiGenerator/. From there, search for your favorite gene by most public identifiers, symbols or aliases (or try one of the example queries), and then select one of the returned genes in the left-hand "Current gene list". You'll then see two windows. The bottom one has the Wikipedia page for the gene, if one exists. The top one has the wikitext and instructions for creating or updating a gene page according to the standard "Gene Wiki" format. If people here have a chance to try it out, we'd love to hear feedback.

(As a plug for BioGPS more generally, you can see the other gene report views by changing the "current layout" in the upper right. If you register for a free user account, you can further customize your layouts by mixing and matching any of the 150+ plugins in the plugin library. Any comments on BioGPS in general are also welcome... </shameless_plug>) Cheers, AndrewGNF (talk) 00:11, 27 August 2009 (UTC)

Great! This template generator will keep me busy for a while :-) I did notice a few bugs however. In the {{PBB/6335}} template which I generated using the version 0.1 of the tool, all the links in the external IDs section were wrong (OMIM: 133020→603415 , MGI: 96621→107636, HomoloGene: 4051→2237). Also the mouse UniProt accession code was missing. But everything else looks correct. Thank you! Boghog (talk) 04:48, 27 August 2009 (UTC)
Yikes, some key errors there. Thanks for discovering them for us. The OMIM, MGI, and Homologene errors have been corrected (as well as another hidden EC number error). Also noticed that the PDB image tags didn't show up right, we'll look into that too. The Uniprot omission is a bit more complicated. In the raw data file we download from ftp://ftp.ncbi.nih.gov/gene/DATA/ (gene_refseq_uniprotkb_collab.gz), none of the three uniprot entries shown on the Entrez Gene page are listed. So this is a data problem. I spot-checked a few other genes and they seem to have the mouse uniprot data just fine, so hopefully this just affects a few less-well-characterized genes... Cheers, AndrewGNF (talk) 16:30, 27 August 2009 (UTC)
... and the PDB image issue has now been fixed too... Cheers, AndrewGNF (talk) 18:14, 27 August 2009 (UTC)
Thanks for the fixes above. I have discovered two additional minor bugs:
  • In the "GNF Protein Box" window, if the HUGO gene name contains a comma, the returned name in the name field is truncated before this first comma.
  • In the "Gene Page" window, in the Entrez ref tag, the opening "<" is replaced with the equivalent HTML Entity Character "& l t ;". This happens twice.
Otherwise it works great! Thanks again for creating this tool. Cheers. Boghog (talk) 15:52, 13 September 2009 (UTC)
Thanks! We'll get on these fixes soon... Cheers, AndrewGNF (talk) 16:07, 13 September 2009 (UTC)
One additional bug that I noticed. The Unicode character extensions (e.g., ä, ö, etc) to the standard ASCII character set are mangled (e.g., in the author names in the further reading section). Also of course, the GenLoc_db parameters needs to be added (Hs_GenLoc_db = hg18, Mm_GenLoc_db = mm8) and the {{GNF Ortholog box}} need to be removed from the GNF_Protein_box template. Cheers. Boghog (talk) 12:58, 27 September 2009 (UTC)

Gene Locations

Discussion moved from User_talk:TimVickers:

Hi Tim. Thank for all your excellent work. Did somebody go through all the human gene locations and edit out the locations? It seems as though they used to list gene location using the old cytogenetic band tags (like "15p2.23") Next they seemed to change to a numeric address expressed in megabasepairs (which I suppose they needed to abandon because there is so much intervariability in humans....like the variable region of triplet repeats that result in fragile- X syndrome). But "n/a" is not useful. At least the old cytogenetic terminology could get you in the ball park. Last year I built a large map that labels the cytogenetic bands, and there is an approximate scale so that users could at least get close to locating what they were looking for. If we're ever going to understand the architecture of the genome, we'll need to create tools to help us see the relationships (Entrez Gene is OK, but it's so "all-inclusive" that it's hard to find anything.) Tim, I'd like to help, but I'm not sure where to start.doctorwolfie (talk) 14:16, 4 September 2009 (UTC)

Sorry for butting in, but I think many of the changes from cytogenetic band to megabasepairs that you refer to were due to replacing {{protein}} with {{GNF_Protein_box}} templates. The megabasepair location was used to provide a link to the UCSC Genome Browser. The advantage of the megabasepair link is that it more precisely targets the gene in question (one-to-one relationship) whereas one cytogenetic band may contain more than one gene (one-to-several relationship). I believe situations where "N/A" is displayed is due to omissions in the data file where these gene locations were obtained (ftp://ftp.ncbi.nih.gov/gene/DATA/) and we are slowly fixing these omissions as we find them. I hope this clarifies things. Cheers. Boghog (talk) 15:07, 4 September 2009 (UTC)
Thanks, that clarifies it. I'm not much good at editing, but is there anything I can do to help?doctorwolfie (talk) 19:51, 4 September 2009 (UTC)
Help is certainly welcome :-) You can start by letting us know if you find missing data. If you are inclined to do-it-yourself, follow the PBB edit link on the top right hand screen right above the protein box. This will take you to the transcluded PBB template were you can add the missing data. If the human gene location is listed as N/A, then you will need to add data to the Hs_GenLoc_chr, Hs_GenLoc_start, and Hs_GenLoc_end parameters. I get these by following the Entrez Gene link and on the Entrez page, follow the "see related" link to the Ensembl database entry. The top of the Ensembl page will list the gene location. There may be easier ways to do this, but this is the only way I know. Hope this helps. Cheers. Boghog (talk) 20:31, 4 September 2009 (UTC)
Thanks guys... If you prefer, you can also use the tool described here to update the entire PBB template, under the assumption that most of the NAs have since been resolved. If that's not the case, and you start noticing more than a few of these, then perhaps we should start assembling a list so I can track down what's happening here. Cheers, AndrewGNF (talk) 21:50, 4 September 2009 (UTC)

Help finding edits which could be candidate GO annotations

I'm looking for some examples of edits to gene pages and I'm hoping people watching here might have suggestions. I want to highlight one (or ideally a handful) of edits that satisfy these two criteria: 1) includes a citation to a scientific reference, and 2) adds information that could be expressed as a Gene Ontology annotation but hasn't been done so by the annotation authorities. For example:

Portal:Gene Wiki/Annotations

Hmmm, I guess not terribly hard to find, but perhaps people here have particularly good examples to highlight? Cheers, AndrewGNF (talk) 01:32, 9 September 2009 (UTC)

Orthology Database

I am an author of the German wikipedia and also edit sometimes in the English wikipedia, mainly on biochemistry articles. I have been active on some gene articles and I noticed that the GNF Protein box and the GNF Ortholog box templates have a lot of useful links to other databases but these mainly cover single genes or gene products (like Entrez, Ensembl and Uniprot) but a link to a phylogenetic database is not present. So I felt that there would be a lot of added value if there were a direct link to a phylogenetic database that lists orthologs of a specific gene. This would allow the exploration of its corresponding genes in hundreds of complete genomes. There are already links from Uniprot to phylogenetic databases but you have to scroll all the way down to find them.

I am currently working with a group at ETH that created the OMA-database (http://www.omabrowser.org). It is one of the biggest orthology databases and many other website such as Uniprot and Ensembl link to OMA. Since I know this project best I would suggest to integrate this website into the GNF Protein box but there are also other projects which might be more suited for Wikipedia. On the technical side, the change in the infobox would only consist of a small adjustment since the database could be called directly with the uniprot entry ID (e.g. http://www.omabrowser.com/cgi-bin/gateway.pl?f=DisplayEntry&p1=PROCA02187 in the case of OMA). Did you discuss such an idea already? What do you think? Greetings --hroest 12:07, 9 September 2009 (UTC)

While the name is somewhat misleading, the HomoloGene link in the external IDs section of {{GNF Protein box}} does list orthologs of the human gene. In your example above, PROCA02187 is the 5-HT1A receptor and the HomoloGene link in that article points here. Even though the species included in the two lists are not identical, there is significant overlap. Is the HomoloGene link sufficient? Are there other databases that might be better? Cheers. Boghog (talk) 12:37, 9 September 2009 (UTC)
I too had the same question about OMA vs Homologene. On a separate note though, I did register OMA as a plugin in our gene annotation portal called BioGPS. Check out the help page to see how BioGPS works... Cheers, AndrewGNF (talk) 19:13, 9 September 2009 (UTC)
Hi, the overlap here is quite large because this protein is only present in higher eukaryotes but another example might look different if there are also prokaryotes in the group. For other databases see e.g. https://www.uniprot.org/uniprot/P08908. If you scroll all the way down you find Phylogenomic databases, unfortunately the link to HoverGene does not seem to work. To compare Homologene and OMA use these two links. There are also other databases such as EggNog (entry would look like this) or RoundUp (which could not find the sample gene I have used here but another example looks like [18] this).
The group that published OMA also wrote a comparative paper. They claim to be the biggest orthology database with currently 830 species. Homologene on the other hand has only eukaryotic species, so basically they have 20 species. This might be another point to consider. In their paper the group names several good orthology databases which we might consider here [...]the ranking of the best performing projects (OMA, RoundUp, Homologene, Inparanoid). Again, Inparanoid is limited to eukaryotic species. Greetings --hroest 11:29, 10 September 2009 (UTC)
Thanks for the explanation and additional details. I have a few thoughts. First, the GNF Protein box definitely has a human gene bias (to satisfy Wikipedia's notability criterion). Not sure how relevant differences in orthology assignments to prokaryotes are to WP users (thought there appear to be higher level differences as well in the example above). Second, if there were consensus, we could have a separate "Orthology" section in the GNF Protein box to contain multiple links. Third, to this non-specialist and at first glance, OMA is much less friendly than Homologene and eggNOG since it doesn't display the organism. Users either have to know the cryptic organism code or use the mouseovers. My two cents... Cheers, AndrewGNF (talk) 18:01, 10 September 2009 (UTC)
Actually, we should probably move this discussion and any follow up to the broader group over to Wikipedia_talk:WikiProject_Molecular_and_Cellular_Biology#Orthology_Database. Cheers, AndrewGNF (talk) 18:03, 10 September 2009 (UTC)

Opening sentence

As discussed previously, there is a consensus to change the lead sentence to make clear that Gene Wiki pages are about both the gene and the protein encoded by that gene. The consensus reached was that the opening sentence (using GPR3 as an example) should read as follows:

  • G protein-coupled receptor 3 is a protein which in humans is encoded by the GPR3 gene.

Perhaps this should be modified slightly according to this edit as follows (replacing "which" with "that" is probably a good idea although the additional two commas in my opinion is over doing it):

  • G protein-coupled receptor 3 is a protein that, in humans, is encoded by the GPR3 gene.

In addition, the name should probably be taken from the UniProt database (protein name) rather than HUGO (gene name) while the gene symbol would be as before (i.e., the HUGO gene symbol). The UniProt protein name is often the same as the HUGO name, but sometimes differ (compare for example estrogen receptor beta where the UniProt name is "estrogen receptor beta" while the HUGO name is "estrogen receptor 2 (ER beta)".

Given the never ending controversy that this opening sentence has generated, I would like BogBot to get started on this in near future. I think it will be straight forward to write a regular expression that will recognize and replace the existing boiler plate lead sentence and leave untouched any sentence that a human editor has modified.

Thoughts concerning the two minor changes proposed above? Also how best to proceed (bring this up again with the MCB project followed by WP:Bots/Requests for approval)?. Cheers. Boghog (talk) 13:53, 27 September 2009 (UTC)

Support, personally preferring the first version of the opening sentence ('which', no commas). Since the main topic of the articles seems to be the protein (given that it is in boldface in the opening sentence, and that this will probably make the wording of the article easier), UniProt names would be more appropriate than HUGO names in my opinion. --ἀνυπόδητος (talk) 19:25, 1 October 2009 (UTC)
Strongly support. Sorry I completely missed this section. I favor this version:
  • G protein-coupled receptor 3 is a protein that in humans is encoded by the GPR3 gene.
I think the grammarians will have a problem with #1, and I think #2 has too many commas. Uniprot name instead of HUGO sounds great (and perhaps we should do this in {{GNF Protein box}} as well... Cheers, AndrewGNF (talk) 19:39, 1 October 2009 (UTC)
Just to polish up my English: Why is "which" incorrect? I always thought this was the all-purpose inanimate relative pronoun in English. --ἀνυπόδητος (talk) 19:50, 1 October 2009 (UTC)
As a native English speaker and writer, I think I've done this wrong for pretty much my entire life. But recently I learned and follow the general rule that which must always follow a comma, and that must not. This link has more info on the slight difference in meaning. Hope that helps! (But really, I should be the last person to consult on the subject... ;) ) Cheers, AndrewGNF (talk) 19:57, 1 October 2009 (UTC)
Thanks, really an interesting point. I've also found this – but it really seems that "that" is the less controversial choice. --ἀνυπόδητος (talk) 14:33, 2 October 2009 (UTC)
Support. --Arcadian (talk) 23:59, 1 October 2009 (UTC)
Sounds good, but how about splice versions like here? (this is BCL2L13 as random example). This can lead to several UniProt files (proteins) corresponding to the same gene if I understand correctly. Not sure how this is treated by ProteinBox. Sorry, I was not looking here for a long time.Biophys (talk) 04:32, 5 October 2009 (UTC)
Good point about splice variants. One gene can code for many splice variants and each presumably would have a different UniProt accession number and protein names. This obviously complicates things. I need to dig into this some more. Boghog (talk) 06:43, 5 October 2009 (UTC)
The BCL2L13 example is probably not a good one to illustrate splice variants. According to UniProt, there is only one reviewed human UniProt entry, Q9BXK5 for the BCL2L13 gene. I am not sure how the UniProt database works, but I do know there is frequently a great deal of debate on exactly which splice variants are expressed as protein as opposed to mRNA. I think we should be conservative and only list reviewed UniProt entries in which protein expression is confirmed and not just predicted. Boghog (talk) 19:52, 5 October 2009 (UTC)
According to the UniProt documentation: all protein sequences encoded by a same gene are merged into a single UniProtKB/Swiss-Prot entry. Hence for each gene, only one UniProtKB/Swiss-Prot (reviewed) entry should be listed and not the multiple UniProtKB/TrEMBL (unreviewed) entries that might be associated with the same gene. Thanks however for bring up the issue since I was not previously aware of this distinction. This one-to-one relationship between gene and UniProt entry will certainly make the job of rewriting the opening sentence much more straight forward. Boghog (talk) 20:41, 5 October 2009 (UTC)
If you go to Entrez Gene for BCL2L13 [19], you would see the following Uniprot entries (at the very bottom): Q86T62, Q8IZP5, and Q9BXK5.1. Here they are in Uniprot: [20],[21],[22], and yes, only last of them seems to be good reviewed entry, and it has been chosen correctly by ProteinBox. Looks good. You might wish to look this article. There is IPI [23] to sort such things out. Support your suggestion.Biophys (talk) 01:07, 6 October 2009 (UTC)
Yes, the reviewed part of Uniprot (UniProtKB) now includes only 20,329 proteins/entries, but with TREMBL it includes 87,185 variants [24], but we do not need them. Biophys (talk) 01:19, 6 October 2009 (UTC)
Thanks for the links and for your support. The Schiöth paper looks especially interesting and I think it will be useful to fill in some holes in the Gene Wiki classification of membrane proteins.
I have made a couple of test runs to replace the opening sentence (see for example here and here). Please note the following:
  1. The HUGO gene name has been replaced with the "Recommended" human UniProt protein name.
  2. The script appends the citation(s) corresponding to the "Pubmed IDs" for the gene taken from the genenames.org database. These citations usually are to the paper that describes the original cloning of the human gene. (note: I have fixed a bug in which ", " was prepended to the authors list; also the script now checks to see if the added citation was already included in the further reading section, if so, it is removed from the further reading section; the net effect is to move the citation in-line)
  3. If any word in the UniProt protein name ends in "ase", the script replaces the protein wiki link with enzyme in the opening sentence.
Before proceeding with the rest of the Gene Wiki articles, I want to check to make sure that everyone thinks these test edits look OK. Any additional suggestions or concerns? Cheers. Boghog (talk) 04:40, 18 October 2009 (UTC)
Brilliant! I think all the changes you mention below are excellent. I scanned through a few of the recent BogBot contributions and everything looks great to me. Full support from me to continue. Cheers, AndrewGNF (talk) 14:03, 18 October 2009 (UTC)

Mass autogeneration of high-quality PDB images

Following a discussion here, I've written a draft program which takes in a PDB ID and outputs a ray traced image of the corresponding structure. When I enter standard commands to make the image's background transparent in PyMOL (using version 0.99r6 on Windows), however, I can't seem to make them work. Specifically, entering 'set ray_opaque_background, 0' and/or 'set opaque_background, 0' does not result in the expected checkerboard background indicating a transparent background (see a relevant entry in PyMOLWiki here: http://pymolwiki.org/index.php/Ray_opaque_background). Using the GUI to select 'Display' -> 'Background' and uncheck 'Opaque' and then check 'Show Alpha Checkerboard' does nothing. This is odd, because I've made backgrounds transparent in PyMOL before (see [25], [26]). Has anybody run into a similar problem? Trouble-shooting via Google hasn't helped. Emw2012 (talk) 14:10, 27 September 2009 (UTC)

Using MacPyMOL version 1.1 and including "set ray_opaque_background, off" in the pml script, I just created this. Have you included another command in your script like "cmd.bg_color(color="white")" which might override the ray_opaque_background command? Boghog (talk) 14:41, 27 September 2009 (UTC)
Also how are you viewing the resulting file? Perhaps the your viewer doesn't depict the transparent background as a checker board? Boghog (talk) 15:00, 27 September 2009 (UTC)
I've viewed the resulting file in GIMP and the standard Windows previewer. Both show background transparency as checkerboard in images with transparency known to work; but not the images produced in 0.99r6 via the above method. I also tried the above on another Windows box to similarly failed result. The most recent free build for Mac (pre version 1.0) also doesn't show the desired effect from entering the mentioned commands. Maybe I'll have to install the (actually) latest build for developers on my shoddy Linux machine. Emw2012 (talk) 17:02, 27 September 2009 (UTC)
OK. Sorry for asking the obvious question about the viewer. I also assume it is not some other setting in your script. Did you previously use version 0.99r6 to produce the figures that you linked above or some older version? Boghog (talk) 18:12, 27 September 2009 (UTC)
I figured out the issue. After executing 'set opaque_background, 0' and 'set show_alpha_layer, 1', the image must be ray traced in order to show the checker board indicating background transparency. So the simple script is now working as intended: it takes in a PDB ID as a command-line argument and outputs a transparent, ray traced PNG of the protein shown in 'cartoon' mode. I've uploaded a testcase here. The new PDB image needs cropping, but I'd like to get feedback on the coloring scheme in particular. Is the standard 'chainbow' color still desired, or would SS-specific coloring be better? Other suggestions would be appreciated: how would we change the ProteinBoxBot images given that they can be automatically tweaked in PyMOL? Emw2012 (talk) 14:41, 28 September 2009 (UTC)
Great! I personally prefer the chainbow color since it enables one to distinquish N- from the C-terminus and follow the threading of the primary sequence through the 3D structure. Coloring by secondary structure is a bit redundant since one can usually infer this from the local shape of the cartoon diagram. Concerning the cropping, unless you have "set auto_zoom, off", the figure should automatically be scaled to fit the entire screen, so cropping should not be necessary. In answer to your last question, it would probably be best to upload your figures to Wikipedia Commons with the Entrez or other appropriate ID in the file name. Once stored there, we could probably write a bot script to automatically check for images and make the corresponding change in the {{GNF_Protein_box}} template so that the new figure is displayed on the Wiki Gene page. Boghog (talk) 15:32, 28 September 2009 (UTC)
I've now got the coloring for each input PDB set to 'chainbows'. However, we may want to adjust resulting images, because nucleic acids are also being colored in chainbows. I can't think of how this adds any value to the image, since it effectively hides the DNA/RNA. What are others' thoughts on hard-coding some color for nucleic acids, like blue or red (or something else)? Coloring a protein/DNA complex (like 3cmt) with chainbow (after showing the structure in cartoon mode) gives an example of the problem I've described. To see an example of my proposed solution, color the DNA red with this command: color red, resn dg+dc+da+dt. Emw (talk) 20:04, 14 November 2009 (UTC)
I agree that nucleic acids should be not be rainbow colored since it then becomes harder to distinguish protein from the nucleic acid. I would suggest coloring DNA/RNA in a color like magenta that won't be confused with the protein rainbow spectrum. Boghog (talk) 20:19, 14 November 2009 (UTC)
Good idea -- magenta it is. Emw (talk) 20:28, 14 November 2009 (UTC)
I've uploaded two images generated from this script to User:Emw/PDBImageTestcases and User:Emw/PDBImageTestcases_2. While the cropping in the first test case is much improving, it's clear from the second test case that the cropping needs work if it is to be applied universally. Ideally PyMOL would have some option to automatically crop images so that there is some fixed number of pixels between the outermost atoms and the edge of the image. Does anyone have ideas on how that could be done? Emw (talk) 22:41, 14 November 2009 (UTC)

(unindent) I think I may have found a workaround to the cropping issue by using GIMP -- simply autocrop the image and add some fixed number of pixels (e.g., 25) as padding. Presumably this will be the solution once I find out how to incorporate the simple process for the fix into a batch script. However, while the image itself looks good, it seems to be deformed once its added to the 'GNF Protein box' template. The problem is viewable at User:Emw/PDBImageTestcases_2. Any ideas on what may be causing this? Emw (talk) 00:51, 15 November 2009 (UTC)

Hmm, the image doesn't look obviously deformed to me. I think the template just shrinks the image to 250px width. In any case, I think the new images look fantastic! Cheers, AndrewGNF (talk) 19:24, 15 November 2009 (UTC)
Sorry, I changed User:Emw/PDBImageTestcases_2 without noting that here. The problem I mentioned is visible by comparing the PDB image in the middle template with the PDB image in the rightmost template. The middle template uses the standard 'GNF Protein box' template, while the rightmost template uses a very slightly tweaked version of 'GNF Protein box'. (The only difference between the two templates is that the latter uses a 1 px-larger border than the standard template, i.e. this, which doesn't seem like it should cause that big of a difference in the image). Emw (talk) 19:50, 15 November 2009 (UTC)
I've uploaded ten new test cases, listed at User:Emw/PDB_Sandbox. In general, the new images are finer-resolution, more efficiently oriented and cropped, higher and wider, and 5-10 times larger (e.g., 32 KB vs. 334 KB) than the old images. At this point I have worked out all the bugs that I've noticed, and am able to output the images shown in the mentioned test cases by passing in a list of PDB IDs. Based on the script's current design, there is significant variance in the time it takes to produce an image for a given PDB ID. It takes an average of about 40 seconds to produce an image on my mid-line desktop, based on the 20 different PDBs listed at User:JonSDSUGrad/Sandbox.
Also, is there a centralized list of pages uploaded by ProteinBoxBot with the naming form 'PBB Protein [protein abbreviation] image.jpg'? If there were, then I could presumably search for a Wikipedia article with that abbreviation and scrape the PDB of interest from the phrase 'PDB rendering based on [PDB ID].' contained in each article's PBB template. This approach seems a bit naive, but feasible as a fallback if there aren't any better ideas of how to get a list of PDB images to update and their corresponding Wikipedia articles. As can be gleaned from the names of each test case at User:Emw/PDB_Sandbox, the naming convention I have for the new images is 'Protein (protein abbreviation) PDB (PDB ID).png'. I think this is a slight improvement over the previous image naming convention used by ProteinBoxBot, which doesn't seem to have accounted for the fact that one protein can have multiple associated PDB files. Emw (talk) 08:42, 23 November 2009 (UTC)
I've made a list of all 2774 protein images uploaded by ProteinBoxBot at User:Emw/PDB_Sandbox/Target_list. Unless anyone has a better idea, I plan on using this to get the names of PDBs to generate by looking up the Wikipedia article for the protein abbreviation contained in each of the image's file names, and scraping the associated PDB ID from the standard PDB image caption: "PDB rendering based on (PDB ID)." Emw (talk) 17:12, 23 November 2009 (UTC)
Your new images look great! I also like your idea of including the PDB accession number in the image name so that it is self documenting. After reading your previous comments, I tried to find a way to automatically crop the images in PyMol, but unfortunately there does not seem to be an easy way to do this so I think you will have to stick to your Gimp solution. I assume from looking at your examples that you are using the PyMol orient command which aligns the first principal of inertia along the x-axis and second moment along the y-axis which I think is a great idea. Cheers. Boghog (talk) 19:30, 23 November 2009 (UTC)

Adding SCOP classification

Has the SCOP classification feature of the 'Automate PDB uploads' project proposal been implemented (Portal:Gene_Wiki/Project_proposals#Automate_uploads_of_PDB_images)? The automation needed to implement the SCOP feature seems straightforward in concept: go to the SCOP website, enter the PDB ID of interest into the search form, and parse from the retrieved lineage any elements of interest.

Which elements of a SCOP lineage should be retrieved by the feature (e.g., class, fold, superfamily, family, protein, and/or species)? And where should the information on the elements be put -- somewhere as a new field in one of the templates used by ProteinBoxBot, and/or in a newly added 'Summary' box in the image file itself (e.g. File:PBB_Protein_MMP9_image.jpg)? Emw (talk) 16:12, 4 November 2009 (UTC)

On second glance, following the actual proposal to add SCOP classification as a category seems best and should take first priority. If others like my above suggestions about where to add the classification, those can be done as well. Emw (talk) 16:21, 4 November 2009 (UTC)
Hi, Emw. The first version of this project was completed previously. We uploaded 66,000 PDB thumbnail images to commons and categorized according to SCOP. You can browse the collection at http://commons.wikimedia.org/wiki/SCOP. Of course, if you want to amend or replace those images with your higher-quality versions, go for it. You can use the existing category structure as a guide. Also, SCOP I'm sure has a downloadable file, so you won't need to hit their web server repeatedly. Cheers, AndrewGNF (talk) 21:06, 4 November 2009 (UTC)

HUGO gene symbol as redirecting article title

Typically, gene articles containing protein structure images are titled in the form <HUGO symbol> (e.g. MMP9) or <HUGO symbol>_(gene) (e.g. PIR (gene)). Of the 2774 articles having structure images, there are about 221 that don't follow either of those two naming conventions. For example, consider interleukin 6, with the HUGO gene symbol IL6 -- IL6 is a disambiguation page and IL6 (gene) doesn't currently exist. Would it make sense to add IL6 (gene) as a redirect for this article, and to apply this pattern to the roughly 10% of gene articles with structure images that are not accessible via HUGO gene symbol? Emw (talk) 21:31, 24 December 2009 (UTC)

I am not sure exactly what the issue is here other than perhaps to make it easier for bots to find the relevant pages. It certainly wouldn't hurt to add redirects from <HUGO symbol>_(gene) to the current Gene Wiki pages. In a somewhat related issue, as discussed here, it also might be worth adding redirects from the UniProt protein name and full HUGO gene name to the existing Gene Wiki pages. Then every article would be linked (either directly as the article title or a redirect) from:
  1. HUGO gene symbol (if unambiguous),
  2. <HUGO symbol>_(gene),
  3. HUGO gene name,
  4. UniProt protein name, and optionally
  5. a Wikipedia editor assigned name that may differ from any of the above.
(please note that 3 and 4 are in many but not all cases identical)
However adding all these links could be over doing it. Cheers. Boghog (talk) 23:04, 24 December 2009 (UTC)
Alright, so I'll add redirects of the form <HUGO symbol>_(gene) for those ~200 gene articles mentioned in my previous post. I agree that the main benefit of these redirects is to ease bot navigation. With regard to adding Uniprot protein name and HUGO gene name redirects, I think that would be useful as well. Emw (talk) 03:51, 25 December 2009 (UTC)

PBB bug or feature?

Hi folks, I'm looking at merging all the different Telomerase RNA component articles that WP has amassed. At the top of the page is an odd phrase 'n/a (protein)n/a (protein)'. I suspect this is coming from the 'PBB|geneid=7012' template. I guess something in the template is spitting the dummy because this is not actually a protein. Is there a quick fix?--Paul (talk) 08:09, 9 October 2009 (UTC)

Impeccable timing. In the same minute that you posted your note, this change was made to {{GNF Protein box}} that fixed that problem. There was a slight delay in fixing it because the template is protected (only admins can change it). In any case, let us know if you notice anything other bugs... Cheers, AndrewGNF (talk) 15:50, 9 October 2009 (UTC)
Now that's what I call an efficient service! --Paul (talk) 21:34, 9 October 2009 (UTC)

Which integrins are that?

Is integrin α4β7 (called LPAM-1, lymphocyte Peyer's patch adhesion molecule 1, here) the same as ITGB7? And while I am at it: Alpha-v beta-3 is probably the same as a ProteinBoxBot-created article, but which? --ἀνυπόδητος (talk) 11:04, 14 October 2009 (UTC)

These integrins appear to be heterodimers composed of two different proteins encoded by two seperate genes. The α4β7 integrin is composed of CD49d4) and ITGB77) proteins while Alpha-v beta-3 is composed of ITGAV and CD61. Cheers. Boghog (talk) 11:45, 14 October 2009 (UTC)
Oh, yes, how stupid of me :-) --ἀνυπόδητος (talk) 11:48, 14 October 2009 (UTC)

Glycoprotein 72

According to WHO, glycoprotein 72 is the target of several monoclonal antibodies (anatumomab mafenatox, minretumomab, indium (111In) satumomab pendetide). It seems to be the same as CA 72-4, but I'd like to be sure before I create a redirect. Can anybody help? --ἀνυπόδητος (talk) 18:50, 24 October 2009 (UTC)

HuD/ELAVL-4

The image is actually of HuC (ELAVL-3), not HuD (ELAVL-4). Only two structures exist in the PDB, both RNA-bound dimers. These are 1FXL and 1G2E. regards, Sunil060902 (talk) 17:17, 27 October 2009 (UTC)

I have taken the liberty of correcting this myself. Newly uploaded image should be of 1fxl. best, Sunil060902 (talk) 17:24, 27 October 2009 (UTC)
Thanks for the explanation and for correcting the error. Cheers. Boghog (talk) 19:05, 27 October 2009 (UTC)

The Bot that does the protein images added the full chemical name of titin, a ridiculous amount of characters, could this be fixed? --The New Mikemoral ♪♫ 06:12, 16 November 2009 (UTC)

This was a case of vandalism which I have reverted. Thanks for pointing it out. Cheers. Boghog (talk) 07:54, 16 November 2009 (UTC)
To the bot template? --The New Mikemoral ♪♫ 06:37, 20 November 2009 (UTC)
Sorry for not being clearer. The bot did not introduce the full chemical name. The name was introduced by User:Teqnique in this edit. As you can see in the history of the {{PBB/7273}} template, it has been vandalized several times by different human editors and in each case, the vandalism was reverted. Boghog (talk) 10:16, 20 November 2009 (UTC)
Ah, that makes sense. --The New Mikemoral ♪♫ 01:57, 21 November 2009 (UTC)

BogBot changed the bolded name of this protein so it doesn't match the article title any more. It should be merged with EpCAM anyway, but in which direction? And how is this done without breaking the {{PBB Summary}}? --ἀνυπόδητος (talk) 12:36, 16 November 2009 (UTC)

The official HUGO gene name and UniProt protein name are both epithelial cell adhesion molecule. So I suggest that the name of the tumor-associated calcium signal transducer 1 be changed to "epithelial cell adhesion molecule" and the material from EpCAM be merged into the newly renamed article. Does this sound reasonable? Cheers. Boghog (talk) 22:04, 16 November 2009 (UTC)
Yes, that sounds good. I did the move, but have no time for the merge at the moment. Also, I'm not sure what to do with the PBB Summary template. Could you do that? Cheers --ἀνυπόδητος (talk) 07:20, 17 November 2009 (UTC)
Fine – does that mean, just remove the PBB Summary in favour of a human-written article? --ἀνυπόδητος (talk) 18:19, 17 November 2009 (UTC)
Hi ἀνυπόδητος, in general, those PBB Summary templates should be removed at whatever point they cease to be useful. Normally they were added to pages that had no other free-text content, so it was just meant to be a starting point. Cheers, AndrewGNF (talk) 20:40, 17 November 2009 (UTC)
Thanks! --ἀνυπόδητος (talk) 09:04, 18 November 2009 (UTC)
  1. ^ Forscher sammeln menschliche Gene in Wikipedia. In: Der Spiegel, 8 July 2008 (online)
  2. ^ A Gene Wiki for Community Annotation of Gene Function. In: PLoS Biology, July 8, 2008 (online)