Objectives of Thesis
1. The main objective of this thesis is to examine the native (un-modified) use of a generic natural language processing engine on biomedical literature isolated from PubMed for extracting binding relationships between protein entities in the mouse. The resulting system will be called “Muscorian”, abbreviated from “Mouse (Mus musculus) Corpus Librarian”. During this course, the issues arising will be elucidated.
Question 1: Is un-modified generic natural language processing engine inferior to its Part-of-Speech specialized counterpart in extracting un-specific protein-protein interactions from text where inferiority is determined by a reduction of 5% in both precision and recall?
Null hypothesis: Using MontyLingua as the generic natural language processing engine, un-modified MontyLingua is inferior than its Part-of-Speech specialized counterpart in extracting un-specific protein-protein interactions from text.
Alternate hypothesis: Un-modified MontyLingua is not inferior than its Part-of-Speech specialized counterpart in extracting un-specific protein-protein interactions from text.
Question 2.1: Can a system that extracts un-specified protein-protein interactions from text be specialized to extract protein-protein binding interactions with the same precision (not more than 3% difference)?
Null hypothesis: A system that extracts un-specified protein-protein interactions from text cannot be specialized to extract protein-protein binding interactions without more than 3% reduction in precision.
Alternate hypothesis: A system that extracts un-specified protein-protein interactions from text can be specialized to extract protein-protein binding interactions without more than 3% reduction in precision.
2. Secondly, adaptations to Muscorian for other relationships, such as protein activation and protein-disease relationships, will be examined.
Question 2.2: Can a system that extracts un-specific protein-protein interactions from text be specialized to extract protein-protein activation interactions with the same precision (not more than 3% difference)?
Null hypothesis: A system that extracts un-specific protein-protein interactions from text cannot be specialized to extract protein-protein activation interactions without more than 3% reduction in precision.
Alternate hypothesis: A system that extracts un-specific protein-protein interactions from text can be specialized to extract protein-protein activation interactions without more than 3% reduction in precision.
3. Thirdly, the usability of PubMed search engine as a tool for co-occurrence (co-retrieval) analysis will be compared with protein name co-occurrence analysis as information extraction methods for un-specified protein-protein interactions from text.
Question 3.1: Given that there are 3 methods of co-occurrence measurements (PubGene, CoPub Mapper, Poisson statistics), are they comparable to each other?
Null hypothesis: The 3 methods yield the same results.
Alternate hypothesis: The 3 methods are not comparable to each other, that is, yield significantly different results.
Question 3.2: Can PubMed document retrieval by protein names be used for co-occurrence analysis?
Null hypothesis: The co-occurrence results from using PubMed document retrieval by protein names yield the same results as the 3 published methods of co-occurrence computation (PubGene, CoPub Mapper, Poisson statistics).
Alternate hypothesis: The co-occurrence results from using PubMed document retrieval by protein names yield the significantly different results as the 3 published methods of co-occurrence computation (PubGene, CoPub Mapper, Poisson statistics).
4. Fourthly, Muscorian's output on protein-protein binding interactions will be compared to co-occurrence analysis of the same list of proteins on the same corpus.
Question 4: Is there a synergistic advantage from using natural language processing and various co-occurrence statistics concurrently in extracting protein-protein interactions from text?
Null hypothesis: There is no synergistic advantage by comparing the results from the combination of both sets of methods and individually.
Alternate hypothesis: There is synergistic advantage by comparing the results from the combination of both sets of methods and individually.
5. Fifthly, the expectations and problems of using a biomedical literature analysis system to aid active biological research will be elucidated to cast insights into the slow uptake of biomedical literature analysis technology by research biologists. At the same time, an application for mapping microarray results onto literature-mined relationship map using outputs from Muscorian will be developed.
Question 5: What are the expectations and problems of using a biomedical literature analysis system to aid active biological research?
Null hypothesis 1: Research biologists have a clear understanding of biomedical literature analysis technology.
Alternate hypothesis 1: Research biologists have a poor understanding of biomedical literature analysis technology.
Null hypothesis 2: Precision and recall measures are the only performance measures to biologists.
Alternate hypothesis 2: Precision and recall measures are not the only performance measures to biologists.
Null hypothesis 3: There are no problems faced by biologists in interpreting the results from biomedical literature analysis system.
Alternate hypothesis 3: There are problems faced by biologists in interpreting the results from biomedical literature analysis system.
6. Sixthly, main issues pertaining to the deployment of Muscorian for production use will be discussed.
7. And lastly, various tagged corpora used in the evaluation of Muscorian, together with the corresponding programmatic accessing routines and common evaluation tools, will be collected as a package.
Work Progress So Far
Objective 1: Completed. Manuscript preparation.
Objective 2: In progress, 50% completion.
Objective 3: In progress, 10% completion.
Objective 4: In progress, 60% completed. Computing completed, into analysis stage.
Objective 5: Not started, in planning.
Objective 6: Initiated.
Objective 7: Completed. Manuscript preparation.
Thesis Writing So Far
Chapter 1 – Introduction (Draft completed)
Chapter 2 – System Description (40% completed)
Chapter 3 – System Evaluation (30% completed)
Chapter 4 – Entity Co-occurrence and Document Co-Retrieval (Not started)
Chapter 5 – Development of Microarray-Literature Mapper (Not started)
Chapter 6 – Deployment Issues (Not started)
Chapter 7 – General Discussions and Future Work (Not started)
Plans for Third Year
List of Publications / Presentations
Ling, MHT, Chung, IF, Kuo, CJ, Lefevre, C, Lonie, A, Nicholas, KR, Lin, F. 2006. Biological Corpus Collection. In preparation.
Ling, MHT, Lefevre, C, Lonie, A, Nicholas, KR, Lin, F. 2006. Re-construction of Protein-Protein Interaction Pathways by Mining Subject-Verb-Objects Intermediates. Submitted.
Ling, MHT, Chung, IF, Kuo, CJ, Lefevre, C, Lonie, A, Nicholas, KR, Lin, F. 2006. Biological Corpus Collection. [Available at: ib-dwb.sourceforge.net/BCC.html] (Software tools)
Ling, MHT, Nicholas, KR, Lin, F, Lonie, A, and Lefevre, C. 2005. Muscorian: A pipeline for biological text analysis. [Available at: ib-dwb.sourceforge.net/Muscorian.html] (Software tools)
Ling, MHT, Lefevre, C, and Nicholas, KR. 2006. A Pipeline for Analysis of Published Abstracts for Information on Protein-Protein Inter-Relations. Proceedings of the Fourth Asia-Pacific Bioinformatics Conference. (Abstract)
Ling, MHT, Lefevre, C, and Nicholas, KR. 2005. Mosirium: A Modelling and Simulation Tool for Lactation in the Mouse. Proceedings of the Third Asia-Pacific Bioinformatics Conference. (Abstract)