The protein universe, one of the fundamental entities of biotechnology is vast. With the increase in structural genomics interest nationwide, there is a critical need to classify the structural data generated by the proteomic initiatives. Though experimental methods (X-ray crystallography and NMR spectroscopy) are critical in determining protein structures, classification of these structures into classes is a challenge. In this project, we will develop new computer-aided methods to determine the protein structural class and integrate the research into education activities.
These new technologies are attracting and exposing capable students to the essentials of scholarly research in the sciences. The technologies developed also have the potential to be transferred to industry. In addition, this project provides students the opportunities to contribute new ideas to the data mining, bioinformatics, and biological database research, which will train a new generation of work force for the high-tech industry in California. First, we investigate innovative machine-learning methods to determine the protein structure class that would have significant impact on the area of bioinformatics. Second, we develop education materials and pedagogical teaching models that is integrated into a multidisciplinary education program whose aim is to recruit and retain more students in the field of computing and biology and increase the diversity among the ranks of computer and biology professionals.
In comparison with other computational methods for protein structure classification tools, the innovation in our approach is to identify the protein structure categories directly from NMR spectra using chemical shift information. Chemical shifts significantly differ from protein sequences information studied in the literature of bioinformatics, approximately 20% experimental structures deposited at the Research Collaboratory for Structural Bioinformatics (RCSB) and this number is rapidly increasing. This is strengthening our approach in terms of efficient and robust representations as we train the system to detect many classes and subclasses of protein structure. Therefore our approach is in synergy with biotechnology tools for structural bioinformatics.
Based on novel machine-learning techniques, we are developing software packages for protein structure class prediction with automated tools for data collection. Using these tools and the software packages together, we are able to deliver a general framework that includes a set of effective and automated tools for data collection, data cleaning, data analysis, and data evaluation. These tools will be freely available to the scientific community and will have significant impact on the area of computational biology.