Supervisor: Dr. Sandra Zilles
Thesis: Precision-based selection criteria for classification with ensembles of learners
GPA: 95 (out of 100)
Supervisor: Dr. Yaser Norouzi
Thesis: Pulse-Amplitude and Time-of-Arrival Based Pulse De-interleaving
GPA : 16.9 (out of 20) - Last year (38 credits) GPA: 17.51 (out of 20)
Human beings tend to seek multiple experts' opinions and combine them for making a wise decision. For example, multiple reviewers judge a paper and classify it to accept with/without revision or reject. In this way, intuitively, the risk of making a wrong decision will be reduced when compared to relying on information provided by a single expert.
A core problem in machine learning is to learn a classifier that categorizes data instances into two or more classes. Such a classifier could be considered as an artificial expert. Similar to the above-mentioned examples, combining multiple experts is a very popular approach in the field of machine learning. An ensemble of classifiers is made by integrating a set of base classifiers to build a predictive model, an idea that was formally introduced by Hansen and Salamon in 1990.
In this project, we introduce a framework of boosting that generalizes AdaBoost.M1 and derive several variants of AdaBoost.M1, based on a new approach from that framework.
Angluin's pattern languages are built on patterns consisting of terminal symbols and variables. A string in the language of a pattern is obtained when replacing every variable with a finite nonempty string of terminal symbols. For example, let $\Sigma = \{a, b, c\}$ and X = \{ x_1, x_2, x_3 \} be a set of terminal symbols and variables respectively. Then, p_1=ax_1cx_2 is a pattern and w_1=abccc is a string in the language of pattern p_1.
In Angluin's patterns if a variable occurs more than once, each of its occurrences is replaced by the same string. For example, if p_2=x_1ax_1, then w_2=bcaabca is a string in the language of p_2, but w'_2=cbabc is NOT.
Michael Geilke and Sandra Zilles introduced Relational Patterns by allowing other relations between the variables in a pattern. Let R be a set of relations among variables. For example, consider p_3=x_1ax_2 and R_3 = \{x_1=x_2^r\} (the substitutoin for x_1 must be reverse of the substituion for x_2). Then, w_3=cbabc is a string in the language of p_3 under the relation R_3, but w'_3=cbacb is NOT.
In this project we studied a specific type of relation and considered some fundamental problems such as: decision problems, classical learnability questions, the properties of tell-tale sets, and the design of subclasses that can be learnt efficiently with membership queries, to name but a few.
Transmembrane (TM) proteins are proteins that span a cell membrane; their segments crossing the membrane are called TM domains. TM domain and TM protein detection are important problems in computational biology, but typical machine learning approaches yield classifiers that are difficult to interpret and hence yield no biological insight.
We study both TM domain and TM protein detection with easy to interpret decision trees. For TM domain detection, the use of decision trees is already reported in the literature, but we provide a critical study of the existing approach, resulting in improved feature sets as well as observations on how to avoid biased training and test sets. In particular, we discover a motif known to be common to TM domains that was not discovered in previous research using machine learning. For TM protein detection, we propose a 2-layer learning method. This method can be generalized to deal with a large class of string classification problems. The method achieves sensitivity and specificity values of up to 92% on the settings we experimented with, while providing intuitive classifiers that are easy to interpret for the domain expert.
The result of this project is published and available here.