In real world applications, training the classifier

using unbalanced dataset is the major problem, as it decreases the performance

of Machine Learning algorithms. Unbalanced dataset can be prominently classified

based on Support Vector Machine (SVM) which uses Kernel technique to find

decision boundary. High Dimensionality and uneven distribution of data has a

significant impact on the decision boundary. By employing Feature selection (FS) high dimensionality

of data can be solved by selecting prominent features. It is usually applied as

a pre-processing step in both soft computing and machine learning tasks. FS is

employed in different applications with a variety of purposes: to overcome the

curse of dimensionality, to speed up the classification model construction, to

help unravel and interpret the innate structure of data sets, to streamline

data collection when the measurement cost of attributes are considered and to

remove irrelevant and redundant features thus improving classification

performance. Hence, in this paper, two different FS approaches has been

proposed namely Fuzzy Rough set based FS and Fuzzy Soft set based FS. After FS

the reduced dataset has been given to the proposed Iterative Fuzzy Support

Vector Machine (IFSVM) for classification which has considered two different

membership functions. The Experiments has been carried out on four different

data sets namely Thyroid, Breast Cancer, Thoracic surgery, and Heart Disease.

The results shown that the classification accuracy is better for Fuzzy Rough

set based FS when compared other.

Keywords: Support Vector Machine, Fuzzy logic, Rough

Sets, Soft Sets, Feature selection.

—————————————————————————————————————————————

1. Introduction:

SVM is one of the most

well known supervised machine learning algorithms for classification or

prediction method developed by Cortes and Vapnik 1 in the 1990s as a result

of the collaboration between the statistical and the machine learning research

community. SVM tries to classify cases by finding a separating boundary called

hyper plane. The main advantage of the SVM is that it can, with relative ease,

overcome ‘the high dimensionality problem’, i.e., the problem that arises when

there is a large number of input variables relative to the number of available

observations 2. Also, because the SVM approach is data driven and possible without theoretical

framework, it may have important discriminative power for classification,

especially in the cases where sample sizes are small and a large number of

features (variables) are involved (i.e., high dimensional space). This

technique has recently been used to improve methods for detecting diseases in

clinical settings 3, 4. Moreover, SVM has demonstrated high performance in

solving classification problems in bioinformatics 5, 6.

In many practical

engineering applications, the obtained training data is often contaminated by

noises. Furthermore, some points in the training data set are misplaced far

away from main body or even on the wrong side in feature space. One of the main

drawbacks of the standard SVM is that the training process of the SVM is

sensitive to the outliers or noise in the training dataset due to over fitting.

A training data point may neither exactly belong to any of the two classes when

the outliers or noises exist in many real-world classification problems. The data

point nearer to decision boundary may belong to one of the class or it may be a

noisy point. But these kinds of uncertainty points may be more important than

others for making decision, which leads to the problem of over fitting. Fuzzy

approaches are effective in solving uncertain problems, which reduces the

sensitivity of less important data 7. This approach assigns a fuzzy

membership value as a weight to each training data point and uses this weight

to control the importance of the corresponding data point. So many fuzzy

approaches are developed and proposed in literature to reduce the effect of

outliers. A similarity measure function to compute fuzzy memberships were

introduced in 8. However, they had to assume that outliers should be somewhat

separate from the normal data. The effect of the trade-off parameter C to the model of conventional

two-class SVM and introduced a triangular membership function to set higher

grades to the data points in regions containing data of both classes. However

this method could be applied with some assumptions involved 9. Above two

problems are solved by Fuzzy SVM.

The method proposed in 10

is based on the supposition that outliers in the training vector set are less trustworthy,

and hence of less significant over other training vectors. As outliers are

detected based solely on their relative distance from their class mean, this

method may be expected to produce good results if the distributions of training

vectors xi of each class are spherical with central means (in the

space used to calculate the memberships). In general, however, this assumption

may not hold, which motivates us to seek a more universally applicable method.

Hence computing fuzzy memberships is still a challenge. This problem can be

solved by IFSVM.

Generally fuzzy approach

based machine learning techniques faces two main difficulties that are how to

set fuzzy memberships and how to decrease computational complexity. It has been

found that the performance of fuzzy SVM highly depends on the determination of

fuzzy memberships therefore in this paper; we proposed a new method to compute

fuzzy memberships that calculates membership values for only misclassified

points and calculates membership values for all training data points. For

calculating the membership values for misclassified points an iterative method

has been employed where membership values are generated iteratively based on

the positions of training vectors relative to the SVM decision surface itself.

For calculating the membership values for misclassified points a fuzzy

clustering based technique has been adopted where clustering method has been applied on the data

and determines the clusters in mixed

regions and set Fuzzy membership value as 1 and fuzzy memberships of other data

points are determined by their closest cluster accordingly.