Skip to content Skip to navigation

OpenStax_CNX

You are here: Home » Content » Bayesian Spam Classifier

Navigation

Recently Viewed

This feature requires Javascript to be enabled.
 

Bayesian Spam Classifier

Module by: Ali Mousavi, Ali Ayremlou. E-mail the authorsEdited By: Ali Mousavi, Ali AyremlouTranslated By: Ali Mousavi, Ali Ayremlou

Summary: In this module we introduce the Bayesian spam classifier.

Introduction

In today’s world, considerable population of people all around the world use email as a very quick and handy way of communication. On the other hand, manufacturers are continuously seeking for new ways for advertising their products in order to maximize their profit. Therefore, sending emails to people would be a reasonable way of advertising products since send it is very cheap, quick and effective way. However, internet users face many problems in this case. They should put a huge amount of time and energy to find appropriate emails rather than advertising emails between a huge amount of emails they receive in each day. As a result having an algorithm which separates useless emails like those which contain advertising products and we call them spam from those which contain important information and we call them ham seems to be a proper way for using email system more efficiently.

There are two general approaches for spam classifying. First, we can set general rules and for each email, we can check if it is observing those rules then we will classify it as a spam email. The problem with this method is that spammers are able to come up with new techniques to bypass those rules and hence we should continuously update rules which is frustrating. Second, we can use “Machine Learning” approach. In this approach, we have a training set for our algorithm which contains pre-classified algorithm. Afterwards, our algorithm will utilize this training set to learn how to classify emails as either spam or ham. It is obvious that this method works better than setting general rules since we are always able to update our training by the new spams even if they utilize new tricks.

Generally speaking, like any other machine learning task, in spam classifying we should first collect enough spam and ham emails as our training examples from which our algorithm will find out what are the main characteristics of spam emails. Since then, we should select features of emails. These features are the ones which are different in spams and hams. This task is crucial in the whole process since we may need some feature reduction in our task to reduce the dimensionality of the data and make the computation cost less expensive.

The Algorithm

If we compare the spam emails and non-spam emails, we will find out that particular words have different occurrence in these two types. In other words, they are certain words, e.g. business related words, which have more occurrence in spam emails than in hams. The approach is to first provide a set of training examples containing both spam and ham emails to the algorithm. This set has been manually labeled as either spam or ham. Then the algorithm will find out for each word the probability of occurring in spam and ham emails in that training set.

As a result, when we have a new email which we would like to classify it as either spam or ham, first we find out the most interesting components of that email (rather than pretty common words like: and, or, the, etc.). Afterwards, we find out the probability of those components or words with the values we found in the training process. As an example, suppose that the word “price” occurs in spam emails with the probability 0.8 and word “ten” occurs in spam emails with probability 0.5 and finally the word “dollar” occurs in spam emails with probability 0.9; therefore, the probability of being spam for the sentence “price is ten dollars” would be computed through the contribution of its different words as mentioned in the following.

As could be assumed by its name, the main idea of Naïve Bayesian spam classifier, which is the algorithm we have utilized in this article, is using Bayes theorem which is used widely in the context of probability.

As such, assume that given a specific word, we would like to find out what is the probability of being spam for a message which contains this word. From Bayes rule, we may write:

Figure 1
Figure 1 (graphics1.png)

Where in this equation, p(spam|word) is the probability of being spam for a message which contains the specific word. Moreover, p(spam) is the probability that a message is spam in general and is equal to the number of spam emails we have in our data set or may be set in a way that is describes later in this paragraph. Similarly, p(word|spam) is the probability of occurrence of specific word in spam messages. We find this value from our training set. In other words, we may check the frequency of occurring of this word in the spam emails for this value. P(word|ham) is the probability of occurrence of specific word in ham message. Similarly, We find this value from our training set. In other words, we may check the frequency of occurring of this word in the ham emails for this value. Finally, p(ham) is the probability that a message is ham in general. This value is equal to the number of ham emails we have in our training set. However, we are able to use prior information about emails in real world out to find out what fraction of the emails are spam in general and what fraction is ham in general.

Regarding the other term of the name of our algorithm “Naïve”, the reason of using this word is that the algorithm disregards the correlation between different words in the message, i.e., we assume that words of the message are conditionally independent given the message. This assumption is a strong assumption and mathematically is translated to the following:

Figure 2
Figure 2 (graphics2.png)

Considering this, it can be shown that the overall probability of being spam for a message is:

Figure 3
Figure 3 (graphics3.png)

Where in this formula p is the probability of being spam for one message. pi is the probability of being spam for a message which contains the word number i.

After this step we compare the value of p with certain threshold and classify the message as spam if the value of p pass the threshold.

Since the value of p could be very small and hence we face the problem of underflow, we manipulate the above formula to write it in a logarithm format. We may write:

Figure 4
Figure 4 (graphics4.png)

As a result we have:

Figure 5
Figure 5 (graphics5.png)

If

Figure 6
Figure 6 (graphics6.png)
then we may write:

Figure 7
Figure 7 (graphics7.png)

Another issue is dealing with the rare words. These words are the one which cause the term 0/0 in the product term we had previously for computation of the probability of spamicity. Since these words are not available in our training set, there is no information about their occurrence in spam and ham emails. Therefore, we may discard these rare words when seeing them for the first time. In addition, we have neutral words like “a”, “an”, “the”, “some” and etc. which are very common in either spam or non-spam emails. In other words, these are the words which have probability near 0.5. For the sake of simplicity and preventing our algorithm and computation of p from underflow, we may first eliminate these common words from the message which we are going to classify it.

The main advantage of Naïve Bayesian classifier is that it is trained based on one’s emails. Since different users receive different types of legitimate emails, the process of the training of the algorithm would be different for different users. In other words, certain words may have more occurrences in the legitimate emails of one user; however, for another user, these words may have more occurrences in spam emails. Therefore, in the training process of algorithm probabilities of spamicities of words are computed differently and as a result we would have different types of classifications.

Furthermore, there are some disadvantages about this type of email classification. If one is aware of the algorithm, then this person might be able to bypass the Bayesian filter.

First, spammers can send a large number of legitimate emails to distract the training process. That is to say, they send lots of legitimate emails which also contain some common words of spam emails. This process will affect the probability of those common spam words in our data base and next time that we receive spam email containing those words, we might classify it as a legitimate email because the probabilities of them have been changed.

Another technique for spammers is producing rare words. As mentioned previously, rare words which we have not faced them previously or faced them rarely add the term of 0/0 to computation of the probability of spamicity. Therefore, one can make a pseudo-word of common spam word like “price” by slightly changing its letters. For example, instead of using price, one can use “pricee” or “priice” which are pseudo-words. Therefore our classifier will ignore them as rare words and as a result the probability of spamicity comes down.

Picturing is other technique which works well for spammers. In this technique, spammers do not write the common spam words in the email. Instead they put pictures in the emails which contain the common spam words. In this case, most classifiers are not able to detect these common spam words in the email and as a result they classify spam emails as legitimate emails. The way to conquer this spamming technique is to use optical character recognition (OCR) which automatically detects the characters in the pictures and as a result we would able to find those common spam words.

In the following we mention algorithm related issues more formally.

In the context of machine learning, the problem of classifying emails as spam or ham is translated to classification of set of objects (emails in here).

For each object (email), we have a feature vector. For the emails, the components of feature vectors are words. We choose these words according to our training set. In other words, we find out the most frequently occurring words in our training set (10,000-50,000 words). Then we form a vector of size let’s say 50,000. For each new email that we would like to classify it as either spam or ham email we first form its corresponding feature vector such that we put 1 for each word which is contained in the message and 0 for each word which is not contained in the message.

We may use some tricks to make the algorithm have low error.

We should as much data as we can. In this case we would have better training process and less error.

We may think of using features which we have extracted them from the email routing information.

Another useful trick is to consider the message body and try to consider features which help us to make the algorithm more efficient. As an example, there is no special difference between “deal” and “Dealer”. In other words, we can consider words which belong to same family as one word. For this aim, we can first convert the message to a new message in which words of a same family appears as one unique word. As an example, instead of words like “Dealer”, “dealing” and “deal”, we may only use one word like “deal” which is the root of this family.

For the tricks like misspelling which is used by spammers and we mentioned it earlier, we can develop some algorithms to detect this misspellings at first and correct them so that we would not face the rare words.

In the following we give brief description about the functions we have used in our MATLAB code. You can either download the zip type of this module to find the MATLAB codes or You can find them at:

https://docs.google.com/open?id=0B6hylUwdsGFxNDYxN2YyOTgtODg3Ny00OTQ5LWEyY2UtOGIzNWI0YmVkMWRi

https://docs.google.com/open?id=0B-L3jxfdTyjFNzVjNWQzZWUtZDE3Mi00NmU1LThlNTYtYmQyMjE1ZjJmNTNm

For better understanding of the process we have designed graphical user interface (GUI). Through the GUI, users can provide the program with their own email and check whether their given email is classified as either spam or legitimate email by our code. The following figure is our GUI:

Figure 8
Figure 8 (graphics8.png)

Our MATLAB code functions are as the following:

vocabList = getVocabList(n)

This function loads the pre-defined vocabulary list. n indicates the length of our vocabulary list. This length-n vector is our general feature vectors where as previously mentioned features in spam classifications are different words. We got this vocabulary list from Stanford online machine learning course computer assignments.

stem = porterStemmer(inString)

This function converts the words which have the same stem and belong to same family (e.g. dealing, dealer, deal) to one word which is the stem of the family. The algorithm is called Porter Stemming algorithm and is presented in the reference [4]. Original code written in C language can be found in here: http://www.tartarus.org/~martin/PorterStemmer/c.txt

[vocabList vocabHist] = processEmail(email_contents,vocabList,vocabHist)

This function converts each email into a vector of features (defined by “vocablist” function mentioned previously) and removes unnecessary characters such as punctuation marks. In addition, it returns the distribution of the words from “vocablist” function in the email which we would like to classify it. This function can update the vocabulary list if the algorithm faces a new word which is not available in our vocabulary list.

[eta I]= indicateEmail(email_contents,vocabList,vocabScore)

This function indicates whether the given email is spam or a legitimate email based on the vocabulary list and the spamicity of the corresponding words of the message.

For the training process of the algorithm, we have used data set which contains both spam and ham emails from  http://csmining.org/index.php/ling-spam-datasets.html

Error Analysis

For tasks like email classifying that amounts of spam and non-spam emails are different to each other, we have skewed classes. The metrics which we evaluate classification algorithms with skewed classes are as the following:

True Positive: emails which are spam and we have classified them correctly.

True Negative: emails which are not spam and we have classified them correctly.

False Positive: emails which are not spam and we have classified them as spam incorrectly.

False negative: emails which are spam and we have classified them as non-spam incorrectly.

Precision: of all emails we have classified as spam (y=1), what fraction of them are really spam emails?

Recall: of all emails which are really spam, what fraction of them have been classified correctly by the algorithm?

Here are the formulas of above metrics:

Figure 9
Figure 9 (graphics9.png)
Figure 10
Figure 10 (graphics10.png)

We call the metric through which we can compare performance of different algorithms as F1 score. F1 score is given by the following formula:

Figure 11
Figure 11 (graphics11.png)

Simulation Results

We have a large data set which we should use it for both training the algorithm and testing the algorithm. Based on the fraction of the data set to which we dedicate to training set, we might see different performance for our algorithm. The following figure shows the precision of the algorithm based on the percentage of data dedicated to the training set.

Figure 12
Figure 12 (graphics12.png)

As expected, training the algorithm with a larger training set will provide us with a better result and precision .

The next figure is corresponding to the second metric we mentioned earlier: recall.

We have plot the percentage of recall versus the portion of data set dedicated to our training set. As expected, the more we dedicate to training set the better would be our algorithm and hence we have a lower rate of recall.

Figure 13
Figure 13 (graphics13.png)

References:

[1] W. A. Awad and S. M. Elseuofi, “Machine Learning Methods For Spam Email Classification”, Int. J. of Computer Science and Information Technology (IJCSIT), Vol. 3, No. 1, Feb 2011.

[2] “Bayesian Spam Filtering”, at http”//en.wikipedia.org/wiki/Bayesian_spam_filtering

[3] Stanford Machine Learning course handouts, available at http://cs229.stanford.edu

[4] Porter, “ An algorithm for suffix stripping, Program”, Vol. 14,  no. 3, pp 130-137, 1980

Content actions

Download module as:

PDF | EPUB (?)

What is an EPUB file?

EPUB is an electronic book format that can be read on a variety of mobile devices.

Downloading to a reading device

For detailed instructions on how to download this content's EPUB to your specific device, click the "(?)" link.

| More downloads ...

Add module to:

My Favorites (?)

'My Favorites' is a special kind of lens which you can use to bookmark modules and collections. 'My Favorites' can only be seen by you, and collections saved in 'My Favorites' can remember the last module you were on. You need an account to use 'My Favorites'.

| A lens I own (?)

Definition of a lens

Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

Who can create a lens?

Any individual member, a community, or a respected organization.

What are tags? tag icon

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

| External bookmarks