If we compare the spam emails and non-spam emails, we will find out that particular words have different occurrence in these two types. In other words, they are certain words, e.g. business related words, which have more occurrence in spam emails than in hams. The approach is to first provide a set of training examples containing both spam and ham emails to the algorithm. This set has been manually labeled as either spam or ham. Then the algorithm will find out for each word the probability of occurring in spam and ham emails in that training set.

As a result, when we have a new email which we would like to classify it as either spam or ham, first we find out the most interesting components of that email (rather than pretty common words like: and, or, the, etc.). Afterwards, we find out the probability of those components or words with the values we found in the training process. As an example, suppose that the word “price” occurs in spam emails with the probability 0.8 and word “ten” occurs in spam emails with probability 0.5 and finally the word “dollar” occurs in spam emails with probability 0.9; therefore, the probability of being spam for the sentence “price is ten dollars” would be computed through the contribution of its different words as mentioned in the following.

As could be assumed by its name, the main idea of Naïve Bayesian spam classifier, which is the algorithm we have utilized in this article, is using Bayes theorem which is used widely in the context of probability.

As such, assume that given a specific word, we would like to find out what is the probability of being spam for a message which contains this word. From Bayes rule, we may write:

Where in this equation, *p(spam|word) *is the probability of being spam for a message which contains the specific word. Moreover, *p(spam)* is the probability that a message is spam in general and is equal to the number of spam emails we have in our data set or may be set in a way that is describes later in this paragraph. Similarly, *p(word|spam)* is the probability of occurrence of specific word in spam messages. We find this value from our training set. In other words, we may check the frequency of occurring of this word in the spam emails for this value. *P(word|ham) *is the probability of occurrence of specific word in ham message. Similarly, We find this value from our training set. In other words, we may check the frequency of occurring of this word in the ham emails for this value. Finally, *p(ham)* is the probability that a message is ham in general. This value is equal to the number of ham emails we have in our training set. However, we are able to use prior information about emails in real world out to find out what fraction of the emails are spam in general and what fraction is ham in general.

Regarding the other term of the name of our algorithm “Naïve”, the reason of using this word is that the algorithm disregards the correlation between different words in the message, i.e., we assume that words of the message are conditionally independent given the message. This assumption is a strong assumption and mathematically is translated to the following:

Considering this, it can be shown that the overall probability of being spam for a message is:

Where in this formula p is the probability of being spam for one message. p_{i} is the probability of being spam for a message which contains the word number *i*.

After this step we compare the value of *p* with certain threshold and classify the message as spam if the value of *p* pass the threshold.

Since the value of *p* could be very small and hence we face the problem of underflow, we manipulate the above formula to write it in a logarithm format. We may write:

As a result we have:

If

then we may write:

Another issue is dealing with the rare words. These words are the one which cause the term 0/0 in the product term we had previously for computation of the probability of spamicity. Since these words are not available in our training set, there is no information about their occurrence in spam and ham emails. Therefore, we may discard these rare words when seeing them for the first time. In addition, we have neutral words like “a”, “an”, “the”, “some” and etc. which are very common in either spam or non-spam emails. In other words, these are the words which have probability near 0.5. For the sake of simplicity and preventing our algorithm and computation of *p* from underflow, we may first eliminate these common words from the message which we are going to classify it.

The main advantage of Naïve Bayesian classifier is that it is trained based on one’s emails. Since different users receive different types of legitimate emails, the process of the training of the algorithm would be different for different users. In other words, certain words may have more occurrences in the legitimate emails of one user; however, for another user, these words may have more occurrences in spam emails. Therefore, in the training process of algorithm probabilities of spamicities of words are computed differently and as a result we would have different types of classifications.

Furthermore, there are some disadvantages about this type of email classification. If one is aware of the algorithm, then this person might be able to bypass the Bayesian filter.

First, spammers can send a large number of legitimate emails to distract the training process. That is to say, they send lots of legitimate emails which also contain some common words of spam emails. This process will affect the probability of those common spam words in our data base and next time that we receive spam email containing those words, we might classify it as a legitimate email because the probabilities of them have been changed.

Another technique for spammers is producing rare words. As mentioned previously, rare words which we have not faced them previously or faced them rarely add the term of 0/0 to computation of the probability of spamicity. Therefore, one can make a pseudo-word of common spam word like “price” by slightly changing its letters. For example, instead of using price, one can use “pricee” or “priice” which are pseudo-words. Therefore our classifier will ignore them as rare words and as a result the probability of spamicity comes down.

Picturing is other technique which works well for spammers. In this technique, spammers do not write the common spam words in the email. Instead they put pictures in the emails which contain the common spam words. In this case, most classifiers are not able to detect these common spam words in the email and as a result they classify spam emails as legitimate emails. The way to conquer this spamming technique is to use optical character recognition (OCR) which automatically detects the characters in the pictures and as a result we would able to find those common spam words.

In the following we mention algorithm related issues more formally.

In the context of machine learning, the problem of classifying emails as spam or ham is translated to classification of set of objects (emails in here).

For each object (email), we have a feature vector. For the emails, the components of feature vectors are words. We choose these words according to our training set. In other words, we find out the most frequently occurring words in our training set (10,000-50,000 words). Then we form a vector of size let’s say 50,000. For each new email that we would like to classify it as either spam or ham email we first form its corresponding feature vector such that we put 1 for each word which is contained in the message and 0 for each word which is not contained in the message.

We may use some tricks to make the algorithm have low error.

We should as much data as we can. In this case we would have better training process and less error.

We may think of using features which we have extracted them from the email routing information.

Another useful trick is to consider the message body and try to consider features which help us to make the algorithm more efficient. As an example, there is no special difference between “deal” and “Dealer”. In other words, we can consider words which belong to same family as one word. For this aim, we can first convert the message to a new message in which words of a same family appears as one unique word. As an example, instead of words like “Dealer”, “dealing” and “deal”, we may only use one word like “deal” which is the root of this family.

For the tricks like misspelling which is used by spammers and we mentioned it earlier, we can develop some algorithms to detect this misspellings at first and correct them so that we would not face the rare words.

In the following we give brief description about the functions we have used in our MATLAB code. You can either download the zip type of this module to find the MATLAB codes or You can find them at:

https://docs.google.com/open?id=0B6hylUwdsGFxNDYxN2YyOTgtODg3Ny00OTQ5LWEyY2UtOGIzNWI0YmVkMWRi

https://docs.google.com/open?id=0B-L3jxfdTyjFNzVjNWQzZWUtZDE3Mi00NmU1LThlNTYtYmQyMjE1ZjJmNTNm

For better understanding of the process we have designed graphical user interface (GUI). Through the GUI, users can provide the program with their own email and check whether their given email is classified as either spam or legitimate email by our code. The following figure is our GUI:

Our MATLAB code functions are as the following:

vocabList = getVocabList(n)

This function loads the pre-defined vocabulary list. *n* indicates the length of our vocabulary list. This length-*n* vector is our general feature vectors where as previously mentioned features in spam classifications are different words. We got this vocabulary list from Stanford online machine learning course computer assignments.

stem = porterStemmer(inString)

This function converts the words which have the same stem and belong to same family (e.g. dealing, dealer, deal) to one word which is the stem of the family. The algorithm is called Porter Stemming algorithm and is presented in the reference [4]. Original code written in C language can be found in here: __http://www.tartarus.org/~martin/PorterStemmer/c.txt__

[vocabList vocabHist] = processEmail(email_contents,vocabList,vocabHist)

This function converts each email into a vector of features (defined by “vocablist” function mentioned previously) and removes unnecessary characters such as punctuation marks. In addition, it returns the distribution of the words from “vocablist” function in the email which we would like to classify it. This function can update the vocabulary list if the algorithm faces a new word which is not available in our vocabulary list.

[eta I]= indicateEmail(email_contents,vocabList,vocabScore)

This function indicates whether the given email is spam or a legitimate email based on the vocabulary list and the spamicity of the corresponding words of the message.

For the training process of the algorithm, we have used data set which contains both spam and ham emails from http://csmining.org/index.php/ling-spam-datasets.html