Spam affects our lives all over the internet. Sometimes you don’t even know that this is spam in front of you. So, what spam actually is? This is the use of e-messaging systems to send an unclaimed message, especially advertising, as well as sending messages repeatedly on the same site.
Nobody likes spam, and the market for anti-spam software is huge counting into billions and expected to grow exponentially. Spam is not something that is black or white, the line is drawn very differently in many systems. For example, a message about some brand new soccer ball characteristics can be spam for personal messages, but in Facebook, group soccer might very appropriate information. That’s one of the reasons why anti-spam software should be customized for specific needs.
Spam in numbers
It accounts for 14.5 billion messages globally per day (on average 2 spam messages to a person per day), this is around 45% of all emails. The United States is the number one generator of spam email, with Korea clocking in as the second-largest contributor of unwanted email. According to a study by the Radicati Research Group Inc., a research firm based in Palo Alto, California, spam costs businesses $20.5 billion annually in decreased productivity as well as in technical expenses. But, spam not only impacts our mailboxes. Nowadays, spam is getting much more “popular” on Social networks. Typical social spam includes tricking users into liking/sharing content (like-jacking), or the promotion of malware from a third party site. The trickiest issue with social spam is that the message generally comes from a user’s actual friends and can be personalized. Facebook states that “less than” 4% of all posts are spam Twitter notes that 1.5% of all Tweets are spam.
There’s still ain’t good universal and perfect system to detect spam and save your time from fighting with it. So we decided to take this challenge.
KindGeek’s spam detection software is oriented to eliminate spam in our clients’ products, which is social networks. For the first stage of the product development, KindGeeks chose Naive Bayes Classifier which has been proven to be quite effective. The classifier is a Machine Learning tool that is based on statistics. In simple words, a message is converted to a bag of words and based on statistics of those words being mentioned in previous messages that have already been categorized, the probability of the message being spam is calculated.
Along with simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods. However, this technique working well only if it has been already trained on a good dataset and also when a message doesn’t contain spelling mistakes. Youtube comments served a good dataset and ElasticSearch for spelling mistakes.
To complete some gaps on this stage we also use NLP, known as Natural language processing. This is artificial intelligence and computational linguistics related area, which concerned with the interactions between computers and human languages.
We believe that the future of anti-spam in social networks lies in targeting not spam but targeting spammers. Users interacting with each other, users type messages that are connected with each other, users have so many things that can help to define whether they are spammers or not. Also, users get annoyed with a message that is interpreted as spam and this wrongfulness leads to a poorer quality of a social network and also the satisfaction of its clientele.
One of the techniques for detecting spammers is a Graph-Based Classifier. The classifier is taking into account not only individual actions but rather tracking the behavior of users. The Classifier is constructing a huge graph of users’ interconnectivity and personal properties. The spam detection system will also include a review of used URL addresses and numbers.