Simple Spam Detector for Video

With fast internet connection, video is growing fast as the multimedia entertainment platform. User is not only consume the video but also becoming the content creator. But not all user creating content with a good intent. Some of the user creating content for spamming.

Spam definition is very vary, so I will give my own definition of spam. Spam is the content for driving the other user to visit the site or page of the spammer.

I will share one of the simple way to detect spam using only:

  1. Python
  2. ffmpeg
  3. opencv
  4. terrasect

First, lets create Prober. This class is for detecting video duration and getting the sample of the video frame.

file: app/prober.py

https://gist.github.com/rustyworks/b47315dd2e3c7a8776324f7b247bf84f#file-prober-py

Then, we will create helper function for converting the duration from seconds to HH:MM:SS format. And also create helper method for delete the sample picture that generated by the prober.

file: app/utils.py

https://gist.github.com/rustyworks/df4786f6cb71b57fda9b7b36a86bd22f#file-utils-py

We create the ocr file, by using pyterrasect (the abstraction for terrasect) for detecting the character in picture. We also desaturate the picture first using opencv, because it can improve the alphabeth detection in a picture.

file: app/ocr.py

https://gist.github.com/rustyworks/fc5d619c193da959feb637f62675a95c#file-ocr-py

After that, we also create the spam text detector. Before we detect, we try to clean all non alphabetical words. And then detecting, is contains either banned words, link, or phone number.

file: app/checker.py

https://gist.github.com/rustyworks/9bb39391102a974e485b512c6fc5e5a2#file-checker-py

And we integrate all in main.py

https://gist.github.com/rustyworks/3667cd7429e95e355be577e36d2d90f0#file-main-py

Example usage:

env python3 main.py -f spam.mkv

Explanation:

  1. Get file location, and define how many frame we want to get
  2. Generate sample frame using prober
  3. Desaturate sample picture using opencv
  4. Detect all character inside using terrasect
  5. Check the parsed character, is it spam or not

Created using python 3.5.2 in Linux Mint 18.3 Sylvia.

Related Posts

Part II — Understanding about RuleChain

Mengenal dasar RxSwift

Making Backward Compatible Adaptive Colors for Dark Mode in iOS

Automate Your Android App Bundle Publishing using Jenkins

No Comment

Leave a Reply