So basically I'm writing a script that allows me to look at 2 bodies of text and tell you what the the percentage of similarity content wise.

What it does is;

1. Retrieve a list of all words in the text ( Only 1 of each word, so even if there are 2 "the"'s it would only list 1. )
2. It iterates through and creates an array of all the words that both bodies of text contain.
3. It takes shared count / largest body of text word count * 100 to determine the percentage of similarity.

The goal of this is to locate possible botting scripts by breaking every post down to the bare rawest form of the text, removing extra filler text and looking at the content as a gutted structure. It is also used to determine if text is intellectually different than other bodies of text.

The reason for me posting this is to see if you have any feedback on the logic used to determine the output and if you can think of a better method of determine the percentage of content similarity.