Bad Kitty

Coding, Security — ryanfb @ 9:11 pm

There’s been some commotion about the KittenAuth CAPTCHA and how effective it is or isn’t. Last weekend, I decided to give the PCA-SIFT algorithm a whirl against it and see how it went. After having wget grab a sufficient number of images, I deleted all duplicates and put the manually verified cats in a folder and got keypoints for them. Now finding out what is and isn’t a cat is as simple as a Perl script:

$ time find . -iname '0.*' -exec ./findkitty.pl {} \;
Finding keypoints...
64 keypoints found.
Finding keypoints...
105 keypoints found.
./www.thepcspy.com/images/dynamic/kitty/2/0.390180844063955 is a cat
Finding keypoints...
124 keypoints found.
./www.thepcspy.com/images/dynamic/kitty/3/0.390180844063955 is a cat
Finding keypoints...
96 keypoints found.
./www.thepcspy.com/images/dynamic/kitty/4/0.390180844063955 is a cat
Finding keypoints...
209 keypoints found.
Finding keypoints...
118 keypoints found.
Finding keypoints...
209 keypoints found.
Finding keypoints...
173 keypoints found.
Finding keypoints...
122 keypoints found.

real    0m6.045s
user    0m5.173s
sys     0m0.319s`

I used PCA-SIFT because, while it’s slower than a simple file hash (which would defeat this in its current form), it can still accurately match images which have been through a variety of modifications (i.e. pretty much all of the modifications I have seen suggested elsewhere).

Demonstration of PCA-SIFT on a modified image

It would be pretty hard to use the idea for this captcha and have a sufficiently large enough database that it could defeat the method I’m using to get around it. An attacker could just have it cache all the unique images it sees and only have to have a human look at any given picture once to classify it, and if you use a large source of known images that you wouldn’t have to classify (such as GIS or kittenwar), it’s reasonable to believe that an attacker could use it as well.

8 Comments »

  1. You should be testing the still buggy php version.

    You’ve also got the problem that I’m going to have nigh on limitless pictures of kittens and I’m only going to show you 3 of them at a time… and you can only get it wrong 3 times from a single ip…

    Nice to have someone keeping you on your toes though =)

    Comment by Oli — April 14, 2006 @ 12:13 am
  2. Just spend a little more time reading around this and yeah it looks quite sound.

    But please try it out on images that have been rotated and zoomed and see if it’s showing the same level of detection. (An emaill response would be preferable)

    Comment by Oli — April 14, 2006 @ 12:38 am
  3. I’ve tried my method against your new PHP version with rotation, and
    it still works. While a lower percentage of keypoints are matched, it
    is still highly accurate for matching images which have been rotated
    or zoomed (i.e. there are not false positives).

    Here’s another visual example against a screenshot of the new version:

    PCA-SIFT matches against KittenAuth

    Comment by ryanfb — April 14, 2006 @ 1:26 pm
  4. Ryan,

    I’ve updated my own version, if you’re interested :) http://www.nextbbs.com/do%5Ftopic%5Fid%5F573%5Fpost%5F3934

    -C.

    Comment by Chris FR — April 15, 2006 @ 3:33 pm
  5. Sorry I don’t know what happened to the underscores in my link :(

    Comment by Chris FR — April 15, 2006 @ 3:34 pm
  6. No problem, I’ve changed it with escaped underscores, as the Markdown syntax interprets underscores as italics.

    Comment by ryanfb — April 15, 2006 @ 5:12 pm
  7. I’m currently assembling people for getting a good solid version of KittenAuth v2 out the door.

    I’ve already spoken to Chris and he’s optimistic, but I would love to have you onboard too, Ryan, as a consultory member (and programmer — if you want) so you can help make it as inconvenient as possible for bots to automate the process of getting around the kittenauth system.

    Of course its never going to be 100% but the image processing you’ve done does have a certain cpu overhead doesnt it? You’re only using 3 pictures to compare with and that’s taking roughly 2s per picture. What happens when we scale that up to 100 or 1000 possible pictures and a 4×3 grid? I’m assuming (and hoping) its going to take much longer and that it wont be viable for the average spammer to donate so much processing power to try and break through the protection.

    Anyway, I’ve got the incomplete v2 demo up at http://www.kittenauth.com running under drupal so I’d really lvoe you input and hopefully your expertise as we develop the system.

    Comment by Oli — April 30, 2006 @ 4:32 pm
  8. Actually, when the program was running it was matching against all 17 cat pictures, not just those 3. That picture is just an illustrative example to show that fewer keypoints are matched compared to the earlier version.

    The current algorithm for finding keypoint matches is pretty simple and slow, but it’s not too hard to improve it. Basically for each input image, it generates keypoints then for each input keypoint iterates through all the known keypoints and uses a distance metric to find matches. Keypoints are just vectors in 128-space though, so optimal algorithms for finding matches within a certain distance are already a heavily researched area (look at kd-trees or priority R-trees). Using another data structure and algorithm, as well as keeping the known keys in memory, would vastly improve performance. It could probably be modified to use something like the ANN library pretty easily. Also, due to the nature of the problem, it parallelizes pretty well.

    I haven’t been able to check out v2 yet as the images don’t load for me, so I can’t comment on that.

    Comment by ryanfb — May 4, 2006 @ 11:05 pm

RSS feed for comments on this post. TrackBack URI

Leave a comment

(c) 2009 cryptosystem.org | powered by WordPress with Barecity