Wednesday, July 22

Matt Cutts SIGIR Industry Day Presentation: WebSpam and Adversarial IR: The Road Ahead

Daniel is introducing Matt Cutts, head of Google web spam.

- What is web spam
- Why people spam
- Spammer/black

-- He's going to teach you how to think like a spammer. Use this only for good; not for evil!

What is Web spam?
- Webspam is cheating (breaking search engine guidelines) in an attempt to rank higher in SE; just trying to rank higher is SEO, not spam.

Why web spam?
- Money!

Church spam!
- A catholic priest used hidden text to spam! Catholicism, Catholicism, Catholicism!

The mind of a blackhat spammer
Anyone using Wifi?
- How do you know it's not fake and someone is trying to sniff your data?
- These guys are here to get you.
- "Recycle your badge" box at Web 2.0. The badge is worth thousands of dollars. How do know it's real?

- It shows you the blackhat attitude.
- The official Internet registry and optimization bureau!

Final Exam
Scenario: Suppose Google starts penalizing sites that have spammy inlinks.
- People will create spammy links to their competitors!
- SPAM: "Sites Positioned Above Me"

Webspam requirements
1) Content (on-page)
2) Reputation (off-page)
- you need both to be a good spammer.
You need something else: A way to make money! (monetization)

On-page Spam
- The most famous example is hidden text. (the old white text on a white background trip)
Old school: FFFFFF... new age...
BrainTeaser: Hidden text using Javascript and setting display none on divs.

Examples of spam techniques
Clipart spam: throwing queries on the page

SecretsMoney: Tax deferred. videos -- a low order hidden markov model generated spam.

Scraping: (From all about Jazz) A key give away is that they don't escape special characters. Spammers are now being tricky and stitching content together (sentence and phrase level stitching)

- Showing different content to Google than your users.
- Steve Bartel on MIT! got hacked... He's relying on the fact that search engines don't execute Javscript. (They do parse JS!, but many search engines don't.)

Off-page Spam
- Blog comment spam ... pretty common.
- Referrer spam.

In 2006 spammers had their own sites, today they try and use other people's!

Paid Links
- NCSU (Linda Stepper)... they are writing paid blog posts and inserting paid links (non-disclosed)

Hacking the biggest trend in spam; it's easier to hack somebody else's than build your own.

Make Spammers waste time or effort. Frustrate them!

Ways to frustrate trolls
  1. Disemvowelin. "thnk tht ths tpc s stpd nd dmb" -- invented by boing boing
  2. Show troll's comments only to the troll, not to anyone else.
  3. Slow down the website experience for the troll. (wait 20s for http reply... put him in dial up mode!)
  4. Start the troll in a -1 "hole" that they can dig out of. (You need to get someone to agree with you to get visible)
Reputation/trust helps
- PageRank, TrustRank, BingRank?
- Ebay: your seller rating. (100% positive, since 02, with hundreds of sales)... any time he writes a post on Amazon, it's probably ok.

Off the beaten path
- Clever: have a hidden form that only bots will fill out to catch them!

Where is this spam from?
- The spammer reported his spam using Google's report spam form!

Good tools: nofollow
- Oompa loompa dating site! Comment spam... add nofollow to your third-party posts.

Trends in webspam
  1. Search engines better at spotting spammy pages.
  2. Spammers make legit-looking pages for spammy links
  3. Spammers hack/deface legit sites for links/landing page
  4. Spammers are using malware!
Spam will soon be more dangerous!
Classifying whether or not a site was hacked, not whether it is keyword stuffing!
Porn producer: 1 in 50 converting to 1 in 200. Answer: Installing malware from a webpage!
Next wave of web spam will be hacking webservers (XSS)

* Detect when a website is hacked based on how links are added, etc...

Selling links from hacked sites.

Preventing Comment Spam
- Any way you can tell humans from bots.
- The KittenAuth captcha: mark all the foxes.

- Is there a question that everyone in the world can answer?
- What techniques prevent comment spam?
- Web service to classify content as spam/nonspam

Trust, Identity, Authentication
- Is there a PageRank for people?... something that spans across social networks and the web.
- Bring authenticated authorship to the web
- When should a website vouch for a link to another website?
- Wikipedia nofollows links to most other sites!
(In short, when can you trust a member of your community)
WoW has better authentication than most sites on the web.

Twitter and Facebook
- Study adversial IR ... spam followers, or the "realmattcutts" he followed the same people as Matt!
- It recreates a lot of the same problems you see in e-mail.
- Twitter trending: the Acai Berries
- Using twitter for linking/malware spam! It's happening.

- Google bombs the first one was "talented hack?"

1 comment:

  1. Jeff,

    Unfortunately I couldn't attend SIGIR this year. But this was a useful summary. Thanks!