Saturday, May 1

Best Paper Awards and Nominees at WWW 2010

Matt Lease kindly sent me the best paper award information. (You should check out the grad IR class he is teaching this semester). Unfortunately, I can't find them all available online yet.

Best Poster Award
How much is your Personal Recommendation Worth
Paul Dütting, Monika Henzinger and Ingmar Weber

SourceRank: Relevance and Trust Assessment for Deep Web Sources Based on Inter-Source Agreement (PDF)
Raju Balakrishnan and Subbarao Kambhampati

Best Student Paper
Privacy Wizards for Social Networking Sites (PDF)
Lujun Fang,
Kristen LeFevre

Best Paper nominees
Factorizing Personalized Markov Chains for Next-Basket Recommendation (winner)
Steffen Rendle (Osaka University), Christoph Freudenthaler, and Lars Schmidt-Thieme (University of Hildesheim).

A Refreshing Perspective of Search Engine Caching (PDF)
Flavio Junqueira, Berkant Barla Cambazoglu, Vassilis Plachouras, Swee Lim, Baoqiu Cui, Scott Banachowski

AdHeat: An Influence-based Diffusion Model for Propagating Hints to Match Ads
Hongji Bao, Ed Chang

Thursday, April 29

Danah Boyd WWW keynote: Privacy and Publicity in Big Data

Today Danah Boyd's gave an address on the Privacy and Publicity in the context of big data at WWW 2010 . Danah released a crib sheet summary on her website, which you should read. Here are Michael's notes from the talk.
  • Privacy concerns are everywhere
    • Big data (Social data created
      by people) magnifies these concerns

  • Data is cheap today
    • Making sense of the data, however, is still hard
    • Accessing and processing it an ethical way is not investigated

  • Methodological issues
    • Big data introduces more
      questions than answers
    • Ethnography tries to answer
      some of these “why” questions

  • Social Sciences Approach
    – 4 key points
    • Bigger is not always better
    • Not all data equal
    • “What” != “Why”
    • Be careful in interpretations

  • Sampling
    • The way you sample affects
      your results – hard to create truly random representative sample
    • Big data doesn’t mean
      “the whole of the data”
    • No matter how many tweets
      you have, your sample is always biased
      • Oversampling users who tweet

  • Not all data are equal
    • What does your network represent?
      Types of social network
      • Articulated
      • Behavioral
      • Personal

    • Data from Facebook is not
      necessarily more accurate that other social (smaller) network
      • Facebook friends != person’s social network
      • Frequency of conversation != personal closeness
  • What != Why
    • Correlation does not mean causation
    • Even if your model points that there are two connected events doesn’t mean one causes
      the other
    • Results need to be interpreted
      • Technology can corrupt social
        science research by making simplifying assumptions and ignoring how
        the context in which original results were obtained

    • Uncertainty principle applies
      • Networks are made of people,
        not of abstract nodes on the graph
      • Data in the network is about
        real people’s lives

  • Just because data is accessible
    doesn’t mean that using it is ethical!
    • Privacy is context
    • Walls (Technology) have
      ears (and mouths)

  • Five point for privacy security
    • Security through obscurity
      • Violated more and more by
      • Technologies change people’s
    • Not all is meant to be publicized
      • Do we all want to become
        “digital micro-celebrities” and fear the “digital paparazzi”?
    • PII vs. PEI (Personal Identifiable
      vs. Embarrassing Information)
      • Algorithms have a hard time
        discerning PII & PEI
    • Data out of context is a
      privacy violation
    • Privacy is not access control

  • People care about privacy
    • But they all also care about
      publicity – a right to be in public

  • Facebook
    • Facebook users have an impression
      that “Facebook is more private than MySpace”
    • Newsfeed – publicizing
      implicit (but accessible) content in explicit way
      • Initially controversial,
        became a great success
      • Created a set of norms in
        the “Facebook world”
    • Beacon – people are vessels
      for advertisements
      • Was a failure, ended in
        a user lawsuit
    • New default privacy settings
      • Research shows that people
        do not understand their privacy settings in Facebook
      • In fact, their mental map
        of settings doesn’t match the actual settings
    • Slow changes from private
      to public
      • Users are like frogs who
        are slowly “cooked” and do not realize it
      • Data from 3rd
        party sites is slowly aggregates
        • Tastes, web actions
          are made public
    • Opt-out is the norm at Facebook
      • People do not understand
        what they implicitly agree to

  • Regulations
    • Involvement from governments
      (esp. from Europe,Canada)
    • Researchers --- need to
      understand the consequences of their analysis

Wednesday, April 28

Search is Dead! Long Live Search Panel at WWW 2010

Continuing the WWW 2010 coverage, this after there was a panel, Search is Dead! Long Live Search. You can see a poor quality video stream of the panel. The following are the notes from my labmate, Michael. You can also see the discussion on Twitter, #searchisdead.

    Search is dead! Long Live Search.

    • Search for 10 blue links is already dead

    • A failure case is if a user sees just the 10 blue links

    • There is much more diverse data sources and presentations than links to web pages

    • Intense competition to get the tail queries right
      • You miss everyone if you miss the tail
      • It doesn’t take much to get into the tail – 1 or 2 more keywords

    • Enormous need to resurface implicit structured information for keyword queries

    • How to satisfy the tail?
      • 10 blue links are not enough
      • Show structured data not
        just for popular queries, but for tail queries as well: maps for “historical houses in Raleigh”
      • UI challenge

    • Change in how people produce content
      • From newswire docs to webdocs to blogs to Twitter

    • Searches that don’t work
      • Book Search
      • Wikipedia
      • Images
      • Complex queries

    • Capturing user behavior activity
      • Reformulations given by users are more likely to be clicked than automatically generated ones.
      • Better tools for capturing
        how users interact with the results

    • Facets
      • Already are used in some vertical domains by Bing
      • Automatically extracting facets from raw text
        • Caveat: are they meaningful?
      • Are users getting used to facets? The “jury is still out”
      • Can become exponentially complex

    • Mobile Search
      • Voice search is a big change
        • Penalty for longer queries go down – natural language processing will become more important
        • Surprising finding – people tend to type (not speak) longer queries on mobile
          • Recognizing long stateless speech utterances is hard
      • Lots of apps
        • Tail queries can be better served by niche apps
        • How to integrate results between apps?

      • Geo-information
        • Comes for free in phones and has to be used by a successful search engine

    • Social Search
      • We already do social search by using click data
      • Can we do better using social networks?
      • Are applications like Aardvark effective?

Yahoo! Expands Restaurant Vertical with Menu information

For those of you who know me, you know that I love to cook, and to eat. So I was excited when today the Yahoo!Search blog highlighted the addition of a new feature to that allows you to search for a specific dish at local restaurants. The post makes vague reference to information extraction from menus:
By extracting structured content – in this case, menu items – from unstructured web pages and matching them to restaurant entities, Yahoo! Search can return results of restaurants near you that serve the dish you crave for when you enter the name of the dish in the search box. You can also try this experience in Yahoo! Local.
I tried a search for roasted chicken, san francisco. I expected Zuni Cafe and their world famous roasted chicken with bread salad. However, I was disappointed and the vertical did not trigger. Giving it another shot, I tried burger northampton, ma. My favorite burger joint, Local Burger is second on the list; right after burger king. Not horrible, but not great either. The local ranking could be improved.

I love the idea, but it still needs work. First, I usually don't think of a specific dish when I'm picking a restaurant, unless it's pizza or a burger and its pretty obvious in these cases. What I would really love to see are search options that let me find restaurants with menus that cater to specialty diets. For example, find me a restaurant with options that are gluten-free, low sodium, kosher, dairy-free, etc... This is important because like many other people I know, my mom has celiac disease and other dietary restrictions.

Vint Cerf WWW 2010 Keynote

Here are the notes that Michael sent me on the Vint Cerf keynote address at WWW 2010.

“Everything is Connected” by Vint Cerf

Note: the slides from the talk are available online and you can watch a video of the talk via Wayne Sutton's livestream.

  • “I’m the guy behind the underlying plumbing, not the applications. So this talk is going to be about the plumbing not the applications built upon this plumbing”

  • Internet is a network of autonomous, independent systems

  • Internet Statistics
    • 1.8B people (26% of the world population)
    • 4.2B mobiles and 1.3B PC’s
    • Asia 770M (20% penetration)
      (Half of users in China)
    • Europe 425M (53% penetration)
    • N. America 260M (76% penetration)
    • Rapid drop of available IP’s (Sometimes in 2012 IPv4 will run out)

  • Major Near Term Changes
    (Nothing too surprising)
    • Introducing IPv6
    • Digitally Signed Address
      Registration to prevent fraud
    • Sensor Networks
    • Smart Grid – Appliances on the Net
    • Mobiles
    • Cloud Computing
    • Social Networks

  • Mobility
    • Persistent state, disrupted connectivity (transactions mode)
    • Multiple types of networks (Wifi, 3G, 4G)
    • New sensory inputs from the mobiles: sound, speech, video --- Everyone can report almost everything
      in real time

  • Beyond text search (Mainly
    Google applications)
    • Image Search --- Google
    • Speech recognition --- Easier
      for some tasks
    • Gestures controlling the
      device (Patti Maes – see a TED talk)
    • Semantic Web
      • Still, a lot of dark information
        in the web
      • Web “publishing” ---
        not just making the raw data available, making it available for use
        and consumption by some other applications
      • Semantic “printing”
        --- Information Representations
      • Creating Persistent Object
        • What is an object?
        • Uniqueness of the object
        • Interpretation “ “ “
        • Authenticity “ “ “
          --- Digital Signatures (and supporting laws)
  • Security Issues – both
    system and user issues
    • Spam
    • Viruses/Trojans
    • Re-use of (poor) password
    • Social Engineering – phishing,
      deceiving emails
    • Human Errors (how to detect bad configurations) – incident of marking every website as malware
      in Google search for 15 minutes
    • organization
      – non-profit organization that detecting sites that carry malware.

  • Privacy
    • Lax user behavior
    • Weak protection of personal
      data by businesses and government
    • Invasive devices: every
      mobile device has a potential for privacy invasion

  • New Technologies
    • Flow routers
    • Massive data correlation
      • Map/Reduce
      • Every datum is a query –
        everything is related to the everything
    • Cloud Collaboration
      • So far, there are only “autonomous
        proprietary clouds”
      • How can they be connected
        – “inter-cloud interactions”
        • Send/Receive data/meta-data
          between clouds
      • How to get the data “out
        of the cloud”?
    • Innovative Storage
      • Mixing SSD’s, RAM and
        hard drives
    • Devices
      • Phones, picture frames
      • Internet-enabled surf board
        (true story)
      • Sensor data from buildings
        and houses / smart grid

  • Research Problems on the
    Internet (just a few select ones)
    • Broadcast/Multicast utilization
      vs. point-to-point delivery
    • Distributed/Multi-Core Algorithms
    • Authentication & Identity
    • Integration of Apps
    • Intellectual Property Protection
    • Rotten Bits: Archiving digital
      information for (very) long range --- Opening a PowerPoint’97 slide
      deck in a year 3000

  • Whole lot still to be done
    (with potentially disruptive results) on the lower levels of the
    network, not just on the application level
Thanks again to Michael, and you can follow others tweeting

Tuesday, April 27

Intro to Search at Facebook: User-centric relevance

I'm not sure how I missed this, but last month the Facebook Engineering blog posted an intro to search at Facebook.

A key difference is that at Facebook, a query != keywords. The query consists of a User's social context + semi-structured profile + keywords. Currently, the typical query is a search for a person or group.

Computing the social context (FoaF graph) and using it during query processing is computationally challenging. They hint at this system for a future post.

The personal context is critical to their ranking and make search hard:
Since our most important ranking features depend on who the searcher is, all our feature generation and ranking happens as a part of the query execution workflow i.e. our indices can't store pre-ranked results to optimize lookups. Instead, we have to generate ranking features like is_same_high_school and num_mutual_connections on the fly for every potential result, and run them through our ranking model to find the best results
It will be interesting to see how Facebook search evolves as Facebook extends to an increasingly heterogeneous types of content beyond people and profiles.

Social Graph Storage and Analysis at Twitter

A quick pointer to a recent presentation by Kevin Weil on data storage and crunching at Twitter, NoSQL at Twitter.

Here are a few highlights of the different systems he describes:
  • Scribe is a distributed event logging framework open sourced by Facebook. The log data is then stored in Hadoop and analyzed using PIG. See the related Elephant-Bird project for reading/writing Protocol Buffer data in Hadoop.

  • HBase - used for offline data analytics imported from online data sources. For example, the data is used to power the Hawkwind people search system.

  • Scaling Twitter with Cassandra - Uses Cassandra for real-time service to store tweets

  • FlockDB - is a simplified distributed graph store (compared with Neo4J) for storing social network data. An online system to compute "who follows who" set operations. It is built on top of the Gizzard framework for sharding/replicating data stores across a cluster
Overall, it's great to see Twitter contributing to the open source ecosystem. Many of their distributed systems use Scala, which is quite interesting.

Monday, April 26

WWW 2010 this week: Semsearch and other events

WWW 2010 is happening this week in NC. You can follow it on twitter, #www2010. I'm not attending, but some of my labmates are there. I'll also try to keep up with information I find online, so help me out by posting your information.

Update: I recommend reading the coverage from Krisztian Balog and Christian Grant who both attended the workshop.

Today, the SemSearch 2010 workshop is being held. You can follow it on twitter, #semsearch2010.

Barney Pell gave the keynote talk, Why users need semantic search. I hope to have more details on that soon. From his abstract,
Our research reveals three key problems of search: imprecise results, need for query refinement, and need to support complex tasks and decisions.
Sam and I submitted runs to the Entity Search Track. Our runs were simple language modeling runs with Indri. We didn't write a system paper, but we wrote a quick abstract. Our best run placed second. Our submissions did surprisingly well considering the lack of training data and simple retrieval techniques.

The best paper award at Semsearch went to Using BM25 for Semantic Search. During the entity track discussion, one the key questions discussed was, "is Semantic search just a special type of document search or not..."

There are several other search events at WWW today, tutorials on:
  • Web Search/Browse Log Mining: Challenges, Methods, and Applications
  • Applications of Open Search tools
More to come as it develops. E-mail me or post in the comments if you have other coverage / writeups that I can point to.