Friday, July 24

Ivory: A New MapReduce Indexing and Retrieval System

To coincide with the SIGIR MapReduce Tutorial, Jimmy Lin announced the release of Ivory, an open-source MapReduce search platform. It is a web-scale indexing and retrieval system built on top of Hadoop. Since it's based on Hadoop, it's clearly written in Java.

For retrieval it uses Don Metzler's Searching using Markov Random Fields (SMRF) Java implementation. You can read his publications on the topic. It's exciting to finally get a chance to play with the implementation of one of the state-of-the-art retrieval tools. To my knowledge this is the first time Don's Java MRF toolkit for retrieval is available to the public.

Ivory is aimed at IR researchers as a platform for experimentation. This is an early release with a lot of rough edges.

Jimmy is using Ivory to index the ClueWeb09 dataset, which has 500 million English documents for the TREC web track.

SIGIR 2009 Wednesday, Day 3, Summary

I attended the Industry Day talk that Matt Cutts gave Webspam and Adversarial IR: The Way Ahead. I have some photos of his some of his really interesting slides I'll post soon.

I then was on volunteer duty helping James with registration and getting volunteer gifts ready.

I caught Matt's talk,
An Improved Markov Random Field Model for Supporting Verbose Queries.

I missed the second morning industry track session. I heard they were quite interesting, including Nick Craswell's talk on Bing. If someone has detailed notes, please share them! I'd love to know more details about what I missed.

Lunch was the ACM SIGIR business meeting. SIGIR 2012 will be in Portland, Oregon. SIGIR 2013 will be somewhere in Europe (multiple contenders) or in Israel.

I attended the Query Formulation Session.

Niranjan did a fantastic job presenting Reducing Long Queries Using Query Quality Predictors for his co-workers at Microsoft who were unable to attend.

I also enjoyed Extracting structured information from user queries with semi-supervised conditional random fields prsented by Xiao Li. It looked at the effectiveness of tagging queries, particular for product search. I enjoyed the talk, but I think CIKM would've been a better forum for the paper; it did not use or measure retrieval, the only connection to IR is that it used query log data.

I then went to the Web Retrieval II session. Nick Craswell gave a very good presentation of The impact of crawl policy on web search effectiveness. Anyone building a large-scale web crawler should read this paper. (On a related note you should also read IRLbot: Scaling to 6 Billion Pages and Beyond, PPT slides, which won best paper at WWW 2008).

I liked Marius Pasca's prentation of Web-Derived Resources for Web Information Retrieval: From Conceptual Hierarchies to Attribute Hierarchies which aligned extracted attributes onto WordNet classes. They manually judged the accuracy which sounded challenging and time-consuming. Instead, I would have liked to have seen the evaluation based on utility in a retrieval system: such as usage of the attributes in a faceted retrieval interface or using query logs. This is another paper that I think would've been more appropriate for CIKM than SIGIR because it was more about extraction than retrieval.

There was a UMass CIIR Alumni reunion. I only attended part of it, but I got the chance to talk (briefly) to a great group of smart and interesting people.

The main event was the boat cruise on the Boston harbor. I have some beautiful photos of the water. It was a relaxing time with friends. The only destriment was the slight odor in part of the harbor coming from the sewage treatment plant. Yuck!

SIGIR 2009 Day 2 Summary

Here are some of my highlights from the day.

I didn't get a chance to go to the Retrieval models II session, but I heard there was some interesting work there on better modeling of term proximity. I hope to go back through those papers quite soon.

I had lunch at the Upper Crust with Jon, Matt, Andy, Elif, and a group from CMU/UMass that braved the rain. Great discussion and delicious thin-crust pizza.

I attended the Interactive Retrieval session.

I liked the Predicting User Interests from Contextual Information paper.
A few comments: 1) It was remarked that the ability to predict interests likely varies significantly across the different ODP categories. It would have been interesting to dig in further here. 2) Is ODP even being mantained anymore? I'm not sure how meaningful its categorizations really are.

I wanted to attend the paper on Effective Query Expansion for Federated Search. It looks interesting; another one to read later.

The last talk was the keynote, From Networks to Human Behavior.

That night was the conference banquet at JFK Library and Museum. The museum was a very fun time. It has a great location right on the water near UMass Boston. The museum and the presentations really gave you a sense of JFK's remarkable speaking ability. Many people remarked on how it was easy to see the comparison to Obama's intelligence and rhetorical skill. The museum ignored or minimized some of the more controversial aspects of his life and presidency, which is unsurprising and disappointing.

At the banquet me and Daniel sat at the 'bloggers table' along with a group from Microsoft's Sharepoint search team and talked at length about some of our blogging experiences, goals, etc... The food and setting were also very enjoyable.

It was Henry's birthday. James embarassed him by having the everyone at the banquet sing happy birthday!

Post Banquet
A group of us ended up at the Bukowski Tavern near the Sheraton. I hung out with Jeremy, Daniel, Victor, Diane, and others until the early hours of the morning. We had some really interesting conversations.

Thursday, July 23

SIGIR Social Media Workshop: Abdur Chowdhury Twitter invited talk

Abdur Chowdhury gave the invited talk at the Social media Workshop called "Emergence"

"...Emergence is the way complex systems and patterns arise out of a multiplicity of relatively simple interactions.."

Abdur's presentation primarily focused on Twitter Trends. It was a series of questions and illustrations to provoke discussion. You needed to be there or watch it to experience it. This post is a bit disorganized because it was hard to capture the discussion dynamics.

Twitter Trends
Last thanksgiving he built Twitter Trends. The word Mumbai popped up and he thought it was a bug.

What is a trend?
examples: Mumbai Bombing, the NY plane crash in the hudson, the Jakarta bombings.

The ability to share information quickly and easily:
It changes our awareness of the world and causes new emerging behaviors not fully understood:
  • What social, technological & research questions emerge from this?
  • What happens when it's not one reporter, but millions of all of us talking?
    Lowering the barrier for all of us to share information and allowing it to propagate quickly.
- He presented a movie of Twitter chatter during the super bowl.

(It's cool. I wonder what the bias is towards those who tweet. The midwest is very underrepresented considering the interest in football and the Super Bowl.)

"More interesting than query logs, you can drill in and read the tweets; what people are actually saying."
  • What will the emerging behavior and benefits from sharin events with the world?
  • What does it mean when the water cooler is the size of the planet?
Trend spam
- Hackers band together and try and game twitter. If you have 10k people sitting in basements, they can create a trend. It's getting harder as the traffic volume grows.

Country Perspective
Location based trends.
There was a good question about looking do trends in a sub-community instead of just in a location... it's possible, but it just hasn't been done yet.

One of the big uses of Twitter is shared events: when you want to share an experience.
- Shared events make it in the trends list: SIGIR, Apple developer conference
Fun examples: near: searches (e.g. near: queries)

Go Marti! She brought up a very important missing topic, food! Despite the fact that people are talking about it, it's not a "trend" according to Abdur. It's background "chatter". (I disagree. You need to look at trends in a lower level in the hierarchy.)

Emergence: Interesting things happen:
FollowFriday -- a recommendation engine that emerged.

"What's it like working somewhere that simultaneously matters, and complete doesn't matter?"
  • What is the interesting research?
(Personally, I would have liked some more depth on how Twitter and Search interact more heavily, e.g. the use of hash tags).

SIGIR Evaluation Workshop: Evaluating IR in Situ by Susan Dumais


IR systems are developed to help people find information to satisfy their need.

Success depends on two general components
- The system depends on the matching system (content and ranking)
- User interface and interaction

* Data is a critical resource!

$$ You have won $10 million $$
* Challenge: You have been asked to lead a team to improve a big web search engine. How do you spend it?
Ideas: study what users are doing and how to help them. Work out what you are doing badly and do better. ... you need to understand what improve means.

* Content
- ranking, crawling, spam detection
* User experience
- presentation (speed, layout, snippets, etc..)
- Features like spelling correction, related searches, ...
- Richer capabilities to support query articulation, results analysis...

Depends on:
* What are the problems now?
* What are you trying to optimize?
* What are the costs and effect sizes?
* What are the tradoffs
* How do various components combine?

Evaluating Search Systems
* Traditional test collections
- fixed docs, queries, relevand judgments, metrics
- goal: compare systems w/respect to the metric (mean, not variance!)
- Search engines do this, but not just this!

* What's missing
- Metrics: user model, average precision, all queries are equal
- Queries: types of queries, history of queries session and longer (trends!)
- Docs: the "set of documents - duplicates, site collapsing, diversity, etc... (interdependent)
- Selection: the nature and dynamics of queries, documents and users
- Users: individual differences (location, personalization, including re-finding, iteration, and interaction
- Presentation: Snippets, speed, features (spelling, query suggestion, the whole page

Kinds of User Data
* User studies (lab setting, controlled tasks, detailed instrumentation (including gaze, video), nuanced interpretation of behavior.

* User panels
- In the wild, user-tasks, reasonable instrumentation, can probe for more detail (put out an instrumented system to hundreds or thousand of users and collect more detail)

* Log analysis (in the large)
- in the wild, user-tasks, no explicit feedback, but lots of implicit indicators. The what vs. the why.

User Studies
- The SIS timeline experiment
- Lab setting, small scale (10-100s of users)
- Months for data
- Known tasks and known outcome (labeled data)
- Detailed logging of queries, URLs visited, scrolling, gaze, trackin, video
- Can evaluate experimental prototypes
- Challenges - user sample, behavior w/experimenter present or w/ new features
* You have to pick the questions you want to ask very carefully because it's time consuming and expensive.

User Panels

- Curious Browser toolbar: 3 panels of 2000 people each back in 2005
- Link explicit user judgments w/implicit actions
- In the wild they asked relevance about URLs, sessions
- They measured scrolling, cutting and pasting, what URLs people visited.

- Browser toolbar
- Smallish scale (100-1000s of users)
- Weeks for data
- in the wild, search interleaved with other tasks
- You can probe about specific tasks and success/failure (some labeled data)
- Challanges - user sample, drop out, some alteration of behavior

Log Analysis (in the large)
- E.g. Query-click logs
- Search engine vs. Toolbar

Search Engine - details of your services (results, features, etc...)
Toolbar - broader coverage of sites/services, less detail (this is more interesting than the first)

Millions of users and queries
real-time data
In the-wild
Benefits: diversity and dynamics of users, queries, tasks, actions
Challenges: logs are very noisy... they tell you what, not why.

Sharable Resources?
* User studies / Panel studies
- Data collection infrastructure and instruments
- Perhaps data
* Log analsysis - Queries, URLs
- Understanding how users interact with existing systems
- What they are doing; Where they are failing, etc..
Implications for retrieval models, lexical resources, interactive systems.
Lemur Query log toolbar.
-- The big issue is: How do you get users?

Operational Systems an experimental platform
- Can generate logs, but more importantly... A/B testing
- Interleave results from different methods

* Can we build a Living Laboratory?
- Web
- Web (search APIs, but ranking experiments are somewhat limited
- If there were concrete proposals she thinks the big three would be willing to open this up more.
- UX perhaps more natural

- Other content resources:
- Wikipedia, Twitter, Scholarly publications, ...
Replicability in the face of changing content, users, queries.

Again: The key is getting users. Someone has to mantain the product.

Closing thoughts
- Today's test collections are very limited with respect to user activities
- Can we develop shared resources to address this?

- Selecting your users is important: what are you likely to have in variablility? .. you need to sample the space in a way that's useful for you.
- Robust Track: looked at variance. - What behaviors precede users switching?

- Do we have a 100 processes of abstractions that characterize what users are doing? It's hard to predict what someone is going to do in an interactive setting. (imagine you want to improve snippets. Start with it in the laboratory setting with eye tracking. Then what behaviors would you observe in logs if you introduced it in the logs.)

- There is interesting discussion about abstracting the framework and modeling tasks and what is driving the behavior.

SIGIR Evaluation Workshop: Richer Theories, richer experiments by Stephen Robertson

Richer Theories, richer experiments by Stephen Robertson
- Stepehen Robertson

- Search as a science
- The role of experiment and empirical data gathering in IR
- The standoff between the Cranfield tradition and user-oriented work
- The role of theory in IR
- Abstraction

A caricature
- One the one hand we have Cranfield tradition of experimental evaluation in IR (powerful lab paradigm, but of limited scope)
- On the other hand, we have observational studies with real users (realistic, but of limited scale)

Experiment in IR
- The Cranfield method was initially only about "which system is best", system in this case meaning the complete package (language, indexing rules, indexing, esarch, etc...)
- Cranfield was not seen as being about theor or models.

Implicit models
Cranfield 1: effectiveness is a consequence of the general approach e.g. 'faceted classification, 'Uniterms'

Cranfield 2: effectiveness is a consequence of the combination of low-level components
e.g. synonyms, morpholoical varians, and generic and specific terms are conflated.
- Relevance is a way of measuring effectiveness

Theory and experiment in IR
"Theorys and models in IR (1977)
- Cranfield gave us an experimental view of what we are trying to do.
- The measurement is an explicit component of the models
- We have pursued this ever since...

* Traditional probabilistic models
- explicit hidden variable which is relevance
- prediction of this variable

* Logical models
- relevance embedded in 'd impies q'

* Language models
- Simple model: relevance embedded in 'd implies q'
- extension: relevance itself as lanuage model

* Others
- DFR: again an embedded notion
- Focal of all these models is predicting relevance (or what the model takes to be the basis for relevance)

No other hypothesis/predictions sought. nor other tests made. This is a very limited view.

Scientific method : a brief overview

Traditional science
- image of science involves experiments in laboraties, but this is misleading. This is true in some fields (small-scale physics, but others are not: biochemical end of biology)

- Lab experiments involve abstraction
- choice of variables included/excluded
- control on variables
- restrictions on values/ranges of variables
(parts of the word are included/excluded)

Models and theories involve abstraction
- Why? To make them possible
- Why else: reduce noise, clarify relationships, study simple cases.

Newton's laws
- have many uses (motion of planets)
- there are many ways to test them
- and they suggest other experimental measurements (mass, acceleration due to gravity, and G the gravitational constant)

Abstraction in Newton's laws
Abstraction allows the unification of astronomy and local physics
.. and also the separation of use, testing and..

Testing Newton's laws
- pendulums vs. a toy which also tests it, but has less use.

Information retrieval phenomena
* people writing documents
*users needing information
- to solve some problem or accomplish some task
* these users undertakin information-seeking tasks
* various mechanisms to help them
* a notion of degrees of success or failure

Science and engineering
As IR engineers we concentrate on
- constructing the mechanims
- measureing the success or failure
-- As scientists we should be looking further

A typical SIGIR paper
1) construct a model
2) Ask the model how to do ranking for a search
3) Construct a system which follows the advice of the model
4) choose a baseline
5) evaluation usin TREC data
6) It does better than the baseline...
... therefore the approach/system/model is good.

Traditional IR Evaluation
- Primarily concerned with evaluating systems rather than models or theory, but has become the usual way to evaluate models or theories
- evaluating in terms of useful outcomes (despite the above)
-- There are some disconnects here.

User-oriented Research
* A lot of observational work, but also increasingly laboratory experiments
- within and outwith the Cranfield/TREC tradition

* Emphasis of the models and theories
- understanding user behavior.

Some points of contact
- We are interested in the interaction of mechanisms and user behavior
* understanding the abstraction that is relevance
* understanding easily observable behaviors
- clicks, reformulations, termnations

Theory and models
- One way to better understanding is better models.
- The purpose of models is to make predictions, but what do we want to predict? (useful applications/ inform us about the model)

Predictions in IR
1) What predictions would be useful?
- Relevance, yes... but others!
- redundancy, novelty, divserity
- optimal thresholds
- satisfaction (and other kinds of quality judgments)
- clicks
- search termination
- query modification (and most other aspects of user behavior)
- satisfactory termnation
- abandonment/unsatisfactory termination
... and other combinations.

2) What predictions would inform us about models?
- more difficult: it depends on the models. (many models insufficiently ambitious)
- in general, observables/testables
- calibrated probabilities of relevance
- hard queries
- clicks, termination
- patterns of click behavior
- query modification

Richer models, richer experiments
* Why develop richer models?
- because we want richer understanding of the phenomena (as well as other predictions)
* Why does it ...
A rich theory should have something to say both to the lab experiments in the Cranfield tradition and observational experiments.

- Justin: the history of science can be thought of as the evolution of measuring instruments... we need to know the engineering is correct in order to measure the theory.
- As soon as observational data became stronger (polemic vs. newtonian physics)... in our case, the data is being abstracted. (e.g. collections with NLP markup, contextual information, etc...).. we're still in the epicycle phase.
- Should we be engineering tools to solve a task instead of the engine for everything?

Wednesday, July 22

Matt Cutts SIGIR Industry Day Presentation: WebSpam and Adversarial IR: The Road Ahead

Daniel is introducing Matt Cutts, head of Google web spam.

- What is web spam
- Why people spam
- Spammer/black

-- He's going to teach you how to think like a spammer. Use this only for good; not for evil!

What is Web spam?
- Webspam is cheating (breaking search engine guidelines) in an attempt to rank higher in SE; just trying to rank higher is SEO, not spam.

Why web spam?
- Money!

Church spam!
- A catholic priest used hidden text to spam! Catholicism, Catholicism, Catholicism!

The mind of a blackhat spammer
Anyone using Wifi?
- How do you know it's not fake and someone is trying to sniff your data?
- These guys are here to get you.
- "Recycle your badge" box at Web 2.0. The badge is worth thousands of dollars. How do know it's real?

- It shows you the blackhat attitude.
- The official Internet registry and optimization bureau!

Final Exam
Scenario: Suppose Google starts penalizing sites that have spammy inlinks.
- People will create spammy links to their competitors!
- SPAM: "Sites Positioned Above Me"

Webspam requirements
1) Content (on-page)
2) Reputation (off-page)
- you need both to be a good spammer.
You need something else: A way to make money! (monetization)

On-page Spam
- The most famous example is hidden text. (the old white text on a white background trip)
Old school: FFFFFF... new age...
BrainTeaser: Hidden text using Javascript and setting display none on divs.

Examples of spam techniques
Clipart spam: throwing queries on the page

SecretsMoney: Tax deferred. videos -- a low order hidden markov model generated spam.

Scraping: (From all about Jazz) A key give away is that they don't escape special characters. Spammers are now being tricky and stitching content together (sentence and phrase level stitching)

- Showing different content to Google than your users.
- Steve Bartel on MIT! got hacked... He's relying on the fact that search engines don't execute Javscript. (They do parse JS!, but many search engines don't.)

Off-page Spam
- Blog comment spam ... pretty common.
- Referrer spam.

In 2006 spammers had their own sites, today they try and use other people's!

Paid Links
- NCSU (Linda Stepper)... they are writing paid blog posts and inserting paid links (non-disclosed)

Hacking the biggest trend in spam; it's easier to hack somebody else's than build your own.

Make Spammers waste time or effort. Frustrate them!

Ways to frustrate trolls
  1. Disemvowelin. "thnk tht ths tpc s stpd nd dmb" -- invented by boing boing
  2. Show troll's comments only to the troll, not to anyone else.
  3. Slow down the website experience for the troll. (wait 20s for http reply... put him in dial up mode!)
  4. Start the troll in a -1 "hole" that they can dig out of. (You need to get someone to agree with you to get visible)
Reputation/trust helps
- PageRank, TrustRank, BingRank?
- Ebay: your seller rating. (100% positive, since 02, with hundreds of sales)... any time he writes a post on Amazon, it's probably ok.

Off the beaten path
- Clever: have a hidden form that only bots will fill out to catch them!

Where is this spam from?
- The spammer reported his spam using Google's report spam form!

Good tools: nofollow
- Oompa loompa dating site! Comment spam... add nofollow to your third-party posts.

Trends in webspam
  1. Search engines better at spotting spammy pages.
  2. Spammers make legit-looking pages for spammy links
  3. Spammers hack/deface legit sites for links/landing page
  4. Spammers are using malware!
Spam will soon be more dangerous!
Classifying whether or not a site was hacked, not whether it is keyword stuffing!
Porn producer: 1 in 50 converting to 1 in 200. Answer: Installing malware from a webpage!
Next wave of web spam will be hacking webservers (XSS)

* Detect when a website is hacked based on how links are added, etc...

Selling links from hacked sites.

Preventing Comment Spam
- Any way you can tell humans from bots.
- The KittenAuth captcha: mark all the foxes.

- Is there a question that everyone in the world can answer?
- What techniques prevent comment spam?
- Web service to classify content as spam/nonspam

Trust, Identity, Authentication
- Is there a PageRank for people?... something that spans across social networks and the web.
- Bring authenticated authorship to the web
- When should a website vouch for a link to another website?
- Wikipedia nofollows links to most other sites!
(In short, when can you trust a member of your community)
WoW has better authentication than most sites on the web.

Twitter and Facebook
- Study adversial IR ... spam followers, or the "realmattcutts" he followed the same people as Matt!
- It recreates a lot of the same problems you see in e-mail.
- Twitter trending: the Acai Berries
- Using twitter for linking/malware spam! It's happening.

- Google bombs the first one was "talented hack?"

SIGIR 2009 Best Paper Award Goes to...

Sources of Evidence for Vertical Selection (ACM DL link) by Jaime Arguello (Carnegie Mellon University), Fernando Diaz (Yahoo! Labs Montreal, UMass Alum), Jamie Callan (Carnegie Mellon University, UMass Alum), Jean-Francois Crespo (Yahoo! Labs Montreal)
Vertical selection is the task of selecting the relevant verticals, if any, in response to a user's query. We focus on single vertical selection, defined as the task of predicting a single relevant vertical.
They looked at 18 different verticals and compared how their vertical selection methods compared with prior work on federated search. They used:
  1. Query string features which used key phrases in the query ("weather", "movies", etc...)
  2. Query log features which used the likelihood of the query given past queries on the vertical
  3. Corpus features which uses documents retrieved from the different verticals.
My congratulations to Jaime, Fernando, Jamie, and Jean-Francois!

(Note: Since the best paper was from a student there was no separate best student paper award)

Tuesday, July 21

SIGIR Keynote: From networks to human biology by Albert-Laszlo Barabasi

Here are my raw notes from the keynote address.

Update: My Flickr album includes photos from the keynote. I'm behind on labeling, but they're the cool looking network images.

He will talk about all kinds of networks. When thinking about large networks most people think of the Internet, a system of hardware routers and cables.

Structure of an organization
Another network: The social network. Each node is a person. The color is a department and the links are communications between them.

Understanding the structure of the social network, what advantage does that give the company? ( The social network defines how a company performs, or how successful a person is in pursuing their goals.

The Genome
Humans have 25,000 genes about three times as many genes as the fly. They have less than the onion, which has 40,000. So how do we explain the complexity differences in organisms? Human complexity seems unlikely to come from a sheer quantity of genes. There is no correlation between the number of genes and intelligence. It has to do with the internal network and how the genes are connected.

What makes a system complex?
Complex Systems are made of many non-identical elements connected by diverse interactions.

How do we model and analyze networks?
Pal Erdos (1913 – 1996) A Hungarian mathematician who lived out of his suitcase, moved from place to place. He stayed only as long as it took to wear out his welcome, which was usually about 5 days. Then he moved on. This made him one of the most networked academics of his time…

Erdos-Renyi model
  • A network is simple: nodes and links
  • Generate a network that has the complexity of the real network. (democratic, random) Connect two nodes with a probability, p.
Animation: Start with nodes floating in space, and you connect them randomly. Start by connecting pairs. Eventually you get a big glob – a magical moment. Results in a Poisson distribution

A poisson distribution. If it modeled how a society formed (friends), we would all have roughly the same number of friends. It would be very democratic. Do we believe that real networks are random? Intuitively the answer is no. We're here because there is order (not random). We need to find the signature of the non-randomness.

They selected the WWW has a model of a large graph.
Around a trillion documents. They expected a degree of randomness, but when they looked a the data they found a power law distribution. There are a small number of nodes with large numbers of links. They compared it with the road network; it's very different. It's much more like the hub system (scale-free network) used by airlines.

Internet backbone too is a scale-free network. We’re talking about just the backbone here, as distinguished from the Internet as a whole.

He also showed a similar trend in scientific co-authorship: Nodes are people and links are working relationships. Most people have have a few collaborators. A few have many. The story is similar with academic citations, i.e. how papers refer to each other.

The same trend is shown in the Swedish sex-web (nodes: females;males and the links are sexual relationships). Liljeros et al. Nature 2001. They were looking at how AIDs spreads and they saw the same underlying scale-free network.
  • Why do all these networks converge to a similar architecture?
Erdos-Renyi made the assumption that as you’re placing the links, the number of nodes don’t change, which is different from the networks we’re observing.

Origin of SF networks: Growth and preferential attachment.
  1. Networks continuously expand by the addition of new nodes.
  2. Growth add a new node with m links. New nodes prefer to connect to highly connected nodes. Your biased towards more connected nodes (preferential attachment).
These two properties sufficient and necessary for the emergence of scale-free networks. Both of these properties are essential ingredients, although there is much more going in real networks (remove nodes, change links, etc...).

See A.-L. Barabasi, R. Albert and H. Jeong, Physica A 272, 173 (1999)

Fitness Model: Can Latecomers Make it?
- Scale-Free Model: (first mover advantage)
- How can you cure the model to account for the latecomers? Each node has a fitness, it's a fuzzy quantity. It tells you how attractive a node is. What is the likelihood that you will connect to the person once exposed?

See Bianconi & Barabasi, Physical Review Letters 2001; Europhys. Lett. 2001.

With fitness, you may have a latecomer take over the whole network. You don’t typically have this with a basic scale-free network w/out the fitness property.

These are manmade networks. Is this a result of what we do, or is there something more fundamental?

- Protein-gene network (genome), protein-protein interactions (proteome), bio-chemical reacctions (metabolism).

- Is it only yeast and bacteria, or do we carry scale-free networks? Rual et al. Nature 2005; Stelze et al. Cell 2005. Yes, we do. So what?

  • The property of complex systems maintain their basic functions even under errors and failures. (animals function despite cell mutations; Internet functions despite broken routers).
  • In a network, what happens when there are node failures? As we remove nodes the network will start breaking apart. What is the fraction of the largest connected component? The critical point where too many nodes have been removed can be calculated.
  • As you remove nodes, the size of the critical point will only happen when fc=1.
  • In a scale-free network most nodes are small: so if they are chosen randomly it's likely you'll choose a node that is not highly connected (hubs are rare).
  • However, if you start at the largest node and remove in order, that within 10-15 nodes the network is shattered. (attack is its Achilles heel). (now from structure to behavior...)

When do things happen?
- Basic assumption: Poisson process: events take place randomly at a constant rate. Is this a good model? Do we really follow this?

In the case of e-mail the pattern is bursty; periods of inactivity and intense activity. It's more of a power law (long tail) distribution of events than a constant Poisson model. The same patterns emerge for library usage, web browsing behavior, document printing, cell phone calls, etc... This means you'll have a burst of behavior followed by a pause with sparse events interspersed.

Prioritized Task Execution
Where does this come from? How do we decide what to do next? Imagine you have a to-do list. Do you throw a die and decide what to do? Uh, no. Most of us assign priorities. We deem some things to be important, and others less so.

Imagine a model where you prioritize each task. You do the highest priority task you can right now, and when you finish you have an opening. Meanwhile you may have gotten another task on your todo list, which you prioritize. Repeat. So how long do things stay on your todo list? (execute a random task and a highest priority task with a prob of 1-p) This explains the power law distribution. Just the fact that you prioritize things creates burstiness and the power law distribution.

Q: Does the length of the queue matter? No. the distribution is the same.
- Just the fact that you prioritize creates the bursty pattern.

See Vazquez, PRL 95, 248701 (2005)

Universality Classes
- Is there life beyond alpha=1?
- The letters of Einstein, Darwin. (pre-email)
Einstein sent 14,500 letters, received 16,200.
Darwin sent 7591 letters and received 6530.
Graphs of number of letters per year, and response time for letters look very similar.
- The response time for their letters follow the same power law distribution (with a=3/2)

Cobham Model 1954. They were overwhelmed with the number of letters and could not keep up.

Thinking about networks: Is it just a fad?
Lorenz’s paper on Chaos (1963) In 1970’s citations took off and today its perhaps the most referenced paper on complexity. [He showed us other graphs on the reference growth of popular papers over time.]

The three most cited network papers’ citations are off the charts. Nothing like it has ever been seen before in the history of academic literature. What’s going on?

What is “network science”?
An attempt to understand networks emerging in nature, technology and society using a unified set of tools and principles.

He displayed a graph of how network science citations have exploded over the last decade. This is due to interdisciplinary recognition of this as a phenomenon. The study of this as a science he argues is due to the fact that "networks emerge and evolve driven by a fundamental set of laws and mechanisms".

Each field has adopted network science as its own, and thinks they invented it.

It was noted that these types of networks are vulnerable to attacks; they are not the safest. However, it's more likely that they represent a trade-off between efficiency and safety (my interpretation). The most efficient network is a single hub with spokes. The chance of bringing down this network by random chance is very small; only 1/n. The power law network has a relatively small number of these hubs.

Jeremy asked about the implications that the network structure has for the discoverability of information. Namely, that it will be harder to find information from non-hub sources. The response was that on the Internet:
"Everybody has a voice, but not everybody is heard."
He added in a follow-up that Google will be hard to beat by Bing and other competitors because it has such a large lead. Their fitness rate has to be significantly higher than Google, which means they need to be significantly better/different (... or use their desktop monopoly, my thought).

SIGIR 2009 Day 1 Summary

Daniel has his summary of the day as well.

Monday Morning
I missed most of the morning talks yesterday, it was a frenetic rush as people were registering. I organized the layout of the poster session, which was non-trivial for the 100+ posters.

I heard good things about Paul's paper, Refined Experts: Improving Classification in Large Taxonomies.

I also want to check out Segment-Level Display Time as Implicit Feedback: A Comparison to Eye Tracking. I missed the session, but there's some coverage from CSAIL.

In the afternoon I attended the Retrieval Models I track. There were multiple papers that tried to address the risk/reward of ranking documents that are similar to one another. In short, if the most relevant (highest prob) pages are very similar, but are not the desired context that you can achieve better performance by introducing diversity into the results. I wonder if some simple document similarity/clustering models would be simpler and be just as effective.

One notable talk I heard about and want to read later is:
Enhancing Cluster Labeling Using Wikipedia by David Carmel

The evening was the poster session. There were an overwhelming number of posters and I know that I didn't get a chance to look at all of them in the depth that I wanted.

Marc, Elif, and I presented a poster Characterizing the Subjectivity of Topics. The goal is to better inform users about the broader context of the topic a given a single document (blog post) on the topic.

As usual, the best part was talking to people about their work.
Justin Zobel had a provocative poster Has Adhoc Retrieval Improved Since 1994? which they have in their evaluation workshop.

Another highlight was Omar's work using Mechanical Turk work.

The Glasgow guys had about half a dozen posters!

After the poster session Michael and a group of the Glasgow guys ended up at Uno's for drinks and ended up discussing our experiences with Hadoop and distributed computation.

Monday, July 20

Susan Dumais: SIGIR 2009 Salton Award Keynote Talk

See also Jon's notes.

- I'm live at the SIGIR 2009 conference. Liz Liddy is introducing Susan Dumais as the Salton Award winner.

- Sue is now on stage giving her presentation:

  • Sue’s research is interdisciplinary, at the intersection of IR and HCI
  • User-centric vs system centric
  • Emphasizes empirical vs theoretical work
Common themes:
Understanding the user, domain, and task contexts

Her Background
  • PhD in Mathematics and cognitive psychology (vision, perception, and cognition).
  • Joined the HCI group at Bell Labs in 1979
  • Introduction to IR, 1980-82
  • Human factors in database access
  • Vocabulary
  • Semantic Indexing
From Verbal Disagreement to LSI
They observed a mismatch between the way people want to retrieve information from a computer and the way that system designers describe that information.

Linux Command names
The trouble with Unix: cryptic command names (cat, grep, ls, etc…). It stimulated thought on the vocabulary mismatch problem.

They studied how people describe objects and operations in text editors, common objects.

She just had us pick names for a Boston weekend activity site/service and compare with our neighbors. Very few people agreed!

Findings: the tremendous diversity in the name that people use to describe the same objects and actions.

Details on alternative aliases:
Single keyword - .07-0.18 “repeat rate”
Single normative keyword: 0.16-0.35
Three aliases: 0.38-0.67
Synonyms for mismatch -- very funny!

Chi 1982 - 0th CHI Conference. Statistical Semantics: How can a computer use what people name things to guess what things people mean when they name things?
"In describing items in a database, however, system designers are at a disadvantage in that they do not usually get explicit, immediate, and continuous feedback from users. Knowing how people describe common objects and shift their descriptions from audiences of different levels of sophistication may help designers build systems whose information is accessible to the widest possible audience."
She's talking about earlier collaborators on LSI.

Tackling word mismatch

Rich Aliasing
  • Allow alternative words for the same item.
  • "Natural" in the world of full-text indexing, but less so for keyword indexing or command naming.
Adaptive Indexing
  • Associated (failed) user queries to destination objects
  • Add these queries as new document field
  • Quickly reduces failure rate for common requests/tasks
Latent Semantic Indexing
  • Model relationships among words, using dimension reduction
  • Especially useful when query and documents are short
  • Done earlier by Baker, Borko/Bernick, Ossario (1962-1966); Kohl (SIGIR 1879 p1). They lacked the computational resources to do this computationally.
Many applications and algorithms
- Bell labs directory of services, expert finding, reviewer assignment, handwritten notes, data evidence analysis, IR and Information filtering test collections

Rich aliasing and adaptive indexing in the web era
  • Full text indexing (rich aliases from authors)
  • Anchor text or tags
  • Historical query-click data (adaptive indexing with implicit measures)
"If you have an operational system and you don't use what your users are doing to improve, you should have your head examined"

The work on vocab mismatch started in CHI and built IR systems to help them, then the work went back to psychology.

More and more the work has been presented in an interdisciplinary fashion. Besides LSI, her most cited paper is in psychology:
A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction and Representation of Knowledge

By 1995 or so, she was firmly entrenched in IR.

Common Themes
The last 10-20 years have been an amazing time to be in IR
  • TREC and evaluations starting in 1992
  • Search is everywhere - desktop, enterprise, web
  • Web search - big advances in scale, diversity of content and users, quality of results for some tasks, etc..
  • SIGIR community has a lot to be proud of, but many search tasks are still quite hard.
Common theme: represent and leverage richer contextual information about users, domains, and task environments in which search occurs.

Web Search at 15
  • Scale and diversity have exploded.
  • Number of pages indexed
  • 7/94 Lycos - 54,000 pages. (not the full text, only a few hundred words).. for IP reasons, not just technical reasons.
  • 1995: 10^6
  • 1997: 10^7
  • 1998: 10^8
  • 2001: 10^9
  • 2005: 10^10
  • (my note, 2009: 50^10+?)

Content Diversity
It started with Web pages and newsgroups. Now it includes maps videos, blogs, health... etc... the diversity of content has exploded.

How it's accessed
She showed the evolution (or lack thereof) the search interface across over a decade by comparing screenshots. It has not changed other than the width of the box (wider) and the label on the search button.
  • Has not changed as rapidly.
  • It's still a similar search box!
  • The result presentation has not changed significantly.
Support for Searchers
  • The search box
  • Spelling suggestions (getting the UI right is important to get right)
  • Auto-complete.
  • Inline answers
  • Richer snippets
Search in the future will look nothing like today's simple search engine interfaces said, Susan Dumais, adding, "if in 10 years we are still using a rectangular box and a list of results, I should be fired." (Mar 7, 2007)

In retrospect she regretted saying it, not expecting it to get picked up by the NY times. She got calls from Bill Gates and others telling here not to worry about it – she’d be retired by then!

Search Today
Query words --> Server --> ranked List
  • User context - where are the query words coming from?
  • Document context - Documents are related by interaction patterns, etc...
  • IR occurs in a larger task/use context -- search is done to solve a problem.
Search and Context (Sue's work)
You get context from: Users, Task, and Document Content.
  • Modeling users
  • Using User Models
  • Inter-relationships among Documents
  • Evaluation is very important
What problems are hard for people to solve? How can we build interfaces for people to solve them? Iteration and empirical testing is critical.

User Modeling
  • Modeling searcher's interests and activities over time (short-term vs. long-term, individual vs. group, and implicit vs. explicit)
  • Refinding and personalization

Refinding on the desktop
: Stuff I've Seen (SIS) (SIGIR 2003)
  • Unified access to many types of information
  • Index of content and metadata
  • Rich UI possibilities because it's your stuff and client applicaiton
Deployed six versions to over 4,000 people. They varied the query syntax, result previews, and the default ranking.

  • People issues shorter queries than web.
  • They don't use a lot of advanced search operators. filter, sorting
  • Date is by far the most common sort attribute (vs best match)
    - importance of time, people, episodes in human memory
    - few searches for "best match"; many other criteria
  • Need for "abstractions" in time - date, people, kind
Rich client-side interface
- support fast iteration/refinement
- Fast filter-sort-scroll vs next-next-next

Re-finding on the web (Teevan et al., SIGIR 2007)
  • 50-80 pages visits are re-vists
  • 30-50% of queries are re-finding queries
  • Total = 43% There is a big opportunity to support re-finding on the web.
  • There are few models to combine web rank w/ personal history of interaction.
  • Interfaces to support finding and re-finding.
  • Today: People et the same results, independent of current session, previous search history, etc...
  • PSearch (SIGIR 2005) : Uses a rich client-side model of a user to personalize the search results.
Building a user profile
  • Type of information: past queries, web pages, desktop
  • Behavior: visited pages, explicit feedback
  • Time frame: Short term, long term
  • Who: Individual, group

  • Where the profile resides:
    - Local: Richer profile, improved privacy (but, increasingly rich public data)
    - Server: Richer communities, portability
Using the user profile
  • Ranking
  • Query support
  • Result presentation
Ranking Algorithm (SIGIR 2007)

When To personalize? (SIGIR 2008, TOCHI)
  • Personalization works well for some queries, but not others
  • Models for predicting when to personalize using features the query and query-user
Evaluating Personalized Search (TOIS, TOCHI)
  • What's relevant for YOU, now
  • Explicit jdugments (offline and in situ)
  • Implicit "judments" from behavioral interaction
  • Linking explicit and implicit (e.g. curious browser study , 4000 volunteers. Used data to learn models that help you link implicit information from other people. If you use just the click, 45% of the time. 75% w/click+ dwell + session
Future Challenges

Dynamics and Search
  • As a community, SIGIR deals with static test collections (such as Gov2), but the Web is very dynamic, constantly changing. The community could use a dynamic information environment
  • Knowledge of how a page has changed over time is largely discarded by today’s search engines. Changes are important, and should be highlighted in snippets when applicable. For example, a SIGIR 2009 page was updated with information about a free breakfast: That would be really important for someone at the conference who was searching the website to help plan out their day.
  • Improved crawling policies
  • Improved ranking using temporal patterns, term longevity, etc.
Data and Evaluation
  • TREC and the Cranfield evaluation methodology is great, but don’t let it narrow the experiments we do. Search is inherently interactive and iterative.

  • User interaction data is important! How do users interact with existing systems, what are they trying to do, and what happens when they fail?
  • Implications for models and interactive system
  • Lemur query log toolbar – developing a community resource

  • The SIGIR community needs more large-scale log data. As a community, how can we get this? A community-based collection of queries and user interactions that can be used for research is needed.

  • Can the SIGIR community build a living laboratory where researchers can test ideas on an experimental platform? This would be an operational environment to test algorithms, do A/B testing, etc. (see the Microsoft Experimentation Platform)
General Challenge: Think outside the traditional IR box. What frustrates you? Understand users and their tasks. Solve their problems.

Sunday, July 19

Off to SIGIR 2009

I'm off to SIGIR 2009. I'm volunteering with registration, so stop this afternoon or tomorrow morning to say hi.

Look for SIGIR coverage soon!