<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-18315968</id><updated>2012-01-27T17:32:35.130-05:00</updated><category term='lingpipe'/><category term='nlp'/><category term='information retrieval'/><category term='java'/><category term='information extraction'/><category term='stemming'/><category term='personalization'/><category term='software'/><category term='local community'/><category term='chandler'/><category term='open source'/><category term='local search'/><category term='google'/><title type='text'>Jeff's Search Engine Caffè</title><subtitle type='html'>Information Retrieval research and search engine development discussion.&lt;br&gt;&lt;br&gt;</subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><link rel='next' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default?start-index=101&amp;max-results=100'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>526</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-18315968.post-1109172498414699934</id><published>2011-11-08T09:30:00.000-05:00</published><updated>2011-11-08T09:30:01.307-05:00</updated><title type='text'>Notes on Strata 2011:  Entities, Relationships, and Semantics: the State of Structured Search</title><content type='html'>&lt;br /&gt;&lt;div&gt;&lt;a href="http://thenoisychannel.com/"&gt;Daniel Tunkelang&lt;/a&gt; moderated a &lt;a href="http://thenoisychannel.com/2011/11/05/entities-relationships-and-semantics-strata-ny-panel-on-the-state-of-structured-search/"&gt;panel on the state of structured search at Strata 2011&lt;/a&gt;. &lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;Entities, Relationships, and Semantics: the State of Structured Search&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;iframe allowfullscreen="" frameborder="0" height="315" src="http://www.youtube.com/embed/vr1blOJxXfQ" width="560"&gt;&lt;/iframe&gt;&lt;br /&gt;I didn't attend the talk, but I watched the video and took down notes on it for future reference.&lt;/div&gt;&lt;div style="font-family: arial; font-size: small;"&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;br /&gt;&lt;b&gt;Andrew Hogue (Google NY)&lt;/b&gt;&lt;/div&gt;&lt;div style="font-family: arial; font-size: small;"&gt;&amp;nbsp;- worked on google squared&lt;/div&gt;&lt;div style="font-family: arial; font-size: small;"&gt;&amp;nbsp;- QA on google, NER, local search&lt;/div&gt;&lt;div style="font-family: arial; font-size: small;"&gt;&amp;nbsp;- (extraction is never perfect) even with a clean db, with freebase. &amp;nbsp;coverage isn't good, 20/200 dog breeds&lt;/div&gt;&lt;div style="font-family: arial; font-size: small;"&gt;&amp;nbsp;- if you try to build a se on top of the incomplete db, users hit the limit, fall off the cliff and get frustrated&lt;/div&gt;&lt;div style="font-family: arial; font-size: small;"&gt;&amp;nbsp;- Tried to build user models of what people like (for Google+). &amp;nbsp;Do you like Tom Hanks, BIG? In the real-world.&lt;br /&gt;&amp;nbsp; &amp;nbsp;(Coincidentally, Google just rolled out &lt;a href="http://googleblog.blogspot.com/2011/11/google-pages-connect-with-all-things.html"&gt;Google+ Pages&lt;/a&gt; that represent entity pages)&lt;/div&gt;&lt;div style="font-family: arial; font-size: small;"&gt;&amp;nbsp; &amp;nbsp; --&amp;gt; if the universe isn't complete, people, entities, then they get frustrated&lt;/div&gt;&lt;div style="font-family: arial; font-size: small;"&gt;&amp;nbsp; &amp;nbsp; --&amp;gt; 1) get a bigger db. &amp;nbsp;2) fall back gracefully to a world of strings (hybrid systems)&lt;/div&gt;&lt;div style="font-family: arial; font-size: small;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="font-family: arial; font-size: small;"&gt;&lt;b&gt;Breck baldwin (alias-i)&lt;/b&gt;&lt;/div&gt;&lt;div style="font-family: arial; font-size: small;"&gt;&amp;nbsp;- go hunt down my blog post (on march 8 '09 on &lt;a href="http://lingpipe-blog.com/2009/03/08/how-breck-approaches-new-projects-in-natural-language-processing/"&gt;how to approach new NLP projects&lt;/a&gt;)&lt;/div&gt;&lt;div style="font-family: arial; font-size: small;"&gt;&amp;nbsp;- the biggest problem is the NLP system in the head vs. reality&lt;/div&gt;&lt;div style="font-family: arial; font-size: small;"&gt;&amp;nbsp;- three steps: 1) take some data an annotate it. &amp;nbsp;10 examples. &amp;nbsp;force fights earlier. &amp;nbsp;#1 best thing. &amp;nbsp;#2 build simple prototypes. info flow is hard. &amp;nbsp;#3 eval metric that maps to the business need&lt;/div&gt;&lt;div style="font-family: arial; font-size: small;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="font-family: arial; font-size: small;"&gt;&lt;b&gt;Evan Sandhause (NY Times)&lt;/b&gt;&lt;/div&gt;&lt;div style="font-family: arial; font-size: small;"&gt;&amp;nbsp;- on the semantic web (3.0)&amp;nbsp;&lt;/div&gt;&lt;div style="font-family: arial; font-size: small;"&gt;&amp;nbsp;- the semantic web is a complex implementation of good, simple ideas&lt;/div&gt;&lt;div style="font-family: arial; font-size: small;"&gt;&amp;nbsp;- get your toe wet with a few areas: 1) linked data, and 2) semantic markup&lt;/div&gt;&lt;div style="font-family: arial; font-size: small;"&gt;&amp;nbsp;- 1) linked data - all articles get categorized from a controlled vocabulary (strong ids tied to all docs). BUT - &amp;nbsp;No context to what those IDs mean. e.g. barack obama is the president of the united states. &amp;nbsp;Kansas city is the capital... &amp;nbsp;you need to link the external data to add new understanding.&lt;/div&gt;&lt;div style="font-family: arial; font-size: small;"&gt;&amp;nbsp; &amp;nbsp;-- e.g. find all articles in A1, P1 that mention presidents of the United States&lt;/div&gt;&lt;div style="font-family: arial; font-size: small;"&gt;&amp;nbsp; &amp;nbsp;-- e.g. find all articles that occur near park slope brooklyn&lt;/div&gt;&lt;div style="font-family: arial; font-size: small;"&gt;&amp;nbsp;2) semantic markup (rdfa, microformat, rich snippets). &amp;nbsp;They use rnews vocab as part of schema.org.&lt;/div&gt;&lt;div style="font-family: arial; font-size: small;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="font-family: arial; font-size: small;"&gt;&lt;b&gt;Wlodek Zadrozny (IBM. &amp;nbsp;Watson)&lt;/b&gt;&lt;/div&gt;&lt;div style="font-family: arial; font-size: small;"&gt;&amp;nbsp;- what are the open problems in QA&lt;/div&gt;&lt;div style="font-family: arial; font-size: small;"&gt;&amp;nbsp;- Trying to detect relations that occur in the candidate passages that are retrieved (in relevance to the question)&lt;/div&gt;&lt;div style="font-family: arial; font-size: small;"&gt;&amp;nbsp;- Then scores and ranks the candidate answers. &amp;nbsp;Some of it in RDF data. &amp;nbsp;Confidences are important because wrong answers are penalized.&lt;/div&gt;&lt;div style="font-family: arial; font-size: small;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="font-family: arial; font-size: small;"&gt;keys to success: 1) data, 2) methodology, testing often &amp;nbsp;1. QA answer sets from historic archives. (200k qa pairs) &amp;nbsp;2. collection data sources. and 3. and test (trace) data (7k experiments, 20-700 mb per experiment. &amp;nbsp;lots of error analysis.&lt;/div&gt;&lt;div style="font-family: arial; font-size: small;"&gt;&amp;nbsp;- medical, legal, education&lt;/div&gt;&lt;div style="font-family: arial; font-size: small;"&gt;&lt;br /&gt;&lt;b&gt;Questions&lt;/b&gt;&lt;/div&gt;&lt;div style="font-family: arial; font-size: small;"&gt;Q: NYT R&amp;amp;D. &amp;nbsp;The trend around NLP. &amp;nbsp;Certain things graduate on reliability. &amp;nbsp;What will these be over the next decade?&lt;/div&gt;&lt;div style="font-family: arial; font-size: small;"&gt;&amp;nbsp; -- Andrew. &amp;nbsp;The most interesting thing is QA. &amp;nbsp;Surface answers to direct questions. &amp;nbsp;(harvard college vs lebron james college)&lt;/div&gt;&lt;div style="font-family: arial; font-size: small;"&gt;&amp;nbsp; -- statistical approaches to language, (when do we have a good parse, vs. we don't know)&lt;/div&gt;&lt;div style="font-family: arial; font-size: small;"&gt;&amp;nbsp; -- Breck - classifiers are getting robust on sentiment, topic classification. breakthroughs in highly customized systems. &amp;nbsp;finely tuned to a domain in ways that bring lots of value.&lt;/div&gt;&lt;div style="font-family: arial; font-size: small;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="font-family: arial; font-size: small;"&gt;Query vs. Document centric&lt;/div&gt;&lt;div style="font-family: arial; font-size: small;"&gt;&amp;nbsp; -- reason across documents at a meta-level. &amp;nbsp;What can you do when you have great meta-data? (we have hand-checked, clean, data)&lt;/div&gt;&lt;div style="font-family: arial; font-size: small;"&gt;&amp;nbsp; -- in Watson, an alternative to high-quality hand curated data is to augment existing sources with data from the web&lt;/div&gt;&lt;div style="font-family: arial; font-size: small;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp;(see&amp;nbsp;Statistical Source Expansion for Question Answering from Nico Schlaefer at CIKM 2011)&lt;/div&gt;&lt;div style="font-family: arial; font-size: small;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="font-family: arial; font-size: small;"&gt;QA on the open web&lt;/div&gt;&lt;div style="font-family: arial; font-size: small;"&gt;&amp;nbsp;- Problem - not enough information from users. &amp;nbsp;People don't ask full NLP questions (30 to 1)&lt;/div&gt;&lt;div style="font-family: arial; font-size: small;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="font-family: arial; font-size: small;"&gt;- Is there an answer? &amp;nbsp;(Google wins by giving people documents and presenting many possible answers)&lt;/div&gt;&lt;div style="font-family: arial; font-size: small;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="font-family: arial; font-size: small;"&gt;Evan - the real-time metadata is needed for the website. &amp;nbsp;They use a rule based information extraction system which suggests terms they might want want to suggest. &amp;nbsp;Then the librarians review the producers tags. &amp;nbsp;&lt;/div&gt;&lt;div style="font-family: arial; font-size: small;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="font-family: arial; font-size: small;"&gt;Breck - Recall is hard. &amp;nbsp;In NER and others.&lt;/div&gt;&lt;div style="font-family: arial; font-size: small;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="font-family: arial; font-size: small;"&gt;&lt;b&gt;Overall Summary&lt;/b&gt;&lt;/div&gt;&lt;div style="font-family: arial; font-size: small;"&gt;&amp;nbsp;- Wlodek - QA depends on having the data: 1) training/test data, 2) sources, and 3) system tests&lt;/div&gt;&lt;div style="font-family: arial; font-size: small;"&gt;&amp;nbsp;- Evan - Structured data is valuable to get out there, rNews and schema.org. &amp;nbsp;Publishers should publish it! &amp;nbsp;It will be a game changer.&lt;/div&gt;&lt;div style="font-family: arial; font-size: small;"&gt;&amp;nbsp;- Breck - 1) annotate your data before you do it. 2) have an eval metric, and 3) lingpipe is free, so use it.&lt;/div&gt;&lt;div style="font-family: arial; font-size: small;"&gt;&amp;nbsp;- Andrew - (involved in schema.org, freebase). &amp;nbsp;Share your data. &amp;nbsp;Get it out there. &amp;nbsp;And -- Ask longer queries!&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-1109172498414699934?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/1109172498414699934/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2011/11/notes-on-strata-2011-entities.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/1109172498414699934'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/1109172498414699934'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2011/11/notes-on-strata-2011-entities.html' title='Notes on Strata 2011:  Entities, Relationships, and Semantics: the State of Structured Search'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://img.youtube.com/vi/vr1blOJxXfQ/default.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-5451085197245481695</id><published>2011-10-27T11:57:00.003-04:00</published><updated>2011-10-27T12:01:16.079-04:00</updated><title type='text'>CIKM 2011 Industry: Toward Deep Understanding of User Behavior on the Web</title><content type='html'>&lt;b&gt;Toward Deep Understanding of User Behavior on the Web&lt;/b&gt;&lt;br /&gt;Vanja Josifovski, Yahoo! Research&lt;br /&gt;&lt;br /&gt;Where is user understanding going?&lt;br /&gt;&lt;br /&gt;What is the future of the web?&lt;br /&gt;&amp;nbsp;- prevalent - everyone and everything&lt;br /&gt;&amp;nbsp;- mutual understanding&lt;br /&gt;&lt;br /&gt;Personalized laptops&lt;br /&gt;&lt;br /&gt;Personalization today&lt;br /&gt;&amp;nbsp;- Search personalization. &amp;nbsp;low entropy of intent. &amp;nbsp;Difficult to improve over the baseline&lt;br /&gt;&amp;nbsp;--&amp;gt; effects are small in practice&lt;br /&gt;&lt;br /&gt;Content recommendation and ad targeting&lt;br /&gt;&amp;nbsp;- High entropy of intent&lt;br /&gt;&amp;nbsp;- Still very crude with relatively low success rates&lt;br /&gt;&lt;br /&gt;How do we need to move to the next level&lt;br /&gt;&amp;nbsp;- more data, better reasoning, and scale&lt;br /&gt;&lt;br /&gt;Data today&lt;br /&gt;&amp;nbsp;- searches, page views&lt;br /&gt;&amp;nbsp;- connections: friends, followers, and others&lt;br /&gt;&amp;nbsp;- tweets&lt;br /&gt;&lt;br /&gt;The data we don't have&lt;br /&gt;&amp;nbsp;- jetlagged, need a run? need a pint?, worried about government debit?&lt;br /&gt;&amp;nbsp;- the observable state is very thin&lt;br /&gt;&lt;br /&gt;How to get more user data?&lt;br /&gt;&amp;nbsp;- Only with added value to the user&lt;br /&gt;&amp;nbsp;- Must be motivated to provide their data&lt;br /&gt;&lt;br /&gt;Privacy is not dead, it's hibernating&lt;br /&gt;&amp;nbsp;- the impact of data leaks online is relatively small&lt;br /&gt;&lt;br /&gt;Methods&lt;br /&gt;&amp;nbsp; - State of the art as we know it. &lt;br /&gt;&amp;nbsp; - Popular that seem to work well in pratice&lt;br /&gt;&amp;nbsp; --&amp;gt; Learn relationship between features rij = xiCzj&lt;br /&gt;&amp;nbsp; --&amp;gt; Dimensionality reduction (random, topical models, recommender systems rij = uivj)&lt;br /&gt;&amp;nbsp; --&amp;gt; Use of extenral knowledge: smoothing&lt;br /&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; --&amp;gt; taxonomies&lt;br /&gt;&lt;br /&gt;An elaborate user topic model (Ahmed, KDD 2011, Smola et al. VLDB 2010), yet so so simple&lt;br /&gt;&amp;nbsp;- the user behavior at time T is a mixture of his behavior at time t-1 + global overall behavior&lt;br /&gt;&amp;nbsp;- Very simple model&lt;br /&gt;&lt;br /&gt;Using External Knowledge&lt;br /&gt;&amp;nbsp;- Aggrawal et all KDD2007, KDD 2010&lt;br /&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Is there more to it?&lt;/div&gt;&lt;div&gt;&amp;nbsp;-&amp;gt; What is the relative merit of the methods?&lt;/div&gt;&lt;div&gt;&amp;nbsp;-&amp;gt; They use the data in the same way and are mathematically very similar&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Where is the limit?&amp;nbsp;&lt;/div&gt;&lt;div&gt;&amp;nbsp; -&amp;gt; what is the upper bound on the performance increase on a given dataset with this family of algorithms?&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Scale&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&amp;nbsp;- Today - MapReduce is a limiting barrier for many algorithms&lt;/div&gt;&lt;div&gt;&amp;nbsp;- Need the right abstractions in parallel environments&lt;/div&gt;&lt;div&gt;&amp;nbsp;- Move towards shared in memory, messages passing models (like Giraph)&lt;br /&gt;&amp;nbsp;-- (we'll work this out)&lt;br /&gt;&lt;br /&gt;Workflow complexity &lt;br /&gt;&amp;nbsp;- the reality bites Hatch et al. CIKM 2011. &amp;nbsp;Massive workflows that run for hours.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Summary&lt;/b&gt;&lt;br /&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;br /&gt;CIKM&lt;br /&gt;1) Deep user understanding - the tale of three communities&lt;br /&gt;&lt;br /&gt;IR:&lt;br /&gt;&amp;nbsp;- Good formalism that function practice&lt;br /&gt;&amp;nbsp;- emphasis on metrics and standard collections&lt;br /&gt;&lt;br /&gt;DB&lt;br /&gt;&amp;nbsp;- seamless running of complex algorithms&lt;br /&gt;&amp;nbsp;- new parallel computation paradigms&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Towards deeper understanding&lt;br /&gt;1) get users to give you more data by providing value&lt;br /&gt;2) significantly increase the complexity of the models&lt;br /&gt;3) scale in terms of data and system complexity&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-5451085197245481695?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/5451085197245481695/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2011/10/cikm-2011-industry-toward-deep.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/5451085197245481695'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/5451085197245481695'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2011/10/cikm-2011-industry-toward-deep.html' title='CIKM 2011 Industry: Toward Deep Understanding of User Behavior on the Web'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-6212552698913075625</id><published>2011-10-27T11:13:00.002-04:00</published><updated>2011-11-10T14:21:33.280-05:00</updated><title type='text'>CIKM 2011 Industry: Model-Driven Research in Social Computing</title><content type='html'>&lt;br /&gt;&lt;b&gt;Model-Driven Research in Social Computing&lt;/b&gt;&lt;br /&gt;Ed Chi&lt;br /&gt;&lt;br /&gt;Google Social Stats&lt;br /&gt;250k words per minute on blogger, 360 million words per day&lt;br /&gt;100M+ people take a social action on YouTube&lt;br /&gt;&lt;br /&gt;Google+ Stats&lt;br /&gt;40 million joined since launch&lt;br /&gt;2x-3x more likely to share content with one of their circles than to make a public post&lt;br /&gt;&lt;br /&gt;Hard to talk about because the systems are changing quite rapidly&lt;br /&gt;Ed joined Google to work on Google+&lt;br /&gt;&lt;br /&gt;Social Stream Research&lt;br /&gt;Analytics&lt;br /&gt;&amp;nbsp;- Factors impacting retweetability (IEEE Social computing)&lt;br /&gt;&amp;nbsp;- Location field of user profiles&lt;br /&gt;&lt;br /&gt;Motivation for studying languages&lt;br /&gt;&amp;nbsp;- Twitter is an international phenomenon&lt;br /&gt;&amp;nbsp;- How do users of different languages use Twitter?&lt;br /&gt;&amp;nbsp;- How do bilingual users spread information across languages?&lt;br /&gt;&lt;br /&gt;Data Collection &amp;amp; Processing&lt;br /&gt;&amp;nbsp;- 62 M tweets (4 week), spritzer feed in april-may june 2010&lt;br /&gt;&amp;nbsp;- Language detection with Google language API + LingPipe&lt;br /&gt;&amp;nbsp;- 104 languages&lt;br /&gt;&amp;nbsp;- Top 10 languages&lt;br /&gt;&lt;br /&gt;English - 51%&lt;br /&gt;Japanese - 19 %&lt;br /&gt;Portuguese - 9.6% (mostly Brazil)&lt;br /&gt;Indonesian - 5.6%&lt;br /&gt;Spanish - 4.7%&lt;br /&gt;&lt;br /&gt;Sampled 2000 random tweets&lt;br /&gt;&amp;nbsp;- 2 human judges for each of the top 10 languages&lt;br /&gt;&lt;br /&gt;Problems with French, German, and Malay.&lt;br /&gt;Accuracy of Language Detection&lt;br /&gt;&amp;nbsp;- Two Types of errors &amp;nbsp;(poor recognition for "tweet English") and for tweets with 1-2 words&lt;br /&gt;&lt;br /&gt;Korean - recommend for conversation tweets&lt;br /&gt;German - promote tweets with URLs&lt;br /&gt;&lt;br /&gt;English serves as a hub language&lt;br /&gt;&lt;br /&gt;Implications - need to understand when building a global network on language barriers&lt;br /&gt;&amp;nbsp;- building a global community&lt;br /&gt;&amp;nbsp;- the need for brokers of information between languages&lt;br /&gt;&lt;br /&gt;Visible Social Signals from Shared Items (Chen, et al. CHI 2010/CHI 2011)&lt;br /&gt;- After all day without WIFI, he would like a summary of what's happening in his social stream&lt;br /&gt;- Eddi - Summarizing Social Streams&lt;br /&gt;&amp;nbsp; --&amp;gt; What's happened since you last logged in&lt;br /&gt;&amp;nbsp; --&amp;gt; A tag cloud of entities that were mentioned&lt;br /&gt;&amp;nbsp; - A topic dashboard where tweets are organized into categorizes to drill into&lt;br /&gt;&lt;br /&gt;Information Gathering/Seeking&lt;br /&gt;&amp;nbsp;- The Filtering problem - I get 1,000+ things in my stream, but only have time for 10. &amp;nbsp;Which ones should I read?&lt;br /&gt;&lt;br /&gt;&amp;nbsp;- The Discovery Problem&lt;br /&gt;&amp;nbsp;-- millions of URLs are posted,&lt;br /&gt;&lt;br /&gt;Zerozero88.com&lt;br /&gt;&amp;nbsp;- twitter as the platform&lt;br /&gt;&amp;nbsp;- URLs as the medium&lt;br /&gt;&amp;nbsp;- a personal newspaper that produces personal headlines&lt;br /&gt;&lt;br /&gt;URL Sources (from tweets) -&amp;gt; Topic &amp;nbsp;Relevance Model, and Social Network Model&lt;br /&gt;&lt;br /&gt;URL Sources&lt;br /&gt;&amp;nbsp;- Consider all URLs was impossible&lt;br /&gt;&amp;nbsp;-- FoF URLS from followee-of-followers&lt;br /&gt;&amp;nbsp; --&amp;gt; Social local news is better&lt;br /&gt;- Popular - URLs that are popular across whole of Twitter&lt;br /&gt;&amp;nbsp; &amp;nbsp;--&amp;gt; popular news is better&lt;br /&gt;&lt;br /&gt;Topic Relevance Model&lt;br /&gt;&amp;nbsp;- A user Tweets about things, which creates a term vector profile.&lt;br /&gt;&amp;nbsp;- Cosine similarity with URLs&lt;br /&gt;&amp;nbsp;- Topic Profile of URLs - Built from tweets that contain the URL, but tweets are short and RT makes word frequencies goofy. &lt;br /&gt;&amp;nbsp;- Adopt a term expansion technique, extract nouns from tweet and feed it into a Wikipedia search engine as a topic detection technique&lt;br /&gt;&lt;br /&gt;Topic Profile of User&lt;br /&gt;&amp;nbsp;- Self-topic&lt;br /&gt;&amp;nbsp;- Information producer - the things they tweet about&lt;br /&gt;&amp;nbsp;- Information gatherer - what they like to read&lt;br /&gt;&amp;nbsp;- Build profiles from froms and aggregate them.&lt;br /&gt;&lt;br /&gt;Social Module&lt;br /&gt;&amp;nbsp;- Take FoF neighborhood, and count the votes for a URL&lt;br /&gt;&amp;nbsp;- Simple counting doesn't work very well.&lt;br /&gt;&amp;nbsp;- Votes are weighted using social network structure&lt;br /&gt;&lt;br /&gt;Study Design&lt;br /&gt;&amp;nbsp;- Each subject evaluating 5 URL recommendations from each of the 12 algorithms. &amp;nbsp;Show 60 URLs in a random order and ask for binary rating,&lt;br /&gt;&lt;br /&gt;Summary of Results&lt;br /&gt;&amp;nbsp;- Global popularity (1%) -- 32.50% are relevant, not bad, but not good enough&lt;br /&gt;&amp;nbsp;- FoF only - 33% - naiive by itself without voting doesn't work great&lt;br /&gt;&amp;nbsp;- Fof voting method - 65% (social voting only)&lt;br /&gt;&amp;nbsp;- Popularity voting - 67%&lt;br /&gt;&amp;nbsp;- FoF Self-Vote - 72% best performing&lt;br /&gt;&lt;br /&gt;Algorithms differ not only in accuracy!&lt;br /&gt;&amp;nbsp;- Relevance vs. Serendipity in recommendations (tension between discovery and affirming aspect)&lt;br /&gt;&amp;nbsp;-&amp;gt; "What I crave is surprising, interesting, whimsy" this is where the value is&lt;br /&gt;&amp;nbsp;-&amp;gt; Two elements two surprise: 1) have I seen this before, 2) non-obvious relationships between things&lt;br /&gt;&lt;br /&gt;Design Rule&lt;br /&gt;- Interaction costs determine number of people who participate&lt;br /&gt;&amp;nbsp;- Reduce the interaction costs, then you can get a lot more people into the system&lt;br /&gt;&amp;nbsp;- For Google+ this is key to deliver this to people&lt;br /&gt;&lt;br /&gt;Q&amp;amp;A:&lt;br /&gt;Japanese crams more information into a tweet. &amp;nbsp;It is used more for conversation than broadcast in these environments&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-6212552698913075625?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/6212552698913075625/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2011/10/cikm-2011-industry-model-driven.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/6212552698913075625'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/6212552698913075625'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2011/10/cikm-2011-industry-model-driven.html' title='CIKM 2011 Industry: Model-Driven Research in Social Computing'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-9122086345583513253</id><published>2011-10-27T06:35:00.001-04:00</published><updated>2011-10-27T07:06:58.725-04:00</updated><title type='text'>CIKM Industry talk: Jeff Hammerbacher on Analytical Platforms</title><content type='html'>&lt;br /&gt;&lt;b&gt;Experiences Evolving a New Analytical Platform: What Works and What's Missing&lt;/b&gt;&lt;br /&gt;Jeff Hammerbacher, Cloudera&lt;br /&gt;&lt;br /&gt;Built the infrastructure team at Facebook, 0 to 2 PB of data&lt;br /&gt;&lt;br /&gt;Take the infrastructure and make it available as open source.&lt;br /&gt;&lt;br /&gt;Philosophy&lt;br /&gt;The true challenges in the task of data mining. &amp;nbsp;Creating a data set with the relevant and accurate information, determining the appropriate analysis techniques&lt;br /&gt;&lt;br /&gt;Exploratory data processing (IBM)&lt;br /&gt;&lt;br /&gt;Taught the &lt;a href="http://datascienc.es/"&gt;data science course at Berkeley&lt;/a&gt; earlier this year&lt;br /&gt;&lt;br /&gt;1) Store all your organization's data in one place&lt;br /&gt;&amp;nbsp; - data first, questions later&lt;br /&gt;&amp;nbsp; - store first, structure later&lt;br /&gt;&lt;br /&gt;Engineers are constrained when you force them to stop and model the data, which is constantly evolving.&lt;br /&gt;&lt;br /&gt;Raw storage: $0.4 / GB (67 for 2 TB disk), Single HDFS instance &amp;gt; 50 PB on commodity hardware in one center&lt;br /&gt;&lt;br /&gt;&amp;nbsp;Enable everyone to party on the data. &amp;nbsp;Use files because developers are not analysts.&lt;br /&gt;&lt;br /&gt;Like the LAMP stack, there is a coherent analytical data management&lt;br /&gt;&lt;br /&gt;Better underlying abstractions&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Platform - Substrate&lt;/b&gt;&lt;br /&gt;&amp;nbsp;- commodity servers (a big warehouse)&lt;br /&gt;&amp;nbsp;-- &lt;a href="http://opencompute.org/"&gt;open compute project&lt;/a&gt; (FB open source)&lt;br /&gt;&amp;nbsp;- open source OS&lt;br /&gt;&amp;nbsp; -- Linux&lt;br /&gt;&amp;nbsp;- Open source config management&lt;br /&gt;&amp;nbsp; -- Puppet, Chef&lt;br /&gt;&amp;nbsp;- Coordination service&lt;br /&gt;&amp;nbsp; -- ZooKeeper&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Platform - Storage&lt;/b&gt;&lt;br /&gt;&amp;nbsp;- Distributed, schemaless storage&lt;br /&gt;&amp;nbsp;--&amp;gt; HDFS, Ceph (UCSC), MapR&lt;br /&gt;&amp;nbsp;- Append-only table storage and metadata&lt;br /&gt;&amp;nbsp; --&amp;gt; Avro, RCFile, HCatalog (Also: Thrift, Protocal Buffers)&lt;br /&gt;&amp;nbsp;- Mutable table storage and metadata&lt;br /&gt;&amp;nbsp;-- HBase&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Compute&lt;/b&gt;&lt;br /&gt;- Cluster resource management&lt;br /&gt;&amp;nbsp;-- YARN (inter-job scheduling, like grid engine, for data intensive computing)&lt;br /&gt;&amp;nbsp;-- Mesos&lt;br /&gt;- Processing Framworks&lt;br /&gt;&amp;nbsp;-- MapReduce, Hamster (MPI), Spark, Dryad, Pregel (Giraph), Dremel&lt;br /&gt;- High level interfaces&lt;br /&gt;&amp;nbsp; -- Crunch (like Google's Flume Java) , DryadLINQ, Pig, Hive&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Platform&lt;/b&gt;&lt;br /&gt;Integration&lt;br /&gt;- Tool access&lt;br /&gt;- Data ingest&lt;br /&gt;&amp;nbsp;-- Sqoop, Flume&lt;br /&gt;(Documents ingest is still an area that needs work. &amp;nbsp;There are crawlers, but they're still immature)&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Trends&lt;/b&gt;&lt;br /&gt;fat servers with fat pipes&lt;br /&gt;2u, 24 gb ram, 12 drives, (bigger nodes)&lt;br /&gt;os support for isolation (VMs have downsides)&lt;br /&gt;Linux containers&lt;br /&gt;&amp;nbsp; -- Google contributed initial patches, used for BORG&lt;br /&gt;Local files system improvements&lt;br /&gt;&amp;nbsp;-- btrfs&lt;br /&gt;&lt;br /&gt;language&lt;br /&gt;&amp;nbsp;- scalaql&lt;br /&gt;&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-9122086345583513253?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/9122086345583513253/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2011/10/cikm-industry-talk-jeff-hammerbacher-on.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/9122086345583513253'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/9122086345583513253'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2011/10/cikm-industry-talk-jeff-hammerbacher-on.html' title='CIKM Industry talk: Jeff Hammerbacher on Analytical Platforms'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-1543608984658528634</id><published>2011-10-27T05:50:00.001-04:00</published><updated>2011-10-27T06:16:34.321-04:00</updated><title type='text'>CIKM 2011 Industry: Freebase: A Rosetta Stone for Entities</title><content type='html'>John Giannandrea, Google&lt;br /&gt;&lt;br /&gt;&lt;b&gt;What is Freebase?&lt;/b&gt;&lt;br /&gt;&amp;nbsp;-&amp;gt; A machine representation of things in the world. (Person, place, thing in the world)&lt;br /&gt;&amp;nbsp;-&amp;gt; Instead of working in the domain of text, we work in the domain of strong identified things&lt;br /&gt;&amp;nbsp;-&amp;gt; Each object has an identifier, once you have it, it will also refer to an identity&lt;br /&gt;&lt;br /&gt;Properties - relationships between objects&lt;br /&gt;&amp;nbsp;- edges between the entity ids&lt;br /&gt;&amp;nbsp;- edges are directional&lt;br /&gt;&amp;nbsp;- properties create meaning&lt;br /&gt;&lt;br /&gt;Graphs&lt;br /&gt;&amp;nbsp;- encode knowledge&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Types&lt;/b&gt;&lt;br /&gt;&amp;nbsp;- a categorization of an entity&lt;br /&gt;&amp;nbsp;- An entity can have multiple Types in Freebase&lt;br /&gt;&amp;nbsp;- "Co-types" - Types are a mix-in&lt;br /&gt;&amp;nbsp;- e.g. Arnold (politician, actor, athlete)&lt;br /&gt;&lt;br /&gt;The real world is extremely messy. &lt;br /&gt;&lt;br /&gt;&lt;b&gt;Knowledge you can use&lt;/b&gt;&lt;br /&gt;&amp;nbsp;- the current state&lt;br /&gt;&amp;nbsp;- 25 Million topics (entities)&lt;br /&gt;&amp;nbsp;- 500 million connections&lt;br /&gt;&amp;nbsp;- 2x the size it was last year&lt;br /&gt;&lt;br /&gt;&amp;gt;= 10 instances, 5790 types&lt;br /&gt;1772 commons (survived community scrutiny)&lt;br /&gt;4019 bases (people created)&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Identity matching&lt;/b&gt;&lt;br /&gt;&amp;nbsp;- reconciliation at scale&lt;br /&gt;&amp;nbsp;- Wikipedia, Wordnet, Library of congress terms, Stanford library&lt;br /&gt;&amp;nbsp;- any large open source term they have tried to import into Freebase&lt;br /&gt;&lt;br /&gt;-&amp;gt; How? Whatever works. &amp;nbsp;MapReduce, Google Refine, and human judgment&lt;br /&gt;-&amp;gt; This is possible if you know what an entity is. &amp;nbsp;(IBM example)&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Freebase as a rosetta stone&lt;/b&gt;&lt;br /&gt;&amp;nbsp;- keys&lt;br /&gt;&amp;nbsp;- behind the websites, there is a structured database with keys (relational db with tables that have primary keys)&lt;br /&gt;&amp;nbsp;- all of these keys leak out onto the web, "shakira" in the url&lt;br /&gt;&amp;nbsp;- In the Freebase system they try to collect these keys to link the entity to external websites&lt;br /&gt;&lt;br /&gt;URLs and Freebase keys&lt;br /&gt;&amp;nbsp;-&amp;nbsp;accrete&amp;nbsp;the URLs and keys onto the object&lt;br /&gt;&amp;nbsp;- Names are just another key (the entities themselves are the same across languages)&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Schema&lt;/b&gt;&lt;br /&gt;&amp;nbsp;- Freebase is schema less&lt;br /&gt;&amp;nbsp;- It is fundamentally based on a graph store&lt;br /&gt;&amp;nbsp;- Schema is described in the graph itself, just as the data ("Type: type")&lt;br /&gt;&amp;nbsp;- The person "type" is an entity with an id, "Type:type:person"&lt;br /&gt;&amp;nbsp;- Put the predicates into the graph system so that it can be updated&lt;br /&gt;&lt;br /&gt;&lt;a href="http://wiki.freebase.com/wiki/New_Freebase_API"&gt;Google API&lt;/a&gt; to Schema Data&lt;br /&gt;&amp;nbsp;- WIKL read ( a query language for inspecting the freebase graph)&lt;br /&gt;&lt;br /&gt;How does Google use Freebase?&lt;br /&gt;&amp;nbsp;- "I work in the search division"&lt;br /&gt;&lt;br /&gt;Time in Freebase&lt;br /&gt;&amp;nbsp;- everything has a start date and end date&lt;br /&gt;&lt;br /&gt;How good is the quality?&lt;br /&gt;&amp;nbsp;- varies depending on the entities (e.g. presidents is high quality, but obscure book there may be some duplicates)&lt;br /&gt;&amp;nbsp;-&amp;gt; 99% accuracy&lt;br /&gt;&amp;nbsp;-&amp;gt; curate the top 100k entries&lt;br /&gt;&amp;nbsp;-&amp;gt; we'd rather not import data than import data that is bad&lt;br /&gt;&amp;nbsp;-&amp;gt; (We imported the open library catalog, which has lots of duplicates. &amp;nbsp;never again.)&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Q&amp;amp;A:&lt;/b&gt;&lt;br /&gt;Recall&lt;br /&gt;&amp;nbsp;- 25 M entities, 2x from last summer, 100M by next year&lt;br /&gt;&amp;nbsp;- It depends on the domain&lt;br /&gt;&amp;nbsp;- For common queries in search engines, it does very well&lt;br /&gt;&amp;nbsp;- search engines handle lots of queries for celebrities, common places on earth&lt;br /&gt;&lt;br /&gt;Confidence on facts&lt;br /&gt;&amp;nbsp;- common&amp;nbsp;criticisms&amp;nbsp;1) it's not a real database, 2) the assertions are not given weight, it doesn't capture uncertain facts&lt;br /&gt;&amp;nbsp;- you can create a mediated way of doing that in the schema&lt;br /&gt;&amp;nbsp;- how do you deal with controversial facts? &amp;nbsp;1) careful with type definitions. &amp;nbsp;countries are hard. (use UN definition) &amp;nbsp;unusual categories. &amp;nbsp;FIFA has its own definition. &amp;nbsp;World cups have been played with countries that don't exist. &lt;br /&gt;&amp;nbsp;- for head entities, there are large number of people arguing&lt;br /&gt;&lt;br /&gt;Ian - quality 99% accuracy is still 1 million incorrect for 100M.&lt;br /&gt;&amp;nbsp;- sampling rate for how you draw entities. &amp;nbsp;you have a probability of confidence.&lt;br /&gt;&amp;nbsp; (two kinds of sampling: 1) random sample, 2) traffic weighted sampling based on popularity)&lt;br /&gt;&amp;nbsp;- 99% at the 95th percentile&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-1543608984658528634?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/1543608984658528634/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2011/10/cikm-2011-industry-freebase-rosetta.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/1543608984658528634'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/1543608984658528634'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2011/10/cikm-2011-industry-freebase-rosetta.html' title='CIKM 2011 Industry: Freebase: A Rosetta Stone for Entities'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-6850296420000159656</id><published>2011-10-26T04:30:00.000-04:00</published><updated>2011-10-26T04:31:33.239-04:00</updated><title type='text'>CIKM 2011 Keynote II: Justin Zobel on Biomedicine</title><content type='html'>&lt;br /&gt;&lt;table style="border-collapse: collapse; margin-bottom: 1em; margin-left: 0px; margin-right: 0px; margin-top: 1em;"&gt;&lt;tbody style="border-top-color: initial; border-top-style: none; border-top-width: initial;"&gt;&lt;tr&gt;&lt;td align="justify" valign="top"&gt;&lt;strong&gt;Data, Health, and Algorithmics: Computational Challenges for Biomedicine&lt;br /&gt;by &lt;a href="http://ww2.cs.mu.oz.au/~jz/"&gt;Justin Zobel&lt;/a&gt;&lt;/strong&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;(I missed the first part of this talk)&lt;br /&gt;&lt;br /&gt;The Central Dogma&lt;br /&gt;- DNA consts of sequence of four bases, A, C, G, T&lt;br /&gt;- The concept of a gene is now uncertain&lt;br /&gt;&lt;br /&gt;SNP analysis (Nature 00)&lt;br /&gt;&amp;nbsp;- used PCA&lt;br /&gt;&lt;br /&gt;Revolution&lt;br /&gt;&amp;nbsp;- Read DNA directly&lt;br /&gt;&amp;nbsp;- $1000 by the end of 2012&lt;br /&gt;&lt;br /&gt;The data is erroful, incomplete, voluminous, ambiguous.&lt;br /&gt;&lt;br /&gt;Also, reads are not very random&lt;br /&gt;&lt;br /&gt;Within a few years there will be DNA databases of 10-100 terabases, which we will use to find matches to short read data&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Challenge: Assembly&lt;br /&gt;&amp;nbsp;- Imagine a million copies of a phone book, a million pages long&lt;br /&gt;&amp;nbsp;- Shredded into tiny pieces, each no more than 20 or 30 characters&lt;br /&gt;&amp;nbsp;- 99.999% are thrown away.&lt;br /&gt;&amp;nbsp;- The task: reconstruct the phone book from the billion remaining pieces&lt;br /&gt;&lt;br /&gt;The Problem of assembling short reads.&lt;br /&gt;&lt;br /&gt;Genomes are a combinatorial minefield&lt;br /&gt;&amp;nbsp;- vast quantities of repeated material&lt;br /&gt;&lt;br /&gt;The genome is cheap, but the analysis is expensive&lt;br /&gt;&lt;br /&gt;de Bruijn graph&lt;br /&gt;&amp;nbsp;- Divide 7-base reads into kmers (3mers)&lt;br /&gt;&amp;nbsp;- each node is a kmer, each arc is an overlap&lt;br /&gt;&lt;br /&gt;The graph is about 4 terabytes, and it needs to be in memory.&lt;br /&gt;&lt;br /&gt;Succint 'Gossamer' representation&lt;br /&gt;&amp;nbsp;- fast access with simple index&lt;br /&gt;&amp;nbsp;- space down by a factor &amp;gt; 10&lt;br /&gt;&amp;nbsp;- cuts the storage down to 32 GB&lt;br /&gt;&lt;br /&gt;DNA dictionaries&lt;br /&gt;&amp;nbsp;- there is no grammar for DNA that would allow construction of a parser&lt;br /&gt;&amp;nbsp;- a dictionary of all possible tokens would be impossible large&lt;br /&gt;&lt;br /&gt;Dictionary - any representative string&lt;br /&gt;&amp;nbsp;-&amp;gt; solves the text compression problem for DBs&lt;br /&gt;&lt;br /&gt;Genetics for diagnosis&lt;br /&gt;&amp;nbsp;-&amp;gt; inference diagnosis based on symptons replaced by ones based on DNA analysis&lt;br /&gt;&amp;nbsp;-&amp;gt; drug effect and health outcome determined directly from historical health records&lt;br /&gt;&amp;nbsp;- Built to simplifly, improve, and automate bureaucratic decisions&lt;br /&gt;&lt;br /&gt;'Guardian Angel' clinical decisions&lt;br /&gt;&amp;nbsp;- Electronic health records analyzed on the fly to check whether a mistake is about to be made.&lt;br /&gt;&lt;br /&gt;Health at Home&lt;br /&gt;&amp;nbsp;-&amp;gt; health monitoring deeply embedded in our e-lifestyle activities&lt;br /&gt;&amp;nbsp;-&amp;gt; webcam that determines how well you are based on your skin&lt;br /&gt;&amp;nbsp;-&amp;gt; iphone app that tells if your drunk&lt;br /&gt;&lt;br /&gt;Computer Science vs heatlth research&lt;br /&gt;&amp;nbsp;- many algorithmic solutions are not biologically meaningful&lt;br /&gt;&amp;nbsp;- spend money on IT and the number of errors is decreased (It saves lives.)&lt;br /&gt;&lt;br /&gt;13 of the top 25 questions (Science mag July 2005) are about DNA&lt;br /&gt;&lt;br /&gt;The real way to have an impact on medicine is in the clinic -&amp;gt; text mining records. &amp;nbsp;Helping doctors make decisions.&lt;br /&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-6850296420000159656?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/6850296420000159656/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2011/10/cikm-2011-keynote-ii-justin-zobel-on.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/6850296420000159656'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/6850296420000159656'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2011/10/cikm-2011-keynote-ii-justin-zobel-on.html' title='CIKM 2011 Keynote II: Justin Zobel on Biomedicine'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-8689496840965902371</id><published>2011-10-25T03:45:00.002-04:00</published><updated>2011-10-25T04:28:47.385-04:00</updated><title type='text'>CIKM 2011 Keynote: David Karger</title><content type='html'>&lt;br /&gt;&lt;b&gt;Creating User Interfaces that Entice People to Manage Better Information&lt;/b&gt;&lt;br /&gt;By &lt;a href="http://people.csail.mit.edu/karger/"&gt;David Karger&lt;/a&gt; (MIT)&lt;br /&gt;&lt;div&gt;&lt;br /&gt;History:&lt;/div&gt;&lt;div&gt;HayStack - Per user Information Environments (1999)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Current State of IKM (Information and Knowledge Management)&lt;/div&gt;&lt;div&gt;&amp;nbsp; - We take users with extremely rich landscapes of information and we give them keyboards to barely sketch their interested. &amp;nbsp;Algorithms work really really hard on that sketch. &amp;nbsp;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&amp;nbsp;- We work hard to make computers do IKM well&lt;/div&gt;&lt;div&gt;&amp;nbsp;- People are better than computers at IKM&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Question:&lt;/div&gt;&lt;div&gt;&amp;nbsp; - In what ways can we give people the ability to manage more or better information?&lt;/div&gt;&lt;div&gt;&amp;nbsp;- How do we make them want to?&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;1) Capture more data digitally&lt;/div&gt;&lt;div&gt;2) Collaborate to understand lecture notes&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Capture of Information Scraps &amp;nbsp;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&amp;nbsp;- The state of PIM&lt;/div&gt;&lt;div&gt;&amp;nbsp;- The desks all have computers, but we have huge piles of paper (never put into it)&lt;/div&gt;&lt;div&gt;&amp;nbsp;- 27 participants, 5 Orgs, 1 hour in situ interviews&lt;/div&gt;&lt;div&gt;&lt;/div&gt;&lt;div&gt;#1 using computer is distracting / impossible&lt;/div&gt;&lt;div&gt;&amp;nbsp; -- people instead just grab random notes to write things down&lt;/div&gt;&lt;div&gt;&amp;nbsp; -- Interfaces for Staying in the Flow (Ben Bederson, Ubiquity 2004&lt;/div&gt;&lt;div&gt;&amp;nbsp; -- (Being "in the zone', in the flow)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;#2 chimeras fight between apps&lt;/div&gt;&lt;div&gt;&amp;nbsp; -- Meeting notes with TODOs and follow up meetings&lt;/div&gt;&lt;div&gt;&lt;/div&gt;&lt;div&gt;#3 Diverse information forms don't fit apps&lt;/div&gt;&lt;div&gt;&lt;/div&gt;&lt;div&gt;Types of information&lt;/div&gt;&lt;div&gt;&amp;nbsp; TODOs, meeting Notes, Name and Contact information&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;#4 Want in view at right time --workflow integration&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Costs to digital capture&lt;/div&gt;&lt;div&gt;&amp;nbsp;- costs: effort to choose place, imposted schema, entry time is a distraction&lt;/div&gt;&lt;div&gt;Fixes: no organization, plain text, in the browser, cross-computer sync offline+online&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;list.it (open source mico note tool for Firefox.&lt;br /&gt;&amp;nbsp;--&amp;gt; 25,000 downloads, 16,625 registered users, 920 volunteers, 116k contributed notes&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Types of notes:&lt;/b&gt;&lt;br /&gt;TODOs, Web bookmarks, Concat information&lt;br /&gt;median time to write something is 7.4s&lt;br /&gt;median number of lines is 4&lt;br /&gt;&lt;br /&gt;35% - ease/speed&lt;br /&gt;20% simplicity&lt;br /&gt;20% direct replacement for post-its&lt;br /&gt;&lt;br /&gt;Detour: Note Science&lt;br /&gt;&amp;nbsp; -- How do people keep and acccess information in list-it?&lt;br /&gt;&lt;br /&gt;3 coders&lt;br /&gt;first clustered, identified 4 archetypes&lt;br /&gt;&lt;br /&gt;MISC - MIT Open Scrap Corpus (available online)&lt;br /&gt;&lt;br /&gt;&lt;b&gt;NB: Classroom Discussion&lt;/b&gt;&lt;br /&gt;Stellar Classroom discussion tool&lt;br /&gt;&amp;nbsp;- 50 most active classes made 3275 posts&lt;br /&gt;&amp;nbsp; -- no heavily populated posts&lt;br /&gt;- Nb: forum in context &amp;nbsp;(happen in the margin of lecture notes)&lt;br /&gt;&amp;nbsp;- Highlight a section of the post, write a comments&lt;br /&gt;&amp;nbsp;--&amp;gt; Implicit context (how do I get 3 from 1)&lt;br /&gt;&lt;br /&gt;Benefits - Discuss as you read without existing note view&lt;br /&gt;&amp;nbsp;-- Context is clear because the PDF content is there&lt;br /&gt;&amp;nbsp;-- annotations create a heat map of lecture notes&lt;br /&gt;&lt;br /&gt;15 classes, 4 different universities&lt;br /&gt;(Annotation required), usage of the tool doubled over the term.&lt;br /&gt;&amp;nbsp;--&amp;gt; they liked seeing that they weren't the only one that was confused.&lt;br /&gt;&amp;nbsp; --&amp;gt; rich interaction&lt;br /&gt;&lt;br /&gt;NB specific benefits&lt;br /&gt;&amp;nbsp;--&amp;gt; "Why?" &lt;br /&gt;&amp;nbsp;--&amp;gt; The social benefits outweighted the use of paper&lt;br /&gt;&lt;br /&gt;&lt;b&gt;FEEDME&lt;/b&gt;&lt;br /&gt;&amp;nbsp;- Artificial Collaborative Filtering&lt;br /&gt;&lt;br /&gt;Vast amounts of content, how do we get the good stuff&lt;br /&gt;machine learning recommenders - users rate what they read, content recommendation, collaborative filtering (find people with similar likes, predict what they will like)&lt;br /&gt;&lt;br /&gt;Effort&lt;br /&gt;&amp;nbsp;- have to read lots of junk to train system&lt;br /&gt;&amp;nbsp;- have to spend energy now for future benefit&lt;br /&gt;&amp;nbsp;- many users won't ever get started&lt;br /&gt;&lt;br /&gt;Quality&lt;br /&gt;&amp;nbsp;- ML algorithms imperfect&lt;br /&gt;&amp;nbsp;- Deliver reading irrelevant content, worry about what is missed&lt;br /&gt;&lt;br /&gt;Alternative: People&lt;br /&gt;&lt;br /&gt;Email is dominant in information sharing&lt;br /&gt;Median 6 - people do want more relevant links&lt;br /&gt;Sharers are reluctant to spam their friends&lt;br /&gt;&amp;nbsp;(unsure of relevance, may have seen it already, too much effort)&lt;br /&gt;&lt;br /&gt;Fixes&lt;br /&gt;-&amp;gt; let them use email, reasssure sender that content is relevant. &amp;nbsp;Aand that the recipient isn't overloaded. One-click sharing&lt;br /&gt;&lt;br /&gt;Firefox plugin&lt;br /&gt;1. recoomend recipients to reduce time and effort for sharing&lt;br /&gt;&amp;nbsp;(uses ML to find people to recommend)&lt;br /&gt;&lt;br /&gt;One-click thanks&lt;br /&gt;&lt;br /&gt;Recommendation Algorithm&lt;br /&gt;&amp;nbsp;-- rochio classifier&lt;br /&gt;&lt;br /&gt;Assessment&lt;br /&gt;&amp;nbsp;- two week study for $30&lt;br /&gt;&amp;nbsp;- 60 google reader users recruited on blogs&lt;br /&gt;&amp;nbsp;- Viewed 85k posts, shared 713 posts&lt;br /&gt;&amp;nbsp;- Significant increase in sharing&lt;br /&gt;&lt;br /&gt;Recipients were happy - 80.4% of the posts contain novel content&lt;br /&gt;&lt;br /&gt;Recommendations Useful&lt;br /&gt;&lt;br /&gt;Do overload indicators help&lt;br /&gt;&amp;nbsp;- 1/3 of subjects with them said they were favorite feature&lt;br /&gt;&amp;nbsp;- 30 of shares resulted a thanks&lt;br /&gt;&lt;br /&gt;Machine filtering&lt;br /&gt;&amp;nbsp;- have to read stuff&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Structured Data&lt;/b&gt;&lt;br /&gt;We all know structured data is good.&lt;br /&gt;it supports&lt;br /&gt;&amp;nbsp;-&amp;gt; rich vizualizations, filtering, sorting, queries, merge data&lt;br /&gt;&lt;br /&gt;Epicious (old version)&lt;br /&gt;&amp;nbsp;-&amp;gt; filter by ingredient, cuisine, part of meal&lt;br /&gt;&lt;br /&gt;Mere mortals just write text or HTML&lt;br /&gt;&lt;br /&gt;Structured data takes skill&lt;br /&gt;&amp;nbsp;- design a data model,&lt;br /&gt;&lt;br /&gt;Plain authors are left behind&lt;br /&gt;&amp;nbsp;-&amp;gt; less power to communication effectively&lt;br /&gt;&lt;br /&gt;Coping: Information Extraction&lt;br /&gt;&amp;nbsp;- Entity Recognition, Coference, relationship extraction&lt;br /&gt;&lt;br /&gt;Imperfect, so errors creep in.&lt;br /&gt;&lt;br /&gt;Alternative: Give regular people tools that let people author structured data&lt;br /&gt;&amp;nbsp;-&amp;gt; to communicate well&lt;br /&gt;&lt;br /&gt;Do we need this? Yes.&lt;br /&gt;&lt;br /&gt;Approach&lt;br /&gt;- HTML is the language of the web&lt;br /&gt;&amp;nbsp;- Extend it to talk about data&lt;br /&gt;&amp;nbsp;- Anyone authoring HTML should be able to author data and interactive visualization&lt;br /&gt;- Edit data-html in web, blogs, wikis&lt;br /&gt;&lt;br /&gt;(like spreadsheets)&lt;br /&gt;&lt;br /&gt;Publishing data is easy, just put a spreadsheet online. &amp;nbsp;rows are items, columns are properties&lt;br /&gt;&lt;br /&gt;Data&lt;br /&gt;&amp;nbsp;Items (recipes)&lt;br /&gt;&amp;nbsp;- Each has properties, Title, source magainze, publication date, etc...&lt;br /&gt;&amp;nbsp;- Vizualization - a collection of a view of data items&lt;br /&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp;-- bar chart, sortable list, map, thumbnail set&lt;br /&gt;&lt;br /&gt;Bound to peroperties&lt;br /&gt;&amp;nbsp;- sort by property&lt;br /&gt;&lt;br /&gt;Facets for filtering information&lt;br /&gt;&amp;nbsp;-&amp;gt; specificy a property, user clicks to select&lt;br /&gt;&amp;nbsp;-&amp;gt; templates -&amp;gt; format per item. &lt;br /&gt;&amp;nbsp;- HTML with "fill in the blanks"&lt;br /&gt;&lt;br /&gt;Key primitives of a data page&lt;br /&gt;Data - a spreadsheet&lt;br /&gt;&lt;br /&gt;Exhibit javascript library&lt;br /&gt;&lt;br /&gt;1800 websites using exhibits&lt;br /&gt;hobby stores, science&lt;br /&gt;(lots of strange hobbyists)&lt;br /&gt;Veggie guide to Glasgow&lt;br /&gt;&lt;br /&gt;Not very scalable (fast for &amp;lt; 100 items)&lt;br /&gt;&lt;br /&gt;Side effects - the data is out there. &amp;nbsp;(structured data is the side effect)&lt;br /&gt;&lt;br /&gt;Wibit&lt;br /&gt;Datapress - data visualization inside the blog&lt;br /&gt;DIDO - WYSIWYG editor&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Conclusion&lt;/b&gt;&lt;br /&gt;- People are powerful information managers&lt;br /&gt;In each case, it's about giving people the tools to be information managers&lt;br /&gt;&lt;br /&gt;Wait, There's more&lt;br /&gt;&amp;nbsp;--&amp;gt; manage structured data by making it look like a spreadsheet&lt;br /&gt;--&amp;gt; Atomate -&amp;gt; help users translate incoming data data into structured data&lt;br /&gt;&lt;br /&gt;We work hard to make computers do IKM well,&lt;br /&gt;Don't assume people are passive IK consumers&lt;br /&gt;Give people tools that can encourage active engagement in IKM&lt;br /&gt;&lt;br /&gt;All the links are at &lt;a href="http://groups.csail.mit.edu/haystack/blog/2011/10/25/cikm-2011-keynote-user-interfaces-that-entice-people-to-manage-better-information/"&gt;haystack.csail.mit.edu/blog&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Questions:&lt;br /&gt;The success of exhibit came from why HeyStack didn't succeed. &amp;nbsp;It's not the only measure of success that lots of people use a tool. &amp;nbsp;It's still an interesting piece of research.&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-8689496840965902371?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/8689496840965902371/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2011/10/cikm-2011-keynote-david-karger.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/8689496840965902371'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/8689496840965902371'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2011/10/cikm-2011-keynote-david-karger.html' title='CIKM 2011 Keynote: David Karger'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-7053306324071238649</id><published>2011-09-21T14:01:00.004-04:00</published><updated>2011-09-21T14:19:17.225-04:00</updated><title type='text'>Twitter Acquires Julpan Real-Time Search Engine</title><content type='html'>Today, &lt;a href="http://www.julpan.com/"&gt;Julpan&lt;/a&gt;&amp;nbsp;(a stealth-mode search engine) &lt;a href="http://www.julpan.com/twitter_acquires_julpan.html"&gt;announced&lt;/a&gt; it is being acquired by Twitter. (see &lt;a href="http://techcrunch.com/2011/09/21/twitter-julpan/"&gt;TechCrunch coverage&lt;/a&gt;)&lt;br /&gt;&lt;br /&gt;There are not a lot of public details, but here are a few.&lt;br /&gt;&lt;br /&gt;Julpan is a real-time search engine based in NYC. &amp;nbsp;It focuses on analyzing social and real-time information from news, Twitter, and other sources. &amp;nbsp;It was founded in mid-2010 by &lt;a href="http://en.wikipedia.org/wiki/Ori_Allon"&gt;Ori Allon&lt;/a&gt;. &amp;nbsp;Ori is an Ex-Googler from the search quality team (Google acquired his patented thesis work called &lt;a href="http://www.unsw.edu.au/news/pad/articles/2005/sep/Orion.html"&gt;Orion&lt;/a&gt;, see the &lt;a href="http://googleblog.blogspot.com/2009/03/two-new-improvements-to-google-results.html"&gt;Google feature post&lt;/a&gt;.).&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Sadly, pending the integration with Twitter, &amp;nbsp;the Julpan search products (Newsgrep and LiveBite) are no longer available online.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-7053306324071238649?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/7053306324071238649/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2011/09/twitter-acquires-julpan-real-time.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/7053306324071238649'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/7053306324071238649'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2011/09/twitter-acquires-julpan-real-time.html' title='Twitter Acquires Julpan Real-Time Search Engine'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-510173700965496102</id><published>2011-07-26T13:11:00.002-04:00</published><updated>2011-07-26T13:19:24.622-04:00</updated><title type='text'>SIGIR 2011 Best Paper Award</title><content type='html'>The SIGIR 2011 best paper awards were announced.  &lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The winner is:&lt;/div&gt;&lt;div&gt;&lt;a href="http://www.sigir2011.org/papershow.asp?PID=102"&gt;Find It If You Can, A Game for Modeling Different Types of Web Search Success Using Interaction Data&lt;/a&gt; &lt;/div&gt;&lt;div&gt;M. Ageev, &lt;a href="http://www.mathcs.emory.edu/~qguo3/"&gt;Q. Guo&lt;/a&gt;, D. Lagun, and &lt;a href="http://www.mathcs.emory.edu/~eugene/"&gt;E. Agichtein&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Honorable mention goes to:&lt;/div&gt;&lt;div&gt;&lt;a href="http://www.sigir2011.org/papershow.asp?PID=84"&gt;Enhanced Results for Web Search&lt;/a&gt;&lt;/div&gt;&lt;div&gt;Kevin Haas, Peter Mika, Paul Tarjan and Roi Blanco&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;See also the notes from the SIGIR 2011 keynote addresses by &lt;a href="http://www.searchenginecaffe.com/2011/07/sigir-2011-keynote-talk-qi-lu-and.html"&gt;Qi Lu&lt;/a&gt; and &lt;a href="http://www.searchenginecaffe.com/2011/07/sigir-2011-keynote-chenxiang-zhai.html"&gt;ChenXiang Zhai&lt;/a&gt;.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-510173700965496102?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/510173700965496102/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2011/07/sigir-2011-best-paper-award.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/510173700965496102'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/510173700965496102'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2011/07/sigir-2011-best-paper-award.html' title='SIGIR 2011 Best Paper Award'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-2621771769436735884</id><published>2011-07-26T12:07:00.002-04:00</published><updated>2011-07-26T12:21:54.214-04:00</updated><title type='text'>SIGIR 2011 Keynote ChenXiang Zhai: Beyond Search: Statistical topic models for text analysis</title><content type='html'>&lt;a href="http://www.cs.uiuc.edu/~czhai/"&gt;ChengXiang Zhai&lt;/a&gt;  gave the second &lt;a href="http://www.sigir2011.org/keynotes.htm"&gt;keynote address&lt;/a&gt; at &lt;a href="http://www.sigir2011.org/"&gt;SIGIR 2011&lt;/a&gt; held this week in Beijing.&lt;br /&gt;&lt;br /&gt;&lt;div&gt;Here are the notes from my friend and fellow UMass grad student &lt;a href="http://ciir.cs.umass.edu/~bemike/"&gt;Michael Bendersky&lt;/a&gt; (follow him on &lt;a href="https://twitter.com/#!/bemikelive"&gt;@bemikelive&lt;/a&gt;).  Also, be sure to check out his workshop on &lt;a href="http://ciir.cs.umass.edu/sigir2011/qru/"&gt;Query Representation and Understanding&lt;/a&gt;. &lt;br /&gt;&lt;br /&gt;Be sure to read Michael's notes from Qi Lu's first keynote talk on the &lt;a href="http://www.searchenginecaffe.com/2011/07/sigir-2011-keynote-talk-qi-lu-and.html"&gt;Future of the Web &amp;amp; Search&lt;/a&gt;.&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;br /&gt;&lt;div&gt;&lt;b&gt;Beyond Search: Statistical topic models for text analysis&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;Complex Task Completion Flow&lt;br /&gt;- Multiple Searches → Information Synthesis &amp;amp; Analysis → Task Completion&lt;br /&gt;- Sometimes the process above is iterative&lt;br /&gt;&lt;br /&gt;Examples of complex tasks&lt;br /&gt;• What laptop to buy?&lt;br /&gt;• What’s hot in database research?&lt;br /&gt;• What do people say in blogs on a certain topics? How does the topic coverage change over time?&lt;br /&gt;• What people like/dislike about “Da Vinci Code”?&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Can we model complex tasks in a general way?&lt;/li&gt;&lt;li&gt;Can we solve them in a unified framework?&lt;/li&gt;&lt;li&gt;How do we bring users into the loop?&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Proposed solution – Statistical Topic Models&lt;br /&gt;- Generative model&lt;br /&gt;- Captures language models shifts based on topics&lt;br /&gt;- Language model serves as a convenient topic representation&lt;br /&gt;- Every document has a lot of contextual data (metadata)&lt;br /&gt;  o Author&lt;br /&gt;  o Communities&lt;br /&gt;  o Location&lt;br /&gt;  o Author’s occupation&lt;br /&gt;  o User labels&lt;/li&gt;&lt;li&gt;Any combination of contextual data can induce partition over the documents&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;We should make topics depend on context variables&lt;br /&gt;o Text is generated from a contextualized PLSA model&lt;br /&gt;o Fitting such a model enables a wide range of analysis tasks on a document&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Applications of contextual topic models&lt;br /&gt;o Social Network Analysis can aid to derive more coherent topic models&lt;br /&gt;o Opinion mining – integration of expert reviews and personal opinions&lt;br /&gt;   • Take into account the well-formed and faceted design of expert reviews to impose context on personal opinions, which come from a variety of unstructured sources (blogs, micro-blogs, review sites, comments)&lt;br /&gt;   • Derive integrated expert/personal opinions on different aspects&lt;br /&gt;   • Infer aspect ratings and weights&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Using topic models to go from search engine to analysis engine&lt;br /&gt;o Tasks&lt;br /&gt;   • What is a task?&lt;br /&gt;   • How is task different from information need/intent?&lt;br /&gt;   • How do we help users to express tasks&lt;br /&gt;o What does ranking mean in analysis engine?&lt;br /&gt;o How to evaluate the output of the analysis engine?&lt;br /&gt;o Operators to allow analysis of search results&lt;br /&gt;-- Select, Split, Intersection/Union, Interpret, Rank, Compare&lt;br /&gt;• Operators can be combined, similar to SQL/InQuery languages&lt;/li&gt;&lt;/ul&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-2621771769436735884?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/2621771769436735884/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2011/07/sigir-2011-keynote-chenxiang-zhai.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/2621771769436735884'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/2621771769436735884'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2011/07/sigir-2011-keynote-chenxiang-zhai.html' title='SIGIR 2011 Keynote ChenXiang Zhai: Beyond Search: Statistical topic models for text analysis'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-1911934323829169955</id><published>2011-07-26T11:23:00.003-04:00</published><updated>2011-07-26T12:05:03.661-04:00</updated><title type='text'>SIGIR 2011 Keynote Talk: Qi Lu and The Future of the Web &amp; Search</title><content type='html'>&lt;div&gt;&lt;a href="http://www.microsoft.com/presspass/exec/lu/"&gt;Qi Lu&lt;/a&gt;, the president of Microsoft's Online Services Division gave the first keynote address at SIGIR 2011 happening this week in Beijing.  He laid out Microsoft's vision for the future.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I am in San Francisco at Twitter, but luckily my friend and fellow UMass grad student &lt;a href="http://ciir.cs.umass.edu/~bemike/"&gt;Michael Bendersky&lt;/a&gt; is taking notes (follow him on &lt;a href="https://twitter.com/#!/bemikelive"&gt;@bemikelive&lt;/a&gt;).  Also, be sure to check out his workshop on &lt;a href="http://ciir.cs.umass.edu/sigir2011/qru/"&gt;Query Representation and Understanding&lt;/a&gt;.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Future of the Web &amp;amp; Search&lt;br /&gt;&lt;/b&gt;&lt;ul&gt;&lt;li&gt;Agenda&lt;br /&gt;- Perspective of the web/IT industry&lt;br /&gt;- Future of search&lt;br /&gt;- Role of IR&lt;br /&gt;- Challenges&lt;br /&gt;- Opportunity&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;The heritage: web of documents&lt;br /&gt;The future:&lt;br /&gt;- Social web - Facebook profiles, like buttons&lt;br /&gt;- Geospatial web: Mobile devices&lt;br /&gt;- Temporal web: Collection of information over time, real-time microblogging&lt;br /&gt;- Application web: Fundamental design of the browser doesn’t support new application models&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;IT industry of the future&lt;br /&gt;- Devices + cloud services&lt;br /&gt;- Changing the user intent capturing from rigid keyboard/mouse/keywords combination to more natural modalities&lt;br /&gt;  • Understanding the natural language&lt;br /&gt;  • Voice recognition&lt;br /&gt;     - On mobile devices&lt;br /&gt;     - In living room products&lt;br /&gt; • Body gestures - Microsoft Kinect&lt;br /&gt; • Image/Audio/Video capturing&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Vision: of the future of search&lt;br /&gt;o Empower people with knowledge&lt;br /&gt;o Re-organize the web for search to unlock the full potential of the web&lt;br /&gt;  • Better discovery&lt;br /&gt;  • More informed decisions&lt;br /&gt;  • Easier task completions&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Role of IR&lt;br /&gt;o Understanding user intent&lt;br /&gt;o Modeling web of the world&lt;br /&gt;    • People/places/things&lt;br /&gt;    • Relations&lt;br /&gt;o Task completion &amp;amp; decision making&lt;br /&gt;o Incentive engineering for making people do more things on the web&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Challenges&lt;br /&gt;o Measurement, evaluation &amp;amp; self-correction&lt;br /&gt;    • Some things are inherently hard to evaluation: objectiveness, design, opinions&lt;br /&gt;    • Search results have profound influence on the way people perceive the world&lt;br /&gt;          • It is important that they have no inherent bias or skew&lt;br /&gt;&lt;br /&gt;o Privacy&lt;br /&gt;&lt;br /&gt;o Lack of&lt;br /&gt;    • Tools &amp;amp; understanding in existing disciplines&lt;br /&gt;    • Training &amp;amp; development if cross-disciplinary talent&lt;br /&gt;&lt;br /&gt;o Barriers for academia research&lt;br /&gt;    • Access to data&lt;br /&gt;    • Computing infrastructure&lt;br /&gt;    • Funding&lt;br /&gt;    • Not just based on company agenda&lt;br /&gt;    • Funding projects based on pure creativity&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Opportunities&lt;br /&gt;    • Opportunities for key breakthroughs in the areas of&lt;br /&gt;        • Serendipitous discovery (e.g. Hunch.com)&lt;br /&gt;        • Information theory for the age of the web and social networks&lt;br /&gt;        • Science of big data&lt;br /&gt;&lt;br /&gt;   • Broadening collaborations&lt;br /&gt;      • Research&lt;br /&gt;      • Development (API/tools)&lt;br /&gt;      • Investment (Training &amp;amp; Development)&lt;br /&gt;&lt;br /&gt;  • Vibrant community&lt;/li&gt;&lt;/ul&gt;&lt;div&gt;Follow &lt;a href="https://twitter.com/#!/search/%23sigir2011"&gt;#sigir2011&lt;/a&gt; for more news, although given the censorship in China, the results are very sparse. &lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-1911934323829169955?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/1911934323829169955/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2011/07/sigir-2011-keynote-talk-qi-lu-and.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/1911934323829169955'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/1911934323829169955'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2011/07/sigir-2011-keynote-talk-qi-lu-and.html' title='SIGIR 2011 Keynote Talk: Qi Lu and The Future of the Web &amp; Search'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-8062435317663738053</id><published>2011-06-21T08:00:00.001-04:00</published><updated>2011-06-21T08:00:03.785-04:00</updated><title type='text'>Inside ACL: Building Watson DeepQA keynote Address by David Ferrucci</title><content type='html'>&lt;div&gt;This morning &lt;a href="https://researcher.ibm.com/researcher/view.php?person=us-ferrucci"&gt;David Ferrucci&lt;/a&gt; gave the &lt;a href="http://www.acl2011.org/"&gt;Association for Computation Linguistics (ACL) 2011&lt;/a&gt; keynote talk.  &lt;a href="http://ciir.cs.umass.edu/~bemike/"&gt;Michael Bendersky&lt;/a&gt; is attending the conference and was very generous to send me his notes on the first keynote talk.  Be sure to read his paper, &lt;a href="http://ciir.cs.umass.edu/~bemike/pubs/2011-3.pdf"&gt;Joint Annotation of Search Queries&lt;/a&gt;.  Here are his notes from the talk,&lt;br /&gt;&lt;br /&gt;&lt;b&gt;&lt;i&gt;Building Watson: An Overview of the DeepQA Project&lt;/i&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;What’s the difference between playing chess and understanding human language?&lt;br /&gt;- People find chess difficult and natural language easy&lt;br /&gt;- Many non-scientists don’t realize how difficult human language understanding really is&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Computers are good at&lt;br /&gt;- Understanding formulas&lt;br /&gt;- Understanding structured query languages&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Computers are bad at&lt;br /&gt;- Parsing ambiguous natural language&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;The system challenges&lt;br /&gt;- Open domain&lt;br /&gt;- Complex language&lt;br /&gt;- High precision&lt;br /&gt;- Accurate confidence – only buzz in when you’re very confident&lt;br /&gt;- High speed&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Core technologies&lt;br /&gt;- Deep parsing – using a proprietary IBM technology that has been developed over the last 20 years&lt;br /&gt;- Relation detection&lt;br /&gt;- Multiple parse interpretations&lt;br /&gt;- Multiple query formulations per parse&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Co-reference resolution&lt;br /&gt;- The entire research was driven by a single end-to-end metric – how much the proposed solution improves the Jeopardy game&lt;br /&gt;- Some improvements on a single algorithm might be redundant or harmful in the overall solution&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Jeopardy is open-domain – not using ontologies that were crafted specifically for Jeopardy&lt;br /&gt;- Using general resources: Wordnet, YAGO&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Learning from Reading&lt;br /&gt;- Parsing sentences in the text&lt;br /&gt;- Generalization and Statistical Aggregation&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Some questions require decomposition and synthesis&lt;br /&gt;- Using techniques to decompose questions into parts&lt;br /&gt;- Synthesis of answers from different parts&lt;br /&gt;- Helps in answering questions that involve puns/rhyming&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Some questions require finding a missing link between concepts&lt;br /&gt;- Using spreading activation to find links&lt;br /&gt;- eg, link between “shirt”, “tv remote”, “telephone” -&amp;gt; buttons&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Metrics for performance evaluation&lt;br /&gt;- Plot x- % answered, y – Precision&lt;br /&gt;- Winners clouds – answered at least 50% of the questions, precision 80-92%&lt;br /&gt;- The goal was to get Watson into the winner cloud – achieved and went over the cloud by the Jeopardy game&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Great leaps in performance from 2007. In the beginning, breaking even in the game seemed like an accomplishment&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Watson is self-contained. Deciding what content to use is very hard – the amount of hardware is limited.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Guidelines&lt;br /&gt;- Specific large hand-crafted methods won’t cut it&lt;br /&gt;- Combining intelligence from diverse methods using machine learning techniques&lt;br /&gt;- Massive Parallelism is a Key Enabler&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;DeepQA – QA system underlying Watson&lt;br /&gt;- Many components for parsing and multiple answer generation&lt;br /&gt;- Logistic regression to weight the different features and rank the answers&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Search systems used: Indri &amp;amp; Lucene. Both were modified to reduce run-time&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Work process&lt;br /&gt;- All group members working in the same open space room&lt;br /&gt;  - NLP researchers, IR researchers, ML researchers, linguists, statisticians&lt;br /&gt;- 8,000 experiments – all documented with tools that allow analysis by question/algorithm/features&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Run-time&lt;br /&gt;- Single CPU time for answering a question  – 2 hours&lt;br /&gt;- Scaled out to 3,000 CPU’s – 2-3 seconds&lt;br /&gt;- Enabled by the built-in parallelization of the algorithms&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;div&gt;What I find particularly striking is the deep analysis of a contained corpus, particularly the analysis to find various kinds of missing links.  The hardware is limited and the corpus is very circumscribed in order to run complex and expensive algorithms - and it results in significant improvements!  &lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;How would you develop a system for the real-time web where what's meaningful is constantly in flux?  &lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;div&gt;Ultimately, the true test of DeepQA will be how it generalizes to domains beyond Jeopardy.  I hope this is just the beginning for Watson.&lt;/div&gt;&lt;div&gt;  &lt;/div&gt;&lt;div&gt;  Thanks again to &lt;a href="http://ciir.cs.umass.edu/~bemike/"&gt;Michael&lt;/a&gt; for his notes.  Look for more highlights from ACL coming soon!&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-8062435317663738053?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/8062435317663738053/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2011/06/inside-acl-building-watson-deepqa.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/8062435317663738053'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/8062435317663738053'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2011/06/inside-acl-building-watson-deepqa.html' title='Inside ACL: Building Watson DeepQA keynote Address by David Ferrucci'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-2001915541461525</id><published>2011-06-14T12:56:00.007-04:00</published><updated>2011-06-14T13:58:19.004-04:00</updated><title type='text'>Google Inside Search Event today</title><content type='html'>Today is a big press event on search at Google, &lt;a href="http://www.google.com/insidesearch/"&gt;Inside Search&lt;/a&gt;.   Be sure to check out the "Evolution of Search" timeline at the bottom of the page.&lt;br /&gt;&lt;br /&gt;Check out the new &lt;a href="http://www.googleinsidesearch.com/underthehood.html#globe"&gt;Search Globe&lt;/a&gt;, a visualization of worldwide search.&lt;br /&gt;&lt;br /&gt;A big theme of the event is:&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight:bold;"&gt;"Breaking down the barriers to knowledge"&lt;br /&gt;&lt;/span&gt; - Make it faster and easier to enter queries across all platforms - especially mobile.  Combining voice search and translation across all search platforms.&lt;br /&gt;&lt;br /&gt;&lt;div&gt;On Search and Knowledge re-org.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Amit - the classic data hierarchy:&lt;br /&gt;&lt;div&gt;Data&lt;/div&gt;&lt;div&gt;Information&lt;/div&gt;&lt;div&gt;Knowledge&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Search has taken an amazing job of taking the billions and billions of pages of data and turning it into information.  We are now setting our sights on knowledge - the relationship of things to one another.&lt;br /&gt;&lt;br /&gt;You can watch the live stream and Danny Sullivan is &lt;a href="http://searchengineland.com/live-blogging-googles-%E2%80%9Cinside-search%E2%80%9D-event-81531"&gt;live blogging&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;More as it develops.&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-2001915541461525?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/2001915541461525/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2011/06/google-inside-search-event-today.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/2001915541461525'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/2001915541461525'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2011/06/google-inside-search-event-today.html' title='Google Inside Search Event today'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-7661547119735398617</id><published>2011-06-06T18:22:00.003-04:00</published><updated>2011-06-06T18:33:09.354-04:00</updated><title type='text'>Watch me Compete on MasterChef Season 2 TONIGHT</title><content type='html'>Set your DVRs and tune in to the premier of Fox's MasterChef today, Monday, at 8pm.  I am competing to be America's next MasterChef! For more on my cooking, read my &lt;a href="http://www.cookingphd.com"&gt;modernist cooking blog&lt;/a&gt;, CookingPhD.&lt;br /&gt;&lt;br /&gt;Masterchef is cooking meets American Idol for amateur cooks.  I was selected as one of 100 final contestants flown to LA out of 40,000 people that auditioned for the show.  Watch me cook my signature dish for Gordon Ramsay, Graham Elliot, and Joe Bastianich.  &lt;br /&gt;&lt;br /&gt;Here is quick promo video, with me searing my signature smoked duck at 2:11:&lt;br /&gt;&lt;iframe width="480" height="303" src="http://www.youtube.com/embed/oQBQYBfC1co" frameborder="0" allowfullscreen&gt;&lt;/iframe&gt;&lt;br /&gt;&lt;br /&gt;A bunch of the cast will be tweeting on &lt;a href="http://twitter.com/#!/search/%23masterchef"&gt;#masterchef&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;The second part of the premier will air tomorrow, Tuesday at 8pm!  &lt;br /&gt;&lt;br /&gt;(The episodes should also be on Hulu at some future date)&lt;br /&gt;&lt;br /&gt;Jeff, aka &lt;a href="http://twitter.com/#!/cookingphd"&gt;@cookingphd&lt;br /&gt;&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-7661547119735398617?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/7661547119735398617/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2011/06/masterchef-season-2-premier-tonight-at.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/7661547119735398617'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/7661547119735398617'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2011/06/masterchef-season-2-premier-tonight-at.html' title='Watch me Compete on MasterChef Season 2 TONIGHT'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://img.youtube.com/vi/oQBQYBfC1co/default.jpg' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-6957241817330459227</id><published>2011-06-01T15:03:00.004-04:00</published><updated>2011-06-01T17:35:20.282-04:00</updated><title type='text'>Twitter Releases Search+ Relevance based search</title><content type='html'>&lt;div&gt;Today marks a significant milestone for real-time search.  The results are now ranked based on relevance instead of purely based on recency.  The announcement was made by CEO Jack Dorsey at the &lt;a href="http://www.allthingsd.com/"&gt;All Things D&lt;/a&gt; conference earlier today.  A key important feature is that the new search incorporates rich media results, as mentioned on their &lt;a href="http://blog.twitter.com/2011/06/searchphotos.html"&gt;blog announcement&lt;/a&gt;,&lt;/div&gt;&lt;div&gt;&lt;blockquote&gt;Not only will it deliver more relevant Tweets when you search for something or click on a trending topic, but it will also show you related photos and videos, right there on the results page. It's never been easier to get a sense of what's happening right now, wherever your curiosity takes you.&lt;/blockquote&gt;&lt;/div&gt;&lt;div&gt;Danny Sullivan has an &lt;a href="http://searchengineland.com/goodbye-time-sorting-twitter-gets-most-relevant-search-results-79368"&gt;article&lt;/a&gt; covering the release and what "relevance" means in the context of real-time search:&lt;/div&gt;&lt;div&gt;&lt;blockquote&gt;Relevance for us today is using a combination of signals, your follower graph, who you follow, who’s following you. Another aspect is just looking at the content itself and the resonance of the content,” Mike Abbott, Twitter’s vice president of engineering.&lt;/blockquote&gt;&lt;/div&gt;The &lt;a href="http://engineering.twitter.com/"&gt;Twitter Engineering blog&lt;/a&gt; has on the &lt;a href="http://engineering.twitter.com/2011/05/engineering-behind-twitters-new-search.html"&gt;update&lt;/a&gt; has more detail on the evolution of Twitter search since the original Summize days.  Here is a small excerpt on what is needed to provide personalized relevance and filtering:&lt;div&gt;&lt;div&gt;&lt;ul&gt;&lt;blockquote&gt;&lt;li&gt;Static signals, added at indexing time&lt;/li&gt;&lt;li&gt;Resonance signals, dynamically updated over time&lt;/li&gt;&lt;li&gt;Information about the searcher, provided at search time&lt;/li&gt;&lt;/blockquote&gt;&lt;/ul&gt;&lt;/div&gt;&lt;div&gt;The post has more details worth reading on the infrastructure that goes into the search&lt;br /&gt;For more news on twitter search be sure to follow &lt;a href="https://twitter.com/#!/twittersearch"&gt;@twittersearch&lt;/a&gt;.&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-6957241817330459227?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/6957241817330459227/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2011/06/twitter-releases-search-relevance-based.html#comment-form' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/6957241817330459227'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/6957241817330459227'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2011/06/twitter-releases-search-relevance-based.html' title='Twitter Releases Search+ Relevance based search'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-3623410631646503035</id><published>2011-05-18T18:41:00.003-04:00</published><updated>2011-05-18T18:54:55.571-04:00</updated><title type='text'>Inside Search: New Official Google Search Quality Blog</title><content type='html'>Today, Amit Singhal, head of search quality announced the creation of a new &lt;a href="http://insidesearch.blogspot.com/"&gt;Inside Search blog&lt;/a&gt;.  It is an extension of the, &lt;a href="http://googleblog.blogspot.com/search/label/This%20Week%20in%20Search"&gt;This Week in Search&lt;/a&gt; column that highlights new features and product announcements.  As Amit writes,&lt;div&gt;&lt;blockquote&gt;...we got feedback that people wanted their search news and information as it happens, not just weekly. So, we’re starting Inside Search as a place where you can find regular updates on the intricacies of search and our team. We have more engineers working on search than any other product, and each one of us has stories to tell.&lt;/blockquote&gt;&lt;/div&gt;I look forward to hearing from a wide variety of voices on the team.  With more than 500 improvements last year, it can be difficult to keep up with the changes, it's sometimes useful to have them pointed out more explicitly.&lt;br /&gt;&lt;br /&gt;Perhaps it will also lead to a bit more transparency in search ranking and quality at Google.  At least there is an official place for voices to speak publicly.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-3623410631646503035?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/3623410631646503035/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2011/05/inside-search-new-official-google.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/3623410631646503035'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/3623410631646503035'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2011/05/inside-search-new-official-google.html' title='Inside Search: New Official Google Search Quality Blog'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-6743613595988441923</id><published>2011-05-07T13:01:00.005-04:00</published><updated>2011-05-07T13:50:03.549-04:00</updated><title type='text'>NY Times: The Stanford Facebook App Class</title><content type='html'>The NY Times today reports has an article, &lt;a href="http://www.nytimes.com/2011/05/08/technology/08class.html"&gt;The Class That Built Apps, and Fortunes&lt;/a&gt;. The Stanford FB app class is &lt;a href="http://www.stanford.edu/group/captology/cgi-bin/facebook/"&gt;CS377W: Creating Engaging Facebook Apps&lt;/a&gt;.  The class was taught by &lt;a href="http://www.bjfogg.com/"&gt;BJ Fogg&lt;/a&gt;, &lt;a href="http://500hats.typepad.com/"&gt;David McClure&lt;/a&gt;, and &lt;a href="http://www.dan.ag/"&gt;Dan Greenberg&lt;/a&gt;.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;One take away from the class is an important reminder from BJ Fogg,&lt;/div&gt;&lt;div&gt;&lt;blockquote&gt;What smart people do, what engineers tend to do, is overthink and from the beginning we said to do simple things.  But, the inclination is to do something fancier, more complicated.  What happened over time was that the students teams discovered that over time is that the complicated things never worked, that simple things took off.  &lt;/blockquote&gt;&lt;div&gt;It was a hugely popular class, with hundreds of people interested in it.  From the NYT article, &lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;blockquote&gt;Working in teams of three, the 75 students created apps that collectively had 16 million users in just 10 weeks.&lt;/blockquote&gt;&lt;div&gt;&lt;div&gt;A key component of the class is the social aspect of the applications being built.  The class is part of the &lt;a href="http://captology.stanford.edu/"&gt;Stanford Persuasive Tech Lab&lt;/a&gt;. From the lab's description,&lt;/div&gt;&lt;div&gt;&lt;blockquote&gt;Our lab specializes in persuasion via technology, so this is naturally our focus when studying Facebook. We want to understand the how motivation and influence operate on Facebook. &lt;/blockquote&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;It was a big experiment in virality.  As student &lt;a href="http://www.johnnyhwin.com/"&gt;Johnny Win&lt;/a&gt; describes it,&lt;/div&gt;&lt;blockquote&gt;The hardest part of any project is to find the initial traction that will get you users and engagement and build from there.  Rather than building a road to the moon, build the first step.&lt;/blockquote&gt;&lt;/div&gt;&lt;/div&gt;The students learned important lessons.  You need people to use a product you build. It's great to capture attention, but it's more important to do something meaningful.   There are enough punch the monkey, hot or not, and similar apps to waste your time on.  Create apps that people solve a problem that matters.  For example, an application like ReCaptcha that helps to digitize books and fight spam.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The efforts of the Stanford persuasive tech lab have taken on important projects: Health, Peace Dot, and others.  I hope that the students were also taught the principles that motivate these projects as part of the app course in addition to how to create popular apps.&lt;br /&gt;&lt;div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-6743613595988441923?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/6743613595988441923/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2011/05/ny-times-stanford-facebook-app-class.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/6743613595988441923'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/6743613595988441923'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2011/05/ny-times-stanford-facebook-app-class.html' title='NY Times: The Stanford Facebook App Class'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-2716743737710394032</id><published>2011-05-07T11:47:00.010-04:00</published><updated>2011-05-07T12:56:56.101-04:00</updated><title type='text'>From Search to Knowledge at Google</title><content type='html'>&lt;div&gt;Techcrunch &lt;a href="http://techcrunch.com/2011/05/03/google-dissolves-search-group-internally-now-called-knowledge/"&gt;reports&lt;/a&gt; that Search as high-level product group in Google no longer exists. As part of Google's re-org under new CEO Larry Page, the search group has been renamed the "Knowledge Group". Search Engine Land is reporting on the &lt;a href="http://searchengineland.com/knowledge-replaces-search-for-google-75739"&gt;promotion of Alan Eustace&lt;/a&gt; from the SVP of Engineering and Research to Google’s Senior Vice President, &lt;b&gt;&lt;i&gt;Knowledge&lt;/i&gt;&lt;/b&gt;. My understanding is that this represents an expanded view of the products in search.  Beyond helping people find information, the group's goals include,&lt;/div&gt;&lt;div&gt;&lt;blockquote&gt;... enhancing people’s understanding and facilitating the creation of knowledge. &lt;/blockquote&gt;&lt;/div&gt;&lt;div&gt;Although all the details are not public, it sounds as if Udi Manber leads the engineering team on information products that are not core search.  The details are speculation on my part, but his responsibilities may include products like &lt;a href="http://knol.google.com/"&gt;Knol&lt;/a&gt;, &lt;a href="http://www.freebase.com/"&gt;Freebase&lt;/a&gt;, and &lt;a href="http://vark.com/"&gt;Aardvark&lt;/a&gt;. It might also include some of Google's data management tools: &lt;a href="http://code.google.com/p/google-refine/"&gt;Google Refine&lt;/a&gt;, &lt;a href="http://www.google.com/fusiontables"&gt;Fusion Tables&lt;/a&gt;, and &lt;a href="http://www.google.com/publicdata"&gt;Public Data Explorer&lt;/a&gt;. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Does this mean that search is no longer a "core" product at Google?  I don't think so.  Instead, it indicates an astute awareness that search needs to be tied to other projects that manage information - social QA, Wikipedia-like knowledge bases, structured data, and other information tools.  &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In the academic world, we should also consider information retrieval in the context of tools and communities that create and share information: digital libraries, NLP, information (and relation) extraction, and the semantic web.  These are all inter-connected components of the information processing and knowledge management ecosystem. As Google's re-org to create a "Knowledge" team indicates, these communities need to communicate and coordinate effectively towards a broader vision of helping people find, create, analyze, and share information.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-2716743737710394032?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/2716743737710394032/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2011/05/from-search-to-knowledge-at-google.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/2716743737710394032'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/2716743737710394032'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2011/05/from-search-to-knowledge-at-google.html' title='From Search to Knowledge at Google'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-8210725177163432961</id><published>2011-05-04T15:30:00.004-04:00</published><updated>2011-05-06T08:03:24.950-04:00</updated><title type='text'>Watch me on Fox's MasterChef USA Season 2</title><content type='html'>&lt;span class="Apple-style-span" style="color: rgb(38, 38, 38); font-family: Arial, Helvetica, Geneva, sans-serif; font-size: 12px; line-height: 21px; "&gt;Fox announced that I am one of the 100 contestants chosen for the new season of MasterChef!  &lt;/span&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;&lt;span class="Apple-style-span" style="font-size: 12px; line-height: 21px;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;&lt;span class="Apple-style-span" style="font-size: 12px; line-height: 21px;"&gt;Read the &lt;a href="http://www.cookingphd.com/blog/2011/5/4/watch-me-on-foxs-masterchef-usa-season-2.html"&gt;MasterChef announcement&lt;/a&gt; on my&lt;a href="http://www.cookingphd.com"&gt; cooking and recipe blog&lt;/a&gt;.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-8210725177163432961?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/8210725177163432961/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2011/05/watch-me-on-foxs-masterchef-usa-season.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/8210725177163432961'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/8210725177163432961'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2011/05/watch-me-on-foxs-masterchef-usa-season.html' title='Watch me on Fox&apos;s MasterChef USA Season 2'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-397951146230136068</id><published>2011-04-28T09:17:00.009-04:00</published><updated>2011-04-28T10:46:44.491-04:00</updated><title type='text'>Greplin: Personal Search for the Social Network Era</title><content type='html'>&lt;a href="http://www.greplin.com/"&gt;Greplin&lt;/a&gt; is a cloud search service that indexes your social network and personal information stored in web services. It provides a central hub for searching all your online data.  Greplin is a small startup company with six engineers. Instead of building its own cluster, it leverages Amazon EC2 for indexing capacity.  The &lt;a href="http://techcrunch.com/2011/04/27/greplin-1-5-billion-documents-indexed-six-engineers/"&gt;TechCrunch article&lt;/a&gt; reports today that:&lt;div&gt;&lt;blockquote&gt;They’ve now indexed some 1.5 billion documents. And they’re indexing about 30 million new documents per day.&lt;/blockquote&gt; &lt;div&gt;The TechCrunch article exaggerates the scale issue. The more significant scale issues relate to query volume, and the article does not report on those numbers. Furthermore, a large component of the documents Greplin indexes are short FB and Twitter updates.  Greplin has more relaxed indexing requirements than real-time search: in the FAQ Greplin says it can take up to 20 minutes or even up to a day to index your documents.  &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;My current Greplin index has approximately 54,000 documents.  It has 30k from Gmail, 7k from Facebook, 17k from Twitter, and around 500 from LinkedIn. The basic search functionality seems reasonable enough. It is very snappy with search as you type.  The advanced search capabilities are a bit limited. For example, search by date is missing.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Greplin is still in its infancy.  The search interface could benefit blending document results from different sources into a more unified result list.  For example, see the recent work on "aggregated" and "federated" search [e.g. &lt;i&gt;A Methodology for Evaluating Aggregated Search Results&lt;/i&gt; from ECIR 2011].  Furthermore, I would like a faceted search UI to support &lt;a href="http://en.wikipedia.org/wiki/Exploratory_search"&gt;exploratory search&lt;/a&gt;.  They could learn a lot by looking at the extensive research on Personal Information Management (PIM) and Desktop search, like &lt;a href="http://research.microsoft.com/en-us/um/people/teevan/publications/"&gt;Jamie Teevan's&lt;/a&gt;  research along with &lt;a href="http://research.microsoft.com/en-us/um/people/sdumais/"&gt;Sue Dumais'&lt;/a&gt; work on &lt;a href="http://research.microsoft.com/en-us/um/people/sdumais/SISLandmarks-Interact2003-final.pdf"&gt;Landmarks&lt;/a&gt; and &lt;a href="http://research.microsoft.com/en-us/um/people/sdumais/SISCore-SIGIR2003-Final.pdf"&gt;Stuff I've Seen&lt;/a&gt;.  (For more on PIM - you can also read &lt;a href="http://lifidea.wordpress.com/"&gt;Jinyoung Kim&lt;/a&gt;, one of my labmates).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I have significant reservations concerning my data privacy.  Do I trust Greplin with my indexed data?  It needs at least partial copies to show snippets of results.  At least it claims I can delete my indices for a service at any time.  However, it is a very coarse mechanism.  There is no version of a robots.txt for my personal data so that I can specify mechanisms for "do not index" or "do not cache" at a granular level.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I have a few invites. If you want to try it out leave a request in the comments. &lt;/div&gt;&lt;div&gt; &lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-397951146230136068?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/397951146230136068/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2011/04/greplin-personal-search-for-social.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/397951146230136068'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/397951146230136068'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2011/04/greplin-personal-search-for-social.html' title='Greplin: Personal Search for the Social Network Era'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-2444138973528033979</id><published>2011-04-19T16:42:00.009-04:00</published><updated>2011-04-19T18:40:51.948-04:00</updated><title type='text'>ECIR 2011 Best Paper Awards and Other Highlights</title><content type='html'>First up are the best paper awards which were announced tonight. There were two awards, one for best paper and one for best student paper:&lt;div&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://www.springerlink.com/content/512053516m133852/"&gt;A User-oriented Model for Expert Finding&lt;/a&gt; by Smirnova and Balog (best paper)&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.springerlink.com/content/j8nvr881n3686161/"&gt;A Methodology for Evaluating Aggregated Search Results.&lt;/a&gt; by Arguello, Diaz, Callan, and Carterette. (best student paper)&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;div&gt;The paper &lt;a href="http://www.dcs.gla.ac.uk/~craigm/publications/macdonald11learned.pdf"&gt;Learning Models for Ranking Aggregates&lt;/a&gt; by Craig MacDonald and Iadh Ounis was also nominated.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The best poster prize was &lt;a href="http://yfrog.com/z/hs821ztj"&gt;A novel reranking approach inspired by quantum measurement&lt;/a&gt; by Zhao et al. (via &lt;a href="http://twitter.com/#!/phelo"&gt;Owen Phelan&lt;/a&gt;).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;An trend at the conference is the handling of ranking and evaluating "aggregate" results, aka "blended" results or "universal search" where results from multiple verticals are blended into a single presentation.  In addition to the above two papers, there is:&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://maroo.cs.umass.edu/pub/web/getpdf.php?id=895"&gt;Smoothing Click Counts for Aggregated Vertical Search&lt;/a&gt; by Janwon Seo, et al.&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;Other trends in the conference appear to be:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Crowd sourcing evaluation (an entire session)&lt;/li&gt;&lt;li&gt;Realtime and Microblog (Twitter) applications (multiple papers across tracks)&lt;/li&gt;&lt;/ul&gt;&lt;div&gt;The &lt;a href="http://www.dcs.gla.ac.uk/workshops/ddr2011/"&gt;DDR 2011 workshop&lt;/a&gt; on diversity in document retrieval also proved popular.  The &lt;a href="http://www.dcs.gla.ac.uk/workshops/ddr2011/ddr2011.proceedings.pdf"&gt;proceedings&lt;/a&gt; are available for download.  There is a fair bit of discussion on Twitter, &lt;a href="http://twitter.com/#!/search/ddr2011"&gt;#ddr2011&lt;/a&gt;.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;There are two other papers from UMass that I want to highlight:&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://ciir-publications.cs.umass.edu/pub/web/getpdf.php?id=912"&gt;Passage Reranking for Question Answering Using Syntactic Structures and Answer Types&lt;/a&gt; by &lt;a href="http://www.cs.umass.edu/~elif/"&gt;Elif Aktolga&lt;/a&gt; et, al.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.cs.cmu.edu/~vitor/papers/ecir_2011.pdf"&gt;An Analysis of Time-instability in Web Search Results&lt;/a&gt; by &lt;a href="http://lifidea.wordpress.com/"&gt;Kim&lt;/a&gt; and &lt;a href="http://www.cs.cmu.edu/~vitor/"&gt;Carvalho&lt;/a&gt;.&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-2444138973528033979?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/2444138973528033979/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2011/04/ecir-2011-best-paper-awards-and-other.html#comment-form' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/2444138973528033979'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/2444138973528033979'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2011/04/ecir-2011-best-paper-awards-and-other.html' title='ECIR 2011 Best Paper Awards and Other Highlights'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-783176662526678797</id><published>2011-04-07T23:25:00.004-04:00</published><updated>2011-04-08T00:03:40.765-04:00</updated><title type='text'>SIGIR 2011 Results</title><content type='html'>Today, the SIGIR paper acceptance/rejections were sent out.  What was your result?  Let me know in the comments.  What did you think of the review quality?  Will there be a new influx of new submissions to &lt;a href="http://nonrel.wordpress.com/"&gt;online journals for rejected papers&lt;/a&gt;?&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;This year there were 545 papers submitted and 108 were accepted (19.8%).  Despite controversy that some papers might not receive oral presentations, all papers will have full presentations.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Instead of complaining about the reviewers of my rejected paper, I would instead like to thank the reviewers for their time and consideration, regardless of the outcome, because writing reviews takes a lot of time and effort. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;My congratulations to the accepted authors.  I look forward to the papers.&lt;/div&gt;&lt;div&gt; &lt;/div&gt;&lt;div&gt; &lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-783176662526678797?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/783176662526678797/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2011/04/sigir-2011-results.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/783176662526678797'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/783176662526678797'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2011/04/sigir-2011-results.html' title='SIGIR 2011 Results'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-6577913756946312377</id><published>2011-03-29T10:17:00.004-04:00</published><updated>2011-03-29T17:03:31.020-04:00</updated><title type='text'>WWW 2011 Day 2: More Workshops and Tutorials</title><content type='html'>This is more on the WWW conference in India this week.  I'm not attending, but I wanted to point out a few things that caught my attention. Today there are more tutorials and workshops.  See also the &lt;a href="http://www.searchenginecaffe.com/2011/03/www-2011-this-week.html"&gt;workshops and tutorials from Day 1&lt;/a&gt;.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;b&gt;Workshops&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;a href="http://wwwhome.math.utwente.nl/~volkovichyv/some2011"&gt;&lt;b&gt;Social Media Engagement (SoME 2011)&lt;/b&gt;&lt;/a&gt; - It focuses on how to measure user engagement (captivated and motivated to participate) and satisfaction with social media.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;a href="https://km.aifb.kit.edu/ws/semsearch11"&gt;&lt;b&gt;SemSearch 2011&lt;/b&gt;&lt;/a&gt; - The fourth workshop on Semantic Search.  The most interesting aspect of the workshop is the data challenge.  What I find most compelling is the manually constructed "List" or "Type" &lt;a href="http://km.aifb.kit.edu/ws/semsearch10/Files/samplequeries-list"&gt;queries&lt;/a&gt; that are more complex than the other entity queries.  The manually constructed queries utilize the attributes and relationships, which make semantic data unique, e.g. [Japanese-born players who have played in MLB where the British monarch is also head of state].&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Tutorials&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;a href="http://snap.stanford.edu/proj/socmedia-www/"&gt;&lt;b&gt;Social Media Analytics&lt;/b&gt;&lt;/a&gt; - is being taught by &lt;a href="http://cs.stanford.edu/people/jure/"&gt;Jure Leskovec&lt;/a&gt; from Stanford.  The &lt;a href="http://snap.stanford.edu/proj/socmedia-www/index.html#materials"&gt;slide materials&lt;/a&gt; are available online. Last fall he also taught a related class, &lt;a href="http://www.stanford.edu/class/cs224w/"&gt;Social and Information Network Analysis&lt;/a&gt;.  He is current teaching a class on &lt;a href="http://cs246.stanford.edu/"&gt;Mining Massive Datasets&lt;/a&gt;.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;a href="http://www2011india.com/tutorialstr34.html"&gt;&lt;b&gt;Web-Based Open Domain Information Extraction&lt;/b&gt;&lt;/a&gt; is being presented by &lt;a href="http://research.google.com/pubs/author107.html"&gt;Marius Pasca&lt;/a&gt; from Google.  He will be giving a related tutorial, &lt;a href="http://www.acl2011.org/tutorials_10pasca.shtml"&gt;Web Search Queries as a Corpus&lt;/a&gt; at the upcoming ACL conference in Portland.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;a href="http://www2011india.com/tutorialsInvited3.html"&gt;&lt;b&gt;Latent Variable Models on the Internet &lt;/b&gt;&lt;/a&gt;- &lt;a href="http://www.cs.cmu.edu/~amahmed/"&gt;Amr Amhed&lt;/a&gt; and &lt;a href="http://alex.smola.org/"&gt;Alex Smola&lt;/a&gt; are presenting work on using Graphical Models on web data.  From the description,  w&lt;i&gt;e will describe inference algorithms for collaborative filtering, recommendation, latent dirichlet allocation, and advanced clustering models&lt;/i&gt;.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;a href="http://www.www2011india.com/tutorialstr23.html"&gt;&lt;b&gt;Social Recommender Systems&lt;/b&gt;&lt;/a&gt; - &lt;a href="http://www.research.ibm.com/haifa/dept/imt/ct_st.shtml"&gt;Ido Guy&lt;/a&gt; and &lt;a href="https://researcher.ibm.com/researcher/view.php?person=il-CARMEL"&gt;David Carmel&lt;/a&gt; from IBM Research are giving a tutorial on social recommender systems.  See also the recent &lt;a href="http://www.comp.hkbu.edu.hk/~lichen/srs2011/"&gt;SRS2011 workshop&lt;/a&gt; which was also organized by Ido.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt; &lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-6577913756946312377?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/6577913756946312377/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2011/03/www-2011-day-2-more-workshops-and.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/6577913756946312377'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/6577913756946312377'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2011/03/www-2011-day-2-more-workshops-and.html' title='WWW 2011 Day 2: More Workshops and Tutorials'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-2855361885241538131</id><published>2011-03-28T10:10:00.005-04:00</published><updated>2011-03-28T10:35:46.383-04:00</updated><title type='text'>WWW 2011 this week</title><content type='html'>The WWW 2011 conference is happening this week in Hyderabad India.  I'm not attending, so drop me a message or an email with notes or highlights. &lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Be sure to check out the &lt;a href="http://www2011india.com/interactive_schedule.html"&gt;full program&lt;/a&gt;.&lt;br /&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Today is Tutorial and Workshop  day, &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Relevant Tutorials&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;a href="http://www2011india.com/tutorialstr25.html"&gt;Ranking on Large-Scale Graphs with Rich Metadata&lt;/a&gt; by &lt;a href="http://research.microsoft.com/en-us/people/bingao/"&gt;Bin Gao&lt;/a&gt;, &lt;a href="http://research.microsoft.com/en-us/people/taifengw/"&gt;Taifeng Wang&lt;/a&gt;, and &lt;a href="http://research.microsoft.com/en-us/people/tyliu/"&gt;Tie-Yan Liu&lt;/a&gt; from MSR Asia.&lt;/div&gt;&lt;div&gt; - For a preview see their tech report, &lt;a href="http://research.microsoft.com/apps/pubs/?id=146769"&gt;Semi-Supervised Ranking on Very Large Graph with Rich Metadata&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;a href="http://www.blogger.com/www.www2011india.com/tutorialsInvited2.html"&gt;Distributed Web Retrieval &lt;/a&gt;by Ricardo Baeza Yates from Yahoo! Research Barcelona.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Selected Workshops&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;a href="http://data.semanticweb.org/usewod/2011/"&gt;USEWOD2011&lt;/a&gt; - 1st International Workshop on Usage Analysis and the Web of Data&lt;/div&gt;&lt;div&gt;This workshop will investigate the synergy between semantics and semantic-web technology on the one hand and analysis and mining of usage data on the other hand.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;a href="http://www.dl.kuis.kyoto-u.ac.jp/webquality2011/"&gt;WICOW/AIRWeb Workshop on Web Quality (WebQuality 2011)&lt;/a&gt;&lt;/div&gt;&lt;div&gt;It is encouraging to see a discussion beyond blatant spam to more subtle issues of authority, credibility, and reputation.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;a href="http://temporalweb.net/"&gt;Temporal Web Analytics Workshop&lt;/a&gt;&lt;br /&gt;TWAW focuses on temporal data analysis along the time dimension for Web data that has been collected over extended time periods.&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Stay tuned for more news WWW later this week!&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-2855361885241538131?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/2855361885241538131/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2011/03/www-2011-this-week.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/2855361885241538131'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/2855361885241538131'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2011/03/www-2011-this-week.html' title='WWW 2011 this week'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-8375467663271148081</id><published>2011-03-04T06:00:00.003-05:00</published><updated>2011-03-04T09:04:02.215-05:00</updated><title type='text'>Evgeniy Gabrilovich wins 2010 Karen Spärck Jones Award</title><content type='html'>&lt;div&gt;The &lt;a href="http://irsg.bcs.org/"&gt;British Computing Society IRSG&lt;/a&gt; announced that the winner of the 2010 &lt;a href="http://irsg.bcs.org/ksjaward.php"&gt;Karen Spärck Jones Award&lt;/a&gt; goes to &lt;a href="http://www.cs.technion.ac.il/~gabr/"&gt;Evgeniy Gabrilovich&lt;/a&gt;. Evgeniy is Senior Research Scientist and Manager of the NLP &amp;amp; IR Group of Yahoo! Research. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;He will present a keynote talk at the upcoming &lt;a href="http://www.ecir2011.dcu.ie/"&gt;ECIR 2011&lt;/a&gt; conference later this month.  His presentation will be &lt;a href="http://irsg.bcs.org/ksjaward/gabr_ksj.pdf"&gt;Ad Retrieval Systems in vitro and in vivo: Knowledge-Based Approaches to Computational Advertising&lt;/a&gt;.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Congratulations Evgeniy!  I have heard a lot of great things from Yahoo! Research interns who worked under his guidance.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-8375467663271148081?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/8375467663271148081/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2011/03/evgeniy-gabrilovich-wins-2010-karen.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/8375467663271148081'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/8375467663271148081'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2011/03/evgeniy-gabrilovich-wins-2010-karen.html' title='Evgeniy Gabrilovich wins 2010 Karen Spärck Jones Award'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-8661155516163707554</id><published>2011-03-03T19:24:00.004-05:00</published><updated>2011-03-03T19:55:50.544-05:00</updated><title type='text'>Google's War on Content Farms: Project Big Panda</title><content type='html'>In late February Google launched a significant update to its ranking algorithm to address "shallow content" pages.  The change has been referred to as the "Farmer" update externally and internally it is known as "Panda". &lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Amit Singhal and Matt Cutts posted about the change on the Google blog, &lt;a href="http://googleblog.blogspot.com/2011/02/finding-more-high-quality-sites-in.html"&gt;Finding more high quality sites in search&lt;/a&gt;.  It reduced the rankings of "low quality sites" that aggregated content from other websites and didn't add a significant amount value to users.  According to the post the update effected 11.8% of queries.  They also launched the &lt;a href="https://chrome.google.com/webstore/detail/nolijncfnkgaikbjbdaogikpmpbdcdef"&gt;Chrome Blocklist Extension&lt;/a&gt; to let people block websites from their Google results.   The O'Reilly Radar published an &lt;a href="http://radar.oreilly.com/2011/03/search-notes-content-farms.html"&gt;article&lt;/a&gt; with a very good overview of the discussion.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;What is behind the change?  The most informative article is a recent Wired interview by Stephen Levy, &lt;a href="http://www.wired.com/epicenter/2011/03/the-panda-that-hates-farms/all/1"&gt;The ‘Panda’ That Hates Farms&lt;/a&gt;.  It interview Matt Cutts and Amit Singhal who managed the update.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;What was the answer?  In short, they built a document quality classifier trained on lots of rater data.  Here are some of the questions they asked raters from the article:&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;Would you be comfortable giving this site your credit card?&lt;/li&gt;&lt;li&gt;Would you be comfortable giving medicine prescribed by this site to your kids?&lt;/li&gt;&lt;li&gt;Do you consider this site to be authoritative?&lt;/li&gt;&lt;li&gt;Would it be okay if this was in a magazine?&lt;/li&gt;&lt;li&gt;Does this site have excessive ads?&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;div&gt;These questions seem to ask about the authoritativeness and trust of the content on a page.  The results were also confirmed by an 84% overlap between sites downgraded in the change and those that people blocked using the Chrome extension, even though it is not used as a feature in update.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;How did Google become overrun with almost-spam content? Amit sheds a bit of light on the question in one of his answers:&lt;/div&gt;&lt;div&gt;&lt;blockquote&gt;So we did Caffeine in late 2009.  Our index grew so quickly, and we were just crawling at a much faster speed. When that happened, we basically got a lot of good fresh content, and some not so good. The problem had shifted from random gibberish, which the spam team had nicely taken care of, into somewhat more like written prose. But the content was shallow.&lt;/blockquote&gt;&lt;/div&gt;&lt;div&gt;The interview then gets bogged down in bigger issues around editorial process and transparency, which are important but not as technically interesting.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-8661155516163707554?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/8661155516163707554/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2011/03/googles-war-on-content-farms-project.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/8661155516163707554'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/8661155516163707554'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2011/03/googles-war-on-content-farms-project.html' title='Google&apos;s War on Content Farms: Project Big Panda'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-1259220878034398487</id><published>2011-03-02T09:39:00.004-05:00</published><updated>2011-03-02T12:52:47.183-05:00</updated><title type='text'>HeyStaks launches: Social and Collaborative Web Search App</title><content type='html'>&lt;a href="http://www.heystaks.com/"&gt;Heystaks&lt;/a&gt; is a collaborative search startup that launched publicly yesterday at DemoCon.  Heystaks has a browser / iPhone app that lets you share your search experiences.  It lets you save searches and pages you find into "Staks" and then share them with your "Search Buddies". &lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;VentureBeat has an &lt;a href="http://venturebeat.com/2011/03/01/demo-heystaks-makes-searching-social/"&gt;article&lt;/a&gt; covering their launch which you should probably check out. Here is the video from their website:&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;iframe src="http://player.vimeo.com/video/20062652" width="400" height="225" frameborder="0"&gt;&lt;/iframe&gt;&lt;p&gt;&lt;a href="http://vimeo.com/20062652"&gt;HeyStaks v4&lt;/a&gt; from &lt;a href="http://vimeo.com/user5508146"&gt;HeyStaks&lt;/a&gt; on &lt;a href="http://vimeo.com/"&gt;Vimeo&lt;/a&gt;.&lt;/p&gt;&lt;/div&gt;&lt;div&gt;The chief scientist at the company is &lt;a href="http://barry.smyth.ucd.ie/"&gt;Barry Smyth&lt;/a&gt;, a professor at the University College Dublin.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;It's a bit early for a full review, but I tried it out and it seems promising.  I have some privacy concerns about browser toolbars that save and share my search history, especially when the service is oriented towards public sharing of the information.  &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;HeyStaks reminds me of the failed &lt;a href="http://www.ysearchblog.com/2009/07/07/unveiling-yahoo-search-pad/"&gt;Yahoo! Search Pad&lt;/a&gt;, but with a more social focus, and it works across search engines.  I hope it has better luck.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I would like to see the service evolve to have more collaboration in the search beyond saving and sharing results. For example some deeper integration that &lt;a href="http://palblog.fxpal.com/?cat=22"&gt;Gene Golovchinsky&lt;/a&gt;, &lt;a href="http://irgupf.com/"&gt;Jeremy Pickens&lt;/a&gt;, and others have been advocating.  See their paper, &lt;a href="http://portal.acm.org/citation.cfm?id=1390389"&gt;Algorithmic mediation for collaborative exploratory search&lt;/a&gt; which won the best paper award at SIGIR 2008.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;My congratulations to Heystaks on the launch.  I look forward to Chome and Android apps versions that I hope will be soon to follow.&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-1259220878034398487?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/1259220878034398487/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2011/03/heystaks-launches-collaborative-web.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/1259220878034398487'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/1259220878034398487'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2011/03/heystaks-launches-collaborative-web.html' title='HeyStaks launches: Social and Collaborative Web Search App'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-4301601060993529952</id><published>2011-03-02T09:29:00.003-05:00</published><updated>2011-03-02T11:42:32.220-05:00</updated><title type='text'>News Highlights: Bing Price search, Yahoo! Boss, Google Data Publishing, and more</title><content type='html'>&lt;div&gt;Here is a round up of news from around the web:&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://www.bing.com/community/site_blogs/b/search/archive/2011/03/01/bing-feature-update-searching-for-a-good-deal-new-natural-language-capabilities-in-bing-shopping-understand-prices.aspx"&gt;Bing adds price recognition&lt;/a&gt; to its query support.  You can now search for "digital camera under $200" and it will automatically add the price filter.  It is a good step in the right direction.  How about something a bit harder?  "Canon 12 MP Camera under $200" with the manufacturer and megapixel attribute restrictions.&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;Google &lt;a href="http://googleresearch.blogspot.com/2011/02/slicing-and-dicing-data-for-interactive.html"&gt;announced&lt;/a&gt; the public release of the &lt;a href="http://code.google.com/apis/publicdata/"&gt;Dataset Publishing Language (DSPL)&lt;/a&gt;, a representation language for the data and metadata of datasets.  &lt;/li&gt;&lt;/ul&gt;&lt;blockquote&gt;&lt;i&gt;We created this format to address a key problem in the Public Data Explorer and other, similar tools, namely, that existing data formats don’t provide enough information to support easy yet powerful data exploration by non-technical users.&lt;/i&gt;&lt;/blockquote&gt;&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;Yahoo! announced the &lt;a href="http://developer.yahoo.com/search/boss/boss_api_guide/"&gt;Yahoo! Search Boss v2.0 API&lt;/a&gt; for the new Bing based search.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.cccblog.org/2011/03/01/congressman-rush-holt-beats-watson-at-jeopardy"&gt;Congressmen beats Watson&lt;/a&gt; in a round of Jeopardy.&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;Scala tip:  Check out &lt;a href="http://www.scala-lang.org/node/2097"&gt;REPL&lt;/a&gt; for interactive debugging.&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-4301601060993529952?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/4301601060993529952/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2011/03/news-highlights-bing-price-search-yahoo.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/4301601060993529952'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/4301601060993529952'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2011/03/news-highlights-bing-price-search-yahoo.html' title='News Highlights: Bing Price search, Yahoo! Boss, Google Data Publishing, and more'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-2793317625708615193</id><published>2011-03-01T09:12:00.004-05:00</published><updated>2011-03-01T09:26:38.304-05:00</updated><title type='text'>WhistlePig: A minimalist real-time search engine</title><content type='html'>William Morgan recently &lt;a href="http://all-thing.net/whistlepig-0.1-released"&gt;announced &lt;/a&gt;the release of &lt;a href="https://github.com/wmorgan/whistlepig"&gt;Whistlepig&lt;/a&gt;, a real-time search engine written in C with Ruby bindings. It is now up to release 0.4.  Whistlepig is a minimalist in memory search system with ranking by reverse date. You can read William's blog post for his motivations for writing it.  Here is a description from the current &lt;a href="https://github.com/wmorgan/whistlepig/blob/master/README"&gt;readme&lt;/a&gt;:&lt;br /&gt;&lt;blockquote&gt;Roughly speaking, realtime search means:&lt;br /&gt;- documents are available to to queries immediately after indexing, without any reindexing or index merging;&lt;br /&gt;- later documents are more important than earlier documents.&lt;br /&gt;&lt;br /&gt;Whistlepig takes these principles to an extreme.&lt;br /&gt;- It only returns documents in the reverse (LIFO) order to which they were&lt;br /&gt;added, and performs no ranking, reordering, or scoring.&lt;br /&gt;- It only supports incremental indexing. There is no notion of batch indexing or index merging.&lt;br /&gt;- It does not support document deletion or modification (except in the&lt;br /&gt;special case of labels; see below).&lt;br /&gt;- It only supports in-memory indexes.&lt;br /&gt;&lt;br /&gt;Features that Whistlepig does provide:&lt;br /&gt;- Incremental indexing. Updates to the index are immediately available to&lt;br /&gt;readers.&lt;br /&gt;- Fielded terms with arbitrary fields.&lt;br /&gt;- A full query language and parser with conjunctions, disjunctions, phrases, negations, grouping, and nesting.&lt;br /&gt;- Labels: arbitrary tokens which can be added to and removed from documents at any point, and incorporated into search queries.&lt;br /&gt;- Early query termination and resumable queries.&lt;br /&gt;- A tiny, &lt;&gt;&lt;/blockquote&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-2793317625708615193?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/2793317625708615193/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2011/03/whistlepig-minimalist-real-time-search.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/2793317625708615193'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/2793317625708615193'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2011/03/whistlepig-minimalist-real-time-search.html' title='WhistlePig: A minimalist real-time search engine'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-4464821394895998604</id><published>2011-02-28T10:59:00.009-05:00</published><updated>2011-02-28T12:45:36.894-05:00</updated><title type='text'>Palantir: Next Gen Platform for Information Analysis</title><content type='html'>&lt;a href="http://palantirtech.com/"&gt;Palantir&lt;/a&gt; is a very ambitious &lt;del&gt;new&lt;/del&gt; tech company building a high-powered information analysis platform.  They currently have products targeted for the government and the financial industries.  Their product is a highly specialized enterprise data system to support intelligence and business analysts.&lt;br /&gt;&lt;br /&gt;&lt;div&gt;&lt;a href="http://blog.palantir.com/2007/12/04/what-do-we-do/"&gt;What does Palantir do&lt;/a&gt;?&lt;br /&gt;&lt;blockquote&gt;... the most central hard problem that we address in trying to enable the analyst is data modeling, the process of figuring out what data types are relevant to a domain, defining what they represent in the world, and deciding how to represent them in the system. At Palantir we make sure our data model (ontology) is both flexible and dynamic, and that it mirrors the concepts people naturally use when reasoning about the domain.&lt;/blockquote&gt;&lt;/div&gt;The platform handles both structured and unstructured information and performs extraction and data integration. See their &lt;a href="http://www.palantirtech.com/infrastructure"&gt;infrastructure page&lt;/a&gt; and &lt;a href="http://www.palantirtech.com/government/videos/whitevideos"&gt;white videos&lt;/a&gt; for a few more details. &lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Their data platform handles objects.  An Object in their platform has four object components:&lt;/div&gt;&lt;div&gt;- Properties: text object attributes&lt;/div&gt;&lt;div&gt;- Media: images, video, and binary data&lt;/div&gt;&lt;div&gt;- Notes: free text&lt;/div&gt;&lt;div&gt;- Relationships: links between objects&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Clients can specialize this generic object to have specific types using their "Dynamic Ontology" tool to define the semantics.  Their platform has one fixed schema with 5 tables: object, property, notes, media, and object-object.  An object is linked to one or more data sources which is critical for data lineage and access controls.&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;A key component of the platform is search over the objects. According to their blog, their scenario has two differentiating features from web search:&lt;br /&gt;&lt;ul&gt;&lt;blockquote&gt;&lt;li&gt;Realtime indexing and querying – we need information to be available immediately as it changes in the system.&lt;/li&gt;&lt;li&gt;Leak-proof access controls – we need the search engine to help us make sure that we don’t have information leaking across access control boundaries.&lt;/li&gt;&lt;/blockquote&gt;&lt;/ul&gt;They go into more detail on their modifications to Lucene for their use cases in two blog posts, &lt;a href="http://blog.palantirtech.com/2009/08/13/palantir-search-with-a-twist-part-one-memory-efficiency/"&gt;Search with a Twist Part I&lt;/a&gt; and &lt;a href="http://blog.palantir.com/2009/10/27/palantir-search-with-a-twist-part-two-realtime-indexing-and-security/#more-1260"&gt;Part II&lt;/a&gt;. From the comments, it sounds like they are using a custom branch of Lucene 2.4.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Palantir's platform combines data processing over large heterogenous datasets, filtering, mapping, visualization, and search in unique ways to create a compelling toolset.  It built an intelligence platform that the Government could not do themselves by recruiting a team of uber-geek talent lured by hip silicon valley panache worthy of James Bond.&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-4464821394895998604?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/4464821394895998604/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2011/02/palantir-next-gen-platform-for.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/4464821394895998604'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/4464821394895998604'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2011/02/palantir-next-gen-platform-for.html' title='Palantir: Next Gen Platform for Information Analysis'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-1497545240903383859</id><published>2011-02-25T11:23:00.013-05:00</published><updated>2011-02-28T23:17:24.075-05:00</updated><title type='text'>Google "Recipe View" Search Disappointing and Dangerous</title><content type='html'>Today Google announced &lt;a href="http://www.google.com/landing/recipes/"&gt;Recipe View&lt;/a&gt; in a &lt;a href="http://googleblog.blogspot.com/2011/02/slice-and-dice-your-recipe-search.html"&gt;blog post&lt;/a&gt;. It is a specialized view of search results restricted to recipes. Recipe View lets you search for recipes without adding text to your query. It searches over recipes from most of the major recipe websites. Google is using semantic data that is marked up using the &lt;a href="http://googlewebmastercentral.blogspot.com/2009/05/introducing-rich-snippets.html"&gt;rich snippets format&lt;/a&gt;. I'm very excited by the idea. I want to like it, but I don't. Let me explain.&lt;br /&gt;&lt;br /&gt;&lt;div&gt;&lt;a href="http://4.bp.blogspot.com/-6DUek9dAQ80/TWfmIO_y1GI/AAAAAAAAABU/WNjlfSKY1xI/s1600/recipeview.jpg"&gt;&lt;img style="TEXT-ALIGN: center; MARGIN: 0px auto 10px; WIDTH: 400px; DISPLAY: block; HEIGHT: 300px; CURSOR: hand" id="BLOGGER_PHOTO_ID_5577679692660659298" border="0" alt="" src="http://4.bp.blogspot.com/-6DUek9dAQ80/TWfmIO_y1GI/AAAAAAAAABU/WNjlfSKY1xI/s400/recipeview.jpg" /&gt;&lt;/a&gt; &lt;/div&gt;&lt;div&gt;It is exciting to see structured data being leveraged by Google for recipe search. Exploratory search and faceted metadata offer a lot of potential to improve food search. However, I'm disappointed by Google's incarnation. The biggest feature the interface adds is the ability to restrict the results by whether or not a recipe contains a particular ingredient. I don't think that this is very interesting or useful. Did anyone who really cooks use this? The other facets are similarly lacking in utility. Calories aren't as meaningful as sodium, sugar, and fat content.  They could have considered useful facets: chef/publisher, cuisine, vegan/vegetarian, gluten-free, cooking technique, complexity, etc... but they ignored these.  Clearly, they didn't put much effort or thought into this revision.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;div&gt;More importantly, I think that Google Recipe View vertical is currently dangerous and detrimental. When activated, it effectively excludes content from blogs and small website publishers. These websites do not use the rich snippet format. Rich snippet markup provides additional metadata, but it should not be required to be included in Recipe View. It is pretty easy to automatically identify whether or not a page contains a recipe using a text classifier and search logs. Personally, I find that content from these websites to often be the most useful and interesting. Until Google fixes this issue, webmasters and publishers should consider if it is worth their effort to adopt. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;/div&gt;&lt;div&gt;I would send Google Recipe View back to the kitchen... it's under cooked and lacks seasoning.&lt;/div&gt;&lt;br /&gt;&lt;div&gt;&lt;b&gt;&lt;i&gt;Note:&lt;/i&gt;&lt;/b&gt; I recently started a &lt;a href="http://www.cookingphd.com"&gt;food blog&lt;/a&gt;, which does not use rich snippet markup (yet).&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-1497545240903383859?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/1497545240903383859/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2011/02/google-recipe-view-search-disappointing.html#comment-form' title='10 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/1497545240903383859'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/1497545240903383859'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2011/02/google-recipe-view-search-disappointing.html' title='Google &quot;Recipe View&quot; Search Disappointing and Dangerous'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/-6DUek9dAQ80/TWfmIO_y1GI/AAAAAAAAABU/WNjlfSKY1xI/s72-c/recipeview.jpg' height='72' width='72'/><thr:total>10</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-7585834993720603588</id><published>2011-02-11T13:41:00.004-05:00</published><updated>2011-02-11T13:51:37.663-05:00</updated><title type='text'>WSDM 2011 - Best paper awards</title><content type='html'>Best paper award: &lt;a href="http://www.research.rutgers.edu/%7Elihong/pub/Li11Unbiased.pdf"&gt;&lt;strong style="font-weight: normal;"&gt;Unbiased Offline Evaluation of Contextual bandit based News Article Recommendation Algorithms&lt;/strong&gt;&lt;/a&gt; &lt;em&gt;by Lihong Li, Wei Chu, John Langford and Xaunhui Wang&lt;br /&gt;&lt;span style="font-style: italic;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/em&gt;Best student paper award:&lt;a href="http://www-cs.stanford.edu/%7Ejure/pubs/cascades-wsdm11.pdf"&gt; &lt;/a&gt;&lt;a href="http://www-cs.stanford.edu/%7Ejure/pubs/cascades-wsdm11.pdf"&gt;&lt;strong style="font-weight: normal;"&gt;Correcting for Missing Data in Information Cascades&lt;/strong&gt;&lt;/a&gt;  &lt;em&gt;by Eldar Sadikov, Montserrat Medina, Jure Leskovec and Hector Garcia-Molina&lt;/em&gt;&lt;br /&gt;&lt;em&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;/em&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-7585834993720603588?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/7585834993720603588/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2011/02/wsdm-2011-best-paper-awards.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/7585834993720603588'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/7585834993720603588'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2011/02/wsdm-2011-best-paper-awards.html' title='WSDM 2011 - Best paper awards'/><author><name>Michael Bendersky</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-4877378704569769338</id><published>2011-02-10T22:55:00.006-05:00</published><updated>2011-02-12T12:19:01.203-05:00</updated><title type='text'>"Bing Dialogue Model"by  Harry Shum - Second WSDM 2011 Keynote</title><content type='html'>&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_kWwNGYD5Z9k/TVWBGJkcWAI/AAAAAAAAA_g/x_TsDNQlVpQ/s1600/IMG_2115.JPG"&gt;&lt;img style="display: block; margin: 0px auto 10px; text-align: center; cursor: pointer; width: 320px; height: 240px;" src="http://4.bp.blogspot.com/_kWwNGYD5Z9k/TVWBGJkcWAI/AAAAAAAAA_g/x_TsDNQlVpQ/s320/IMG_2115.JPG" alt="" id="BLOGGER_PHOTO_ID_5572502056588826626" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;!--[if gte mso 9]&gt;&lt;xml&gt;  &lt;w:worddocument&gt;   &lt;w:view&gt;Normal&lt;/w:View&gt;   &lt;w:zoom&gt;0&lt;/w:Zoom&gt;   &lt;w:trackmoves/&gt;   &lt;w:trackformatting/&gt;   &lt;w:punctuationkerning/&gt;   &lt;w:validateagainstschemas/&gt;   &lt;w:saveifxmlinvalid&gt;false&lt;/w:SaveIfXMLInvalid&gt;   &lt;w:ignoremixedcontent&gt;false&lt;/w:IgnoreMixedContent&gt;   &lt;w:alwaysshowplaceholdertext&gt;false&lt;/w:AlwaysShowPlaceholderText&gt;   &lt;w:donotpromoteqf/&gt;   &lt;w:lidthemeother&gt;EN-US&lt;/w:LidThemeOther&gt;   &lt;w:lidthemeasian&gt;X-NONE&lt;/w:LidThemeAsian&gt;   &lt;w:lidthemecomplexscript&gt;X-NONE&lt;/w:LidThemeComplexScript&gt;   &lt;w:compatibility&gt;    &lt;w:breakwrappedtables/&gt;    &lt;w:snaptogridincell/&gt;    &lt;w:wraptextwithpunct/&gt;    &lt;w:useasianbreakrules/&gt;    &lt;w:dontgrowautofit/&gt;    &lt;w:splitpgbreakandparamark/&gt;    &lt;w:dontvertaligncellwithsp/&gt;    &lt;w:dontbreakconstrainedforcedtables/&gt;    &lt;w:dontvertalignintxbx/&gt;    &lt;w:word11kerningpairs/&gt;    &lt;w:cachedcolbalance/&gt;   &lt;/w:Compatibility&gt;   &lt;m:mathpr&gt;    &lt;m:mathfont val="Cambria Math"&gt;    &lt;m:brkbin val="before"&gt;    &lt;m:brkbinsub val="&amp;#45;-"&gt;    &lt;m:smallfrac val="off"&gt;    &lt;m:dispdef/&gt;    &lt;m:lmargin val="0"&gt;    &lt;m:rmargin val="0"&gt;    &lt;m:defjc val="centerGroup"&gt;    &lt;m:wrapindent val="1440"&gt;    &lt;m:intlim val="subSup"&gt;    &lt;m:narylim val="undOvr"&gt;   &lt;/m:mathPr&gt;&lt;/w:WordDocument&gt; &lt;/xml&gt;&lt;![endif]--&gt;&lt;!--[if gte mso 9]&gt;&lt;xml&gt;  &lt;w:latentstyles deflockedstate="false" defunhidewhenused="true" defsemihidden="true" defqformat="false" defpriority="99" latentstylecount="267"&gt;   &lt;w:lsdexception locked="false" priority="0" semihidden="false" unhidewhenused="false" qformat="true" name="Normal"&gt;   &lt;w:lsdexception locked="false" priority="9" semihidden="false" unhidewhenused="false" qformat="true" name="heading 1"&gt;   &lt;w:lsdexception locked="false" priority="9" qformat="true" name="heading 2"&gt;   &lt;w:lsdexception locked="false" priority="9" qformat="true" name="heading 3"&gt;   &lt;w:lsdexception locked="false" priority="9" qformat="true" name="heading 4"&gt;   &lt;w:lsdexception locked="false" priority="9" qformat="true" name="heading 5"&gt;   &lt;w:lsdexception locked="false" priority="9" qformat="true" name="heading 6"&gt;   &lt;w:lsdexception locked="false" priority="9" qformat="true" name="heading 7"&gt;   &lt;w:lsdexception locked="false" priority="9" qformat="true" name="heading 8"&gt;   &lt;w:lsdexception locked="false" priority="9" qformat="true" name="heading 9"&gt;   &lt;w:lsdexception locked="false" priority="39" name="toc 1"&gt;   &lt;w:lsdexception locked="false" priority="39" name="toc 2"&gt;   &lt;w:lsdexception locked="false" priority="39" name="toc 3"&gt;   &lt;w:lsdexception locked="false" priority="39" name="toc 4"&gt;   &lt;w:lsdexception locked="false" priority="39" name="toc 5"&gt;   &lt;w:lsdexception locked="false" priority="39" name="toc 6"&gt;   &lt;w:lsdexception locked="false" priority="39" name="toc 7"&gt;   &lt;w:lsdexception locked="false" priority="39" name="toc 8"&gt;   &lt;w:lsdexception locked="false" priority="39" name="toc 9"&gt;   &lt;w:lsdexception locked="false" priority="35" qformat="true" name="caption"&gt;   &lt;w:lsdexception locked="false" priority="10" semihidden="false" unhidewhenused="false" qformat="true" name="Title"&gt;   &lt;w:lsdexception locked="false" priority="1" name="Default Paragraph Font"&gt;   &lt;w:lsdexception locked="false" priority="11" semihidden="false" unhidewhenused="false" qformat="true" name="Subtitle"&gt;   &lt;w:lsdexception locked="false" priority="22" semihidden="false" unhidewhenused="false" qformat="true" name="Strong"&gt;   &lt;w:lsdexception locked="false" priority="20" semihidden="false" unhidewhenused="false" qformat="true" name="Emphasis"&gt;   &lt;w:lsdexception locked="false" priority="59" semihidden="false" unhidewhenused="false" name="Table Grid"&gt;   &lt;w:lsdexception locked="false" unhidewhenused="false" name="Placeholder Text"&gt;   &lt;w:lsdexception locked="false" priority="1" semihidden="false" unhidewhenused="false" qformat="true" name="No Spacing"&gt;   &lt;w:lsdexception locked="false" priority="60" semihidden="false" unhidewhenused="false" name="Light Shading"&gt;   &lt;w:lsdexception locked="false" priority="61" semihidden="false" unhidewhenused="false" name="Light List"&gt;   &lt;w:lsdexception locked="false" priority="62" semihidden="false" unhidewhenused="false" name="Light Grid"&gt;   &lt;w:lsdexception locked="false" priority="63" semihidden="false" unhidewhenused="false" name="Medium Shading 1"&gt;   &lt;w:lsdexception locked="false" priority="64" semihidden="false" unhidewhenused="false" name="Medium Shading 2"&gt;   &lt;w:lsdexception locked="false" priority="65" semihidden="false" unhidewhenused="false" name="Medium List 1"&gt;   &lt;w:lsdexception locked="false" priority="66" semihidden="false" unhidewhenused="false" name="Medium List 2"&gt;   &lt;w:lsdexception locked="false" priority="67" semihidden="false" unhidewhenused="false" name="Medium Grid 1"&gt;   &lt;w:lsdexception locked="false" priority="68" semihidden="false" unhidewhenused="false" name="Medium Grid 2"&gt;   &lt;w:lsdexception locked="false" priority="69" semihidden="false" unhidewhenused="false" name="Medium Grid 3"&gt;   &lt;w:lsdexception locked="false" priority="70" semihidden="false" unhidewhenused="false" name="Dark List"&gt;   &lt;w:lsdexception locked="false" priority="71" semihidden="false" unhidewhenused="false" name="Colorful Shading"&gt;   &lt;w:lsdexception locked="false" priority="72" semihidden="false" unhidewhenused="false" name="Colorful List"&gt;   &lt;w:lsdexception locked="false" priority="73" semihidden="false" unhidewhenused="false" name="Colorful Grid"&gt;   &lt;w:lsdexception locked="false" priority="60" semihidden="false" unhidewhenused="false" name="Light Shading Accent 1"&gt;   &lt;w:lsdexception locked="false" priority="61" semihidden="false" unhidewhenused="false" name="Light List Accent 1"&gt;   &lt;w:lsdexception locked="false" priority="62" semihidden="false" unhidewhenused="false" name="Light Grid Accent 1"&gt;   &lt;w:lsdexception locked="false" priority="63" semihidden="false" unhidewhenused="false" name="Medium Shading 1 Accent 1"&gt;   &lt;w:lsdexception locked="false" priority="64" semihidden="false" unhidewhenused="false" name="Medium Shading 2 Accent 1"&gt;   &lt;w:lsdexception locked="false" priority="65" semihidden="false" unhidewhenused="false" name="Medium List 1 Accent 1"&gt;   &lt;w:lsdexception locked="false" unhidewhenused="false" name="Revision"&gt;   &lt;w:lsdexception locked="false" priority="34" semihidden="false" unhidewhenused="false" qformat="true" name="List Paragraph"&gt;   &lt;w:lsdexception locked="false" priority="29" semihidden="false" unhidewhenused="false" qformat="true" name="Quote"&gt;   &lt;w:lsdexception locked="false" priority="30" semihidden="false" unhidewhenused="false" qformat="true" name="Intense Quote"&gt;   &lt;w:lsdexception locked="false" priority="66" semihidden="false" unhidewhenused="false" name="Medium List 2 Accent 1"&gt;   &lt;w:lsdexception locked="false" priority="67" semihidden="false" unhidewhenused="false" name="Medium Grid 1 Accent 1"&gt;   &lt;w:lsdexception locked="false" priority="68" semihidden="false" unhidewhenused="false" name="Medium Grid 2 Accent 1"&gt;   &lt;w:lsdexception locked="false" priority="69" semihidden="false" unhidewhenused="false" name="Medium Grid 3 Accent 1"&gt;   &lt;w:lsdexception locked="false" priority="70" semihidden="false" unhidewhenused="false" name="Dark List Accent 1"&gt;   &lt;w:lsdexception locked="false" priority="71" semihidden="false" unhidewhenused="false" name="Colorful Shading Accent 1"&gt;   &lt;w:lsdexception locked="false" priority="72" semihidden="false" unhidewhenused="false" name="Colorful List Accent 1"&gt;   &lt;w:lsdexception locked="false" priority="73" semihidden="false" unhidewhenused="false" name="Colorful Grid Accent 1"&gt;   &lt;w:lsdexception locked="false" priority="60" semihidden="false" unhidewhenused="false" name="Light Shading Accent 2"&gt;   &lt;w:lsdexception locked="false" priority="61" semihidden="false" unhidewhenused="false" name="Light List Accent 2"&gt;   &lt;w:lsdexception locked="false" priority="62" semihidden="false" unhidewhenused="false" name="Light Grid Accent 2"&gt;   &lt;w:lsdexception locked="false" priority="63" semihidden="false" unhidewhenused="false" name="Medium Shading 1 Accent 2"&gt;   &lt;w:lsdexception locked="false" priority="64" semihidden="false" unhidewhenused="false" name="Medium Shading 2 Accent 2"&gt;   &lt;w:lsdexception locked="false" priority="65" semihidden="false" unhidewhenused="false" name="Medium List 1 Accent 2"&gt;   &lt;w:lsdexception locked="false" priority="66" semihidden="false" unhidewhenused="false" name="Medium List 2 Accent 2"&gt;   &lt;w:lsdexception locked="false" priority="67" semihidden="false" unhidewhenused="false" name="Medium Grid 1 Accent 2"&gt;   &lt;w:lsdexception locked="false" priority="68" semihidden="false" unhidewhenused="false" name="Medium Grid 2 Accent 2"&gt;   &lt;w:lsdexception locked="false" priority="69" semihidden="false" unhidewhenused="false" name="Medium Grid 3 Accent 2"&gt;   &lt;w:lsdexception locked="false" priority="70" semihidden="false" unhidewhenused="false" name="Dark List Accent 2"&gt;   &lt;w:lsdexception locked="false" priority="71" semihidden="false" unhidewhenused="false" name="Colorful Shading Accent 2"&gt;   &lt;w:lsdexception locked="false" priority="72" semihidden="false" unhidewhenused="false" name="Colorful List Accent 2"&gt;   &lt;w:lsdexception locked="false" priority="73" semihidden="false" unhidewhenused="false" name="Colorful Grid Accent 2"&gt;   &lt;w:lsdexception locked="false" priority="60" semihidden="false" unhidewhenused="false" name="Light Shading Accent 3"&gt;   &lt;w:lsdexception locked="false" priority="61" semihidden="false" unhidewhenused="false" name="Light List Accent 3"&gt;   &lt;w:lsdexception locked="false" priority="62" semihidden="false" unhidewhenused="false" name="Light Grid Accent 3"&gt;   &lt;w:lsdexception locked="false" priority="63" semihidden="false" unhidewhenused="false" name="Medium Shading 1 Accent 3"&gt;   &lt;w:lsdexception locked="false" priority="64" semihidden="false" unhidewhenused="false" name="Medium Shading 2 Accent 3"&gt;   &lt;w:lsdexception locked="false" priority="65" semihidden="false" unhidewhenused="false" name="Medium List 1 Accent 3"&gt;   &lt;w:lsdexception locked="false" priority="66" semihidden="false" unhidewhenused="false" name="Medium List 2 Accent 3"&gt;   &lt;w:lsdexception locked="false" priority="67" semihidden="false" unhidewhenused="false" name="Medium Grid 1 Accent 3"&gt;   &lt;w:lsdexception locked="false" priority="68" semihidden="false" unhidewhenused="false" name="Medium Grid 2 Accent 3"&gt;   &lt;w:lsdexception locked="false" priority="69" semihidden="false" unhidewhenused="false" name="Medium Grid 3 Accent 3"&gt;   &lt;w:lsdexception locked="false" priority="70" semihidden="false" unhidewhenused="false" name="Dark List Accent 3"&gt;   &lt;w:lsdexception locked="false" priority="71" semihidden="false" unhidewhenused="false" name="Colorful Shading Accent 3"&gt;   &lt;w:lsdexception locked="false" priority="72" semihidden="false" unhidewhenused="false" name="Colorful List Accent 3"&gt;   &lt;w:lsdexception locked="false" priority="73" semihidden="false" unhidewhenused="false" name="Colorful Grid Accent 3"&gt;   &lt;w:lsdexception locked="false" priority="60" semihidden="false" unhidewhenused="false" name="Light Shading Accent 4"&gt;   &lt;w:lsdexception locked="false" priority="61" semihidden="false" unhidewhenused="false" name="Light List Accent 4"&gt;   &lt;w:lsdexception locked="false" priority="62" semihidden="false" unhidewhenused="false" name="Light Grid Accent 4"&gt;   &lt;w:lsdexception locked="false" priority="63" semihidden="false" unhidewhenused="false" name="Medium Shading 1 Accent 4"&gt;   &lt;w:lsdexception locked="false" priority="64" semihidden="false" unhidewhenused="false" name="Medium Shading 2 Accent 4"&gt;   &lt;w:lsdexception locked="false" priority="65" semihidden="false" unhidewhenused="false" name="Medium List 1 Accent 4"&gt;   &lt;w:lsdexception locked="false" priority="66" semihidden="false" unhidewhenused="false" name="Medium List 2 Accent 4"&gt;   &lt;w:lsdexception locked="false" priority="67" semihidden="false" unhidewhenused="false" name="Medium Grid 1 Accent 4"&gt;   &lt;w:lsdexception locked="false" priority="68" semihidden="false" unhidewhenused="false" name="Medium Grid 2 Accent 4"&gt;   &lt;w:lsdexception locked="false" priority="69" semihidden="false" unhidewhenused="false" name="Medium Grid 3 Accent 4"&gt;   &lt;w:lsdexception locked="false" priority="70" semihidden="false" unhidewhenused="false" name="Dark List Accent 4"&gt;   &lt;w:lsdexception locked="false" priority="71" semihidden="false" unhidewhenused="false" name="Colorful Shading Accent 4"&gt;   &lt;w:lsdexception locked="false" priority="72" semihidden="false" unhidewhenused="false" name="Colorful List Accent 4"&gt;   &lt;w:lsdexception locked="false" priority="73" semihidden="false" unhidewhenused="false" name="Colorful Grid Accent 4"&gt;   &lt;w:lsdexception locked="false" priority="60" semihidden="false" unhidewhenused="false" name="Light Shading Accent 5"&gt;   &lt;w:lsdexception locked="false" priority="61" semihidden="false" unhidewhenused="false" name="Light List Accent 5"&gt;   &lt;w:lsdexception locked="false" priority="62" semihidden="false" unhidewhenused="false" name="Light Grid Accent 5"&gt;   &lt;w:lsdexception locked="false" priority="63" semihidden="false" unhidewhenused="false" name="Medium Shading 1 Accent 5"&gt;   &lt;w:lsdexception locked="false" priority="64" semihidden="false" unhidewhenused="false" name="Medium Shading 2 Accent 5"&gt;   &lt;w:lsdexception locked="false" priority="65" semihidden="false" unhidewhenused="false" name="Medium List 1 Accent 5"&gt;   &lt;w:lsdexception locked="false" priority="66" semihidden="false" unhidewhenused="false" name="Medium List 2 Accent 5"&gt;   &lt;w:lsdexception locked="false" priority="67" semihidden="false" unhidewhenused="false" name="Medium Grid 1 Accent 5"&gt;   &lt;w:lsdexception locked="false" priority="68" semihidden="false" unhidewhenused="false" name="Medium Grid 2 Accent 5"&gt;   &lt;w:lsdexception locked="false" priority="69" semihidden="false" unhidewhenused="false" name="Medium Grid 3 Accent 5"&gt;   &lt;w:lsdexception locked="false" priority="70" semihidden="false" unhidewhenused="false" name="Dark List Accent 5"&gt;   &lt;w:lsdexception locked="false" priority="71" semihidden="false" unhidewhenused="false" name="Colorful Shading Accent 5"&gt;   &lt;w:lsdexception locked="false" priority="72" semihidden="false" unhidewhenused="false" name="Colorful List Accent 5"&gt;   &lt;w:lsdexception locked="false" priority="73" semihidden="false" unhidewhenused="false" name="Colorful Grid Accent 5"&gt;   &lt;w:lsdexception locked="false" priority="60" semihidden="false" unhidewhenused="false" name="Light Shading Accent 6"&gt;   &lt;w:lsdexception locked="false" priority="61" semihidden="false" unhidewhenused="false" name="Light List Accent 6"&gt;   &lt;w:lsdexception locked="false" priority="62" semihidden="false" unhidewhenused="false" name="Light Grid Accent 6"&gt;   &lt;w:lsdexception locked="false" priority="63" semihidden="false" unhidewhenused="false" name="Medium Shading 1 Accent 6"&gt;   &lt;w:lsdexception locked="false" priority="64" semihidden="false" unhidewhenused="false" name="Medium Shading 2 Accent 6"&gt;   &lt;w:lsdexception locked="false" priority="65" semihidden="false" unhidewhenused="false" name="Medium List 1 Accent 6"&gt;   &lt;w:lsdexception locked="false" priority="66" semihidden="false" unhidewhenused="false" name="Medium List 2 Accent 6"&gt;   &lt;w:lsdexception locked="false" priority="67" semihidden="false" unhidewhenused="false" name="Medium Grid 1 Accent 6"&gt;   &lt;w:lsdexception locked="false" priority="68" semihidden="false" unhidewhenused="false" name="Medium Grid 2 Accent 6"&gt;   &lt;w:lsdexception locked="false" priority="69" semihidden="false" unhidewhenused="false" name="Medium Grid 3 Accent 6"&gt;   &lt;w:lsdexception locked="false" priority="70" semihidden="false" unhidewhenused="false" name="Dark List Accent 6"&gt;   &lt;w:lsdexception locked="false" priority="71" semihidden="false" unhidewhenused="false" name="Colorful Shading Accent 6"&gt;   &lt;w:lsdexception locked="false" priority="72" semihidden="false" unhidewhenused="false" name="Colorful List Accent 6"&gt;   &lt;w:lsdexception locked="false" priority="73" semihidden="false" unhidewhenused="false" name="Colorful Grid Accent 6"&gt;   &lt;w:lsdexception locked="false" priority="19" semihidden="false" unhidewhenused="false" qformat="true" name="Subtle Emphasis"&gt;   &lt;w:lsdexception locked="false" priority="21" semihidden="false" unhidewhenused="false" qformat="true" name="Intense Emphasis"&gt;   &lt;w:lsdexception locked="false" priority="31" semihidden="false" unhidewhenused="false" qformat="true" name="Subtle Reference"&gt;   &lt;w:lsdexception locked="false" priority="32" semihidden="false" unhidewhenused="false" qformat="true" name="Intense Reference"&gt;   &lt;w:lsdexception locked="false" priority="33" semihidden="false" unhidewhenused="false" qformat="true" name="Book Title"&gt;   &lt;w:lsdexception locked="false" priority="37" name="Bibliography"&gt;   &lt;w:lsdexception locked="false" priority="39" qformat="true" name="TOC Heading"&gt;  &lt;/w:LatentStyles&gt; &lt;/xml&gt;&lt;![endif]--&gt;&lt;!--[if gte mso 10]&gt; &lt;style&gt;  /* Style Definitions */  table.MsoNormalTable  {mso-style-name:"Table Normal";  mso-tstyle-rowband-size:0;  mso-tstyle-colband-size:0;  mso-style-noshow:yes;  mso-style-priority:99;  mso-style-qformat:yes;  mso-style-parent:"";  mso-padding-alt:0in 5.4pt 0in 5.4pt;  mso-para-margin-top:0in;  mso-para-margin-right:0in;  mso-para-margin-bottom:10.0pt;  mso-para-margin-left:0in;  line-height:115%;  mso-pagination:widow-orphan;  font-size:11.0pt;  font-family:"Calibri","sans-serif";  mso-ascii-font-family:Calibri;  mso-ascii-theme-font:minor-latin;  mso-fareast-font-family:"Times New Roman";  mso-fareast-theme-font:minor-fareast;  mso-hansi-font-family:Calibri;  mso-hansi-theme-font:minor-latin;  mso-bidi-font-family:"Times New Roman";  mso-bidi-theme-font:minor-bidi;} &lt;/style&gt; &lt;![endif]--&gt;The 2nd keynote talk at WSDM 2011 was an intriguing peak at the Bing's model of user intent, by Harry Shum, VP of Search Product Development, Microsoft.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;* &lt;span style="font-weight: bold;"&gt;Challenges at launch&lt;/span&gt;&lt;br /&gt;* Google market share has been steadily growing from 2005-2008 (Bing launch)&lt;br /&gt;* Google is a consumer brand and a habit&lt;br /&gt;&lt;br /&gt;* &lt;span style="font-weight: bold;"&gt;Bing gained 5.1% query traffic share (worldwide) since launch&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;* &lt;span style="font-weight: bold;"&gt;3 elements of Search Quality&lt;/span&gt;&lt;br /&gt;1) Relevance&lt;br /&gt;&lt;ul&gt;&lt;li&gt;     Ranking based on meaning not keywords&lt;/li&gt;&lt;li&gt;     Direct answers&lt;/li&gt;&lt;/ul&gt;2) Speed&lt;br /&gt;&lt;ul&gt;&lt;li&gt;    Reduce effort to complete tasks&lt;/li&gt;&lt;li&gt;    Direct answers&lt;/li&gt;&lt;li&gt;    Fewer clicks&lt;/li&gt;&lt;/ul&gt; 3) Ease of Use (User experience)&lt;br /&gt;&lt;ul&gt;&lt;li&gt;     Intuitive query interface&lt;/li&gt;&lt;li&gt;  Relevance is hard&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;* &lt;span style="font-weight: bold;"&gt;Demo of Bing features&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;1) “Quick access” – surfacing customer service phone # for query {delta airline}&lt;br /&gt; &lt;br /&gt;2) Enhanced movie results for query that match movie titles )&lt;br /&gt;      * Rent, buy, watch online, reviews, posters&lt;br /&gt; &lt;br /&gt;3) Microsoft Academic Search&lt;br /&gt;      * Faceted interface: filter results by author, venue&lt;br /&gt;      * Summary pages for author information&lt;br /&gt;      * Academic activity&lt;br /&gt;      * Co-author graph&lt;br /&gt;      * Disambiguation of author names&lt;br /&gt;&lt;br /&gt;4) Summary of important information for queries that match geo-locations&lt;br /&gt;      * Weather&lt;br /&gt;      * Overview of tourist destinations&lt;br /&gt;      * Maps&lt;br /&gt; &lt;br /&gt;5) Parsing natural language queries&lt;br /&gt;      * Parse the query {flight to Taipei feb 12 returning feb 13} to provide fast access to Bing Travel&lt;br /&gt;&lt;br /&gt;6) Music search&lt;br /&gt;      * Enhanced results for queries that match musician names&lt;br /&gt;      * Preview songs, lyrics, bio&lt;br /&gt;&lt;br /&gt;  7) Facebook Integration&lt;br /&gt;      * Results that were liked by Facebook friends&lt;br /&gt;   * Surface Facebook profiles in searches of matching friend names&lt;br /&gt;&lt;br /&gt;* &lt;span style="font-weight: bold;"&gt;Internet searchers are becoming more Task Centric&lt;/span&gt;  &lt;br /&gt;* Decision Making: 66% people are using search to make decisions&lt;br /&gt;* Top search tasks: Entertainment, Games, Health, Travel, Shopping, Directions,…&lt;br /&gt;&lt;br /&gt;* &lt;span style="font-weight: bold;"&gt;Tasks are becoming more sophisticated&lt;/span&gt;&lt;br /&gt;  * Longer queries&lt;br /&gt;  * Longer sessions&lt;br /&gt;&lt;br /&gt;* &lt;span style="font-weight: bold;"&gt;10 blue links are no longer sufficient&lt;/span&gt;&lt;br /&gt;* Instead there’s a need for organized “Whole Page” experience&lt;br /&gt;* Search Paradigm Shift&lt;br /&gt;  * From “hit or miss” model to “dialogue” model&lt;br /&gt;  * Understanding query intent&lt;br /&gt;  * Incorporating structured data into search results&lt;br /&gt;  * Relevance on the session level&lt;br /&gt;  * Minimize the effort to complete task&lt;br /&gt;&lt;br /&gt;* &lt;span style="font-weight: bold;"&gt;Bing Dialogue Model&lt;/span&gt; (see the image above)&lt;br /&gt;&lt;br /&gt;* &lt;span style="font-weight: bold;"&gt;Four levels of dialogue&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;1) Query level&lt;br /&gt;        * Query auto-completion&lt;br /&gt;        * Spelling correction&lt;br /&gt;   * Interaction with user in mobile devices with touch screens&lt;br /&gt;&lt;br /&gt;2) Document level&lt;br /&gt;        * Title, snippet, deep links presentation&lt;br /&gt;   * Extended document preview on hover over the result&lt;br /&gt;&lt;br /&gt;3) Page level&lt;br /&gt;        * Quick Tabs for relevant verticals&lt;br /&gt;        * Entity-based result summary&lt;br /&gt;        * Algorithmic results&lt;br /&gt;        * Related query suggestions&lt;br /&gt;        *    Search history&lt;br /&gt;  &lt;br /&gt;4) Session level&lt;br /&gt;  * History-aware results comparisons&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-4877378704569769338?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/4877378704569769338/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2011/02/bing-dialogue-modelby-harry-shum-second.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/4877378704569769338'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/4877378704569769338'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2011/02/bing-dialogue-modelby-harry-shum-second.html' title='&quot;Bing Dialogue Model&quot;by  Harry Shum - Second WSDM 2011 Keynote'/><author><name>Michael Bendersky</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_kWwNGYD5Z9k/TVWBGJkcWAI/AAAAAAAAA_g/x_TsDNQlVpQ/s72-c/IMG_2115.JPG' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-5668910766808839596</id><published>2011-02-09T22:27:00.005-05:00</published><updated>2011-02-10T23:09:36.037-05:00</updated><title type='text'>"Mining Billion-node Graphs: Patterns, Generators and Tools" Christos Faloutsos - First WSDM 2011 Keynote</title><content type='html'>&lt;!--[if gte mso 9]&gt;&lt;xml&gt;  &lt;w:worddocument&gt;   &lt;w:view&gt;Normal&lt;/w:View&gt;   &lt;w:zoom&gt;0&lt;/w:Zoom&gt;   &lt;w:trackmoves/&gt;   &lt;w:trackformatting/&gt;   &lt;w:punctuationkerning/&gt;   &lt;w:validateagainstschemas/&gt;   &lt;w:saveifxmlinvalid&gt;false&lt;/w:SaveIfXMLInvalid&gt;   &lt;w:ignoremixedcontent&gt;false&lt;/w:IgnoreMixedContent&gt;   &lt;w:alwaysshowplaceholdertext&gt;false&lt;/w:AlwaysShowPlaceholderText&gt;   &lt;w:donotpromoteqf/&gt;   &lt;w:lidthemeother&gt;EN-US&lt;/w:LidThemeOther&gt;   &lt;w:lidthemeasian&gt;X-NONE&lt;/w:LidThemeAsian&gt;   &lt;w:lidthemecomplexscript&gt;X-NONE&lt;/w:LidThemeComplexScript&gt;   &lt;w:compatibility&gt;    &lt;w:breakwrappedtables/&gt;    &lt;w:snaptogridincell/&gt;    &lt;w:wraptextwithpunct/&gt;    &lt;w:useasianbreakrules/&gt;    &lt;w:dontgrowautofit/&gt;    &lt;w:splitpgbreakandparamark/&gt;    &lt;w:dontvertaligncellwithsp/&gt;    &lt;w:dontbreakconstrainedforcedtables/&gt;    &lt;w:dontvertalignintxbx/&gt;    &lt;w:word11kerningpairs/&gt;    &lt;w:cachedcolbalance/&gt;   &lt;/w:Compatibility&gt;   &lt;m:mathpr&gt;    &lt;m:mathfont val="Cambria Math"&gt;    &lt;m:brkbin val="before"&gt;    &lt;m:brkbinsub val="&amp;#45;-"&gt;    &lt;m:smallfrac val="off"&gt;    &lt;m:dispdef/&gt;    &lt;m:lmargin val="0"&gt;    &lt;m:rmargin val="0"&gt;    &lt;m:defjc val="centerGroup"&gt;    &lt;m:wrapindent val="1440"&gt;    &lt;m:intlim val="subSup"&gt;    &lt;m:narylim val="undOvr"&gt;   &lt;/m:mathPr&gt;&lt;/w:WordDocument&gt; &lt;/xml&gt;&lt;![endif]--&gt;&lt;!--[if gte mso 9]&gt;&lt;xml&gt;  &lt;w:latentstyles deflockedstate="false" defunhidewhenused="true" defsemihidden="true" defqformat="false" defpriority="99" latentstylecount="267"&gt;   &lt;w:lsdexception locked="false" priority="0" semihidden="false" unhidewhenused="false" qformat="true" name="Normal"&gt;   &lt;w:lsdexception locked="false" priority="9" semihidden="false" unhidewhenused="false" qformat="true" name="heading 1"&gt;   &lt;w:lsdexception locked="false" priority="9" qformat="true" name="heading 2"&gt;   &lt;w:lsdexception locked="false" priority="9" qformat="true" name="heading 3"&gt;   &lt;w:lsdexception locked="false" priority="9" qformat="true" name="heading 4"&gt;   &lt;w:lsdexception locked="false" priority="9" qformat="true" name="heading 5"&gt;   &lt;w:lsdexception locked="false" priority="9" qformat="true" name="heading 6"&gt;   &lt;w:lsdexception locked="false" priority="9" qformat="true" name="heading 7"&gt;   &lt;w:lsdexception locked="false" priority="9" qformat="true" name="heading 8"&gt;   &lt;w:lsdexception locked="false" priority="9" qformat="true" name="heading 9"&gt;   &lt;w:lsdexception locked="false" priority="39" name="toc 1"&gt;   &lt;w:lsdexception locked="false" priority="39" name="toc 2"&gt;   &lt;w:lsdexception locked="false" priority="39" name="toc 3"&gt;   &lt;w:lsdexception locked="false" priority="39" name="toc 4"&gt;   &lt;w:lsdexception locked="false" priority="39" name="toc 5"&gt;   &lt;w:lsdexception locked="false" priority="39" name="toc 6"&gt;   &lt;w:lsdexception locked="false" priority="39" name="toc 7"&gt;   &lt;w:lsdexception locked="false" priority="39" name="toc 8"&gt;   &lt;w:lsdexception locked="false" priority="39" name="toc 9"&gt;   &lt;w:lsdexception locked="false" priority="35" qformat="true" name="caption"&gt;   &lt;w:lsdexception locked="false" priority="10" semihidden="false" unhidewhenused="false" qformat="true" name="Title"&gt;   &lt;w:lsdexception locked="false" priority="1" name="Default Paragraph Font"&gt;   &lt;w:lsdexception locked="false" priority="11" semihidden="false" unhidewhenused="false" qformat="true" name="Subtitle"&gt;   &lt;w:lsdexception locked="false" priority="22" semihidden="false" unhidewhenused="false" qformat="true" name="Strong"&gt;   &lt;w:lsdexception locked="false" priority="20" semihidden="false" unhidewhenused="false" qformat="true" name="Emphasis"&gt;   &lt;w:lsdexception locked="false" priority="59" semihidden="false" unhidewhenused="false" name="Table Grid"&gt;   &lt;w:lsdexception locked="false" unhidewhenused="false" name="Placeholder Text"&gt;   &lt;w:lsdexception locked="false" priority="1" semihidden="false" unhidewhenused="false" qformat="true" name="No Spacing"&gt;   &lt;w:lsdexception locked="false" priority="60" semihidden="false" unhidewhenused="false" name="Light Shading"&gt;   &lt;w:lsdexception locked="false" priority="61" semihidden="false" unhidewhenused="false" name="Light List"&gt;   &lt;w:lsdexception locked="false" priority="62" semihidden="false" unhidewhenused="false" name="Light Grid"&gt;   &lt;w:lsdexception locked="false" priority="63" semihidden="false" unhidewhenused="false" name="Medium Shading 1"&gt;   &lt;w:lsdexception locked="false" priority="64" semihidden="false" unhidewhenused="false" name="Medium Shading 2"&gt;   &lt;w:lsdexception locked="false" priority="65" semihidden="false" unhidewhenused="false" name="Medium List 1"&gt;   &lt;w:lsdexception locked="false" priority="66" semihidden="false" unhidewhenused="false" name="Medium List 2"&gt;   &lt;w:lsdexception locked="false" priority="67" semihidden="false" unhidewhenused="false" name="Medium Grid 1"&gt;   &lt;w:lsdexception locked="false" priority="68" semihidden="false" unhidewhenused="false" name="Medium Grid 2"&gt;   &lt;w:lsdexception locked="false" priority="69" semihidden="false" unhidewhenused="false" name="Medium Grid 3"&gt;   &lt;w:lsdexception locked="false" priority="70" semihidden="false" unhidewhenused="false" name="Dark List"&gt;   &lt;w:lsdexception locked="false" priority="71" semihidden="false" unhidewhenused="false" name="Colorful Shading"&gt;   &lt;w:lsdexception locked="false" priority="72" semihidden="false" unhidewhenused="false" name="Colorful List"&gt;   &lt;w:lsdexception locked="false" priority="73" semihidden="false" unhidewhenused="false" name="Colorful Grid"&gt;   &lt;w:lsdexception locked="false" priority="60" semihidden="false" unhidewhenused="false" name="Light Shading Accent 1"&gt;   &lt;w:lsdexception locked="false" priority="61" semihidden="false" unhidewhenused="false" name="Light List Accent 1"&gt;   &lt;w:lsdexception locked="false" priority="62" semihidden="false" unhidewhenused="false" name="Light Grid Accent 1"&gt;   &lt;w:lsdexception locked="false" priority="63" semihidden="false" unhidewhenused="false" name="Medium Shading 1 Accent 1"&gt;   &lt;w:lsdexception locked="false" priority="64" semihidden="false" unhidewhenused="false" name="Medium Shading 2 Accent 1"&gt;   &lt;w:lsdexception locked="false" priority="65" semihidden="false" unhidewhenused="false" name="Medium List 1 Accent 1"&gt;   &lt;w:lsdexception locked="false" unhidewhenused="false" name="Revision"&gt;   &lt;w:lsdexception locked="false" priority="34" semihidden="false" unhidewhenused="false" qformat="true" name="List Paragraph"&gt;   &lt;w:lsdexception locked="false" priority="29" semihidden="false" unhidewhenused="false" qformat="true" name="Quote"&gt;   &lt;w:lsdexception locked="false" priority="30" semihidden="false" unhidewhenused="false" qformat="true" name="Intense Quote"&gt;   &lt;w:lsdexception locked="false" priority="66" semihidden="false" unhidewhenused="false" name="Medium List 2 Accent 1"&gt;   &lt;w:lsdexception locked="false" priority="67" semihidden="false" unhidewhenused="false" name="Medium Grid 1 Accent 1"&gt;   &lt;w:lsdexception locked="false" priority="68" semihidden="false" unhidewhenused="false" name="Medium Grid 2 Accent 1"&gt;   &lt;w:lsdexception locked="false" priority="69" semihidden="false" unhidewhenused="false" name="Medium Grid 3 Accent 1"&gt;   &lt;w:lsdexception locked="false" priority="70" semihidden="false" unhidewhenused="false" name="Dark List Accent 1"&gt;   &lt;w:lsdexception locked="false" priority="71" semihidden="false" unhidewhenused="false" name="Colorful Shading Accent 1"&gt;   &lt;w:lsdexception locked="false" priority="72" semihidden="false" unhidewhenused="false" name="Colorful List Accent 1"&gt;   &lt;w:lsdexception locked="false" priority="73" semihidden="false" unhidewhenused="false" name="Colorful Grid Accent 1"&gt;   &lt;w:lsdexception locked="false" priority="60" semihidden="false" unhidewhenused="false" name="Light Shading Accent 2"&gt;   &lt;w:lsdexception locked="false" priority="61" semihidden="false" unhidewhenused="false" name="Light List Accent 2"&gt;   &lt;w:lsdexception locked="false" priority="62" semihidden="false" unhidewhenused="false" name="Light Grid Accent 2"&gt;   &lt;w:lsdexception locked="false" priority="63" semihidden="false" unhidewhenused="false" name="Medium Shading 1 Accent 2"&gt;   &lt;w:lsdexception locked="false" priority="64" semihidden="false" unhidewhenused="false" name="Medium Shading 2 Accent 2"&gt;   &lt;w:lsdexception locked="false" priority="65" semihidden="false" unhidewhenused="false" name="Medium List 1 Accent 2"&gt;   &lt;w:lsdexception locked="false" priority="66" semihidden="false" unhidewhenused="false" name="Medium List 2 Accent 2"&gt;   &lt;w:lsdexception locked="false" priority="67" semihidden="false" unhidewhenused="false" name="Medium Grid 1 Accent 2"&gt;   &lt;w:lsdexception locked="false" priority="68" semihidden="false" unhidewhenused="false" name="Medium Grid 2 Accent 2"&gt;   &lt;w:lsdexception locked="false" priority="69" semihidden="false" unhidewhenused="false" name="Medium Grid 3 Accent 2"&gt;   &lt;w:lsdexception locked="false" priority="70" semihidden="false" unhidewhenused="false" name="Dark List Accent 2"&gt;   &lt;w:lsdexception locked="false" priority="71" semihidden="false" unhidewhenused="false" name="Colorful Shading Accent 2"&gt;   &lt;w:lsdexception locked="false" priority="72" semihidden="false" unhidewhenused="false" name="Colorful List Accent 2"&gt;   &lt;w:lsdexception locked="false" priority="73" semihidden="false" unhidewhenused="false" name="Colorful Grid Accent 2"&gt;   &lt;w:lsdexception locked="false" priority="60" semihidden="false" unhidewhenused="false" name="Light Shading Accent 3"&gt;   &lt;w:lsdexception locked="false" priority="61" semihidden="false" unhidewhenused="false" name="Light List Accent 3"&gt;   &lt;w:lsdexception locked="false" priority="62" semihidden="false" unhidewhenused="false" name="Light Grid Accent 3"&gt;   &lt;w:lsdexception locked="false" priority="63" semihidden="false" unhidewhenused="false" name="Medium Shading 1 Accent 3"&gt;   &lt;w:lsdexception locked="false" priority="64" semihidden="false" unhidewhenused="false" name="Medium Shading 2 Accent 3"&gt;   &lt;w:lsdexception locked="false" priority="65" semihidden="false" unhidewhenused="false" name="Medium List 1 Accent 3"&gt;   &lt;w:lsdexception locked="false" priority="66" semihidden="false" unhidewhenused="false" name="Medium List 2 Accent 3"&gt;   &lt;w:lsdexception locked="false" priority="67" semihidden="false" unhidewhenused="false" name="Medium Grid 1 Accent 3"&gt;   &lt;w:lsdexception locked="false" priority="68" semihidden="false" unhidewhenused="false" name="Medium Grid 2 Accent 3"&gt;   &lt;w:lsdexception locked="false" priority="69" semihidden="false" unhidewhenused="false" name="Medium Grid 3 Accent 3"&gt;   &lt;w:lsdexception locked="false" priority="70" semihidden="false" unhidewhenused="false" name="Dark List Accent 3"&gt;   &lt;w:lsdexception locked="false" priority="71" semihidden="false" unhidewhenused="false" name="Colorful Shading Accent 3"&gt;   &lt;w:lsdexception locked="false" priority="72" semihidden="false" unhidewhenused="false" name="Colorful List Accent 3"&gt;   &lt;w:lsdexception locked="false" priority="73" semihidden="false" unhidewhenused="false" name="Colorful Grid Accent 3"&gt;   &lt;w:lsdexception locked="false" priority="60" semihidden="false" unhidewhenused="false" name="Light Shading Accent 4"&gt;   &lt;w:lsdexception locked="false" priority="61" semihidden="false" unhidewhenused="false" name="Light List Accent 4"&gt;   &lt;w:lsdexception locked="false" priority="62" semihidden="false" unhidewhenused="false" name="Light Grid Accent 4"&gt;   &lt;w:lsdexception locked="false" priority="63" semihidden="false" unhidewhenused="false" name="Medium Shading 1 Accent 4"&gt;   &lt;w:lsdexception locked="false" priority="64" semihidden="false" unhidewhenused="false" name="Medium Shading 2 Accent 4"&gt;   &lt;w:lsdexception locked="false" priority="65" semihidden="false" unhidewhenused="false" name="Medium List 1 Accent 4"&gt;   &lt;w:lsdexception locked="false" priority="66" semihidden="false" unhidewhenused="false" name="Medium List 2 Accent 4"&gt;   &lt;w:lsdexception locked="false" priority="67" semihidden="false" unhidewhenused="false" name="Medium Grid 1 Accent 4"&gt;   &lt;w:lsdexception locked="false" priority="68" semihidden="false" unhidewhenused="false" name="Medium Grid 2 Accent 4"&gt;   &lt;w:lsdexception locked="false" priority="69" semihidden="false" unhidewhenused="false" name="Medium Grid 3 Accent 4"&gt;   &lt;w:lsdexception locked="false" priority="70" semihidden="false" unhidewhenused="false" name="Dark List Accent 4"&gt;   &lt;w:lsdexception locked="false" priority="71" semihidden="false" unhidewhenused="false" name="Colorful Shading Accent 4"&gt;   &lt;w:lsdexception locked="false" priority="72" semihidden="false" unhidewhenused="false" name="Colorful List Accent 4"&gt;   &lt;w:lsdexception locked="false" priority="73" semihidden="false" unhidewhenused="false" name="Colorful Grid Accent 4"&gt;   &lt;w:lsdexception locked="false" priority="60" semihidden="false" unhidewhenused="false" name="Light Shading Accent 5"&gt;   &lt;w:lsdexception locked="false" priority="61" semihidden="false" unhidewhenused="false" name="Light List Accent 5"&gt;   &lt;w:lsdexception locked="false" priority="62" semihidden="false" unhidewhenused="false" name="Light Grid Accent 5"&gt;   &lt;w:lsdexception locked="false" priority="63" semihidden="false" unhidewhenused="false" name="Medium Shading 1 Accent 5"&gt;   &lt;w:lsdexception locked="false" priority="64" semihidden="false" unhidewhenused="false" name="Medium Shading 2 Accent 5"&gt;   &lt;w:lsdexception locked="false" priority="65" semihidden="false" unhidewhenused="false" name="Medium List 1 Accent 5"&gt;   &lt;w:lsdexception locked="false" priority="66" semihidden="false" unhidewhenused="false" name="Medium List 2 Accent 5"&gt;   &lt;w:lsdexception locked="false" priority="67" semihidden="false" unhidewhenused="false" name="Medium Grid 1 Accent 5"&gt;   &lt;w:lsdexception locked="false" priority="68" semihidden="false" unhidewhenused="false" name="Medium Grid 2 Accent 5"&gt;   &lt;w:lsdexception locked="false" priority="69" semihidden="false" unhidewhenused="false" name="Medium Grid 3 Accent 5"&gt;   &lt;w:lsdexception locked="false" priority="70" semihidden="false" unhidewhenused="false" name="Dark List Accent 5"&gt;   &lt;w:lsdexception locked="false" priority="71" semihidden="false" unhidewhenused="false" name="Colorful Shading Accent 5"&gt;   &lt;w:lsdexception locked="false" priority="72" semihidden="false" unhidewhenused="false" name="Colorful List Accent 5"&gt;   &lt;w:lsdexception locked="false" priority="73" semihidden="false" unhidewhenused="false" name="Colorful Grid Accent 5"&gt;   &lt;w:lsdexception locked="false" priority="60" semihidden="false" unhidewhenused="false" name="Light Shading Accent 6"&gt;   &lt;w:lsdexception locked="false" priority="61" semihidden="false" unhidewhenused="false" name="Light List Accent 6"&gt;   &lt;w:lsdexception locked="false" priority="62" semihidden="false" unhidewhenused="false" name="Light Grid Accent 6"&gt;   &lt;w:lsdexception locked="false" priority="63" semihidden="false" unhidewhenused="false" name="Medium Shading 1 Accent 6"&gt;   &lt;w:lsdexception locked="false" priority="64" semihidden="false" unhidewhenused="false" name="Medium Shading 2 Accent 6"&gt;   &lt;w:lsdexception locked="false" priority="65" semihidden="false" unhidewhenused="false" name="Medium List 1 Accent 6"&gt;   &lt;w:lsdexception locked="false" priority="66" semihidden="false" unhidewhenused="false" name="Medium List 2 Accent 6"&gt;   &lt;w:lsdexception locked="false" priority="67" semihidden="false" unhidewhenused="false" name="Medium Grid 1 Accent 6"&gt;   &lt;w:lsdexception locked="false" priority="68" semihidden="false" unhidewhenused="false" name="Medium Grid 2 Accent 6"&gt;   &lt;w:lsdexception locked="false" priority="69" semihidden="false" unhidewhenused="false" name="Medium Grid 3 Accent 6"&gt;   &lt;w:lsdexception locked="false" priority="70" semihidden="false" unhidewhenused="false" name="Dark List Accent 6"&gt;   &lt;w:lsdexception locked="false" priority="71" semihidden="false" unhidewhenused="false" name="Colorful Shading Accent 6"&gt;   &lt;w:lsdexception locked="false" priority="72" semihidden="false" unhidewhenused="false" name="Colorful List Accent 6"&gt;   &lt;w:lsdexception locked="false" priority="73" semihidden="false" unhidewhenused="false" name="Colorful Grid Accent 6"&gt;   &lt;w:lsdexception locked="false" priority="19" semihidden="false" unhidewhenused="false" qformat="true" name="Subtle Emphasis"&gt;   &lt;w:lsdexception locked="false" priority="21" semihidden="false" unhidewhenused="false" qformat="true" name="Intense Emphasis"&gt;   &lt;w:lsdexception locked="false" priority="31" semihidden="false" unhidewhenused="false" qformat="true" name="Subtle Reference"&gt;   &lt;w:lsdexception locked="false" priority="32" semihidden="false" unhidewhenused="false" qformat="true" name="Intense Reference"&gt;   &lt;w:lsdexception locked="false" priority="33" semihidden="false" unhidewhenused="false" qformat="true" name="Book Title"&gt;   &lt;w:lsdexception locked="false" priority="37" name="Bibliography"&gt;   &lt;w:lsdexception locked="false" priority="39" qformat="true" name="TOC Heading"&gt;  &lt;/w:LatentStyles&gt; &lt;/xml&gt;&lt;![endif]--&gt;&lt;!--[if gte mso 10]&gt; &lt;style&gt;  /* Style Definitions */  table.MsoNormalTable  {mso-style-name:"Table Normal";  mso-tstyle-rowband-size:0;  mso-tstyle-colband-size:0;  mso-style-noshow:yes;  mso-style-priority:99;  mso-style-qformat:yes;  mso-style-parent:"";  mso-padding-alt:0in 5.4pt 0in 5.4pt;  mso-para-margin-top:0in;  mso-para-margin-right:0in;  mso-para-margin-bottom:10.0pt;  mso-para-margin-left:0in;  line-height:115%;  mso-pagination:widow-orphan;  font-size:11.0pt;  font-family:"Calibri","sans-serif";  mso-ascii-font-family:Calibri;  mso-ascii-theme-font:minor-latin;  mso-hansi-font-family:Calibri;  mso-hansi-theme-font:minor-latin;  mso-bidi-font-family:"Times New Roman";  mso-bidi-theme-font:minor-bidi;} &lt;/style&gt; &lt;![endif]--&gt;  &lt;p class="MsoNormal"&gt;Christos Faloutsos, Professor at Dept. of Computer Science, CMU, gave an excellent keynote this morning on pattern mining in large scale graphs. Below is a brief summary of this keynote.&lt;br /&gt;&lt;/p&gt;  &lt;p class="MsoListParagraphCxSpFirst" style="text-indent: -0.25in;"&gt;&lt;span style="font-family:Symbol;"&gt;&lt;span style=""&gt;·&lt;span style="font: 7pt &amp;quot;Times New Roman&amp;quot;;"&gt;         &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;Why should we care about graphs?&lt;/p&gt;  &lt;p class="MsoListParagraphCxSpMiddle" style="margin-left: 1in; text-indent: -0.25in;"&gt;&lt;span style=";font-family:&amp;quot;;" &gt;&lt;span style=""&gt;o&lt;span style="font: 7pt &amp;quot;Times New Roman&amp;quot;;"&gt;   &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;Internet Map&lt;/p&gt;  &lt;p class="MsoListParagraphCxSpMiddle" style="margin-left: 1in; text-indent: -0.25in;"&gt;&lt;span style=";font-family:&amp;quot;;" &gt;&lt;span style=""&gt;o&lt;span style="font: 7pt &amp;quot;Times New Roman&amp;quot;;"&gt;   &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;Predators network in nature&lt;/p&gt;  &lt;p class="MsoListParagraphCxSpMiddle" style="margin-left: 1in; text-indent: -0.25in;"&gt;&lt;span style=";font-family:&amp;quot;;" &gt;&lt;span style=""&gt;o&lt;span style="font: 7pt &amp;quot;Times New Roman&amp;quot;;"&gt;   &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;Social networks&lt;/p&gt;  &lt;p class="MsoListParagraphCxSpMiddle" style="text-indent: -0.25in;"&gt;&lt;span style="font-family:Symbol;"&gt;&lt;span style=""&gt;·&lt;span style="font: 7pt &amp;quot;Times New Roman&amp;quot;;"&gt;         &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;Patterns in Graphs&lt;/p&gt;  &lt;p class="MsoListParagraphCxSpMiddle" style="margin-left: 1in; text-indent: -0.25in;"&gt;&lt;span style=";font-family:&amp;quot;;" &gt;&lt;span style=""&gt;o&lt;span style="font: 7pt &amp;quot;Times New Roman&amp;quot;;"&gt;   &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;What is normal/abnormal?&lt;/p&gt;  &lt;p class="MsoListParagraphCxSpMiddle" style="margin-left: 1in; text-indent: -0.25in;"&gt;&lt;span style=";font-family:&amp;quot;;" &gt;&lt;span style=""&gt;o&lt;span style="font: 7pt &amp;quot;Times New Roman&amp;quot;;"&gt;   &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;Which patterns hold in large datasets?&lt;/p&gt;  &lt;p class="MsoListParagraphCxSpMiddle" style="text-indent: -0.25in;"&gt;&lt;span style="font-family:Symbol;"&gt;&lt;span style=""&gt;·&lt;span style="font: 7pt &amp;quot;Times New Roman&amp;quot;;"&gt;         &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;Types of patterns&lt;/p&gt;  &lt;p class="MsoListParagraphCxSpMiddle" style="margin-left: 1in; text-indent: -0.25in;"&gt;&lt;span style=";font-family:&amp;quot;;" &gt;&lt;span style=""&gt;o&lt;span style="font: 7pt &amp;quot;Times New Roman&amp;quot;;"&gt;   &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;Graphs are *not* uniformly distributed&lt;span style=""&gt;  &lt;/span&gt;&lt;/p&gt;  &lt;p class="MsoListParagraphCxSpMiddle" style="margin-left: 1in; text-indent: -0.25in;"&gt;&lt;span style=";font-family:&amp;quot;;" &gt;&lt;span style=""&gt;o&lt;span style="font: 7pt &amp;quot;Times New Roman&amp;quot;;"&gt;   &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;Power law in the degree distribution&lt;/p&gt;  &lt;p class="MsoListParagraphCxSpMiddle" style="margin-left: 1in; text-indent: -0.25in;"&gt;&lt;span style=";font-family:&amp;quot;;" &gt;&lt;span style=""&gt;o&lt;span style="font: 7pt &amp;quot;Times New Roman&amp;quot;;"&gt;   &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;Power law in the eigenvalues of the adjacency matrices&lt;/p&gt;  &lt;p class="MsoListParagraphCxSpMiddle" style="margin-left: 1in; text-indent: -0.25in;"&gt;&lt;span style=";font-family:&amp;quot;;" &gt;&lt;span style=""&gt;o&lt;span style="font: 7pt &amp;quot;Times New Roman&amp;quot;;"&gt;   &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;Triangle law. (Triangle is a three node clique in the graph). &lt;/p&gt;  &lt;p class="MsoListParagraphCxSpMiddle" style="margin-left: 1.5in; text-indent: -0.25in;"&gt;&lt;span style="font-family:Wingdings;"&gt;&lt;span style=""&gt;§&lt;span style="font: 7pt &amp;quot;Times New Roman&amp;quot;;"&gt;  &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;Power law in the distribution of triangles in graphs&lt;/p&gt;  &lt;p class="MsoListParagraphCxSpMiddle" style="margin-left: 1.5in; text-indent: -0.25in;"&gt;&lt;span style="font-family:Wingdings;"&gt;&lt;span style=""&gt;§&lt;span style="font: 7pt &amp;quot;Times New Roman&amp;quot;;"&gt;  &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;#triangles = 1/6*SUM(EV^3) (EV= Eigenvalue of the adjacency matrix of the graphs)&lt;/p&gt;  &lt;p class="MsoListParagraphCxSpMiddle" style="margin-left: 2in; text-indent: -0.25in;"&gt;&lt;span style="font-family:Symbol;"&gt;&lt;span style=""&gt;·&lt;span style="font: 7pt &amp;quot;Times New Roman&amp;quot;;"&gt;         &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;Can be used to speed up computation over large graphs&lt;/p&gt;  &lt;p class="MsoListParagraphCxSpMiddle" style="margin-left: 1in; text-indent: -0.25in;"&gt;&lt;span style=";font-family:&amp;quot;;" &gt;&lt;span style=""&gt;o&lt;span style="font: 7pt &amp;quot;Times New Roman&amp;quot;;"&gt;   &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;EigenSpokes&lt;/p&gt;  &lt;p class="MsoListParagraphCxSpMiddle" style="margin-left: 1.5in; text-indent: -0.25in;"&gt;&lt;span style="font-family:Wingdings;"&gt;&lt;span style=""&gt;§&lt;span style="font: 7pt &amp;quot;Times New Roman&amp;quot;;"&gt;  &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;Principal component analysis shows that many &lt;span style=""&gt; &lt;/span&gt;large scale graphs are usually loosely connected tight communities&lt;/p&gt;  &lt;p class="MsoListParagraphCxSpMiddle" style="text-indent: -0.25in;"&gt;&lt;span style="font-family:Symbol;"&gt;&lt;span style=""&gt;·&lt;span style="font: 7pt &amp;quot;Times New Roman&amp;quot;;"&gt;         &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;Patterns on weighted graphs&lt;/p&gt;  &lt;p class="MsoListParagraphCxSpMiddle" style="margin-left: 1in; text-indent: -0.25in;"&gt;&lt;span style=";font-family:&amp;quot;;" &gt;&lt;span style=""&gt;o&lt;span style="font: 7pt &amp;quot;Times New Roman&amp;quot;;"&gt;   &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;Weights are super linear in the in-degree of the graph&lt;/p&gt;  &lt;p class="MsoListParagraphCxSpMiddle" style="text-indent: -0.25in;"&gt;&lt;span style="font-family:Symbol;"&gt;&lt;span style=""&gt;·&lt;span style="font: 7pt &amp;quot;Times New Roman&amp;quot;;"&gt;         &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;Time evolution in graphs&lt;/p&gt;  &lt;p class="MsoListParagraphCxSpMiddle" style="margin-left: 1in; text-indent: -0.25in;"&gt;&lt;span style=";font-family:&amp;quot;;" &gt;&lt;span style=""&gt;o&lt;span style="font: 7pt &amp;quot;Times New Roman&amp;quot;;"&gt;   &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;Surprisingly, graph diameter shrinks over time as the graph grows&lt;/p&gt;  &lt;p class="MsoListParagraphCxSpMiddle" style="margin-left: 1in; text-indent: -0.25in;"&gt;&lt;span style=";font-family:&amp;quot;;" &gt;&lt;span style=""&gt;o&lt;span style="font: 7pt &amp;quot;Times New Roman&amp;quot;;"&gt;   &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;NODES(t+1)=2*NODES(t)&lt;/p&gt;  &lt;p class="MsoListParagraphCxSpMiddle" style="margin-left: 1in; text-indent: -0.25in;"&gt;&lt;span style=";font-family:&amp;quot;;" &gt;&lt;span style=""&gt;o&lt;span style="font: 7pt &amp;quot;Times New Roman&amp;quot;;"&gt;   &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;EDGES(t+1) = 1.6*NODES(t+1)&lt;/p&gt;  &lt;p class="MsoListParagraphCxSpMiddle" style="margin-left: 1in; text-indent: -0.25in;"&gt;&lt;span style=";font-family:&amp;quot;;" &gt;&lt;span style=""&gt;o&lt;span style="font: 7pt &amp;quot;Times New Roman&amp;quot;;"&gt;   &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;Node popularity over time (eg. visits to blog posts)&lt;/p&gt;  &lt;p class="MsoListParagraphCxSpMiddle" style="margin-left: 1.5in; text-indent: -0.25in;"&gt;&lt;span style="font-family:Wingdings;"&gt;&lt;span style=""&gt;§&lt;span style="font: 7pt &amp;quot;Times New Roman&amp;quot;;"&gt;  &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;Drop off in popularity is power law with exponent 1.6&lt;/p&gt;  &lt;p class="MsoListParagraphCxSpMiddle" style="margin-left: 1in; text-indent: -0.25in;"&gt;&lt;span style=";font-family:&amp;quot;;" &gt;&lt;span style=""&gt;o&lt;span style="font: 7pt &amp;quot;Times New Roman&amp;quot;;"&gt;   &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;Duration of tasks (eg., duration of phone calls)&lt;/p&gt;  &lt;p class="MsoListParagraphCxSpMiddle" style="margin-left: 1.5in; text-indent: -0.25in;"&gt;&lt;span style="font-family:Wingdings;"&gt;&lt;span style=""&gt;§&lt;span style="font: 7pt &amp;quot;Times New Roman&amp;quot;;"&gt;  &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;The longer the task has taken so far, the longer it is expected to take in the future&lt;/p&gt;  &lt;p class="MsoListParagraphCxSpMiddle" style="margin-left: 1.5in; text-indent: -0.25in;"&gt;&lt;span style="font-family:Wingdings;"&gt;&lt;span style=""&gt;§&lt;span style="font: 7pt &amp;quot;Times New Roman&amp;quot;;"&gt;  &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;Similar to log-logistic distribution in survivability analysis&lt;/p&gt;  &lt;p class="MsoListParagraphCxSpMiddle" style="text-indent: -0.25in;"&gt;&lt;span style="font-family:Symbol;"&gt;&lt;span style=""&gt;·&lt;span style="font: 7pt &amp;quot;Times New Roman&amp;quot;;"&gt;         &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;Tools for anomaly detection in graphs&lt;/p&gt;  &lt;p class="MsoListParagraphCxSpMiddle" style="margin-left: 1in; text-indent: -0.25in;"&gt;&lt;span style=";font-family:&amp;quot;;" &gt;&lt;span style=""&gt;o&lt;span style="font: 7pt &amp;quot;Times New Roman&amp;quot;;"&gt;   &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;Egonet of the node –neighbors of the node and the &lt;span style=""&gt; &lt;/span&gt;edges between them&lt;/p&gt;  &lt;p class="MsoListParagraphCxSpMiddle" style="margin-left: 1.5in; text-indent: -0.25in;"&gt;&lt;span style="font-family:Wingdings;"&gt;&lt;span style=""&gt;§&lt;span style="font: 7pt &amp;quot;Times New Roman&amp;quot;;"&gt;  &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;Examining the features of the egonets can be used to detect anomalous nodes &lt;/p&gt;  &lt;p class="MsoListParagraphCxSpMiddle" style="margin-left: 1in; text-indent: -0.25in;"&gt;&lt;span style=";font-family:&amp;quot;;" &gt;&lt;span style=""&gt;o&lt;span style="font: 7pt &amp;quot;Times New Roman&amp;quot;;"&gt;   &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;Belief propagation can be used for fraud detection&lt;/p&gt;  &lt;p class="MsoListParagraphCxSpMiddle" style="text-indent: -0.25in;"&gt;&lt;span style="font-family:Symbol;"&gt;&lt;span style=""&gt;·&lt;span style="font: 7pt &amp;quot;Times New Roman&amp;quot;;"&gt;         &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;Scalability&lt;/p&gt;  &lt;p class="MsoListParagraphCxSpMiddle" style="margin-left: 1in; text-indent: -0.25in;"&gt;&lt;span style=";font-family:&amp;quot;;" &gt;&lt;span style=""&gt;o&lt;span style="font: 7pt &amp;quot;Times New Roman&amp;quot;;"&gt;   &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;Map-Reduce for graphs&lt;/p&gt;  &lt;p class="MsoListParagraphCxSpMiddle" style="margin-left: 1.5in; text-indent: -0.25in;"&gt;&lt;span style="font-family:Wingdings;"&gt;&lt;span style=""&gt;§&lt;span style="font: 7pt &amp;quot;Times New Roman&amp;quot;;"&gt;  &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;a href="http://www.cs.cmu.edu/%7Epegasus/"&gt;PEGASUS &lt;/a&gt;– tool for parallelizing computation of various graph characteristics for large graphs (using Hadoop)&lt;/p&gt;  &lt;p class="MsoListParagraphCxSpMiddle" style="margin-left: 2in; text-indent: -0.25in;"&gt;&lt;span style="font-family:Symbol;"&gt;&lt;span style=""&gt;·&lt;span style="font: 7pt &amp;quot;Times New Roman&amp;quot;;"&gt;         &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;Graph Radius&lt;/p&gt;  &lt;p class="MsoListParagraphCxSpMiddle" style="margin-left: 2.5in; text-indent: -0.25in;"&gt;&lt;span style=";font-family:&amp;quot;;" &gt;&lt;span style=""&gt;o&lt;span style="font: 7pt &amp;quot;Times New Roman&amp;quot;;"&gt;   &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;14 for a web graph of 1.4B nodes&lt;/p&gt;  &lt;p class="MsoListParagraphCxSpMiddle" style="margin-left: 2.5in; text-indent: -0.25in;"&gt;&lt;span style=";font-family:&amp;quot;;" &gt;&lt;span style=""&gt;o&lt;span style="font: 7pt &amp;quot;Times New Roman&amp;quot;;"&gt;   &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;Many loosely connected tight communities&lt;/p&gt;  &lt;p class="MsoListParagraphCxSpMiddle" style="margin-left: 2in; text-indent: -0.25in;"&gt;&lt;span style="font-family:Symbol;"&gt;&lt;span style=""&gt;·&lt;span style="font: 7pt &amp;quot;Times New Roman&amp;quot;;"&gt;         &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;Connected components&lt;/p&gt;  &lt;p class="MsoListParagraphCxSpMiddle" style="margin-left: 2.5in; text-indent: -0.25in;"&gt;&lt;span style=";font-family:&amp;quot;;" &gt;&lt;span style=""&gt;o&lt;span style="font: 7pt &amp;quot;Times New Roman&amp;quot;;"&gt;   &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;Giant connected component ~ 1B nodes&lt;/p&gt;  &lt;p class="MsoListParagraphCxSpMiddle" style="margin-left: 2.5in; text-indent: -0.25in;"&gt;&lt;span style=";font-family:&amp;quot;;" &gt;&lt;span style=""&gt;o&lt;span style="font: 7pt &amp;quot;Times New Roman&amp;quot;;"&gt;   &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;Many disconnected nodes&lt;/p&gt;  &lt;p class="MsoListParagraphCxSpLast" style="margin-left: 2.5in; text-indent: -0.25in;"&gt;&lt;span style=";font-family:&amp;quot;;" &gt;&lt;span style=""&gt;o&lt;span style="font: 7pt &amp;quot;Times New Roman&amp;quot;;"&gt;   &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;Many suspicious connected components (link spam)&lt;/p&gt;  &lt;p class="MsoListParagraphCxSpLast" style="margin-left: 2in;"&gt; &lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-5668910766808839596?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/5668910766808839596/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2011/02/mining-billion-node-graphs-patterns.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/5668910766808839596'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/5668910766808839596'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2011/02/mining-billion-node-graphs-patterns.html' title='&quot;Mining Billion-node Graphs: Patterns, Generators and Tools&quot; Christos Faloutsos - First WSDM 2011 Keynote'/><author><name>Michael Bendersky</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-1665027089204019509</id><published>2011-02-07T19:02:00.005-05:00</published><updated>2011-02-09T22:27:08.228-05:00</updated><title type='text'>WSDM 2011 coverage</title><content type='html'>&lt;span style="font-family:arial;"&gt;As I am attending &lt;a href="http://www.wsdm2011.org/"&gt;WSDM 2011&lt;/a&gt; in Hong Kong to present a &lt;a href="http://ciir.cs.umass.edu/%7Ebemike/pubs/2011-1.pdf"&gt;paper&lt;/a&gt;, I will be taking over Jeff's blog for the next week. Expect posts about invited talks by &lt;strong style="font-family: arial; font-weight: normal;"&gt;&lt;a href="http://www.wsdm2011.org/wsdm2011/keynotes#mining_billion-node_graphspatterns_generators_and_tools"&gt;Christos Faloutsos&lt;/a&gt; and &lt;/strong&gt;&lt;a href="http://www.wsdm2011.org/wsdm2011/keynotes#bing_dialog_modelintent_knowledge_and_user_interaction"&gt;&lt;strong style="font-family: arial; font-weight: normal;"&gt;Harry Shum&lt;/strong&gt;&lt;/a&gt;&lt;span style="font-family:arial;"&gt;, &lt;a href="http://www.wsdm2011.org/wsdm2011/awards"&gt;best paper&lt;/a&gt; award updates and more.&lt;/span&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="font-family:arial;"&gt;Comment if you have any wishes or qu&lt;/span&gt;&lt;span style="font-family:arial;"&gt;estions.&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;a href="http://ciir.cs.umass.edu/%7Ebemike/"&gt;&lt;span style="font-family:arial;"&gt;Michael Bendersky&lt;/span&gt;&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-1665027089204019509?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/1665027089204019509/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2011/02/wsdm-2011-coverage.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/1665027089204019509'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/1665027089204019509'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2011/02/wsdm-2011-coverage.html' title='WSDM 2011 coverage'/><author><name>Michael Bendersky</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-2657617698066059489</id><published>2011-01-27T06:13:00.004-05:00</published><updated>2011-01-27T14:19:15.603-05:00</updated><title type='text'>Forget the Superbowl, Host a Watson Jeopardy Party</title><content type='html'>The &lt;a href="http://www-03.ibm.com/innovation/us/watson/"&gt;IBM Watson Jeopardy&lt;/a&gt; contest is happening February 14th, 15th ,16th.  The time to start planning your party is now!&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Buy a &lt;a href="http://www.etsy.com/listing/65993473/ibmjeopardy-challenge-commemorative"&gt;Watson Poster&lt;/a&gt; now to start advertising.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;It will be human vs. machine showdown and costumes are encouraged!  Perhaps an stick-on &lt;a href="http://www.lemurproject.org/indri/"&gt;Lemur &lt;/a&gt;tattoo might be appropriate.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;For food consider:&lt;/div&gt;&lt;div&gt;&lt;div&gt;- Robot 'Chips' - homemade potato chips tossed in herb salt&lt;/div&gt;&lt;div&gt;-"Buzz in" Mousse - A chocolate-coffee mouse with chocolate covered espresso beans&lt;/div&gt;&lt;/div&gt;&lt;div&gt; - A jeopardy themed cake with a robot chasing a human (think Bender from Futurama)&lt;/div&gt;&lt;div&gt; - 'Deconstructed' &lt;a href="http://www.thisisbrandx.com/2010/02/michael-voltaggio-makes-molecular-super-bowl-hot-wings.html"&gt;Techno-Style boneless buffalo wings&lt;/a&gt; &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Leave more suggestions in the comments!&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;(For some more details on how Watson works read, &lt;a href="http://www.stanford.edu/class/cs124/AIMagzine-DeepQA.pdf"&gt;Building Watson: An Overview of the DeepQA Project&lt;/a&gt; - AI Magazine).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Also read Stephen Baker's new book, &lt;a href="http://thenumerati.net/"&gt;Final Jeopardy&lt;/a&gt;.  The eBook is available on &lt;a href="http://www.amazon.com/Final-Jeopardy-Machine-Quest-Everything/dp/0547483163"&gt;Amazon &lt;/a&gt;now and the first chapter is free to read.  The last chapter will be released after the match.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-2657617698066059489?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/2657617698066059489/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2011/01/forget-superbowl-host-watson-jeopardy.html#comment-form' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/2657617698066059489'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/2657617698066059489'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2011/01/forget-superbowl-host-watson-jeopardy.html' title='Forget the Superbowl, Host a Watson Jeopardy Party'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-4972632911478776566</id><published>2011-01-26T13:03:00.005-05:00</published><updated>2011-01-26T13:19:24.343-05:00</updated><title type='text'>Sofia-ML and Maui Two Cool Machine Learning and Extraction libraries</title><content type='html'>&lt;a href="http://www.cs.umass.edu/~niranjan/"&gt;Niranjan&lt;/a&gt; recently suggested I check out &lt;a href="http://code.google.com/p/sofia-ml/"&gt;Sofia-ML&lt;/a&gt; as a fast tool for some learning to rank applications.  From the description,&lt;div&gt;&lt;blockquote&gt;The suite of fast incremental algorithms for machine learning (sofia-ml) can be used for training models for classification, regression, ranking, or combined regression and ranking. Several different techniques are available. This release is intended to aid researchers and practitioners who require fast methods for classification and ranking on large, sparse data sets.&lt;/blockquote&gt;&lt;/div&gt;It is written in C/C++.  One nice feature is that it supports the LETOR feature file format for ranking.  They describe some of the algorithms in &lt;a href="http://www.eecs.tufts.edu/~dsculley/papers/large-scale-rank.pdf"&gt;Large Scale Learning to Rank&lt;/a&gt; from NIPS where they  presented algorithms for fast learners for approximate SVMs. The &lt;a href="http://portal.acm.org/citation.cfm?id=1835804.1835928"&gt;Combined Regression and Ranking&lt;/a&gt; (CRR) work was presented as a paper at &lt;a href="http://www.sigkdd.org/kdd2010/"&gt;KDD 2010&lt;/a&gt;.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The other package I recently discovered was &lt;a href="http://code.google.com/p/maui-indexer/"&gt;Maui-Indexer&lt;/a&gt;.   Maui-Indexer is an extension of the &lt;a href="http://www.nzdl.org/Kea/"&gt;KEA&lt;/a&gt; key phrase extractor.&lt;br /&gt;&lt;blockquote&gt;... it allows the assignment of topics to documents based on terms from Wikipedia using Wikipedia Miner. Maui also has many new features that help identify topics more accurately.&lt;/blockquote&gt;&lt;/div&gt;&lt;div&gt;You can read more about it on the web page and read the &lt;a href="http://code.google.com/p/maui-indexer/wiki/Publications"&gt;publications&lt;/a&gt;.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-4972632911478776566?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/4972632911478776566/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2011/01/sofia-ml-and-maui-two-cool-machine.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/4972632911478776566'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/4972632911478776566'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2011/01/sofia-ml-and-maui-two-cool-machine.html' title='Sofia-ML and Maui Two Cool Machine Learning and Extraction libraries'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-8092008858607112020</id><published>2011-01-19T10:56:00.002-05:00</published><updated>2011-01-19T11:53:03.302-05:00</updated><title type='text'>Microsoft Research Spelling Corrector Challenge</title><content type='html'>Microsoft is sponsoring the &lt;a href="http://web-ngram.research.microsoft.com/spellerchallenge"&gt;Speller Challange&lt;/a&gt;, a competition to create the best search query correction service.  The contest started on Monday and will run until May 27th.  It appears to be an initiative to spur use of its &lt;a href="http://research.microsoft.com/en-us/collaboration/focus/cs/web-ngram.aspx"&gt;web n-gram service&lt;/a&gt;, which includes query log n-grams.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The training dataset that they provide is based on the &lt;a href="http://ir.cis.udel.edu/million/"&gt;TREC Million Query Track&lt;/a&gt;.  The speller challenge team annotated a sample of the track logs and has it &lt;a href="http://research.microsoft.com/en-us/downloads/ff7aba09-fbb4-4201-bc98-23e2a3674e3c/default.aspx"&gt;available for download&lt;/a&gt;.  The dataset contains 5892 queries, 311 mispelled queries, and 1122 queries that have some suggested spelling change.  The challenge will be evaluating using the Expected F1 (EF1) score, the harmonic mean of precision and recall.&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;I am a bit skeptical about the training dataset.  First, the number of mispellings is significantly lower than I would expect.  It is also quite small.  The website appears to offer a place for contestants to share data, the "Team datasets", so it's possible that some teams could annotate a larger dataset.  Also, the non-mispelling "suggestions" are not clearly described.  I think these need more explaining and differentiation from query reformulations.  To put this into more context, here is a very brief overview of query correction.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;According to several studies of search queries, approximately 10-15% of search queries contain spelling errors. For example an important paper in the area is &lt;a href="http://acl.ldc.upenn.edu/acl2004/emnlp/pdf/Cucerzan.pdf"&gt;Spelling correction as an iterative process that exploits the collective knowledge of web users&lt;/a&gt;.  Beyond spelling correction, there has been a recent trend towards other types of query reformulations: &lt;a href="http://portal.acm.org/citation.cfm?id=1277741.1277851"&gt;stemming&lt;/a&gt;, &lt;a href="http://portal.acm.org/citation.cfm?id=1135777.1135835"&gt;substitutions&lt;/a&gt;, and query expansion. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;If you want an easy way to get started, consider the &lt;a href="http://alias-i.com/lingpipe"&gt;LingPipe toolkit&lt;/a&gt;, which has a &lt;a href="http://alias-i.com/lingpipe/demos/tutorial/querySpellChecker/read-me.html"&gt;tutorial&lt;/a&gt; and is offering a &lt;a href="http://lingpipe-blog.com/2010/12/21/a-call-to-code-microsoft-researchbing-query-spell-check-challenge/"&gt;special license for the competition&lt;/a&gt;.  You may also be inspired by &lt;a href="http://norvig.com/spell-correct.html"&gt;Peter Norvig's 21 line spelling corrector&lt;/a&gt; in Python.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-8092008858607112020?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/8092008858607112020/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2011/01/microsoft-research-spelling-corrector.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/8092008858607112020'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/8092008858607112020'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2011/01/microsoft-research-spelling-corrector.html' title='Microsoft Research Spelling Corrector Challenge'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-7715582511524210159</id><published>2010-12-10T10:00:00.005-05:00</published><updated>2010-12-10T13:56:44.468-05:00</updated><title type='text'>Seeking Summer Internship Opportunities</title><content type='html'>I am beginning to explore internship opportunities for the summer of 2011.  &lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I am an applied researcher whose interests are search in specialized domains and building search tools to solve complex information needs.  My experience includes search in the engineering domain, medical search, local business objects, food and recipes, and information extraction on &lt;a href="http://www.archive.org/details/texts"&gt;book data&lt;/a&gt;.  My work often involves processing of large datasets using distributed processing frameworks such as Hadoop and PIG.  &lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I am looking for opportunities that fit with my background and preferably include research that could lead to a publishable paper in a major conference. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;If you know of any opportunities that would be appropriate, please contact me via email.  My CV is available from my website (&lt;a href="http://www.cs.umass.edu/~jdalton/files/Jeff_Dalton_CV_12-10-2010.doc"&gt;Word&lt;/a&gt;, &lt;a href="http://www.cs.umass.edu/~jdalton/files/Jeff_Dalton_CV_12-10-2010.pdf"&gt;PDF&lt;/a&gt;).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Several of my fellow PhD students here in the &lt;a href="http://ciir.cs.umass.edu/"&gt;CIIR&lt;/a&gt; are also seeking internships, so I would be happy to pass along any appropriate opportunities.&lt;/div&gt;&lt;div&gt;  &lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-7715582511524210159?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/7715582511524210159/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2010/12/seeking-summer-internship-opportunities.html#comment-form' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/7715582511524210159'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/7715582511524210159'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2010/12/seeking-summer-internship-opportunities.html' title='Seeking Summer Internship Opportunities'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-4323969295841147927</id><published>2010-12-08T17:19:00.003-05:00</published><updated>2010-12-08T17:35:06.440-05:00</updated><title type='text'>New Book: Mining of Massive Datasets</title><content type='html'>&lt;a href="http://infolab.stanford.edu/~anand/"&gt;Anand Rajaraman&lt;/a&gt; and &lt;a href="http://infolab.stanford.edu/~ullman/"&gt;Jeffrey D. Ullman&lt;/a&gt; have put together a new ebook, &lt;a href="http://infolab.stanford.edu/~ullman/pub/book.pdf"&gt;Mining of Massive Datasets&lt;/a&gt;.  The book builds on the course materials for the Stanford CS345 course "Web Mining" and the CS246 class, &lt;a href="http://www.stanford.edu/class/cs246/cs246-11-mmds/"&gt;Mining Massive Data Sets&lt;/a&gt;.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;From the ToC, the book covers:&lt;/div&gt;&lt;div&gt;&lt;ol&gt;&lt;li&gt;An introduction to data mining&lt;/li&gt;&lt;li&gt;Large-scale processing with distributed file systems and MapReduce&lt;/li&gt;&lt;li&gt;Similarity search: nearest neighbor, minhashing, LSH, etc...&lt;/li&gt;&lt;li&gt;Algorithms for mining streaming data&lt;/li&gt;&lt;li&gt;(Web) Graph analysis: Pagerank, HITS, and spam detection&lt;/li&gt;&lt;li&gt;Frequent Itemset algorithms&lt;/li&gt;&lt;li&gt;Clustering Algorithms&lt;/li&gt;&lt;li&gt;Advertising on the web&lt;/li&gt;&lt;li&gt;Recommendation Systems&lt;/li&gt;&lt;/ol&gt;&lt;div&gt;It is an interesting blend of material that are not usually taught together.  I look forward to examining it in more detail.&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-4323969295841147927?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/4323969295841147927/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2010/12/new-book-mining-of-massive-datasets.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/4323969295841147927'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/4323969295841147927'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2010/12/new-book-mining-of-massive-datasets.html' title='New Book: Mining of Massive Datasets'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-3951470714203292064</id><published>2010-12-07T15:47:00.006-05:00</published><updated>2010-12-08T10:28:51.612-05:00</updated><title type='text'>Barriers to Entry in Search Getting Lower</title><content type='html'>The Mim's Bits column in the MIT Tech Review has an article, &lt;a href="http://www.technologyreview.com/blog/mimssbits/26106/"&gt;You, Too Can Be the Next Google&lt;/a&gt;.  In the article, Tom Annau, the CTO of &lt;a href="http://blekko.com/"&gt;blekko&lt;/a&gt; (see my &lt;a href="http://www.searchenginecaffe.com/2010/11/blekko-launches-brings-transparency-to.html"&gt;previous post&lt;/a&gt;) argues that computing power is growing faster than the amount of 'useful' and 'interesting' content on the web.&lt;div&gt;&lt;blockquote&gt;"Web search is still an application that pushes the boundaries of current computing devices pretty hard," says Annau. But Blekko accomplishes a complete, up-to-the-minute index of the Web with less than 1000 servers...&lt;/blockquote&gt;&lt;/div&gt;&lt;div&gt;To be more efficient, they are more careful about what they crawl by:&lt;/div&gt;&lt;div&gt;&lt;ol&gt;&lt;li&gt;Avoiding crawling spam and splog content&lt;/li&gt;&lt;li&gt;Using a "split-crawl" strategy that refreshes different genres of content at different rates to ensure that blogs and news are refreshed often.&lt;/li&gt;&lt;/ol&gt;&lt;div&gt;&lt;div&gt;I'm not sure blekko's "efficiency" techniques are particularly interesting or novel.  However, I do think that overall the ability to crawl and index the entire web is getting easier, especially with distributed crawlers (like &lt;a href="http://bixo.101tec.com/"&gt;Bixo&lt;/a&gt;).&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;blockquote&gt;"Whether we succeed or fail as as startup, it will be true that every year that goes by individual servers will become more and more powerful, and the ability to crawl and index the useful info on the Web will actually become more and more affordable," says Annau.&lt;/blockquote&gt;&lt;/div&gt;&lt;div&gt;The recent Mei and Church paper in 2008, &lt;a href="http://portal.acm.org/citation.cfm?id=1341540"&gt;Entropy of search logs: how hard is search? with personalization? with backoff?&lt;/a&gt;, analyzed a large search engine log to determine the size of this 'interesting' part of the web.  They find that they can encode the URLs from search logs using approximately 22 bits, millions of pages.  As they say,&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;blockquote&gt;Large investments in clusters in the cloud could be wiped out if someone found a way to capture much of the value of billions with a small cache of millions.&lt;/blockquote&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;In principle, if you knew these pages and had a way of accurately predicting which ones change, then the price of search can be significantly reduced.  In the paper, they go on to highlight that a personalized page cache or one based on profiles of similar users offers an even greater opportunity.  In short, there is great opportunity for small very personalized verticals.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I think the main reason that blekko needs a modest number of servers is that its query volume is small.  One of the key reasons that Google and other web search engines need thousands and thousands of computers is to support very fast query latency for billions of queries per day from hundreds of millions users around the world.  To pull this off Google keeps its search index in memory (see &lt;a href="http://www.searchenginecaffe.com/2009/02/jeffrey-dean-wsdm-keynote-building.html"&gt;Jeff Dean WSDM 2009 keynote&lt;/a&gt;).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-3951470714203292064?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/3951470714203292064/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2010/12/barriers-to-entry-in-search-getting.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/3951470714203292064'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/3951470714203292064'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2010/12/barriers-to-entry-in-search-getting.html' title='Barriers to Entry in Search Getting Lower'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-8453337198405442466</id><published>2010-12-01T15:42:00.002-05:00</published><updated>2010-12-01T16:15:44.401-05:00</updated><title type='text'>Google Fixes DecorMyEyes.com Problem</title><content type='html'>This afternoon Amit Singhal, from Google Search Quality, wrote a blog post about how Google &lt;a href="http://googleblog.blogspot.com/2010/12/being-bad-to-your-customers-is-bad-for.html"&gt;fixed the recent DecorMyEyes.com fiasco&lt;/a&gt;.  The store was broken by the &lt;a href="http://www.nytimes.com/2010/11/28/business/28borker.html"&gt;NY Times article&lt;/a&gt; exposing how a disreputable merchant gained high ranking by being mean to his customers.  He gained links and reputation by being written up negatively on many popular and important sites.  &lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;What's interesting is Amit's post is the insight into how Google's team approached solving the problem.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;1) Only blocking the site would not solve the underlying problem.&lt;/div&gt;&lt;div&gt;2) Sentiment analysis wouldn't solve the problem because reputation was coming from neutral news sites with solid reputations.  Google has not yet found a useful way to incorporate sentiment into ranking.&lt;/div&gt;&lt;div&gt;3) Expose the reviews and ratings next to the results, but this would not actually alter the ranking.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Their new secret undisclosed fixed detected the problematic merchant and several hundred other bad apples.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-8453337198405442466?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/8453337198405442466/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2010/12/google-fixes-decormyeyescom-problem.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/8453337198405442466'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/8453337198405442466'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2010/12/google-fixes-decormyeyescom-problem.html' title='Google Fixes DecorMyEyes.com Problem'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-4737219084485686931</id><published>2010-11-30T12:21:00.003-05:00</published><updated>2010-11-30T12:39:54.973-05:00</updated><title type='text'>Lectures on User Behavior Modeling and Implicit Feedback from Query Logs</title><content type='html'>I am one of the TAs for the &lt;a href="http://cs646.cs.umass.edu/"&gt;graduate IR course&lt;/a&gt; at UMass this semester.  I recently gave two lectures on modeling user behavior and utilizing implicit user feedback from logs.&lt;div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;a href="http://cs646.cs.umass.edu/sites/default/files/ir19-information-seeking.pdf"&gt;User Behavior Modeling&lt;/a&gt;.  I covered models of information seeking behavior. Then, I went over the Google 3M (micro-, meso-, and macro-) characterizations of interactions.  We looked at how we learn about these various levels of interactions through field and lab studies, instrumented panels, and query logs.  &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;a href="http://cs646.cs.umass.edu/sites/default/files/ir20-implicit-user-feedback.pdf"&gt;Implicit User Feedback&lt;/a&gt;.  We finished up query log analysis including query classification, applications like disambiguation and trends.  Most of the time was spent on interpreting clickthrough and browsing behavior to generate preference and relevance data.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;If you want to learn more, a lot of the lectures build on materials from Eugene Agichtein's &lt;a href="http://ir.mathcs.emory.edu/intent_tutorial/"&gt;tutorial on Inferring User Intent&lt;/a&gt; at WWW 2010.  If you want more detail, their &lt;a href="http://ir.mathcs.emory.edu/intent/"&gt;intent project&lt;/a&gt; is a good place to start.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-4737219084485686931?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/4737219084485686931/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2010/11/lectures-on-user-behavior-modeling-and.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/4737219084485686931'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/4737219084485686931'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2010/11/lectures-on-user-behavior-modeling-and.html' title='Lectures on User Behavior Modeling and Implicit Feedback from Query Logs'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-4910591245706887130</id><published>2010-11-30T11:27:00.003-05:00</published><updated>2010-11-30T14:46:11.687-05:00</updated><title type='text'>Call for participation of Academic IR community in Lucene</title><content type='html'>Otis Gospodnetic, a committer on the Lucene project put out a call on the &lt;a href="http://blog.sematext.com/"&gt;SemaText blog&lt;/a&gt; for greater engagement of academia with the open source Solr/Lucene community.  In particular, he is seeking &lt;a href="http://blog.sematext.com/2010/11/29/lucene-solr-for-academia-phd-thesis-ideas/"&gt;ideas for advanced topics&lt;/a&gt; that would be worth of a MS/PhD thesis that would be implemented and contributed to the community.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;If you have ideas, please add it to the public &lt;a href="https://spreadsheets.google.com/ccc?key=0Amm2Nme6sg3PdGJacWpTNktuN3VmTGdOY19LcWFDWXc&amp;amp;authkey=CNDe6IcC&amp;amp;hl=en#gid=0"&gt;idea spreadsheet&lt;/a&gt; he started. I strongly you to go there and contribute.  &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Lucene is the most widely used search engine library.  If important new academic ideas that improve retrieval get incorporated, the impact would be huge. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;However, historically, the Lucene community and academia has been kept very separate.  Instead, the research teams have developed their own systems, the fragmentation is apparent if you look at my list of &lt;a href="http://www.searchenginecaffe.com/2007/03/open-source-search-engines-in-java-and.html"&gt;open source search libraries&lt;/a&gt;.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Lucene's ranking algorithms are dated and it is inflexible and difficult to change. Because it is so widely adopted, it is hard to modify and extend in radical ways.  If academia is going to get involved, some of these issues need to be addressed, and a lot of it is straightforward engineering work that would enable it to be a better research platform.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;    &lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-4910591245706887130?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/4910591245706887130/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2010/11/call-for-participation-of-academic-ir.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/4910591245706887130'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/4910591245706887130'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2010/11/call-for-participation-of-academic-ir.html' title='Call for participation of Academic IR community in Lucene'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-3165144568642657922</id><published>2010-11-04T14:45:00.002-04:00</published><updated>2010-11-04T17:17:48.635-04:00</updated><title type='text'>Susan Dumais CIKM 2010 Keynote: Temporal Dynamics in Information Retrieval</title><content type='html'>I am still catching up on a backlog of items from last week.&lt;br /&gt;&lt;br /&gt;Here are more of &lt;a href="http://ciir.cs.umass.edu/%7Ebemike/"&gt;Michael&lt;/a&gt;'s notes from &lt;a href="http://research.microsoft.com/en-us/um/people/sdumais/"&gt;Susan Dumais&lt;/a&gt;' keynote presentation at CIKM 2010 that addressed the impact of time on web search. Gene also has &lt;a href="http://palblog.fxpal.com/?p=4873"&gt;his notes&lt;/a&gt; from the presentation.&lt;br /&gt;&lt;ul&gt;&lt;li&gt; Change in IR&lt;/li&gt;&lt;ul&gt;&lt;li&gt;New documents and queries &lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;Query volume changes seasonally/periodically &lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;Document content changes over time&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;User interactions change over time (e.g., anchor text, page visits)&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;Relevant document for query change over time, “Hurricane Earl” (Sept. 2010 vs. before/after)&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;But -&gt; evaluation corpora is usually static&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;li&gt;Digital dynamics are relatively easy to capture, however tools for interacting with information are static (Browsers/search engines)&lt;/li&gt;&lt;br /&gt;&lt;li&gt;Characteristics of Web page change&lt;/li&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://www.blogger.com/www.cs.cmu.edu/%7Ejelsas/papers/wsdm09-change-camready.pdf"&gt;Adar et al. (WSDM, 2009)&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;Measuring web page change in a large web crawl&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;33% of web pages changed over a period of 11 weeks&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;66% of visited pages changed over 5 weeks, 63 changed every hr&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;Avg. time between changes – 123 hr.&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;.com pages change more often than .gov,.org&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;Knot point – the place on the change curve where the page stabilizes over time; Characterizes the way pages change&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;Term-level changes&lt;/li&gt;&lt;ul&gt;&lt;li&gt;Looking at characteristic term for the page and their “staying power”, e.g. “cookbooks” &amp;amp; “ingredients” have a high staying power for allrecipes.com, “barbeque” is more transient&lt;/li&gt;&lt;/ul&gt;&lt;/ul&gt;&lt;br /&gt;&lt;li&gt;Revisitation Patterns on the Web&lt;/li&gt;&lt;ul&gt;&lt;li&gt;60-80% of the pages you visited, you’ve already seen before&lt;br /&gt;&lt;/li&gt;&lt;li&gt;4 revisit patterns:&lt;br /&gt;&lt;/li&gt;&lt;ul&gt;&lt;li&gt;Fast - Navigation within site&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Hybrid - High quality fast pages&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Medium - Popular homepages/mail &amp;amp; web applications&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Slow - Entry pages, bank pages, accessed via search engines&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/ul&gt;&lt;br /&gt;&lt;li&gt;Revisitations &amp;amp; Search (&lt;a href="http://portal.acm.org/citation.cfm?id=1277770&amp;amp;CFID=112026226&amp;amp;CFTOKEN=25358452"&gt;Teevan et al, SIGIR 2007&lt;/a&gt;, &lt;a href="http://www.blogger.com/people.csail.mit.edu/teevan/work/publications/papers/wsdm10.pdf"&gt;Tyler et al., WSDM 2010&lt;/a&gt;)&lt;/li&gt;&lt;ul&gt;&lt;li&gt;Repeat query 33%&lt;/li&gt;&lt;li&gt;Repeat click 39%&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;li&gt;Relationships between revisits and change (&lt;a href="http://research.microsoft.com/apps/pubs/default.aspx?id=79631"&gt;Adar et al., CHI 2009&lt;/a&gt;)&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Monitor change&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Effect change is not related to change&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Change can interfere with re-finding&lt;br /&gt;&lt;/li&gt;&lt;li&gt;The more visitors the page has, the more often it changes&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Three pages: nytimes.com, woot.com, costco.com&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Similar change patterns, but different revisit patterns:&lt;/li&gt;&lt;li&gt;NYT – fast revisit&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Woot – medium revisit&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Costco – slow revisit&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://research.microsoft.com/en-us/projects/diffie/default.aspx"&gt;Diff-IE&lt;/a&gt; – Building support for understanding change&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Browser toolbar that highlights content that was changed since the last visit&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Non-intrusive and personalized --- changes that are of interest to you, not to the publisher of the page&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Helps to uncover unexpected important content&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Facilitates serendipitous encounters&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Helps to understand page dynamics&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Will be publicly available later this month from&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://research.microsoft.com/en-us/projects/diffie/default.aspx"&gt;http://research.microsoft.com/en-us/projects/diffie/default.aspx&lt;/a&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Research surveys show that Diff-IE drives more revisitation&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Driving visits to pages that change frequently&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.blogger.com/www.cs.cmu.edu/%7Ejelsas/.../WSDM2010-ContentChangeInRanking.pdf"&gt;Leveraging Temporal Dynamics for IR&lt;/a&gt; (Elsas &amp;amp; Dumais, WSDM 2010)&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Use document change rate to set document priors&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Use term longevity to weight terms&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Evaluation using static data&lt;br /&gt;&lt;/li&gt;&lt;ul&gt;&lt;li&gt;Using 2k navigational queries&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Dynamic model outperforms the static baseline&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;li&gt;Ongoing evaluation collection (&lt;a href="http://wiki.cse.cuhk.edu.hk/wsdm2011/accepted-papers"&gt;Understanding Temporal Query Dynamics&lt;/a&gt;, to appear in WSDM 2011)&lt;br /&gt;&lt;/li&gt;&lt;ul&gt;&lt;li&gt;Collect relevance judgments over time, e.g. “march madness” query&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Document relevance changes over time&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;/ul&gt;&lt;/li&gt;&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-3165144568642657922?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/3165144568642657922/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2010/11/susan-dumais-cikm-2010-keynote-temporal.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/3165144568642657922'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/3165144568642657922'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2010/11/susan-dumais-cikm-2010-keynote-temporal.html' title='Susan Dumais CIKM 2010 Keynote: Temporal Dynamics in Information Retrieval'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-9209175811379640975</id><published>2010-11-03T16:30:00.003-04:00</published><updated>2010-11-03T16:55:01.731-04:00</updated><title type='text'>Yahoo! Open Sources S4 Real-Time MapReduce framework</title><content type='html'>Today Yahoo! &lt;a href="http://twitter.com/#!/s4project"&gt;announced&lt;/a&gt; the release of a new real-time MapReduce framework written in Java called &lt;a href="http://s4.io/"&gt;S4&lt;/a&gt;. From the website,&lt;div&gt;&lt;blockquote&gt;S4 is a general-purpose, distributed, scalable, partially fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data.&lt;/blockquote&gt;&lt;/div&gt;&lt;div&gt;For more technical details you can read the &lt;a href="http://wiki.s4.io/Manual/S4Overview#toc1"&gt;technical overview&lt;/a&gt; or check out the &lt;a href="http://github.com/s4/core"&gt;code on github&lt;/a&gt;.  The &lt;a href="https://github.com/s4/examples"&gt;example application&lt;/a&gt; keeps counts of hash tags in a Twitter stream.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The framework was previously announced at a &lt;a href="http://www.blogger.com/Yahoo%20announced%20the%20release%20of%20S4,%20a%20real-time%20MapReduce%20stream%20processing%20framework.%20%20http://labs.yahoo.com/event/99"&gt;Y! lab event&lt;/a&gt; which discussed processing in Y!'s advertising platform.&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-9209175811379640975?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/9209175811379640975/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2010/11/yahoo-open-sources-s4-real-time.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/9209175811379640975'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/9209175811379640975'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2010/11/yahoo-open-sources-s4-real-time.html' title='Yahoo! Open Sources S4 Real-Time MapReduce framework'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-5980481775034838945</id><published>2010-11-03T16:20:00.002-04:00</published><updated>2010-11-03T16:30:56.590-04:00</updated><title type='text'>Google Open Sources Sawzall</title><content type='html'>Google today open sourced &lt;a href="http://code.google.com/p/szl/"&gt;sawzall&lt;/a&gt;, see the &lt;a href="http://research.google.com/archive/sawzall.html"&gt;original publication&lt;/a&gt;.  From its description,&lt;div&gt;&lt;blockquote&gt;Sawzall is a procedural language developed for parallel analysis of very large data sets (such as logs). It provides protocol buffer handling, regular expression support, string and array manipulation, associative arrays (maps), structured data (tuples), data fingerprinting (64-bit hash values), time values, various utility operations and the usual library functions operating on floating-point and string values. For years Sawzall has been Google's logs processing language of choice and is used for various other data analysis tasks across the company.&lt;/blockquote&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-5980481775034838945?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/5980481775034838945/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2010/11/google-open-sources-sawzall.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/5980481775034838945'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/5980481775034838945'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2010/11/google-open-sources-sawzall.html' title='Google Open Sources Sawzall'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-93356729352090188</id><published>2010-11-03T12:31:00.010-04:00</published><updated>2010-11-05T11:14:41.349-04:00</updated><title type='text'>Components of Compelling Vertical Search</title><content type='html'>&lt;div&gt;In this post, I will discuss key components of successful topic-specific &lt;a href="http://www.searchenginecaffe.com/2005/11/vertical-search-definition-and-context.html"&gt;vertical search&lt;/a&gt;.   I was motivated to write it by the &lt;a href="http://www.searchenginecaffe.com/2010/11/blekko-launches-brings-transparency-to.html"&gt;launch of blekko&lt;/a&gt; earlier this week. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Blekko is marketing its ability to slice the web up into verticals using slashtags. Blekko's slashtags define a list of hosts or page to focus a search.  But, that is not enough to be successful. Search in a vertical needs to provide a significantly different experience from general web search.  A compelling vertical search engine has the following key components:&lt;/div&gt;&lt;div&gt;&lt;ol&gt;&lt;li&gt;&lt;b&gt;Vertical specific ranking&lt;/b&gt;. A focused topic should define and utilize ranking features unique to the vertical.  It may as simple as the topical classification score for a page.  It often requires applying information extraction to identify meaningful document fields.  It should also leverage vertical-specific static rank features.  For example, use a technique like &lt;a href="http://www-cs-students.stanford.edu/~taherh/papers/topic-sensitive-pagerank.pdf"&gt;topic-specific pagerank&lt;/a&gt;, an author/source popularity score, or other features.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;b&gt;Rich results.&lt;/b&gt; The result objects should be presented in a way that uses the structured and semantic information from the topic.  For example, simple examples of this include presentations that use data from &lt;a href="http://googlewebmastercentral.blogspot.com/2009/05/introducing-rich-snippets.html"&gt;Google Rich Snippets&lt;/a&gt; and &lt;a href="http://developer.yahoo.com/searchmonkey/"&gt;SearchMonkey&lt;/a&gt;. This may include topic-specific metadata like authors, political perspectives, addresses, or aggregated user rating scores.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;b&gt;Faceted UI&lt;/b&gt;.  A vertical should exploit structured metadata for &lt;a href="http://en.wikipedia.org/wiki/Exploratory_search"&gt;exploratory search&lt;/a&gt;.  It should allow you to flexibly combine keyword search and structured attribute restriction to limit the search space by: price, airline, manufacturer, genre, date, etc... See the &lt;a href="http://www.blogger.com/flamenco.berkeley.edu/talks/chi_course06.pdf"&gt;CHI 2006 tutorial&lt;/a&gt; and the relevant section from Marti Hearst's &lt;a href="http://searchuserinterfaces.com/book/"&gt;Search UI book&lt;/a&gt; on &lt;a href="http://searchuserinterfaces.com/book/sui_ch8_navigation_and_search.html#figure_8.12"&gt;eBay Express&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;b&gt;Domain knowledge&lt;/b&gt;.  A restricted topical domain should model important relationships between objects and concepts to improve retrieval. For example, it should use a &lt;a href="http://www.freebase.com/"&gt;Freebase&lt;/a&gt;-like knowledge base of objects and their attributes.  In a recipe search engine, it would would model ingredients and relationships such as contains:gluten or is kosher.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;b&gt;Task Modeling.&lt;/b&gt;  A key benefit of a narrow domain is that it should allow users to accomplish &lt;a href="http://ils.unc.edu/ISSS/papers/papers/aula.pdf"&gt;complex tasks&lt;/a&gt; more easily.  It should provide tools and interfaces to more directly allow users to get things done.&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;&lt;/div&gt;&lt;div&gt;Of course, it needs to keep up with web search engines in ranking, comprehensiveness, and freshness, which are all key components of &lt;a href="http://jopedersen.com/Presentations/Berkeley_Search_Quality_9-10-07.pdf"&gt;search quality&lt;/a&gt;.  &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;For more of my thoughts on these issues, you can see the slides from my ECIR 2008 Industry Day talk &lt;a href="http://ecir2008.dcs.gla.ac.uk/id_slides/ECIR_08_Dalton.pdf"&gt;The Challenge of Engineering. Vertical Search&lt;/a&gt;.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Overall, creating a compelling vertical experience currently requires a lot of hard work and painstaking curation.  It requires a deep understanding of the tasks that users perform.  It requires modeling the topic and domain objects in meaningful ways. Combining these elements together is difficult to do well.  It is extremely hard to do at the scale of the entire web across all topics.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-93356729352090188?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/93356729352090188/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2010/11/components-of-compelling-vertical.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/93356729352090188'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/93356729352090188'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2010/11/components-of-compelling-vertical.html' title='Components of Compelling Vertical Search'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-724133769883585143</id><published>2010-11-01T14:19:00.017-04:00</published><updated>2010-11-03T22:45:26.394-04:00</updated><title type='text'>Blekko Launches: Brings Transparency to Relevance Ranking</title><content type='html'>&lt;a href="http://blekko.com/"&gt;blekko&lt;/a&gt; launched its public beta on Monday. blekko is a new web search engine that focuses on creating an open and transparent process around search engine relevance ranking. blekko is attempting to differentiate itself as the open alternative to "closed" search engines and involve greater public participation in the ranking process.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Creating blekko is an impressive feat because they built their own system from the ground up to crawl, index, and rank a multi-billion page search index. This is hard to do well. They have accomplished a lot in short period of time, so I am excited about the changes we'll see as they evolve. I hope that they will take the risks that other search engines can't afford. One of their risky moves is opening up their ranking features.&lt;div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Open vs. Closed Ranking&lt;/b&gt;&lt;/div&gt;&lt;div&gt;Google's "closed policy" is a difficult issue that has garnered significant criticism. For example, at &lt;a href="http://thenoisychannel.com/2008/04/08/qa-with-amit-singhal-2/"&gt;ECIR 2008 in a QA with Amit Singhal&lt;/a&gt;, Daniel Tunkelang questioned the need to rely on security through obscurity. (For an updated perspective now that Daniel works at Google read his recent post on &lt;a href="http://thenoisychannel.com/2010/03/07/google-and-transparency/"&gt;Google and Transparency&lt;/a&gt;). In response to a EU inquiry, Amit Singhal laid out the &lt;a href="http://googlepolicyeurope.blogspot.com/2010/02/this-stuff-is-tough.html"&gt;underlying philosophy of Google's ranking&lt;/a&gt;,&lt;div&gt;&lt;ol&gt;&lt;li&gt;Algorithmically-generated results.&lt;/li&gt;&lt;li&gt;No query left behind.&lt;/li&gt;&lt;li&gt;Keep it simple.&lt;/li&gt;&lt;/ol&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;div&gt;Although Google uses signals from humans in the form of links and click through on search results, it does not actively involve humans in the search process. Blekko is going to be different.&lt;br /&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;As a first step involving users in ranking, blekko allows users define their own search experience using "slashtags". Founder &lt;a href="http://www.skrenta.com/"&gt;Rich Skrenta&lt;/a&gt; describes this on a recent blog post, on &lt;a href="http://www.skrenta.com/2010/10/blekko_launch.html"&gt;crowdsourcing relevance&lt;/a&gt;,&lt;div&gt;&lt;blockquote&gt;We're starting by letting users define their own vertical search experiences, using a feature we call slashtags. &lt;b&gt;Slashtags&lt;/b&gt; let all of the vertical engines that people define on blekko live within the same search box. They also let you do a search and quickly pivot from one vertical to another.&lt;/blockquote&gt;You can contrast Google's philosophy with Blekko's, here are the first 3 of 11 points in Rich Skrenta's &lt;a href="http://www.skrenta.com/2010/07/if_blekko_sees_its_shadow_6_mo.html"&gt;post&lt;/a&gt;,&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;ol&gt;&lt;li&gt;Search shall be open&lt;/li&gt;&lt;li&gt;Search results shall involve people&lt;/li&gt;&lt;li&gt;Ranking data shall not be kept secret&lt;/li&gt;&lt;li&gt;...&lt;/li&gt;&lt;/ol&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;div&gt;&lt;div&gt;A philosophy is great, but it doesn't matter if your results suck. Blekko just launched, so let's take a closer look.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Blekko's ranking&lt;/b&gt;&lt;/div&gt;&lt;div&gt;I tried blekko and it is a very solid first effort.  To experiment, I re-ran a variety of searches from my Google web history. I didn't conduct thorough experiments, but my impression is that the ranking and coverage is very reasonable, but not as good as Google or Bing's. &lt;a href="http://searchengineland.com/blekko-the-slashtag-search-engine-goes-live-54447"&gt;SELand&lt;/a&gt; has a more comprehensive review with a side-by-side comparison with Google.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-weight: normal; "&gt;One frustration I encountered using blekko is that slashtags autofired and my query was automatically restricted to a vertical when it was overly restricted. This limited scope led to missing key relevant results and I manually backed off several times to the /web.  Slashtags create added complexity which leads to problems.&lt;/span&gt;&lt;/b&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-weight: normal; "&gt;&lt;br /&gt;&lt;/span&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;I'd like to point out a few queries where I found that blekko's relevance particularly stumbled and could improve, [&lt;a href="http://blekko.com/ws/carbonell+mmr"&gt;carbonell mmr&lt;/a&gt;] and [&lt;a href="http://blekko.com/ws/iron+cook+northampton"&gt;iron cook northampton&lt;/a&gt;]. Neither of these are easy queries. The first is somewhat ambiguous and the second is about a small local event. What I find hopeful with blekko is that I can begin to understand the underlying reason for failure. I clicked on "rank stats" or you can use the /rank tag, e.g. [&lt;a href="http://blekko.com/ws/carbonell+mmr+/rank"&gt;carbonell mmr /rank&lt;/a&gt;]. For each result blekko also provides an "seo" link to see static rank features, &lt;a href="http://blekko.com/ws/http://www.ml.cmu.edu/mlpeople/affiliatedfaculty.html+/seo"&gt;http://www.ml.cmu.edu/mlpeople/affiliatedfaculty.html&lt;/a&gt;. As an IR researcher I find open access to this feature data very exciting. However, for the average searcher this level of detail is distracting and unnecessary. The "openness" features need to earn their real estate by being actionable, but right now they don't do that.&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Instead of cluttering the search UI, I would like to see blekko be more open by providing the data through an API.  It would let academics and searchers use this raw material to rerank results in new and novel ways.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;On "Crowdsourcing relevance"&lt;/b&gt;&lt;/div&gt;&lt;div&gt;Slashtags are operators that can both restrict a search and change its ranking. They currently allow you to sort the results by /relevance or /date. Users can define slashes that tag hosts as relevant to topic. I started a slashtag on &lt;a href="http://blekko.com/ws/+/jeffd/information-retrieval"&gt;information-retrieval&lt;/a&gt;.  However, restricting a query to a set of hosts using slashtags is a bit like performing surgery with a chainsaw.  In the end you are missing key bits.  This approach has several problems:&lt;/div&gt;&lt;div&gt;&lt;ol&gt;&lt;li&gt;The granularity of hosts is too coarse. The amount of content relevant to a topic could be a single page or section of website.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Recall. A slashtag cannot be mantained by people in real-time and will miss relevant content.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;The semantics of a slashtag are not well defined and it is not obvious how to combine them, e.g. a combine a topic slashtag with a /date ranking.&lt;/li&gt;&lt;/ol&gt;&lt;/div&gt;&lt;div&gt;The claim is that slashtags reduce spam by limiting search to a restricted set of trusted websites. However, I don't think that the impact of spam on search results is very compelling. Search engines are quite good at identifying and incorporating implicit user feedback to reduce the impact of irrelevant (spammy) results.  There needs to be a more compelling reason.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Using slashtags doesn't address several key issues in crowdsourcing ranking. The first is that they doesn't address the obvious need to involve people in making relevance assessments on the results in a systematic way. Secondly, the core of search ranking is determining what features indicate relevance for a query and how they should be combined.  blekko is not currently surfacing a way to change either of those aspects.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;It remains to be seen how you could really let users change the relevance in meaningful ways and more importantly, measure the utility to everyone. It may be that academia could play some role in creating and testing features.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Presentation&lt;/b&gt;&lt;/div&gt;&lt;div&gt;blekko's "10 blue links" search UI feels outdated. Modern search engines are incorporating rich results into SERPs. The "Universal Search" results blend images, videos, maps, even places and people. I hope that this is an area where we see blekko evolve quickly to catch up.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Overall&lt;/b&gt;&lt;/div&gt;&lt;div&gt;I won't be switching to blekko for regular use.  I find the level of ranking information and features that they share very exciting and compelling.  Because I can see the ranking pieces, it compels me to jump in and help make things better, because I can.  However, I question the utility of the information for average users and the ability to deeply engage the public in useful ways to improve ranking.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Slashtags are fun to play with, but are they useful?  Slicing the web into groups creates mini vertical search experiences. However, using the tags adds complexity that may not be necessary most of the time.  The value offered by the slashtag verticals is quite limited right now.  I hope that slashtags will evolve to allow users to do more curation and add more value as blekko matures.  &lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-724133769883585143?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/724133769883585143/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2010/11/blekko-launches-brings-transparency-to.html#comment-form' title='6 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/724133769883585143'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/724133769883585143'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2010/11/blekko-launches-brings-transparency-to.html' title='Blekko Launches: Brings Transparency to Relevance Ranking'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>6</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-7183774245172867365</id><published>2010-10-27T15:38:00.005-04:00</published><updated>2010-10-28T12:36:37.791-04:00</updated><title type='text'>CIKM 2010 Jamie Callan Keynote: Search Engine Support for Software Applications</title><content type='html'>&lt;p class="MsoNormal"&gt;I am not at CIKM, but &lt;a href="http://ciir.cs.umass.edu/~bemike/"&gt;Michael Bendersky&lt;/a&gt; sent me his notes from &lt;a href="http://www.cs.cmu.edu/~callan/"&gt;Jamie Callan&lt;/a&gt;'s &lt;a href="http://www.yorku.ca/cikm10/keynote_speakers.php"&gt;keynote address&lt;/a&gt;.  Gene also gave his &lt;a href="http://palblog.fxpal.com/?p=4866"&gt;writeup on the FXPal Blog&lt;/a&gt;.&lt;/p&gt;&lt;span style="font-weight:bold;"&gt;Jamie Callan: Search Engine Support for Software Applications&lt;/span&gt;&lt;br /&gt;&lt;ul&gt;&lt;br /&gt;&lt;li&gt; Motivation: SE (search engine) as a "language DB"&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Computer Assisted Language Learning&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;Q&amp;amp;A&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;Read-the-Web&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt; IR typically assumes a "user" is a person&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt; Software applications are a new challenging class of SE users&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt; There are very low expectations from a SE from an application "user" perspective&lt;br /&gt;&lt;ul&gt;&lt;li&gt;E.g., SE's are mostly used for keyword search&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt; Recall-Precision tradeoff avoids SE's from using a highly structured query language (like Indri)&lt;br /&gt;&lt;ul&gt;&lt;li&gt;BOW query - high recall/low precision&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;Structured query - low recall/high precision&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt; Motivation II: using rich language/information resources&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Wordnet, Freebase, Dbpedia, ...&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;SE's are not very good at using them&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt; Structured queries and documents are well-studied IR topics, but&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Do we really understand them?&lt;/li&gt;&lt;li&gt;Maybe the basic structures, but not the more advanced ones&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt; Document = structured object&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Metadata:&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;Fielded text: title, chapters, sections, references&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;Relations to other documents &lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt; Example application: REAP Project: Computer Assisted Language Learning&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Find interesting documents/passages for students based on their language level&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;Use a structured Indri query language to find relevant documents or document parts&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt; A typical approach to fields&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Exact Boolean match on the attributes&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;Can be brittle.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt; Another type of document structure&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Text annotations in documents (POS, semantic labeling, co-referencing)&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;Annotations can be considered to be "small fields"&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;li&gt; Problems with retrieval with text annotations&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Annotations are not always 100% accurate / ambiguous&lt;ul&gt;&lt;li&gt; Missing annotations&lt;br /&gt;&lt;/li&gt;&lt;li&gt; Wrong annotation boundaries&lt;br /&gt;&lt;/li&gt;&lt;li&gt; Conflated annotations: white/JJ house/NN should be white/NP house/NP&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Term weighting in short fields is hard - need to take field length normalization into account.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Problem of multiple matches: combining evidence from different fields from the same type is not a solved problem.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt; Relations among documents/entities&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Hyperlinks &amp;amp; RDF&lt;/li&gt;&lt;/ul&gt;&lt;ul&gt;&lt;li&gt;XML&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt; Relational Retrieval (&lt;a href="http://www.cs.cmu.edu/~wcohen/postscript/ecml-2010-ni.pdf"&gt;Lao &amp;amp; Cohen 2010&lt;/a&gt;)&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Example for use: journal recommendations, expert finding&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Some parts of metadata are "domain knowledge" --- they really reside outside the documents.&lt;/li&gt;&lt;br /&gt;&lt;li&gt;How to model domain knowledge as an integral part of the documents&lt;/li&gt;&lt;ul&gt;&lt;li&gt; Have different types of documents: paper, journal, authors...&lt;/li&gt;&lt;li&gt;Have typed relations between the documents: transcribes, appears in, ...&lt;/li&gt;&lt;li&gt; Have an Indri-like query language to match documents and relations&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/ul&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt; Inferred knowledge: Read-the-Web project&lt;br /&gt;&lt;ul&gt;&lt;li&gt;How to integrate the accumulated knowledge in SE's&lt;br /&gt;&lt;/li&gt;&lt;li&gt;Entity search is one example&lt;/li&gt;&lt;li&gt;General purpose solutions are still in progress.&lt;/li&gt;&lt;/ul&gt;&lt;/li&gt;&lt;/ul&gt;More CIKM coverage soon.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-7183774245172867365?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/7183774245172867365/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2010/10/cikm-2010-jamie-callan-keynote-search.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/7183774245172867365'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/7183774245172867365'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2010/10/cikm-2010-jamie-callan-keynote-search.html' title='CIKM 2010 Jamie Callan Keynote: Search Engine Support for Software Applications'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-7178732546040216895</id><published>2010-10-25T20:22:00.005-04:00</published><updated>2010-10-26T09:44:29.226-04:00</updated><title type='text'>Ray Ozzie on the future of computing</title><content type='html'>Ray Ozzie leaving Microsoft as Chief Architect.  In a farewell memo, &lt;a href="http://ozzie.net/docs/dawn-of-a-new-day/"&gt;dawn of a new day&lt;/a&gt;, he points to the future,&lt;br /&gt;&lt;blockquote&gt;Instead, to cope with the inherent complexity of a world of devices, a world of websites, and a world of apps &amp;amp; personal data that is spread across myriad devices &amp;amp; websites, a simple conceptual model is taking shape that brings it all together.  We’re moving toward a world of 1) cloud-based continuous services that connect us all and do our bidding, and 2) appliance-like connected devices enabling us to interact with those cloud-based services....&lt;br /&gt;&lt;br /&gt;It’s the dawn of a new day – the sun having now arisen on a world of&lt;i&gt; continuous services&lt;/i&gt; and &lt;i&gt;connected devices&lt;/i&gt;.&lt;br /&gt;&lt;br /&gt;&lt;/blockquote&gt;What does this shift imply for search?  We are already seeing growth in mobile search.  People are searching more because they have the capability.  And these searches tend to be more local in nature because people are  more often looking for actionable information now.  &lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;One possibility is what Eric Schmidt described as &lt;a href="http://www.searchenginecaffe.com/2010/09/autonomous-search-did-you-know.html"&gt;&lt;i&gt;autonomous search&lt;/i&gt;&lt;/a&gt;.  In this model the retrieval system is proactive, responding to queries, but also actively notifying the user due to changes in the environment.  The might describe such a system is an "intelligent information agent".&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-7178732546040216895?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/7178732546040216895/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2010/10/ray-ozzie-on-future-of-computing.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/7178732546040216895'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/7178732546040216895'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2010/10/ray-ozzie-on-future-of-computing.html' title='Ray Ozzie on the future of computing'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-2061530283669432798</id><published>2010-10-07T09:33:00.002-04:00</published><updated>2010-10-07T10:11:05.242-04:00</updated><title type='text'>Twitter Launches Lucene Real-Time Search Architecture</title><content type='html'>The &lt;a href="http://twitter.com/twittersearch"&gt;Twitter Search&lt;/a&gt; engineering team announced that they launched a &lt;a href="http://engineering.twitter.com/2010/10/twitters-new-search-architecture.html"&gt;New Twitter search Architecture&lt;/a&gt;.  The previous system was based on the original Summize search system that &lt;a href="http://blog.twitter.com/2008/07/finding-perfect-match.html"&gt;Twitter acquired in 2008&lt;/a&gt;. The old technology was a MySQL-based system that became difficult to scale.  I'm really amazed that they were able to make a MySQL search work for this long.&lt;br /&gt;&lt;br /&gt;According to the blog, the new search system was designed to handle &lt;span style="font-style:italic;"&gt;over 1,000 TPS (Tweets/sec) and 12,000 QPS (queries/sec) = over 1 billion queries per day &lt;/span&gt;.  Besides the challenging query volume, the data needs to be available quickly, a tweet needed to be searchable in less than 10 seconds.&lt;br /&gt;&lt;br /&gt;They turned to &lt;a href="http://lucene.apache.org"&gt;Lucene&lt;/a&gt;, a popular &lt;a href="http://www.searchenginecaffe.com/2007/03/open-source-search-engines-in-java-and.html"&gt;open source search library&lt;/a&gt;.  To meet their latency and query serving requirements they needed to make extensive modifications to Lucene's core,&lt;br /&gt;&lt;blockquote&gt;That’s why we rewrote big parts of the core in-memory data structures, especially the posting lists, while still supporting Lucene’s standard APIs. This allows us to use Lucene’s search layer almost unmodified. Some of the highlights of our changes include:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;significantly improved garbage collection performance&lt;/li&gt;&lt;li&gt;lock-free data structures and algorithms&lt;/li&gt;&lt;li&gt;posting lists, that are traversable in reverse order&lt;/li&gt;&lt;li&gt;efficient early query termination&lt;/li&gt;&lt;/ul&gt;&lt;/blockquote&gt;Their post says that these contributions will be rolled into Lucene and the &lt;a href="http://svn.apache.org/viewvc/lucene/dev/branches/realtime_search/"&gt;Lucene realtime branch&lt;/a&gt;.  It is a tantalizing overview and I would really like to see pointers to the details (e.g. JIRA issues).  &lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The main benefit to users is that the new system is much more scalable and can support an index that is twice as large as previous versions which means that you can search for tweets further back in time.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;An interesting parallel is that LinkedIn recently released &lt;a href="http://blog.linkedin.com/2010/09/29/linkedin-signal/"&gt;LinkedIn Signal&lt;/a&gt;, which is a mashup of Twitter data with LinkedIn social network data for professionals.  For details on how that system works, see their &lt;a href="http://sna-projects.com/blog/2010/10/linkedin-signal-a-look-under-the-hood/"&gt;Signal Under the Hood&lt;/a&gt; post.  One of the key components is the &lt;a href="http://sna-projects.com/zoie/"&gt;Zoie real-time search system&lt;/a&gt; built on top of Lucene.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-2061530283669432798?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/2061530283669432798/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2010/10/twitter-launches-lucene-real-time.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/2061530283669432798'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/2061530283669432798'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2010/10/twitter-launches-lucene-real-time.html' title='Twitter Launches Lucene Real-Time Search Architecture'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-4390474038785216284</id><published>2010-09-28T14:05:00.008-04:00</published><updated>2010-09-28T18:10:13.549-04:00</updated><title type='text'>Twitter Talk at UMass: Discovery and Emergence</title><content type='html'>Today &lt;a href="http://twitter.com/abdur"&gt;Abdur&lt;/a&gt;, the Chief Scientist at Twitter gave a talk, &lt;i&gt;&lt;a href="http://www.cs.umass.edu/speakers/abdur-chowdhury-chief-scientist/2010/sep/28"&gt;Discovery and Emergence&lt;/a&gt;&lt;/i&gt; here at UMass.  The talk was interactive with lots of questions.  It was similar to the one he presented at &lt;a href="http://www.searchenginecaffe.com/2009/07/sigir-social-media-workshop-abdur.html"&gt;SIGIR Social Media Workshop&lt;/a&gt; last year.  If you read it, be sure to read Jon's comment.  Here are a few of my notes from the talk, which focused heavily on the Trending Topics feature.  &lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;A few key points to remember.  First we have to keep in the forefront:&lt;/div&gt;&lt;div&gt;&lt;blockquote&gt;It's not about the technology, it's how it enriches our lives and makes it better.&lt;/blockquote&gt;&lt;/div&gt;&lt;br /&gt;&lt;b&gt;The Data&lt;/b&gt;&lt;br /&gt;160 Million accounts.  90 million tweets per day. 16.7 gb of tweets. &gt; 1000 tps&lt;br /&gt;200,000 time line rps, 3GBs outbound data, 1 B queries per day&lt;br /&gt;&lt;br /&gt;Tweets are searchable within seconds and the data is kept forever.&lt;br /&gt;&lt;br /&gt;About 30% of search traffic is generated by clicks from trending topics.&lt;br /&gt;&lt;br /&gt;In 1ms answer the following about a tweet:&lt;br /&gt;- what language is this tweet?&lt;br /&gt;- where was this tweet posted from?&lt;br /&gt;- what are the entities in this tweet?&lt;br /&gt;&lt;br /&gt;Every X min answer the following:&lt;br /&gt;- Which tweets should you ignore?&lt;br /&gt;- What topics are trending and where?&lt;br /&gt;&lt;br /&gt;A key problem is how to evaluate the quality of trending topics.  What makes one topic 'better' than another?  &lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;One of the coolest things I saw from the talk was the vizualization of the World Cup tweets, which was on their blog, &lt;a href="http://blog.twitter.com/2010/07/2010-world-cup-global-conversation.html"&gt;World Cup 2010: A Global Conversation&lt;/a&gt;.  It was created by &lt;a href="http://twitter.com/miguelrios"&gt;Miguel Rios&lt;/a&gt;, whose work you can check out on &lt;a href="http://www.miguelrios.org/"&gt;his website&lt;/a&gt;.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Abdur ended with an admonition to researchers to think about the impact of their work,&lt;/div&gt;&lt;div&gt;&lt;blockquote&gt;Why does your research matter? Will it make the world a better place?&lt;br /&gt;&lt;/blockquote&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-4390474038785216284?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/4390474038785216284/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2010/09/twitter-at-umass-discovery-and.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/4390474038785216284'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/4390474038785216284'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2010/09/twitter-at-umass-discovery-and.html' title='Twitter Talk at UMass: Discovery and Emergence'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-5013222884762018572</id><published>2010-09-21T12:13:00.003-04:00</published><updated>2010-09-21T12:33:33.829-04:00</updated><title type='text'>ECML PKDD 2010 Data Challenge: Measuring Web Data Quality</title><content type='html'>&lt;div&gt;Yesterday the &lt;a href="http://www.ecmlpkdd2010.org/articles-mostra-2041-eng-discovery_challenge_2010.htm"&gt;ECML PKDD Data Discovery challenge&lt;/a&gt; results were presented.  See the website for the papers of the winning participants.  The winning team used a bagged C4.5 decision tree for learning given the features.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;A high level overview of the challenge from the website describe the challenge,&lt;/div&gt;&lt;div&gt;&lt;blockquote&gt;In this year's Discovery Challenge we target at more and different aspects. We want to develop site-level classification for the genre of the web sites (editorial, news, commercial, educational, "deep web", or Web spam and more) as well as their readability, authoritativeness, trustworthiness and neutrality.&lt;/blockquote&gt;&lt;/div&gt;&lt;div&gt;The &lt;a href="http://datamining.sztaki.hu/?q=en/DiscoveryChallenge/"&gt;challenge dataset&lt;/a&gt; consists of 23M pages from 99K hosts in the .eu domain.  Read the &lt;a href="http://datamining.sztaki.hu/?q=en/DiscoveryChallenge/AssessmentGuidelines"&gt;assessment guidelines&lt;/a&gt;.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;The competition involves three tasks, see the &lt;a href="http://datamining.sztaki.hu/?q=en/DiscoveryChallenge/rules"&gt;full description of tasks&lt;/a&gt;.  Here is a summary:&lt;br /&gt;&lt;br /&gt;&lt;b&gt;1. Classification task (English)&lt;br /&gt;&lt;/b&gt;&lt;ul&gt;&lt;li&gt;Web Spam&lt;/li&gt;&lt;li&gt;News/Editorial&lt;/li&gt;&lt;li&gt;Commercial&lt;/li&gt;&lt;li&gt;Educational/Research&lt;/li&gt;&lt;li&gt;Discussion&lt;/li&gt;&lt;li&gt;Personal/Leisure&lt;/li&gt;&lt;li&gt;Neutrality: from 3 (normal) to 1 (problematic)&lt;/li&gt;&lt;li&gt;Bias: 1 flags significant problems&lt;/li&gt;&lt;li&gt;Trustiness: from 3 (normal) to 1 (problematic)&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;b&gt;2. Quality task (English)&lt;/b&gt;&lt;br /&gt;Quality is measured as an aggregate function of genre, trust, factuality and bias and spam has lowest (0) quality&lt;br /&gt;&lt;br /&gt;&lt;b&gt;3. Multilingual quality task (German and French)&lt;br /&gt;&lt;/b&gt;Same as task 2, but for non-English.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The interesting aspect of the challenge is that it moves away from spam/not spam labels to assessing more complex aspects of the quality of information.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-5013222884762018572?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/5013222884762018572/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2010/09/ecml-pkdd-2010-data-challenge-measuring.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/5013222884762018572'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/5013222884762018572'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2010/09/ecml-pkdd-2010-data-challenge-measuring.html' title='ECML PKDD 2010 Data Challenge: Measuring Web Data Quality'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-4927906887067347640</id><published>2010-09-17T17:26:00.002-04:00</published><updated>2010-09-17T17:40:01.798-04:00</updated><title type='text'>A lesson in defining topic-based communities</title><content type='html'>There is a post on the stack overflow blog on how they are managing communities, &lt;a href="http://blog.stackoverflow.com/2010/09/merging-season/"&gt;Merging Season&lt;/a&gt;.  At the heart of discussion: What is the right size of domain for a topic-based community?  They are against one giant community, as they say:&lt;br /&gt;&lt;blockquote&gt;Yahoo! Answers. Monumentally popular, enormous traffic, and containing absolutely no useful information, Yahoo! Answers is actually more of a teenage chat room than a place to get real answers.&lt;/blockquote&gt;They also highlight failed attempts to bring the Ubuntu and Unix community sites together to make a single community.  The process of defining a "topical community" reminds me of the problems we have in IR when we define a "topic based vertical" to apply domain knowledge in retrieval.  From their blog:&lt;blockquote&gt;Communities consist of concentric circles. You share more with people in the inner circle than you do with people in the outer circles, but if you were in a strange place, you’d seek out people even from the larger circles. If you’re building a community (or a Stack Exchange site), it’s not immediately obvious which level is going to work...&lt;/blockquote&gt;They are developing rules that use the size and degree of overlap between communities  to guide the process.  It will be interesting how this plays out and what lessons we can use to apply to IR.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-4927906887067347640?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/4927906887067347640/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2010/09/lesson-in-defining-topic-based.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/4927906887067347640'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/4927906887067347640'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2010/09/lesson-in-defining-topic-based.html' title='A lesson in defining topic-based communities'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-7929277683477023193</id><published>2010-09-13T13:32:00.005-04:00</published><updated>2010-09-13T14:41:11.899-04:00</updated><title type='text'>Google Scribe: Autocomplete beyond queries</title><content type='html'>Overshadowed by the Google Instant last week, a labs project called &lt;a href="http://scribe.googlelabs.com/"&gt;Google Scribe&lt;/a&gt; was launched.  See some information on the &lt;a href="http://scribe.googlelabs.com/static/help.html"&gt;help page&lt;/a&gt;.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Here is an example what it did with the initial words and accepting all further suggestions:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Jeff Dalton is &lt;i&gt;a researcher at the University of California at Berkeley...&lt;/i&gt;&lt;/div&gt;&lt;div&gt;&lt;i&gt;&lt;br /&gt;&lt;/i&gt;&lt;/div&gt;&lt;div&gt;An amusing example.  I'm actually quite surprised it autocompleted "researcher" correctly.  However, Scribe got the university wrong.  It looks like UC Berkeley wins the web popularity contest.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Overall, Scribe appears to be a straightforward application of web n-gram language models covered in an AJAX interface.  Some of its mistakes demonstrate the drawbacks of not utilizing long range word dependencies and topical context.  Still, an interesting step towards more autocompletion.  I think there may be interesting opportunities to improve effectiveness by leveraging custom language models built from my other documents and web history.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-7929277683477023193?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/7929277683477023193/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2010/09/google-scribe-autocomplete-beyond.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/7929277683477023193'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/7929277683477023193'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2010/09/google-scribe-autocomplete-beyond.html' title='Google Scribe: Autocomplete beyond queries'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-507944566348295390</id><published>2010-09-08T13:12:00.009-04:00</published><updated>2010-09-08T22:04:12.137-04:00</updated><title type='text'>AJAX Results as you Type: Google Instant</title><content type='html'>&lt;div&gt;Google announced &lt;a href="http://googleblog.blogspot.com/2010/09/search-now-faster-than-speed-of-type.html"&gt;Google Instant&lt;/a&gt; on their official blog.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;a href="http://www.google.com/instant/"&gt;Google Instant&lt;/a&gt; - "Search at the speed of thought".  Google Instant is results as you type with a new AJAX view of Google's search results.  It takes the autocomplete feature a step further and sends the most probable search to the server to fetch the results.  A few optimizations to make it possible:&lt;div&gt;&lt;ul&gt;&lt;li&gt;Prioritizing searches - the biggest optimization is to run only the most probable searches.&lt;/li&gt;&lt;li&gt;User state - shortcut in process searches that are obsolete to avoid running all searches to completion&lt;/li&gt;&lt;li&gt;Result caches - improved result caching&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;Overall, Google claims that instant search saves 2-5 seconds per query.  It took some really committed engineers at Google, including &lt;a href="http://twitter.com/bengomes"&gt;Ben Gomes&lt;/a&gt;, to make it possible.  Kudos!&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;It is worth noting that many of these ideas have been in the community for awhile.  For example, it reminds me of the &lt;a href="http://search.mpi-inf.mpg.de/"&gt;CompleteSearch&lt;/a&gt; system, which has been around since SIGIR 2006.  CompleteSearch has some novel prefix based search capability which is still beyond what Google rolled out today.&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-507944566348295390?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/507944566348295390/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2010/09/ajax-results-as-you-type-google-instant.html#comment-form' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/507944566348295390'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/507944566348295390'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2010/09/ajax-results-as-you-type-google-instant.html' title='AJAX Results as you Type: Google Instant'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-3880933147122051832</id><published>2010-09-07T14:28:00.013-04:00</published><updated>2010-09-07T16:50:47.863-04:00</updated><title type='text'>Autonomous Search: Did you know?</title><content type='html'>Eric Schmidt, CEO of Google gave a &lt;a href="http://www.promeas.com/ifa-tv/webcasts2010/keynote5/index.html"&gt;keynote address at IFA&lt;/a&gt;, a consumer electronics show in Germany.  The keynote was covered in an &lt;a href="http://paidcontent.co.uk/article/419-googles-schmidt-autonomous-fast-search-is-our-new-definition/"&gt;article by paidContent&lt;/a&gt;. He emphasizes "mobile first" as very important.  According to him, the new and most interesting applications are happening on smart phones.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;This leads to what Eric describes as "autonomous search",&lt;/div&gt;&lt;div&gt;&lt;blockquote&gt;Ultimately, search is about finding what you want right now and The next step of search is doing this automatically.  And so when I'm walking down Berlin and I like history my smart phone is doing searches constantly -  &lt;b&gt;did you know? did you know? did you know?&lt;/b&gt; This occurred here, this occurred there.&lt;br /&gt;&lt;br /&gt;Because it knows who I am, it knows what I care about, and it knows where roughly where I am.  This notion of &lt;i&gt;autonomous search&lt;/i&gt;, the ability to tell me things I didn't know, but I'm probably very interested in, is the next great stage in search.&lt;/blockquote&gt;&lt;/div&gt;See also an earlier interview with Amit Singhal on the &lt;a href="http://www.searchenginecaffe.com/2010/07/amit-singhal-on-evolution-of-search.html"&gt;Evolution of Search&lt;/a&gt;.  Many of these future search applications share ground in common with the field of Agent Planning in AI.  One company taking an initial step in this direction is &lt;a href="http://siri.com/"&gt;Siri&lt;/a&gt; (read me previous post on &lt;a href="http://www.searchenginecaffe.com/2010/04/battelle-interviews-on-search-siri.html"&gt;Siri and Darpa's CALO project&lt;/a&gt;), a task oriented virtual assistant for your iPhone.  You could reframe much of what Eric described as &lt;i&gt;"Intelligent Search"&lt;/i&gt;.  &lt;a href="http://www.tomgruber.org/"&gt;Tom Gruber&lt;/a&gt;, the CTO of SIRI, describes some principles from &lt;a href="http://tomgruber.org/news/sdforum-dec13.htm"&gt;&lt;i&gt;Intelligence at the Interface&lt;/i&gt;&lt;/a&gt;:&lt;div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;It knows a lot about you.&lt;/li&gt;&lt;li&gt;It understands you in context.&lt;/li&gt;&lt;li&gt;&lt;b&gt;It is proactive.&lt;/b&gt;&lt;/li&gt;&lt;li&gt;It gets better with experience.&lt;/li&gt;&lt;/ul&gt;&lt;div&gt;I think one of the key things that is new in "autonomous" or "intelligent" search is that the system proactively surfaces interesting information to the user and assists the user in performing actions.  A key challenge is how to perform rigorous evaluation in such an immature and developing area.  The task is a significant departure from some of the more traditional adhoc search tasks and requires a much richer user model. &lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-3880933147122051832?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/3880933147122051832/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2010/09/autonomous-search-did-you-know.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/3880933147122051832'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/3880933147122051832'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2010/09/autonomous-search-did-you-know.html' title='Autonomous Search: Did you know?'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-3166342646337681513</id><published>2010-08-20T12:20:00.003-04:00</published><updated>2010-08-20T12:33:27.563-04:00</updated><title type='text'>GraphLab: Beyond MapReduce for Parallel Machine Learning</title><content type='html'>A team at the CMU &lt;a href="http://www.select.cs.cmu.edu/"&gt;Select Lab&lt;/a&gt; recently released a new software package, called &lt;a href="http://www.graphlab.ml.cmu.edu/"&gt;GraphLab&lt;/a&gt; that provides an alternative to the MapReduce paradigm for developing Machine Learning algorithms.  The work is described in the paper,&lt;br /&gt;&lt;br /&gt;Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos Guestrin, and Joseph M. Hellerstein (2010). "&lt;a href="http://www.select.cs.cmu.edu/publications/paperdir/uai2010-low-gonzalez-kyrola-bickson-guestrin-hellerstein.pdf"&gt;GraphLab: A New Parallel Framework for Machine Learning&lt;/a&gt;." Conference on Uncertainty in Artificial Intelligence (UAI). (&lt;a href="http://www.select.cs.cmu.edu/publications/paperdir/uai2010-low-gonzalez-kyrola-bickson-guestrin-hellerstein.pptx"&gt;PPT slides&lt;/a&gt;)&lt;br /&gt;&lt;br /&gt;From the description on the website,&lt;br /&gt;&lt;blockquote&gt;GraphLab provides a similar analog to the Map in the form of an &lt;span style="font-weight: bold;"&gt;Update&lt;/span&gt; &lt;span style="font-weight: bold;"&gt;Function&lt;/span&gt;. The Update Function however, is able to read and modify overlapping sets of data... In addition the update functions can be recursively triggered with one update function spawning the application of update functions to other vertices in the graph enabling dynamic iterative computation...&lt;br /&gt;&lt;br /&gt;The GraphLab analog to Reduce is the &lt;span style="font-weight: bold;"&gt;Sync&lt;/span&gt; &lt;span style="font-weight: bold;"&gt;Operation&lt;/span&gt;. The Sync Operation also provides the ability to perform reductions in the background while other computation is running. Like the update function sync operations can look at multiple records simultaneously providing the ability to operate on larger dependent contexts.&lt;/blockquote&gt;Other than the paper, you can read the &lt;a href="http://www.graphlab.ml.cmu.edu/details.html"&gt;details page&lt;/a&gt; more information.&lt;br /&gt;&lt;br /&gt;I need to think about this more.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-3166342646337681513?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/3166342646337681513/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2010/08/graphlab-beyond-mapreduce-for-parallel.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/3166342646337681513'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/3166342646337681513'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2010/08/graphlab-beyond-mapreduce-for-parallel.html' title='GraphLab: Beyond MapReduce for Parallel Machine Learning'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-7084323110319103025</id><published>2010-08-11T12:53:00.003-04:00</published><updated>2010-08-11T13:27:35.164-04:00</updated><title type='text'>Yahoo! Labs Interview: Towards Web Object Search</title><content type='html'>The Yahoo! Search blog has an &lt;a href="http://www.ysearchblog.com/2010/08/11/shahshahani/"&gt;interview with Dr. Ben Shahshahani&lt;/a&gt; of Y! labs on search.  The questions in the interview cover real-time search, the use of social data, and object retrieval.&lt;br /&gt;&lt;br /&gt;The interview begins with an introduction that search is moving beyond 10 links to a federated model that that blends objects from different verticals, also known as "Universal Search" by Google.  He then continues about the increasing role that structured data is playing on the web.  He says,&lt;br /&gt;&lt;blockquote&gt;Now, the other thing that has been happening is an integration of structured data and unstructured data, so structured meaning that there are particular attributes to different entities. We have a pretty active technology and science effort in trying to understand the main object, attributes, and relationships – not just the text on a web page... &lt;/blockquote&gt;Later, he continues the thread when it comes to answering specific user intents,&lt;br /&gt;&lt;blockquote&gt;Once a query comes in, the question is: “what is the intent” or “what are the common intents of the users submitting this query?” To answer that question, we use a variety of ways to understand the query – a lot of the queries are about objects... Objects are things in the real-world.  They can be events, a location, a person or a product. Our active effort in understanding attributes and their relationships helps us find out the things you can do with those objects. &lt;/blockquote&gt;The last quote reminded me of the great presentation given by &lt;a href="http://www.mathcs.emory.edu/%7Eeugene/"&gt;Eugene Agichtein&lt;/a&gt; at the &lt;a href="http://ciir.cs.umass.edu/sigir2010/qru/"&gt;Query Representation and Understanding workshop&lt;/a&gt; at SIGIR, &lt;span style="font-style: italic;"&gt;Inferring User Intent from Interactions with the Search Results&lt;/span&gt;.  As I recall, Eugene used search logs gathered from toolbar data to analyze different object attributes and tasks associated with different types of objects from the log.  However, I don't recall all the details and the slides are not online.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;Disclosure: I am an intern at Y! this summer working on object retrieval, so I'm a bit biased.&lt;/span&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-7084323110319103025?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/7084323110319103025/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2010/08/yahoo-labs-interview-towards-web-object.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/7084323110319103025'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/7084323110319103025'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2010/08/yahoo-labs-interview-towards-web-object.html' title='Yahoo! Labs Interview: Towards Web Object Search'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-1817116131931641890</id><published>2010-08-10T03:45:00.003-04:00</published><updated>2010-08-10T04:07:05.905-04:00</updated><title type='text'>Quick Links of the Day: Blekko, Silicon Valley History, and I hate your paper, and P!=NP</title><content type='html'>&lt;ul&gt;&lt;li&gt;&lt;a href="http://www.blekko.com/"&gt;Blekko&lt;/a&gt; - Daniel has a &lt;a href="http://thenoisychannel.com/2010/08/06/taking-blekko-out-for-a-spin/"&gt;Blekko preview&lt;/a&gt; from the beta.  See also Michael Arrington's &lt;a href="http://searchengineland.com/blekko-a-new-search-engine-that-lets-you-spin-the-web-47215"&gt;post on SE Land&lt;/a&gt;.  From Daniel's post:&lt;br /&gt;&lt;blockquote&gt;Rather, they are a way for users to “spin” their search results using a variety of filters. For example, [climate /liberal] and [climate /conservative] return very different results, because they are restricted to different sets of sites... In addition to providing a set of curated slashtags, Blekko allows users to define their own slashtags by specifying the sets of sites to be included. &lt;/blockquote&gt;This is a very primitive means of creating mini vertical search engines.  My first instinct is that slashtags remind me of &lt;a href="http://www.rollyo.com/"&gt;Rollyo&lt;/a&gt; where you can "roll your own" by restricting search to a group of websites.&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.the-scientist.com/2010/8/1/36/1/#ixzz0w7Bl1duO"&gt;I Hate Your Paper&lt;/a&gt; - an article by New Scientist that looks look at how the reviewing process is broken and some ways that journals are exploring possible reforms.  (thank &lt;a href="http://www.eecs.qmul.ac.uk/%7Ehany/"&gt;Hany&lt;/a&gt;)&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://measuringmeasures.com/blog/2010/8/9/the-next-silicon-valley.html"&gt;A historical perspective on the evolution of Silicon Valley&lt;/a&gt; by &lt;a href="http://www.linkedin.com/in/russelljurney"&gt;Russell Jurney&lt;br /&gt;&lt;br /&gt;&lt;/a&gt;&lt;/li&gt;&lt;li&gt;An in case you've been living under a rock, the &lt;a href="http://www.scribd.com/doc/35539144/pnp12pt"&gt;P != NP proof&lt;/a&gt; attempt by HP labs, see &lt;a href="http://twitter.com/#search?q=%23pnp"&gt;#pnp&lt;/a&gt;.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-1817116131931641890?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/1817116131931641890/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2010/08/quick-links-of-day-blekko-silicon.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/1817116131931641890'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/1817116131931641890'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2010/08/quick-links-of-day-blekko-silicon.html' title='Quick Links of the Day: Blekko, Silicon Valley History, and I hate your paper, and P!=NP'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-3502549055199059266</id><published>2010-08-02T04:41:00.003-04:00</published><updated>2010-08-02T05:12:41.070-04:00</updated><title type='text'>NY Times Article: Bing Kicks Google in the Pants</title><content type='html'>The NY Times has an article on a new &lt;a href="http://www.nytimes.com/2010/08/02/technology/02google.html"&gt;search "cold war"&lt;/a&gt; between Bing and Google.  I think the article is pretty poorly done.  The author paints Bing as an innovative upstart giving Google a "kick in the pants" and forcing it to play catch up. It selectively uses facts to misportray reality for the sake of a good story.  The author fails to mentions areas where Google is innovating and Microsoft is playing catch up: &lt;a href="http://techcrunch.com/2010/07/29/google-mobile-search-market-share/"&gt;mobile search, &lt;/a&gt;&lt;a href="http://googlemobile.blogspot.com/2010/05/translate-real-world-with-google.html"&gt;visual search&lt;/a&gt;, &lt;a href="http://googleblog.blogspot.com/2009/12/relevance-meets-real-time-web.html"&gt;real-time search&lt;/a&gt;, &lt;a href="http://googleblog.blogspot.com/2009/10/introducing-google-social-search-i.html"&gt;social search&lt;/a&gt;, and others.&lt;br /&gt;&lt;br /&gt;On valid point where I think Microsoft has done well is in it's vertical strategy as a means of differentiation.  As the article says,&lt;br /&gt;&lt;blockquote&gt;Microsoft has tried to attract people like Mr. Callan by excelling at answering frequently asked questions, like those related to travel, health, shopping, entertainment and local businesses. For example, Bing has flight search and prediction tools that reveal price fluctuations for certain routes, and advises customers whether to buy or wait. Bing Health uses data from sources like the Mayo Clinic and Healthwise...&lt;/blockquote&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-3502549055199059266?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/3502549055199059266/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2010/08/ny-times-article-bing-kicks-google-in.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/3502549055199059266'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/3502549055199059266'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2010/08/ny-times-article-bing-kicks-google-in.html' title='NY Times Article: Bing Kicks Google in the Pants'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-9098839897778182921</id><published>2010-07-29T09:48:00.004-04:00</published><updated>2010-07-30T04:16:31.521-04:00</updated><title type='text'>Recorded Future: Trend and event spotting from real-time news data</title><content type='html'>Yesterday Wired featured an article, &lt;a href="http://www.wired.com/dangerroom/2010/07/exclusive-google-cia/"&gt;Google, CIA Invest in ‘Future’ of Web Monitoring&lt;/a&gt;.  The article stretches the truth a bit when it says that Google is doing business with the CIA.  The link is tenuous, that both companies are interested in predictive analytics on news and real-time data.  The subject of the article is a small Cambridge based company, &lt;a href="https://www.recordedfuture.com/"&gt;Recorded Future&lt;/a&gt;.  From the article's description,&lt;br /&gt;&lt;blockquote&gt;Recorded Future strips from web pages the people, places and activities they mention. The company examines when and where these events happened (“spatial and temporal analysis”) and the tone of the document (“sentiment analysis”)... Recorded Future maintains an index with more than 100 million events, hosted on Amazon.com servers.  &lt;/blockquote&gt;For a more detailed look at what the company is doing, take a look at the white paper published on the &lt;a href="http://www.blogger.com/%20http://blog.recordedfuture.com/"&gt;company blog&lt;/a&gt;, &lt;a href="http://www.blogger.com/%20http://blog.recordedfuture.com/2010/03/13/recorded-future-%E2%80%93-a-white-paper-on-temporal-analytics/"&gt;A whitepaper on temporal analytics&lt;/a&gt;.  You can also read the &lt;a href="http://www.predictivesignals.com/"&gt;Predictive Signals&lt;/a&gt; blog by Bill Ladd, the Chief Analytic Officer at Recorded Future.&lt;br /&gt;&lt;br /&gt;Recorded Future is not alone in this field.  For example, the &lt;a href="http://livingknowledge.europarchive.org/"&gt;Living Knowledge Project&lt;/a&gt; is also working on &lt;a href="http://livingknowledge.europarchive.org/index.php/about/application_scenario_future_predictor/"&gt;future prediction&lt;/a&gt; of news events from web data.&lt;br /&gt;&lt;br /&gt;The people working in this field should be aware of the wealth of previous research analyzing event data in news.  For example, the DARPA TIDES program on &lt;a href="http://www.itl.nist.gov/iad/mig//tests/tdt/"&gt;Topic Detection and Tracking&lt;/a&gt; (TDT).  See &lt;a href="http://ciir.cs.umass.edu/%7Eallan/"&gt;James Allan&lt;/a&gt;'s book,  &lt;a href="http://books.google.com/books?id=50hnLI_Jz3cC"&gt;Topic Detection and Tracking&lt;/a&gt; for an overview.  You can also look at some of &lt;a href="http://homepages.inf.ed.ac.uk/vlavrenk/"&gt;Victor Lavrenko&lt;/a&gt;'s work, specifically on TDT and &lt;a href="http://homepages.inf.ed.ac.uk/vlavrenk/doc/kdd2k_uncut.pdf"&gt;AEnalyst&lt;/a&gt; for financial market prediction from news.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-9098839897778182921?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/9098839897778182921/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2010/07/recorded-future-trend-and-event.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/9098839897778182921'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/9098839897778182921'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2010/07/recorded-future-trend-and-event.html' title='Recorded Future: Trend and event spotting from real-time news data'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-2831150572871702123</id><published>2010-07-29T04:13:00.000-04:00</published><updated>2010-07-29T04:13:19.212-04:00</updated><title type='text'>Quick Links of the Day: KDD Cup, Task Oriented Search, ScalaNLP, SIGIR</title><content type='html'>Any of these stories could be a full blog post.  But, for now I'll just have to give you a few quick pointers:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.eurospider.com/acm-sigir-industry-track-2010.html"&gt;SIGIR 2010 Industry day videos&lt;/a&gt; - complete videos of all the talks, via &lt;a href="http://thenoisychannel.com/2010/07/27/sigir-2010-day-3-industry-track-afternoon-sessions/"&gt;Noisy Channel&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.scalanlp.org/"&gt;ScalaNLP&lt;/a&gt; - A new NLP package in Scala from the Berkeley and Stanford NLP teams.  Scala is hip new language for NLP that runs inside the JVM.  See also the &lt;a href="http://code.google.com/p/factorie/"&gt;factorie project&lt;/a&gt; from &lt;a href="http://iesl.cs.umass.edu/"&gt;UMass's IESL lab&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://pslcdatashop.org/KDDCup/workshop/"&gt;KDD Cup Challenge Results&lt;/a&gt; -  This year's competition asked participants to predict student performance on mathematical problems from logs of student interaction with Intelligent Tutoring Systems.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.azarask.in/blog/post/tabcandy/"&gt;TabCandy&lt;/a&gt; - from &lt;a href="http://datamining.typepad.com/data_mining/2010/07/task-oriented-search-bing-task-oriented-browsing-firefox-lets-dance.html"&gt;Matthew Hurst&lt;/a&gt;.  Create groups of tabs for task-oriented search.  Create a "save for later" group of tabs.  Share "groups of tabs" across platforms and with your friends - "group browsing".&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.youtube.com/watch?v=nyu5ZxGUfgs"&gt;How Google Builds APIs&lt;/a&gt; from Google I/O&lt;br /&gt;&lt;br /&gt;&lt;a href="http://mobblog.cs.ucl.ac.uk/2010/07/26/sigir-research-vs-reality/"&gt;Research vs. Reality&lt;/a&gt; - Discuss.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-2831150572871702123?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/2831150572871702123/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2010/07/quick-links-of-day-kdd-cup-task.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/2831150572871702123'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/2831150572871702123'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2010/07/quick-links-of-day-kdd-cup-task.html' title='Quick Links of the Day: KDD Cup, Task Oriented Search, ScalaNLP, SIGIR'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-657223559021977585</id><published>2010-07-27T04:00:00.003-04:00</published><updated>2010-07-27T04:26:07.103-04:00</updated><title type='text'>KDD 2010 Coverage, Best Paper Awards</title><content type='html'>&lt;a href="http://www.kdd.org/kdd2010/"&gt;KDD 2010&lt;/a&gt; is being held in Washington D.C. this week.  I'm not attending, but everyone can participate because the keynotes are being &lt;a href="http://www.kdd.org/kdd2010/video.shtml"&gt;streamed live&lt;/a&gt;.  The keynote at 9am EST is from &lt;a href="http://kdl.cs.umass.edu/people/jensen/"&gt;David Jensen&lt;/a&gt; from UMass Amherst, giving a talk on &lt;a href="http://kdd10.crowdvine.com/talks/14088"&gt;Computational Social Science&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Yesterday, was the first day of papers.  Two that garnered lots of discussion on Twitter are:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://research.google.com/pubs/archive/36371.pdf"&gt;Suggesting Friends Using the Implicit Social Graph&lt;/a&gt;&lt;br /&gt;&lt;blockquote&gt;In this paper, we describe the implicit social graph which is formed by users' interactions with contacts and groups of contacts, and which is distinct from explicit social graphs in which users explicitly add other individuals as their "friends".  &lt;/blockquote&gt;It won honorable mention in the industry paper category.  Look for "Got the wrong Bob" and "Don't forget Bob" features in GMail labs.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/pubs/archive/36500.pdf"&gt;Overlapping Experiment Infrastructure: More, Better, Faster Experimentation&lt;/a&gt;&lt;br /&gt;&lt;blockquote&gt;In this paper, we describe Google’s overlapping experiment infrastructure that is a key component to solving these problems. In addition, because an experiment infrastructure alone is insufficient, we also discuss the associated tools and educational processes required to use it effectively. &lt;/blockquote&gt;The awards were also announced, see the &lt;a href="http://www.kdd.org/kdd2010/awards.shtml"&gt;KDD awards&lt;/a&gt; for the full list.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Best Research Paper:&lt;/span&gt;&lt;br /&gt;&lt;a href="http://www.select.cs.cmu.edu/publications/paperdir/kdd2010-shahaf-guestrin.pdf"&gt;Connecting the Dots Between News Articles&lt;/a&gt;&lt;br /&gt;&lt;blockquote&gt;In this paper, we investigate methods for automatically connecting the dots - providing a structured, easy way to navigate within a new topic and discover hidden connections. We focus on the news domain: given two news articles, our system automatically  finds a coherent chain linking them together. For example, it can recover the chain of events starting with the decline of home prices (January 2007), and ending with the ongoing health-care debate.&lt;/blockquote&gt;&lt;span style="font-weight: bold;"&gt;Best Industry/Government Paper&lt;/span&gt;&lt;br /&gt;&lt;a href="http://www.prem-melville.com/publications/constrained-reinforcement-learning-kdd2010.pdf"&gt;Optimizing Debt Collections Using Constrained Reinforcement Learning&lt;/a&gt;&lt;br /&gt;&lt;blockquote&gt;In this paper, we propose and develop a novel approach to the problem of optimally managing the tax, and more generally debt, collections processes at financial institutions...We re port on our experience in an actual deployment of a tax collections optimization system based on the proposed approach, at New York State Department of Taxation and Finance.&lt;/blockquote&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-657223559021977585?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/657223559021977585/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2010/07/kdd-2010-coverage.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/657223559021977585'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/657223559021977585'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2010/07/kdd-2010-coverage.html' title='KDD 2010 Coverage, Best Paper Awards'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-5756217950692196809</id><published>2010-07-27T03:32:00.004-04:00</published><updated>2010-07-27T03:59:49.198-04:00</updated><title type='text'>SIGIR 2010 Workshops: CrowdSourcing for Search Evaluation</title><content type='html'>Last Friday was &lt;a href="http://www.sigir2010.org/doku.php?id=program:workshops"&gt;SIGIR workshop day&lt;/a&gt;. First up is the workshop on &lt;a href="http://www.ischool.utexas.edu/%7Ecse2010/"&gt;CrowdSourcing for Search Evaluation&lt;/a&gt;.  It focuses on using Amazon's Mechanical Turk (MT) and similar service to provide judgments.  I did not attend this workshop, but heard positive things from the attendees.  The workshop is organized by &lt;a href="http://www.ischool.utexas.edu/%7Eml/"&gt;Matt Lease&lt;/a&gt;, &lt;a href="http://www.cs.cmu.edu/%7Evitor/"&gt;Vitor Carvalho&lt;/a&gt;, and &lt;a href="http://research.microsoft.com/en-us/people/eminey/"&gt;Emine Yilmaz&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;The presentations and papers in the &lt;a href="http://www.ischool.utexas.edu/%7Ecse2010/program.htm"&gt;program&lt;/a&gt; are available online.  Here are a few I want to highlight:&lt;br /&gt;&lt;br /&gt;A main highlight was the &lt;a href="http://crowdflower.com/"&gt;CrowdFlower&lt;/a&gt; keynote:&lt;br /&gt;&lt;a href="http://www.ischool.utexas.edu/%7Ecse2010/slides/biewald.pptx"&gt;Better Crowdsourcing through Automated Methods for Quality Control&lt;/a&gt;&lt;br /&gt;CrowdFlower  provides commercial support for companies performing tasks on  Mechanical Turk.  Everyone had great things to say about this talk that  kept people enthralled even though it was the end of the day; some said  it was the best talk of the conference.&lt;br /&gt;&lt;br /&gt;The other keynote was:&lt;br /&gt;&lt;a href="http://www.ischool.utexas.edu/%7Ecse2010/slides/alonso-invited.pdf"&gt;Design of experiments for crowdsourcing search evaluation: challenges and opportunities&lt;/a&gt; by &lt;a href="http://wwwcsif.cs.ucdavis.edu/%7Ealonsoom"&gt;Omar Alonso&lt;/a&gt;. Don't miss the slides from Omar's &lt;a href="http://www.ischool.utexas.edu/%7Ecse2010/slides/alonso-ECIR2010-tutorial.pdf"&gt;ECIR tutorial&lt;/a&gt;. They also had a paper at the workshop,&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.ischool.utexas.edu/%7Ecse2010/slides/alonso-paper.pdf"&gt;Detecting Uninteresting Content in Text Streams&lt;/a&gt;, which looked at using crowdsourcing to evaluate the 'interestingness' of tweets.  They found that most tweets, 57% were not interesting.  The found that generally, tweets that contain links tend to be interesting (81% accuracy) and that those without links that were interesting generally contained named entities.&lt;br /&gt;&lt;br /&gt;Omar, &lt;a href="http://www.eecs.qmul.ac.uk/%7Egabs/"&gt;Gabriella Kazai&lt;/a&gt;, and &lt;a href="http://users.dimi.uniud.it/%7Estefano.mizzaro/"&gt;Stefano Mizzaro&lt;/a&gt; are working on a book on crowdsourcing that will be published by Springer in 2011.&lt;br /&gt;&lt;br /&gt;My  labmate, &lt;a href="http://ciir.cs.umass.edu/%7Ehfeild/"&gt;Henry Feild&lt;/a&gt;, presented a paper, &lt;a href="http://www.ischool.utexas.edu/%7Ecse2010/slides/feildEtAl.pptx"&gt;Logging the Search Self-Efficacy of Amazon Mechanical Turkers&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Be sure to read over the rest of the program, because there are other great papers that I haven't had a chance to feature here.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-5756217950692196809?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/5756217950692196809/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2010/07/sigir-2010-workshops-crowdsourcing-for.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/5756217950692196809'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/5756217950692196809'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2010/07/sigir-2010-workshops-crowdsourcing-for.html' title='SIGIR 2010 Workshops: CrowdSourcing for Search Evaluation'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-8855278592939980390</id><published>2010-07-22T10:35:00.004-04:00</published><updated>2010-07-23T05:05:37.013-04:00</updated><title type='text'>SIGIR 2010 Industry Day: Being Social: Context-aware and Personalized Info. Access</title><content type='html'>&lt;b&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;Being Social: Research in Context-aware and Personalized Information Access @ Telefonica&lt;br /&gt;&lt;/span&gt;&lt;/b&gt;&lt;a href="http://twitter.com/xamat"&gt;Xavier Amatriain&lt;/a&gt;, &lt;a href="http://twitter.com/karenchurch"&gt;Karen Church&lt;/a&gt; and &lt;a href="http://twitter.com/solso"&gt;Josep M. Pujol&lt;/a&gt;, Telefónica&lt;br /&gt;&lt;br /&gt;&lt;div&gt;&lt;div&gt;Context overload&lt;/div&gt;&lt;div&gt;- the device of the future for information seeking is no longer the desktop&lt;/div&gt;&lt;div&gt;- it is mobile: iPad, mobile phone.&lt;/div&gt;&lt;div&gt;- Mobile phones are "personal"&lt;/div&gt;&lt;div&gt;- Mobile users tend to seek "fresh" content&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Where is the nearest florist? &lt;/div&gt;&lt;div&gt; -- this is pretty easy&lt;/div&gt;&lt;div&gt; -- where is that raelly cool cocktail bar I went to last month? (harder)&lt;/div&gt;&lt;div&gt; -- What bout discovery?&lt;/div&gt;&lt;div&gt; -- Interesting things close to me? Events?&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Can we improve the search and disvery expdeirence of mobile users using social information?&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Social Search Browser - SSB&lt;/div&gt;&lt;div&gt; - Karen Church&lt;/div&gt;&lt;div&gt; - iPhone web application + Facebook app&lt;/div&gt;&lt;div&gt; - displacies queries/questions by other users in that location&lt;/div&gt;&lt;div&gt; - users can post and interact with queries from others&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;SSB was a tool for helping and sharing....&lt;/div&gt;&lt;div&gt;A tool for supporting curiousity... an extension to my social Network&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;But!&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Crowds are not always wise.  Predictions based on large datasets that are sparse and noisy.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;User feedback is noisy&lt;/div&gt;&lt;div&gt; - you can trust if something is excellent, but not necessary the other way around.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;"Trust Us - We're Experts&lt;/div&gt;&lt;div&gt;- "It is really only experts who can reliably account for the decisions"&lt;/div&gt;&lt;div&gt;- The Wisdom of the Few - SIGIR '09&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Expert-based CF&lt;/div&gt;&lt;div&gt;An expert = individual that we can trust to have produced thtoughful, consistent, and reliable evaluations (ratings) of items in a given domain.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Working prototypes&lt;/div&gt;&lt;div&gt;- Music recommendations, mobile geo-located recommendations...&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Summary&lt;/div&gt;&lt;div&gt; - Sometimes the experts are better than your direct social network.&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-8855278592939980390?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/8855278592939980390/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2010/07/sigir-2010-industry-day-being-social.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/8855278592939980390'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/8855278592939980390'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2010/07/sigir-2010-industry-day-being-social.html' title='SIGIR 2010 Industry Day: Being Social: Context-aware and Personalized Info. Access'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-3587493807804049356</id><published>2010-07-22T09:18:00.005-04:00</published><updated>2011-11-10T11:25:22.289-05:00</updated><title type='text'>SIGIR 2010 Industry Day: Lessons and Challenges from Product Search</title><content type='html'>&lt;b&gt;&lt;span class="Apple-style-span" style="font-size: large;"&gt;Lessons and Challenges from Product Search&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;div&gt;Daniel Rose, A9&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Different Domains, Different Solutions&lt;/b&gt;&lt;/div&gt;&lt;div&gt;- Traditional IR,&lt;/div&gt;&lt;div&gt;- Enterprise search&lt;/div&gt;&lt;div&gt;- Web search&lt;/div&gt;&lt;div&gt;- Product Search&lt;/div&gt;&lt;div&gt;How are the issues different?  Let's go back to user goals...&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The Goals of Web Search&lt;/div&gt;&lt;div&gt;- Understsanding user goals in web search paper (WWW 2004). &amp;nbsp;Manually clustered queries until they were stable.&lt;br /&gt;- Done at AltaVista in 2003 (not completely representative queries)&lt;br /&gt;- Most product queries fell into other categories&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Why do people search on Amazon?&lt;/b&gt;&lt;/div&gt;&lt;div&gt;- When they want to buy something?&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Even ignoring the non-buying issues..&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;The Goals of the product Search &lt;/b&gt;&lt;/div&gt;&lt;div&gt;- Depends on where you are in the buying funnel.&lt;/div&gt;&lt;div&gt;-- Top: awareness, then Desire, then Interest, finally Action&lt;/div&gt;&lt;div&gt;St. Elmo Lewis, 1898&lt;/div&gt;&lt;div&gt;- Provide the right tools at the right stage in the process.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;[roller coaster]&lt;/div&gt;&lt;div&gt;- toys and games&lt;/div&gt;&lt;div&gt;- sort by average customer review&lt;/div&gt;&lt;div&gt;- sort by price (is actually hard:  new vs. used, amazon vs. third-party, etc...)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Different Tools for Different Stages&lt;/b&gt;&lt;/div&gt;&lt;div&gt;- Product search shows more fluid movement between searching and browsing behavior (relying on faceted metadata)&lt;/div&gt;&lt;div&gt;- Because of the nature of the search task?&lt;/div&gt;&lt;div&gt;- Because of the interfaces?&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;What Amazon Queries Look Like&lt;/div&gt;&lt;div&gt;- [which old testament book best represent the chronological structure]&lt;/div&gt;&lt;div&gt;- [shipping rates for amazon]&lt;/div&gt;&lt;div&gt;- [long black underbust corset] - still looking  &lt;/div&gt;&lt;div&gt;- vs ISBN number -&amp;gt; about to buy it&lt;br /&gt;&lt;br /&gt;(mostly one word, most the name of a thing. &amp;nbsp;except "generator")&lt;br /&gt;top 10 across the us&lt;br /&gt;(kindle, kindle fire, skyrim, mw3, sonic generations, cars 2)&lt;br /&gt;&lt;br /&gt;queries in frequency deciles, by category&lt;br /&gt;US, books, electronics, apparel&lt;br /&gt;&amp;nbsp;--&amp;gt; very diverse, mispelling, miscategorization, all levels of the buying funnel&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Context is King&lt;/b&gt;&lt;/div&gt;&lt;div&gt;- Some facets for  Dresses vs. Digital Cameras&lt;/div&gt;&lt;div&gt;- The problem of facet selection&lt;/div&gt;&lt;div&gt;- Not a one size fits all UI solution for different facet types&lt;/div&gt;&lt;div&gt;- We can interpret your query in a smarter way: [timberland] boots inside shoes is a brand&lt;/div&gt;&lt;div&gt;- Timberland in music -&amp;gt; Timbaland the band (context dependent spelling correction)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Amazon is a MarketPlace...&lt;/b&gt;&lt;/div&gt;&lt;div&gt;- So search must be realtime&lt;/div&gt;&lt;div&gt;-- new products&lt;/div&gt;&lt;div&gt;-- new merchants&lt;/div&gt;&lt;div&gt;-- prices being changed all the time&lt;br /&gt;-- items going in and out of stock all the time&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Structured Data: "It's a gift... and a curse"&lt;/b&gt;&lt;/div&gt;&lt;div&gt;- Unlike the web search, we know the semantics of different bits of text&lt;/div&gt;&lt;div&gt;- We know what fields are important for customers (e.g. brand)&lt;/div&gt;&lt;div&gt;- A large degree of quality control (less adversarial problems)&lt;/div&gt;&lt;div&gt;- We don't have to do sentiment analysis to know if a review is positive/negative&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;A Curse&lt;/div&gt;&lt;div&gt;- Search engine needs to have both DBMS-like "right answer" behavior and IR-like "best answer" behavior&lt;/div&gt;&lt;div&gt;- Tradiontional IR mechanisms don't always work well for structured data&lt;/div&gt;&lt;div&gt;-- e.g. naive tf x idf &amp;nbsp;doesn't work well (see BM25F)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;What happens when one of the fields is order of magnitudes bigger than others?&lt;/div&gt;&lt;div&gt;-- Search inside the book vs. brand name&lt;/div&gt;&lt;div&gt;- What happens when you don't have all the fields all the time?  (missing data)&lt;/div&gt;&lt;div&gt;-- ratings, reviews correlate with user satisfaction, but it may not be there&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Search Inside the Book&lt;/b&gt;&lt;br /&gt;&amp;nbsp;- how often do you want to surface full-text matches vs. filter them out&lt;br /&gt;&amp;nbsp;- (example query: &amp;nbsp;[byte-aligned compression])&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Using Behavioral Data&lt;/b&gt;&lt;/div&gt;&lt;div&gt;- Powerful source of information for any search engine&lt;/div&gt;&lt;div&gt;- When is using behavioural data an invasion of privacy (or just plain creep), and when is it better for users?&lt;/div&gt;&lt;div&gt;- Customers of a business seem more comfortable with that business learning from past behavior.  &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Interpreting Behavioral Signals&lt;/b&gt;&lt;/div&gt;&lt;div&gt;Example: Are search result clicks good and bad?&lt;/div&gt;&lt;div&gt;- How many clicks are best?&lt;/div&gt;&lt;div&gt;-- 1: the customer found what their are looking for right away&lt;/div&gt;&lt;div&gt;-- many: comparison shopping and are looking around at multiple items&lt;/div&gt;&lt;div&gt;-- zero: the search result contained all the information necessary&lt;/div&gt;&lt;div&gt;Also, some items are inherently "click attractive", e.g. a book with a sexy cover&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;- "Why is the web so hard... to evaluate" (from snippet evaluation at Yahoo!)&amp;nbsp;2004&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Evaluating Product Search Relevance&lt;/b&gt;&lt;/div&gt;&lt;div&gt;Common argument&lt;/div&gt;&lt;div&gt;-- Customers to to a shopping site to buy stuff&lt;/div&gt;&lt;div&gt;-- if a search engine change leads to customers buying mor stuff, they must have had their search need met more effectively.&lt;/div&gt;&lt;div&gt;-- Therefore, relevance can be measured by how much customers buy.&lt;/div&gt;&lt;div&gt;What's wrong with this argument?&lt;/div&gt;&lt;div&gt;-- besides ignoring the rest of the buying funnel, and that someone is ready to buy.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;The A/B Test Mystery&lt;/b&gt;&lt;/div&gt;&lt;div&gt;- Compare ranking algorithms A and B&lt;/div&gt;&lt;div&gt;- Assign half of users A and half to B&lt;/div&gt;&lt;div&gt;- And the end the avg. revenue is higher in A than B.  &lt;/div&gt;&lt;div&gt;-&amp;gt; algorithm A could be better than B, or Algorithm A could be recommending higher priced items than B&lt;/div&gt;&lt;div&gt;-&amp;gt; Algorithm A could be recommending completely unrelated, but very popular items.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;So How to do Evaluation?&lt;/b&gt;&lt;br /&gt;&amp;nbsp;- A/B tests, automated metrics, editorial relevance assessments (possibly crowdsourced).&lt;br /&gt;&amp;nbsp;- Use all of them!&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Lessons from IR&lt;/b&gt;&lt;/div&gt;&lt;div&gt;One idea: Generalizing the buying funnel&lt;/div&gt;&lt;div&gt;- The information seeking funnel&lt;/div&gt;&lt;div&gt;- Wandering: no information seeking goal in mind&lt;/div&gt;&lt;div&gt;- Exploring: have a general goal, but not a plan on achieving&lt;/div&gt;&lt;div&gt;-Seeking: have started to identify info needs that must be satisfied, but needs are open-ended&lt;/div&gt;&lt;div&gt;-Asking: have a very specific information need corresping to a closed class question&lt;/div&gt;&lt;div&gt;- Published in: The information seeking funnel, in Information-seeking support systems workshop 2008.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Summary&lt;/b&gt;&lt;/div&gt;&lt;div&gt;- Start thinking about how to meet user needs before user knows she has a need&lt;/div&gt;&lt;div&gt;- Offer different interaction mechanisms for different parts of the information seeking process&lt;/div&gt;&lt;div&gt;- Let type of content influence the way search works&lt;/div&gt;&lt;div&gt;- Design for realtime&lt;/div&gt;&lt;div&gt;- Interpret behavioral data carefully&lt;/div&gt;&lt;div&gt;- Exploit structure when have it&lt;/div&gt;&lt;div&gt;- Exploit context when you have it&lt;/div&gt;&lt;div&gt;&lt;br /&gt;(My Thoughts and Questions)&lt;br /&gt;&amp;nbsp;- The world is not only Amazon. &amp;nbsp;What about linking the products to external sources, like consumer reports, dpreview and other sites?&lt;br /&gt;&amp;nbsp; --&amp;gt; Amazon enhanced Wikipedia (e.g. &lt;a href="http://www.amazon.com/wiki/Orson_Scott_Card/ref=sr_1_3_wp?qid=1320942156&amp;amp;sr=1-3-wp"&gt;Orson Scott Card&lt;/a&gt;)&lt;br /&gt;&amp;nbsp;- Social, how is amazon incorporating social search?&lt;br /&gt;&amp;nbsp;--&amp;gt; delicate balancing act with Facebook and other sources&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&amp;nbsp;- Do you try and leverage mentions of products on book review sites? or within other books?&lt;br /&gt;&amp;nbsp;- I recently went to barnes and noble and saw the new Orson Scott Card book, one of my favorite authors. &amp;nbsp;Why didn't Amazon surface that to me? (support for subscribing to authors) &amp;nbsp;Or, "buy the new top picks from this month's Cook's Illustrated"...&lt;br /&gt;&amp;nbsp;- From my perspective, the recommendation quality of Amazon has decreased over time despite more of my data. &amp;nbsp;Does this reflect a shift in emphasis?&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-3587493807804049356?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/3587493807804049356/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2010/07/sigir-2010-industry-day-lessons-and.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/3587493807804049356'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/3587493807804049356'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2010/07/sigir-2010-industry-day-lessons-and.html' title='SIGIR 2010 Industry Day: Lessons and Challenges from Product Search'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-2316458481641302770</id><published>2010-07-22T06:59:00.004-04:00</published><updated>2010-07-22T07:06:56.448-04:00</updated><title type='text'>Microsoft Releases Learning to Rank Datasets</title><content type='html'>Microsoft Research announced that it is releasing a new &lt;a href="http://research.microsoft.com/en-us/projects/mslr/default.aspx"&gt;MS LTR datase&lt;/a&gt;t.&lt;br /&gt;&lt;blockquote&gt;We release two large scale datasets for research on learning to rank: MSLR-WEB30k with more than 30,000 queries and a random sampling of it MSLR-WEB10K with 10,000 queries.&lt;br /&gt;&lt;br /&gt;136 features have been extracted for each query-url pair.&lt;/blockquote&gt;The dataset is a retired dataset. What makes this quite interesting is that the features have been released.  You can see the &lt;a href="http://research.microsoft.com/en-us/projects/mslr/feature.aspx"&gt;feature list&lt;/a&gt;.&lt;br /&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;See also the &lt;a href="http://learningtorankchallenge.yahoo.com/"&gt;Y! LTR datasets&lt;/a&gt;.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-2316458481641302770?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/2316458481641302770/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2010/07/microsoft-releases-learning-to-rank.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/2316458481641302770'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/2316458481641302770'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2010/07/microsoft-releases-learning-to-rank.html' title='Microsoft Releases Learning to Rank Datasets'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-4124538116363676183</id><published>2010-07-22T05:26:00.005-04:00</published><updated>2010-07-22T05:44:57.609-04:00</updated><title type='text'>SIGIR 2010 Industry Day: Machine Learning in Search Quality at Yandex</title><content type='html'>&lt;b&gt;Machine Learning in Search Quality at Yandex&lt;br /&gt;&lt;/b&gt;Ilya Segalovich, Yandex&lt;br /&gt;&lt;br /&gt;&lt;div&gt;&lt;b&gt;Russian Search Market&lt;/b&gt;&lt;/div&gt;&lt;div&gt;- Yandex has 60+% market share&lt;/div&gt;&lt;div&gt;- It's all about small attention to details about the search&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;A Yandex overview&lt;/b&gt;&lt;/div&gt;&lt;div&gt;- started in 1997&lt;/div&gt;&lt;div&gt;- no 7 search engine in the world by # of queries&lt;/div&gt;&lt;div&gt;- 150 million queries per day&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Variety of Markets&lt;/b&gt;&lt;/div&gt;&lt;div&gt;- 15 countries with cyrillic alphabet&lt;/div&gt;&lt;div&gt;- 77 regions in Russia&lt;/div&gt;&lt;div&gt;-&gt; different culture, standard of living, average income, for example: Moscow, Magadan&lt;/div&gt;&lt;div&gt;-&gt; large semi-autonomous ethnic groups (tatar, chech, bashkir)&lt;/div&gt;&lt;div&gt;-&gt; neighbouring bilingual markets&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Geo-specific queries&lt;/b&gt;&lt;/div&gt;&lt;div&gt;- Relevant result sets very significantly across regions and countries&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;pFound&lt;/b&gt;&lt;/div&gt;&lt;div&gt;- a probablistic measure of user satisfaction&lt;/div&gt;&lt;div&gt;- optimization goal at Yandex sinces 2007&lt;/div&gt;&lt;div&gt;- Similar to ERR, Chapelle 2009 --&gt; hopefully someone can fill in the exact formula&lt;/div&gt;&lt;div&gt;- pFound, pBreak, pRel &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Geo-specific Ranking&lt;/b&gt;&lt;/div&gt;&lt;div&gt;query -&gt; query + user's region&lt;/div&gt;&lt;div&gt;- may need to build a specific formula for countries/region because of the variance and missing/lacking features in some of them.  &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Alternatives in Regionalization&lt;/b&gt;&lt;/div&gt;&lt;div&gt;- separate local indices  or unified indx with geo-coded pages&lt;/div&gt;&lt;div&gt;- one query or region specific query&lt;/div&gt;&lt;div&gt;- query based local intent detection vs. results based local intent detection&lt;/div&gt;&lt;div&gt;- single ranking function vs. co-ranking and re-ranking of local results&lt;/div&gt;&lt;div&gt;- train one formula or train many formulas on local pools&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Why use MLR?&lt;/b&gt;&lt;/div&gt;&lt;div&gt;Machine learning as a conveyor&lt;/div&gt;&lt;div&gt;- Some query classes require specific ranking&lt;/div&gt;&lt;div&gt;- many features&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;MatrixNet &lt;/b&gt;&lt;/div&gt;&lt;div&gt;A learning method&lt;/div&gt;&lt;div&gt;- boosted decision tree, "oblivious" trees.&lt;/div&gt;&lt;div&gt;- optimize for pFound&lt;/div&gt;&lt;div&gt;- solve regression tasks, train classifiers&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Complexity of ranking formulas&lt;/b&gt;&lt;/div&gt;&lt;div&gt;20 bytes - 2006&lt;/div&gt;&lt;div&gt;14 kb - 2008&lt;/div&gt;&lt;div&gt;220 kb - 2009&lt;/div&gt;&lt;div&gt;120 MB - 2010&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;A sequence of More and More complex rankers&lt;/b&gt;&lt;/div&gt;&lt;div&gt;- pruning with the static rank (static features)&lt;/div&gt;&lt;div&gt;- use of simply dynamic features (such as bm25)&lt;/div&gt;&lt;div&gt;- complex formula that uses all the features available&lt;/div&gt;&lt;div&gt;- potentially up to million of matrices/trees for the very top documents&lt;/div&gt;&lt;div&gt; - see camazoglu, 2010 early exit optimization&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Geo-dependent queries: pFound&lt;/b&gt;&lt;/div&gt;&lt;div&gt;- a big jump in 2009 in Quality&lt;/div&gt;&lt;div&gt;- 3x more local results than competitors in Russia, than #2 player&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Lessons&lt;/b&gt;&lt;/div&gt;&lt;div&gt;- MLR is the only to regional search: it provides us the possiblity of tuning many geo-specific models at the same time.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Challenges&lt;/b&gt;&lt;/div&gt;&lt;div&gt;Complexity  of the models is increasingly rapidly&lt;/div&gt;&lt;div&gt; -&gt; don't fit into memory!&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;MLR is in its current setting does not fit well to time-specific queries&lt;/div&gt;&lt;div&gt;-&gt; features of the fresh content are very sparse and temporal&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Opacity of results of the MLR&lt;/div&gt;&lt;div&gt;- The backside of ML&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Number of featuers grows faster than the number of judgments&lt;/div&gt;&lt;div&gt;-&gt; hard to train ranking&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Learning from clicks and user behavior is hard&lt;/div&gt;&lt;div&gt;Tens of GB of data per day!&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Yandex and IR&lt;/b&gt;&lt;/div&gt;&lt;div&gt;- Participation and Support&lt;/div&gt;&lt;div&gt;- Yandex MLR at IR context&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-4124538116363676183?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/4124538116363676183/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2010/07/sigir-2010-industry-day-machine.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/4124538116363676183'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/4124538116363676183'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2010/07/sigir-2010-industry-day-machine.html' title='SIGIR 2010 Industry Day: Machine Learning in Search Quality at Yandex'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-3707886878411706909</id><published>2010-07-22T04:48:00.008-04:00</published><updated>2010-07-22T05:24:24.576-04:00</updated><title type='text'>SIGIR 2010 Industry Day: Query Understanding at Bing</title><content type='html'>&lt;div&gt;&lt;div&gt;&lt;b&gt;Query Understanding at Bing&lt;/b&gt;&lt;/div&gt;&lt;div&gt;Jan Pederson&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Standard IR assumptions&lt;/b&gt;&lt;/div&gt;&lt;div&gt; - Queries are well-formed expressions of intent&lt;/div&gt;&lt;div&gt; - Best effort response to the query as given&lt;/div&gt;&lt;div&gt;Reality: queries contain errors&lt;/div&gt;&lt;div&gt; - 10% of queries are mispelled&lt;/div&gt;&lt;div&gt; - incorrect use of terms (large vocabulary gap)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Users will reformulate&lt;/div&gt;&lt;div&gt; - if results do not meet information need&lt;/div&gt;&lt;div&gt;Reality: If you don't understand what's wrong you can't reformulate.  You miss good content and go down dead ends&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;- Take the query, understand what is being set and modify the query to get better results&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Problem Definitions&lt;/b&gt;&lt;/div&gt;&lt;div&gt;- Best effort retrieval&lt;/div&gt;&lt;div&gt;  -- Find the most relevant results for the user query&lt;/div&gt;&lt;div&gt;  -- Query segmentation&lt;/div&gt;&lt;div&gt;  -- Stemming and synonym expansion&lt;/div&gt;&lt;div&gt;  -- Term deletion&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Automated Query Reformulation&lt;/div&gt;&lt;div&gt;- Modify the user query to produce more relevant results for the inferred intent&lt;/div&gt;&lt;div&gt; -- spell correction&lt;/div&gt;&lt;div&gt; -- term deletion&lt;/div&gt;&lt;div&gt; -- This takes more liberty with the user's intent&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Spelling correction&lt;/b&gt;&lt;/div&gt;&lt;div&gt;Example: blare house&lt;/div&gt;&lt;div&gt;- corrected to "blair house".  There is a "recourse link" because the query was changed to back out.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Stemming&lt;/b&gt;&lt;/div&gt;&lt;div&gt;- restaurant -&gt; resturants&lt;/div&gt;&lt;div&gt;- sf -&gt; san francisco&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Abbreviations&lt;/b&gt;&lt;/div&gt;&lt;div&gt;-&gt; un jobs -&gt; united nations (may already be there in anchor text)&lt;/div&gt;&lt;div&gt;- utilize co-click patterns to find un/united nations for that page&lt;/div&gt;&lt;div&gt;- it is especially important for long queries, tail queries&lt;/div&gt;&lt;div&gt;&lt;div&gt;- not so good: federated news results for the query.  Is the same query interpretation being used consistently?  The news vertical did not perform expansion and there is a problem.&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Term Relaxation&lt;/b&gt;&lt;/div&gt;&lt;div&gt;[what is a diploid chromosome] -&gt; "what is a" is not important from the SE matching, it introduces noise&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;[where can I get an iPhone 4] -&gt; where is an important part of the query.  Removing "where" misses the whole point of the query&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;[bowker's test of change tutorial] -&gt; test of symmetry is the correct terminology.  How do you know that it is the incorrect terminology?  If you relax it to "bowker's test" you get better results&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Key Concepts&lt;/b&gt;&lt;/div&gt;&lt;div&gt;- Win/Loss ratios&lt;/div&gt;&lt;div&gt; -- wins are queries whose results improve&lt;/div&gt;&lt;div&gt; -- losses are queries whose results degrade&lt;/div&gt;&lt;div&gt;- Related to precision&lt;/div&gt;&lt;div&gt; -- but not all valid reformulations change results&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;- Pre vs. Post result analysis&lt;/div&gt;&lt;div&gt; -- Query alternatives generated pre-results&lt;/div&gt;&lt;div&gt; -- Blending decisions are post results&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Query Evaluation&lt;/b&gt;&lt;/div&gt;&lt;div&gt;"Matching" level 0/l1/l2 -&gt; inverted index, matching and ranking.  Reduce billions to hundreds of thousands of pages.  Much of the loss can occur here because it never made it into the candidate set.  Assume that the other layers that use ML, etc... will bubble the correct results to the top.  &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;"Merge" layer L3 -&gt; the blending with multiple queries will be brought together&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Federation layer L4 -&gt; top layer coordinating activity&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;An important component is the query planner in L3 that performs annoation and rewriting.  &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Matching and Ranking&lt;/b&gt;&lt;/div&gt;&lt;div&gt;L0/l1 - 10^10 docs.  l0 - boolean set operations, l1- ir score (a linear function over simple features like bm25, simple and fast, but not very accurate)&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;L2 reranking - 10^5 docs - ML heavily lifting: 1500 features, proximity&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;L3 reranking - 10^3 - federation and blending&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;L4 -&gt; 10^1 &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Learning to rank Using Gradient Descent (for L2 layer)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Query Annotation&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;NLP Query annotation&lt;/div&gt;&lt;div&gt;-  offline analysis&lt;/div&gt;&lt;div&gt;- Think of the annotations as a parse tree&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Ambiguity preserving&lt;/div&gt;&lt;div&gt;- multiple interpretations&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Backend independent&lt;/div&gt;&lt;div&gt; - shared&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Structure and Attributes&lt;/div&gt;&lt;div&gt;- Syntax and semantics (how to handle leaf nodes in the tree)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Query Planning&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;[{un  | "united nations"} jobs] -&gt; l3-merge(l2-rank([un jobs]), l2-rank(["united nations" jobs])&lt;/div&gt;&lt;div&gt;OR&lt;/div&gt;&lt;div&gt;[{un| "united nations"} jobs] -&gt; l3-cascade(threshold, l2-rank([un jobs]), l2-rank(["united nations" jobs])&lt;/div&gt;&lt;div&gt;-- the second is less certain and conditional&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Design Considerations&lt;/b&gt;&lt;/div&gt;&lt;div&gt;Efficiency&lt;/div&gt;&lt;div&gt;- one user query may generate multiple backend queries that are merged in L3&lt;/div&gt;&lt;div&gt;- Some queries are cheaper than others&lt;/div&gt;&lt;div&gt; -- query reduction can improve performance&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Relevance&lt;/div&gt;&lt;div&gt; - L3 merging has maximal information, but is costly&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Multiple query plan strategies&lt;/div&gt;&lt;div&gt; - Depending on query analysis confidence&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Query Analysis Models&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Noisy Channel Model&lt;/div&gt;&lt;div&gt;argmax{P(reqire | query) } = arg max{ P(rewrite)P(query| rewrite)}&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;-- bayes inversion&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;- example: spelling&lt;/div&gt;&lt;div&gt; -- languagel model: likelihood of the correction&lt;/div&gt;&lt;div&gt; -- translation model: likelihood of the error occurring&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Language Models&lt;/b&gt;&lt;/div&gt;&lt;div&gt;Based on Large-scale text mining&lt;/div&gt;&lt;div&gt; -- unigrams and N-grams (to favor common previously seen things, they make sense)&lt;/div&gt;&lt;div&gt; -- Probability of query term sequence&lt;/div&gt;&lt;div&gt;  -- favor queries seen before&lt;/div&gt;&lt;div&gt;  -- avoid nonsensical combinations&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;1T n-gram resource&lt;/div&gt;&lt;div&gt; -- see MS ngram work here in SIGIR&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Translation Models&lt;/div&gt;&lt;div&gt;- training sets of aligned pairs (mispelling/correction; surface/stemmed)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Query log analysis &lt;/div&gt;&lt;div&gt; -- session reformulation&lt;/div&gt;&lt;div&gt; -- coclick-&gt; associated queries&lt;/div&gt;&lt;div&gt; -- manual annotation&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;(missed the references, but see: Wei et al, Pang et al., Craswell)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Summary&lt;/b&gt;&lt;/div&gt;&lt;div&gt;- 60-70% of queries are reformulated&lt;/div&gt;&lt;div&gt;- Can radically improve results&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Trade-off between relevance and efficiency&lt;/div&gt;&lt;div&gt; - rewrites can be costly&lt;/div&gt;&lt;div&gt; - win/loss ratio is the key&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Especially important for tail queries&lt;/div&gt;&lt;div&gt; - no metadata to guide matching and ranking&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-3707886878411706909?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/3707886878411706909/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2010/07/sigir-2010-industry-day-query.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/3707886878411706909'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/3707886878411706909'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2010/07/sigir-2010-industry-day-query.html' title='SIGIR 2010 Industry Day: Query Understanding at Bing'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-7893834536407249560</id><published>2010-07-22T03:43:00.011-04:00</published><updated>2010-07-22T04:19:18.340-04:00</updated><title type='text'>SIGIR 2010 Industry Day: Search Flavours at Google</title><content type='html'>&lt;span class="Apple-style-span"  style="font-size:large;"&gt;Search Flavours:  Recent updates and Trends&lt;/span&gt;&lt;div&gt;Yossi Matias&lt;/div&gt;&lt;div&gt;Director of Israel R&amp;amp;D Center, Google&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Solution for the search problem: imitate a person &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Wish list&lt;/div&gt;&lt;div&gt;- knows everything&lt;/div&gt;&lt;div&gt;- language agnostic&lt;/div&gt;&lt;div&gt;- always up to day&lt;/div&gt;&lt;div&gt;- context sensitive&lt;/div&gt;&lt;div&gt;- understands me&lt;/div&gt;&lt;div&gt;- Good sense of timing&lt;/div&gt;&lt;div&gt;- Good sense of scope&lt;/div&gt;&lt;div&gt;- Smart about interaction&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;(Suggest answers to questions I didn't ask or didn't ask accurately)&lt;/div&gt;&lt;div&gt;In short, things we expect from people when we interact from experts or friends.  This is subtle.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Demo of things&lt;/div&gt;&lt;div&gt;- auto suggest of weather; an intelligent guess at what the user will ask&lt;/div&gt;&lt;div&gt;- flight information for ua 101&lt;/div&gt;&lt;div&gt;- weather in the suggestion  &lt;/div&gt;&lt;div&gt;- This is new because the user does not have a chance to finish the question&lt;/div&gt;&lt;div&gt;- How do we understand the feedback when they don't have any feedback (except to maybe stop typing)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;- Being local [restaurant] (implicit context)&lt;/div&gt;&lt;div&gt;- world cup (now is a general answer, but a week or two ago it was very different)&lt;/div&gt;&lt;div&gt;- new forms of information: user generated content in real-time, Twitter&lt;/div&gt;&lt;div&gt;- [whale] it turns out there was a whale jumping on the ship&lt;/div&gt;&lt;div&gt;- Google trends shows hot topics&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Greater Depth With Real-Time&lt;/div&gt;&lt;div&gt;- Example of an earthquake.&lt;/div&gt;&lt;div&gt;- Two minutes after an earthquake, tweets were surfacing in the results before a formal announcement&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;How?&lt;/div&gt;&lt;div&gt;-- quick slide showing a chart, which he's not going into&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Social Circle Personalization&lt;/div&gt;&lt;div&gt; - someone I know blogs about something or a picture, surface it&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Understanding:  What does &lt;b&gt;&lt;span class="Apple-style-span"  style="color:#FF6666;"&gt;Change &lt;/span&gt;&lt;/b&gt;mean?&lt;/div&gt;&lt;div&gt;- change = to adjust (adjust the brightness) , or convert, or switch all depending on the context&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Paul McCartney Concert&lt;/div&gt;&lt;div&gt; - uploading real-time video from the concert&lt;/div&gt;&lt;div&gt;- A few may be good, but we don't want 300 clips all from the same concert&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Web translation&lt;/div&gt;&lt;div&gt;- language agnostic&lt;/div&gt;&lt;div&gt;- NY Times translated into china&lt;/div&gt;&lt;div&gt;- translated search&lt;/div&gt;&lt;div&gt;- automatic captioning (translation of obama speech to add arabic captions)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Search by voice... any Voice&lt;/div&gt;&lt;div&gt;- People are starting to use it.&lt;/div&gt;&lt;div&gt;- How do you do it for any person, any language?  &lt;/div&gt;&lt;div&gt;- The combination of voice search and translation is almost like science fiction&lt;/div&gt;&lt;div&gt;- This is a significant technology worth paying attention to&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Search by Sight&lt;/div&gt;&lt;div&gt;- Google Goggles&lt;/div&gt;&lt;div&gt;- Mobile is important for contextual understanding (location)&lt;/div&gt;&lt;div&gt;- Phones are starting to take on behavior of smart agents&lt;/div&gt;&lt;div&gt;- 10 or 20 results are not useful on a smart phone, "im feeling lucky" is important&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The power of data&lt;/div&gt;&lt;div&gt;- 1.6 billion Internet users&lt;/div&gt;&lt;div&gt;- A billion searches a day on Google worldwide&lt;/div&gt;&lt;div&gt;- He started working in ML and data mining&lt;/div&gt;&lt;div&gt;- From a research perspective there is a massive benefit of working with it&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Trendonomics&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Timeliness&lt;/div&gt;&lt;div&gt;- how to leverage trends of data, such as user search to derive insights&lt;/div&gt;&lt;div&gt;- &lt;a href="http://www.google.com/insights/search"&gt;Google insights for Search&lt;/a&gt;&lt;/div&gt;&lt;div&gt;- Trends over time, location, etc.   &lt;/div&gt;&lt;div&gt;- Identify outbreaks of flu: find queries that correlate with CDC reports&lt;/div&gt;&lt;div&gt;- Google could predict the outbreak two weeks ahead of CDC, a heads up of something happening now.&lt;/div&gt;&lt;div&gt;- Nowcasting: forecasting the present based on information from the past&lt;/div&gt;&lt;div&gt;- Hal Varian: Predict economic indicators before they were published by the Govt.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Real Estate&lt;/div&gt;&lt;div&gt;- Using stastical models to provide up to the minute information on where we are on economic indicators for sectors of real-estate.&lt;/div&gt;&lt;div&gt;- It doesn't always work, but it's helpful&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;2010 World Cup - new Search&lt;/div&gt;&lt;div&gt;- popularity of David Villa, etc...  &lt;/div&gt;&lt;div&gt;- South Africa, and sponsors getting attention&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Researching Search Trends Time-Series&lt;/div&gt;&lt;div&gt;- Forecasting.  Seasonality is a common case.  Many queries have strong seasonal components (yearly/ weekly cycles)&lt;/div&gt;&lt;div&gt;- we can use time-series prediction models to forecast&lt;/div&gt;&lt;div&gt;- (e.g. skiing, sports)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;- Define notions of how predictable and regular the search queries are&lt;/div&gt;&lt;div&gt;- About half the search queries are predictable in a 12 month ahead forecast with a mean abs prediction err of 12% on average&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Health, Food &amp;amp; Drink, and ..   are quite seasonal.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;- Categories are more predictable than individual queries&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Deviation from modeled prediction&lt;/div&gt;&lt;div&gt;- US automative industry, forecasting: august o8 - July 09&lt;/div&gt;&lt;div&gt;- The maintance and parts was ahead of forecast, new sales were below&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;See papers (a big long list...)&lt;/div&gt;&lt;div&gt;...&lt;/div&gt;&lt;div&gt;What can search predict?  many publications by Hal Varian&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;There is no API, but it is possible to download.  They are encouraging collaboration with researchers.  &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Big themes of the talk:&lt;/div&gt;&lt;div&gt; - real-time is expected ('local'), mobile access&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-7893834536407249560?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/7893834536407249560/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2010/07/sigir-2010-industry-day-search-flavours.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/7893834536407249560'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/7893834536407249560'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2010/07/sigir-2010-industry-day-search-flavours.html' title='SIGIR 2010 Industry Day: Search Flavours at Google'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-7149684972705809025</id><published>2010-07-22T03:04:00.003-04:00</published><updated>2010-07-22T03:31:28.665-04:00</updated><title type='text'>SIGIR 2010 Best Paper Award Winners</title><content type='html'>The best paper awards were awarded last night at the banquet. &lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Best Paper Award&lt;/b&gt;&lt;div&gt;&lt;a href="http://portal.acm.org/ft_gateway.cfm?id=1835548&amp;amp;type=pdf&amp;amp;coll=ACM&amp;amp;dl=ACM"&gt;Assessing the Scenic Route: Measuring the Value of Search Trails in Web Logs&lt;/a&gt;, R. White, J. Huang&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;blockquote&gt;In this paper, we present a log-based study estimating the user value of trail following. We compare the relevance, topic coverage, topic diversity, novelty, and utility of full trails over that provided by sub-trails, trail origins (landing pages), and trail destinations (pages where trails end). Our findings demonstrate significant value to users in following trails, especially for certain query types. The findings have implications for the design of search systems, including trail recommendation systems that display trails on search result pages.&lt;/blockquote&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Best Student Paper&lt;/b&gt;&lt;/div&gt;&lt;a href="http://portal.acm.org/ft_gateway.cfm?id=1835512&amp;amp;type=pdf&amp;amp;coll=ACM&amp;amp;dl=ACM"&gt;A comparison of general vs personalized affective models for the prediction of topical relevance&lt;/a&gt;, I. Arapakis, K. Athanasakos, J. Jose&lt;br /&gt;&lt;blockquote&gt;The main goal is to determine whether the behavioural differences of users have an impact on the models' ability to determine topical relevance and if, by personalising them, we can improve their accuracy. For modelling relevance we extract a set of features from the facial expression data and classify them using Support Vector Machines. Our initial evaluation indicates that accounting for individual differences and applying personalisation introduces, in most cases, a noticeable improvement in the models' performance.&lt;/blockquote&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-7149684972705809025?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/7149684972705809025/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2010/07/sigir-2010-best-paper-award-winners.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/7149684972705809025'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/7149684972705809025'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2010/07/sigir-2010-best-paper-award-winners.html' title='SIGIR 2010 Best Paper Award Winners'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-2398714732468690438</id><published>2010-07-22T03:03:00.011-04:00</published><updated>2010-07-22T03:43:44.139-04:00</updated><title type='text'>SIGIR Industry Day: Baidu on Future Search</title><content type='html'>Future Search: From Information Retrieval to Information Enabled Commerce&lt;div&gt;William Chang, Baidu&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Two commerce revolutions&lt;/div&gt;&lt;div&gt;- 1995 the first web search engines (ebay, amazon, etc...)&lt;/div&gt;&lt;div&gt;- China miracle&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Early History of IEC&lt;/div&gt;&lt;div&gt;- Early shippers:  created corporations, but more important there is a futures market&lt;/div&gt;&lt;div&gt;- Commerce: coming together to trade: trading goods and information&lt;/div&gt;&lt;div&gt;- Local: Yellow pages created in 1886&lt;/div&gt;&lt;div&gt;- Local classified ads in papers&lt;/div&gt;&lt;div&gt;- Mail order: Sears catalogue in 1888 for farming supplies (enabled by efficient postal service)&lt;/div&gt;&lt;div&gt;- Credit cards: consumer production and data mining&lt;/div&gt;&lt;div&gt;- Development of "advertising science": print, radio, tv&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I.E.C in our Daily Lives&lt;/div&gt;&lt;div&gt;- Restaurant menus&lt;/div&gt;&lt;div&gt;- Zagat, Michelin&lt;/div&gt;&lt;div&gt;- Shopping guides, supermarket aisles&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Technology and Internet&lt;/div&gt;&lt;div&gt;- Walmart: real-time transaction tracking and inventory management; scale and speed&lt;/div&gt;&lt;div&gt;- Amazon: user generated reviews and recommendations, common business platform&lt;/div&gt;&lt;div&gt;- eBay&lt;/div&gt;&lt;div&gt;- Craigslist&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Search Engines&lt;/div&gt;&lt;div&gt;- Y! Directory&lt;/div&gt;&lt;div&gt;- Lycos Crawler, Altavista big index, Excite HotBot&lt;/div&gt;&lt;div&gt;- Infoseek (1996-1999) where he worked&lt;/div&gt;&lt;div&gt;  - OR queries&lt;/div&gt;&lt;div&gt;  - Phrase inference and query rewriting&lt;/div&gt;&lt;div&gt;  - Banner ads tied to search keywords&lt;/div&gt;&lt;div&gt;  - real-time addurl&lt;/div&gt;&lt;div&gt;  - anti-spam (adversarial IR)&lt;/div&gt;&lt;div&gt;  - hyperlink voting and anchor text indexing&lt;/div&gt;&lt;div&gt;  - log analysis and query suggestion&lt;/div&gt;&lt;div&gt;- Goto.com / Overture paid placement&lt;/div&gt;&lt;div&gt;- Google ad platform: AdWord, AdSense&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Search as Media&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Working 'defintion of a media company (1997)&lt;/div&gt;&lt;div&gt;&lt;blockquote&gt;A media company's business is to help other businesses build brands, and a brand is the total loyalty of the company's customers.  A "new media company" does this by leveraging the interactive nature of the Internet to enable users to communicate with one another..."&lt;/blockquote&gt;&lt;/div&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;China Economics&lt;br /&gt;&lt;/span&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;China Background&lt;/div&gt;&lt;div&gt;- reality: only 15% of Internet users earn 5000/year&lt;/div&gt;&lt;div&gt;- inflation at 5% spurts of hyper-inflation&lt;/div&gt;&lt;div&gt;- education and personal aspiration: virtually no illiteracy, but there is a problem with brain drain (about a million of the best and brightest left and never came back)&lt;/div&gt;&lt;div&gt;- competition is fierce in school and in work&lt;/div&gt;&lt;div&gt;- gender equality: one child policy&lt;/div&gt;&lt;div&gt;- entrepreneurial spirit&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The Economy&lt;/div&gt;&lt;div&gt;- GDP is growing 10% annually&lt;/div&gt;&lt;div&gt;- Despite a tradition of honoring "old brands" there are few new domestic brands and little marketing know-how&lt;/div&gt;&lt;div&gt;- Domestic commerice is still nascent, lacking IEC tools (no yellow pages or directories that work)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The Prize&lt;/div&gt;&lt;div&gt;- Highly developed Internet in user and usage count: 420 million users, 85% broadband, the average spends 20/week on the internet&lt;/div&gt;&lt;div&gt;- Sitra (sp?) the expedia of China&lt;/div&gt;&lt;div&gt;- Micropayments are made via phone bills: even children use it to buy games online&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The Money&lt;/div&gt;&lt;div&gt;- Half the Internet population is under 25&lt;/div&gt;&lt;div&gt;- Tencent QQ is an IM used by everyone, virtual currency, with real economics&lt;/div&gt;&lt;div&gt;- Online games: Shanda, Giant etc: &lt;/div&gt;&lt;div&gt;- Taobao/Alibaba already 1% of GDP, dominates B2C goods&lt;/div&gt;&lt;div&gt;- Baidu web search dominates B2C services (health, education... help on cramming)&lt;/div&gt;&lt;div&gt;- China mobile: everyone uses it, and for almost everything &lt;/div&gt;&lt;div&gt;- Ctrip: integration of online, mobile, offline services&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Baidu&lt;/div&gt;&lt;div&gt;- Aladdain: Open Search Platform (allows webmasters to submit query and content pairs)&lt;/div&gt;&lt;div&gt;  -- rich results that form an application&lt;/div&gt;&lt;div&gt;&lt;br /&gt;- iKnow (2005) an open Q&amp;amp;A platform: the largest in the world  &lt;/div&gt;&lt;div&gt;  -- has many partner websites, all with a Q&amp;amp;A panel on their website&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;- Ark: Open product database&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;- Map++ (embed yellow page like information on a map)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Baidu Aladdin:&lt;br /&gt;Travel&lt;/div&gt;&lt;div&gt;- On the result page, there is a full panel with airline reservation &lt;/div&gt;&lt;div&gt;Housingr, Shopping&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;A few more ideas:&lt;/div&gt;&lt;div&gt;- The average chinese worker spends 2-3 hours per day on public transportation.  They spend the time playing games or reading "new literature".  This is an opportunity for mobile shopping recommendation&lt;/div&gt;&lt;div&gt;- Shopping malls are almost impossible to navigate.  There are no directories or ways to find things&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Conclusions&lt;/div&gt;&lt;div&gt;- Depends critically on information quality and security: spam&lt;/div&gt;&lt;div&gt;- Users demand quality, but there are still not solid reliable brands&lt;/div&gt;&lt;div&gt;- There are new novel business models to explore: a trillion dollar opportunity&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-2398714732468690438?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/2398714732468690438/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2010/07/sigir-industry-day-baidu-on-future.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/2398714732468690438'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/2398714732468690438'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2010/07/sigir-industry-day-baidu-on-future.html' title='SIGIR Industry Day: Baidu on Future Search'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-4529318880046277857</id><published>2010-07-21T03:08:00.014-04:00</published><updated>2010-07-23T05:07:43.461-04:00</updated><title type='text'>SIGIR 2010 Keynote: Donna Harman on Cranfield Paradigm</title><content type='html'>&lt;div&gt;&lt;span class="Apple-style-span"   style="  line-height: 17px; font-family:'Lucida Grande', Verdana, Lucida, Helvetica, Arial, sans-serif;font-size:14px;"&gt;&lt;strong style="padding-top: 0px; padding-right: 0px; padding-bottom: 0px; padding-left: 0px; margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "&gt;&lt;/strong&gt;&lt;/span&gt;&lt;/div&gt;&lt;span&gt;&lt;span&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;&lt;b&gt;Is the Cranfield Paradigm Outdated?&lt;br /&gt;&lt;/b&gt;&lt;/span&gt;by Donna Harman, NIST&lt;/span&gt;&lt;/span&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;div&gt;&lt;b&gt;Cranfield 1 - (1958 - 1960?) &lt;/b&gt;&lt;/div&gt;&lt;div&gt;- Missed most of this due to a late bus.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Cranfield 2 - 1962-1966&lt;/b&gt;&lt;/div&gt;&lt;div&gt;Goal: learn what makes a good descriptor&lt;/div&gt;&lt;div&gt;new user model: researcher wanting all documents relevant to their question&lt;/div&gt;&lt;div&gt;Documents: 1400 recent Papers in aeronautical engineering&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Questions gathered from authors of the papers, asking for the basic problem the paper addressed and also supplemental questions that could have been put to an information services&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Full relevance assessments at 5 levels&lt;/div&gt;&lt;div&gt;- complete answer to a question&lt;/div&gt;&lt;div&gt;- high degree of relevance... necessary for the work&lt;/div&gt;&lt;div&gt;- useful as background&lt;/div&gt;&lt;div&gt;- minimal interest, historical interest only&lt;/div&gt;&lt;div&gt;- no interest&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Hundreds of manual experiments using different combination of the index terms specificicity, etc., etc.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Metrics used were recall ration and precision ratio (set retrieval)&lt;br /&gt;&lt;br /&gt;The results said we could just use the words in the document (used title and abstract)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Cranfield paradigm (defined)&lt;/b&gt;&lt;/div&gt;&lt;div&gt;- Faithfully model a real user application, in this case searching appropriate abstractions with real questions&lt;/div&gt;&lt;div&gt;- have enough documents and queries to allow significant testing on results&lt;/div&gt;&lt;div&gt;- building the collection before the experiments in order to prevent human bias and enable re-usability&lt;/div&gt;&lt;div&gt;- define a metric that reflects real user  &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Continuation in SMART project&lt;/b&gt;&lt;/div&gt;&lt;div&gt;- Mike Keen spent time at Cornell working on new collections&lt;/div&gt;&lt;div&gt;- (a description of SMART Test Collections)&lt;/div&gt;&lt;div&gt;- They found there was only a 30% agreement between questioners and assessors, but there was no significant difference in how the systems ranked.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Continuation in TREC&lt;/b&gt;&lt;/div&gt;&lt;div&gt;- In 1990 DARPA asked NIST to build a new test collection for the TIPSTER project&lt;/div&gt;&lt;div&gt;- User model: intelligence analysts&lt;/div&gt;&lt;div&gt;- large numbers of newspaper articles&lt;/div&gt;&lt;div&gt;- TIPSTER Disk 1 and 2 (mixed short and long documents to force people to focus on length normalization and scale up to full-text from abstracts) &lt;/div&gt;&lt;/div&gt;&lt;div&gt;- Topics 1-50 were training topics.  Topics 51-100 were created by one person&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Relevance Judgments&lt;/div&gt;&lt;div&gt;- Used pooling (took the top 100 docs from each run).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Overlap for 8 years of Adhoc&lt;/div&gt;&lt;div&gt;- The queries from Trec-1 to Trec-8 got progressively narrower with few relevant documents &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;What is relelvant?&lt;/b&gt;&lt;/div&gt;&lt;div&gt;- Back to the user model&lt;/div&gt;&lt;div&gt;- A document is relelvant if you would use it in a report in some manner &lt;/div&gt;&lt;div&gt;- This means that even if only one sentence is useful, the document is relevant&lt;/div&gt;&lt;div&gt;- "Duplicates" also relevant as it would be very difficult to define and remove these &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;How complete is the relevant set? (Tipster)&lt;/b&gt;&lt;/div&gt;&lt;div&gt;- some relevant documents are not in pools&lt;/div&gt;&lt;div&gt;- But, lack of bias in pools is crucial so that systems that don't contribute to the pool can be fairly judged&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Other Relevancy issues (Tipster)&lt;/b&gt;&lt;/div&gt;&lt;div&gt;- Relevancy is time and user dependent&lt;/div&gt;&lt;div&gt; - learning issues, novelty issues&lt;/div&gt;&lt;div&gt;- user profiles issues such as prior knowledge, reason for doing search, etc...&lt;/div&gt;&lt;div&gt;- TREC picked the broadest definition of relevancy for several reasons&lt;/div&gt;&lt;div&gt;  - it fit the user model well&lt;/div&gt;&lt;div&gt;  - it was well-defined and thus likely to be followed&lt;/div&gt;&lt;div&gt;  - thousands of documents must be judged quickly (300 documents per hour)&lt;/div&gt;&lt;div&gt;  - (Keep these lessons in mind when using Mechanical Turk)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;b&gt;TREC Genomics Track&lt;/b&gt;&lt;/div&gt;&lt;div&gt;- User Model: medical researchers working with MDELINE and full-text journals&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Topics: Started with a user survey looking for questions&lt;/div&gt;&lt;div&gt;- Included topics based on 4 generic topic type templates and instantiated from real user requests&lt;/div&gt;&lt;div&gt;System response&lt;/div&gt;&lt;div&gt;- ranked list of up to 1000 passages (pieces of paragraphs)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;TREC Legal Track&lt;/b&gt;&lt;/div&gt;&lt;div&gt;- Very dependent on user model.  It is modeled after actual legal discovery practice with topics and relevance judgments don by lawyers&lt;/div&gt;&lt;div&gt;- Documents: 7 millions messy XML records on tobacco &lt;/div&gt;&lt;div&gt;- Topics: hypothetical complaints&lt;/div&gt;&lt;/div&gt;&lt;div&gt;- Relevance judgments: from pool created by sampling&lt;/div&gt;&lt;div&gt;- Metrics: set retrieval, F @ k&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Others: NTCIR, ImageCLEF, INEX - the requirements are all determined by the user model&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;TREC Web Tracks&lt;/b&gt;&lt;/div&gt;&lt;div&gt;- Initially used ad hoc user model, just scaled up to 100 GB&lt;/div&gt;&lt;div&gt;- Then scaled to 426 gigabytes&lt;/div&gt;&lt;div&gt;  - judgments unlikely to be complete&lt;/div&gt;&lt;div&gt;  - possible bias in relevant documents &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Cranfield Paradigm outdated??&lt;/b&gt;&lt;/div&gt;&lt;div&gt;- Faithfully model a real user application&lt;/div&gt;&lt;div&gt;- However, we need to think outside the current implementations of Cranfield paradigm to find new user models for the web&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;User Tasks and Types&lt;/b&gt;&lt;/div&gt;&lt;div&gt;- Trec-6, Allan et al. on ranked linked vs. vizualization&lt;/div&gt;&lt;div&gt;- Bhavnani, TREC 2001: med librarians a and cs studnets&lt;/div&gt;&lt;div&gt;- White, Dumais, and Teevan: Large-scale log studies looking at how domain experts search such as vocabularly, resources, et...  &lt;/div&gt;&lt;div&gt;- Alonso and Mizzaro SIGIR '09 -- Interesting results on what users find important qualities of result sets&lt;/div&gt;&lt;div&gt;- Lin &amp;amp; Smucker, SIGIR '09: PubMed study&lt;/div&gt;&lt;div&gt;- Using logs to determine goals:  Rose and Levinson, WWW 2004 manually classified search goals from the Y! logs &lt;/div&gt;&lt;div&gt;- Others...&lt;/div&gt;&lt;div&gt;- Guo, White, Dumai, Wang &amp;amp; Anderson at RIAO 2010: predicting query performance based on user interaction features&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Diversity study using logs&lt;/b&gt;&lt;/div&gt;&lt;div&gt;Clough, et al.  SIGIR '09 poster, work in WSCD '09 to study diversity, ambiguity in MS log&lt;/div&gt;&lt;div&gt;- Size of wikipedia article on the topic and query reformulations indicated diversity&lt;/div&gt;&lt;div&gt;Bendersky &amp;amp; Croft WSCD'09 - work on describing long queries.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;How can we apply these lessons?&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Ad hoc experiments must continue&lt;/b&gt;&lt;/div&gt;&lt;div&gt;- There are many different access needs that are basically traditional ad hoc retrieval; specific tasks, long queries, etc...&lt;/div&gt;&lt;div&gt;- Scores in Robust, etc. still not good.  We know that there are "easy" things that could be done to improve results significantly: you need to be better at term weighting, stemming, needs relevance feedback, etc...  &lt;/div&gt;&lt;div&gt;- However, we need to think more about other information access methods, especially on the web/mobile phone, etc.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;ClueWeb 09&lt;/b&gt;&lt;/div&gt;&lt;div&gt;- If we are going to do "ad hoc" retrieval, where can we enough "enough" of the "right" topics?&lt;/div&gt;&lt;div&gt;- How do we get relevance judgments; is it possible to sample and still have "reusable"?&lt;/div&gt;&lt;div&gt;- Is reusable important; how do we reconcile the fact that users nly look at the top (the web user model) with the reusability of a collection?&lt;/div&gt;&lt;div&gt; - search engines only judges the very top&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;What else should we look at in web track?, Specific subsets of the web.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Retrofit TREC etc. collections&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;User Simulation&lt;/b&gt;&lt;/div&gt;&lt;div&gt;- Lin &amp;amp; Smucker suggested hat Cranfield is only one model for user simulation; that new test collections could be built for other user models&lt;/div&gt;&lt;div&gt;- We have log studies, plus examples of feature tables from log studies to provide some reality&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Cranfield Paradigm not outdated&lt;/b&gt;&lt;/div&gt;&lt;div&gt;- We still need to work on ad hoc! &lt;/div&gt;&lt;div&gt;- But, we have to look at new web user models&lt;/div&gt;&lt;div&gt;  -- focus on specific web queries where we can contribute (e.g. not 'britney spears')&lt;/div&gt;&lt;div&gt;- We also need to think outside the ranked list mindset; surely that is not all there is!!&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-4529318880046277857?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/4529318880046277857/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2010/07/sigir-2010-keynote-donna-harmon-on.html#comment-form' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/4529318880046277857'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/4529318880046277857'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2010/07/sigir-2010-keynote-donna-harmon-on.html' title='SIGIR 2010 Keynote: Donna Harman on Cranfield Paradigm'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-3532110544473889898</id><published>2010-07-20T08:34:00.003-04:00</published><updated>2010-07-20T08:44:49.675-04:00</updated><title type='text'>Amit Singhal on the Evolution of Search: Searching without Searches</title><content type='html'>Engadget has an article covering a presentation (no details provided) given by &lt;a href="http://singhal.info/"&gt;Amit Singhal&lt;/a&gt; on the&lt;a href="http://www.engadget.com/2010/07/16/googles-amit-singhal-tells-us-about-the-dreams-search-engines-a/"&gt; evolution of search&lt;/a&gt;.  Most of the interview outlines the evolution of search towards multimedia, real-time search, etc...  Most of it has been well covered in the past.  One interesting note is that Amit outlines his vision for one possible future direction of search.&lt;br /&gt;&lt;blockquote&gt;Your phone knows about your shopping needs because they're in your to-do list and it knows about your meetings because they're in your schedule. All it needs is your location (which, of course, it has) and some local area information, and it'll ping out a message advising you that you can just pop down the road, buy that wooden stick, and be back in time for your 2PM with Marty from the Synergy Department.&lt;/blockquote&gt;The search engine detected an implicit information need from your to do list that it could satisfy efficiently.  It's still a distant dream.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-3532110544473889898?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/3532110544473889898/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2010/07/amit-singhal-on-evolution-of-search.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/3532110544473889898'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/3532110544473889898'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2010/07/amit-singhal-on-evolution-of-search.html' title='Amit Singhal on the Evolution of Search: Searching without Searches'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-6072621004488306344</id><published>2010-07-20T03:29:00.014-04:00</published><updated>2010-07-20T04:16:39.671-04:00</updated><title type='text'>SIGIR 2010 Keynote Address: Refactoring Search by Gary Flake</title><content type='html'>&lt;div&gt;&lt;div&gt;&lt;div&gt;SIGIR 2010 coverage is starting.  You can also follow the coverage on Twitter, &lt;a href="https://twitter.com/search?q=%23sigir2010"&gt;#sigir2010&lt;/a&gt;. Here are the raw notes from the first keynote address.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;Refactoring Search by &lt;/span&gt;&lt;a href="http://flakenstein.net/"&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;Gary Flake&lt;/span&gt;&lt;/a&gt;&lt;br /&gt;- aka Zoomable UIs, IR, and the Uncanny Valley&lt;/div&gt;&lt;div&gt;- Bing search meets Pivot.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;a href="http://msdn.microsoft.com/en-us/library/cc645050(VS.95).aspx"&gt;DeepZoom&lt;/a&gt; &lt;a href="http://www.seadragon.com/"&gt;Seadragon &lt;/a&gt;demo&lt;/div&gt;&lt;div&gt;- 50 gigs of scans of the seattle intelligencer, 600 dpi&lt;/div&gt;&lt;div&gt;- a proof of concept.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;a href="http://www.getpivot.com/"&gt;Microsoft Pivot&lt;/a&gt;&lt;/div&gt;&lt;div&gt; - take raw data; combine it with metadata for faceted navigation&lt;/div&gt;&lt;div&gt; - A look at Census data on death.  A very novel way of navigating the dataset&lt;/div&gt;&lt;div&gt; - Between search and browsing.&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Web Search Retrospective&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;What's worked well:&lt;/div&gt;&lt;div&gt; - instant answers&lt;/div&gt;&lt;div&gt; - spell correction&lt;/div&gt;&lt;div&gt; - vertical tabs&lt;/div&gt;&lt;div&gt; - query suggestions&lt;/div&gt;&lt;div&gt; - query completion&lt;/div&gt;&lt;div&gt; - grouping results&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;- The biggest improvement is in overall index scale&lt;/div&gt;&lt;div&gt;- Some improvement in core relevance&lt;/div&gt;&lt;div&gt;- But, this list is actually pretty modest&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;What hasn't worked as well&lt;/div&gt;&lt;div&gt; - Natural language queries&lt;/div&gt;&lt;div&gt; - richer representations for results&lt;/div&gt;&lt;div&gt; - richer presentation for one result&lt;/div&gt;&lt;div&gt; - clustering (visual or otherwise)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;A lack of fluidity is part of the problem&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Grokker (RIP, 2009)&lt;/div&gt;&lt;div&gt; - The sexiest search experience that no one was going use.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Instead of discreet shifts from one query to the next, can we make it a more fluid interactive process.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Uncanny Valley&lt;/div&gt;&lt;div&gt;- As you increase the sophistication it becomes mre pleasing until it becomes "too real" and then they feel like zombies.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Discrete vs. Continuous Interactions&lt;/div&gt;&lt;div&gt;CGI: stick figures -&gt; The Simpsons -&gt; Toy Story -&gt; Polar Express -&gt; Avatar&lt;/div&gt;&lt;div&gt;UIs: text terminal -&gt; web 1.0 -&gt; Rich client -&gt; over ambitious ajax -&gt; Good Zoomable UIs&lt;/div&gt;&lt;div&gt;Search:  Grep -&gt; Altavista -&gt; present day search engines -&gt; Grokker -&gt; ???&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Surpassing the unvacnny valley is exceedingly difficult because it requires excellence in science, technology, and.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Our dilemmas&lt;/div&gt;&lt;div&gt;- We are already familiar with the dilemma of precision and recall.&lt;/div&gt;&lt;div&gt;- There exists a similar dilemma around scale, fluidity, and complexity.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Zoomable UIs and Similarities to IR&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;DoopZoom items&lt;/div&gt;&lt;div&gt; * each tile is an image file&lt;/div&gt;&lt;div&gt; * each level is a set of image files in a folder&lt;/div&gt;&lt;div&gt; * each pyramid is a set of folders with image tiles for each level&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;DoopZoom collections&lt;/div&gt;&lt;div&gt; * thumbnails are packed onto shared tiles&lt;/div&gt;&lt;div&gt; * loading 100s of images requires loading few tiles.&lt;br /&gt;* very simple: hierarchical file structure with XML description and metadata (no db)&lt;/div&gt;&lt;div&gt;- The fidelity of the experience is independent from the size of the object.&lt;/div&gt;&lt;div&gt;- The trick on the back end is to build the pyramid on the backend&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The net outcome is that the user feels in control.  "It's like having super human powers" to change levels of &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Why user control is essential&lt;/div&gt;&lt;div&gt;- they feel empowered to explore&lt;/div&gt;&lt;div&gt;- Actions are more clearly invertible&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Lessons from ZUIs to apply to IR&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;- Preprocess on the backend&lt;/div&gt;&lt;div&gt;- Assume the front end can do a lot&lt;/div&gt;&lt;div&gt;- Build UI around continuous interactions&lt;/div&gt;&lt;div&gt;- Use asynchronous I/O between endpoints&lt;/div&gt;&lt;div&gt;- Use the two in combination to reinforce one another (left versus right brain)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Higher level goals&lt;/div&gt;&lt;div&gt;- Turn the present discrete mode of interaction of search into a continous dialogue&lt;/div&gt;&lt;div&gt;- Support fluid interactions that are powerful, informative, and fun&lt;/div&gt;&lt;div&gt;- Scale to thousands of items within the user / client interactions&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The biggest challenge is in dynamic generation of collections on the backend&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Server-side IR problems&lt;/div&gt;&lt;div&gt; - ranking, facet determinations&lt;/div&gt;&lt;div&gt; - cleaning / augmenting bipartite graph&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Pivot + search architecture&lt;/div&gt;&lt;div&gt; - Uses the Bing API + thumbnail cache to use Pivot to explore search results&lt;/div&gt;&lt;div&gt;- The UI supports a novel way of analysing a larger corpus of pages across multiple queries.&lt;/div&gt;&lt;div&gt;   -- e.g. dpreview.com is prominent across multiple different queries about camera reviews&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;First: do no harm&lt;/div&gt;&lt;div&gt; - Linear order must be obvious&lt;/div&gt;&lt;div&gt; - First result or instance answer is prominent&lt;/div&gt;&lt;div&gt; - First 4 or so items are easily visible&lt;/div&gt;&lt;div&gt; - Preserve title / url / description format&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Next modestly improve&lt;/div&gt;&lt;div&gt; - handle more results: &gt; 50&lt;/div&gt;&lt;div&gt; - basic n-gram extraction&lt;/div&gt;&lt;div&gt;...&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Not done&lt;/div&gt;&lt;div&gt; - Documents classes as facets&lt;/div&gt;&lt;div&gt; - Document similarity as synthestic facets&lt;/div&gt;&lt;div&gt; - Folksonomies and community tags&lt;/div&gt;&lt;div&gt; - Federation and verticals&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Viciious cycle of the web&lt;/div&gt;&lt;div&gt;- easy to create -&gt; more people create&lt;/div&gt;&lt;div&gt;- More stuff created - harder to find good stuff&lt;/div&gt;&lt;div&gt;....&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;What's the cure?&lt;/div&gt;&lt;div&gt;We desperately need a mode of interaction where the whole of the data is greater than the sume of the parts.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Wisdom &gt; knowledge &gt; information &gt; data&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Q&amp;amp;A&lt;/div&gt;&lt;div&gt;- For facets: word frequency with stop words from abstracts and titles (with just a little cleanup)&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-6072621004488306344?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/6072621004488306344/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2010/07/sigir-2010-keynote-address-refactoring.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/6072621004488306344'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/6072621004488306344'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2010/07/sigir-2010-keynote-address-refactoring.html' title='SIGIR 2010 Keynote Address: Refactoring Search by Gary Flake'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-5612749196136397926</id><published>2010-07-19T04:57:00.002-04:00</published><updated>2010-07-19T05:08:24.992-04:00</updated><title type='text'>Headed to SIGIR 2010</title><content type='html'>I'm leaving for Geneva today to attend &lt;a href="http://www.sigir2010.org/"&gt;SIGIR&lt;/a&gt;.  I look forward to seeing you there!  I will be live-blogging the keynote talks (subject to WiFi availability) and providing other coverage.  I will also be &lt;a href="http://twitter.com/jeffd"&gt;tweeting&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Today is &lt;a href="http://www.sigir2010.org/doku.php?id=program:tutorials"&gt;tutorial day.&lt;/a&gt;   The main talks start tomorrow.  To get started, here are the best paper nominees from the website.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;    A comparison of general vs personalized affective models for the prediction of topical relevance, I. Arapakis, K. Athanasakos, J. Jose&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;      Assessing the Scenic Route: Measuring the Value of Search Trails in Web Logs, R. White, J. Huang&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;      Caching Search Engine Results over Incremental Indices, F. Junqueira, R. Blanco, E. Bortnikov, R. Lempel, L. Telloli, H. Zaragoza&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;      Comparing the Sensitivity of Information Retrieval Metrics, F. Radlinski, N. Craswell&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;      Extending Average Precision to Graded Relevance Judgments, S. Robertson, E. Kanoulas, E. Yilmaz&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;      Information Based Model for ad hoc information retrieval, S. Clinchant, E. Gaussier&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;      Multi-style language model for web scale information retrieval, K. Wang, J. Gao, X. Li&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;      Properties of Optimally Weighted Data Fusion in CBMIR, P. Wilkins, A. Smeaton&lt;/li&gt;&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-5612749196136397926?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/5612749196136397926/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2010/07/headed-to-sigir-2010.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/5612749196136397926'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/5612749196136397926'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2010/07/headed-to-sigir-2010.html' title='Headed to SIGIR 2010'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-8553678739918084986</id><published>2010-07-16T09:40:00.005-04:00</published><updated>2010-07-16T11:03:39.115-04:00</updated><title type='text'>The Impact of TREC and its Future Directions</title><content type='html'>This week NIST released a report on the &lt;a href="http://trec.nist.gov/pubs/2010.economic.impact.pdf"&gt;Economic Impact Assessment of TREC&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;In section 6 they report the results of a survey where stakeholders were  asked to rate the importance of the different TREC tracks.  The most  important tracks were Adhoc Track with 77% rating it very important and  the Web Track with 74%.  Other highly rated tracks were the TrecVid and  Q&amp;amp;A tracks.  In contrast, the Spam Track (mainly email) and  Speech track were ranked at the bottom.&lt;br /&gt;&lt;br /&gt;Here are few highlights from the conclusion.  In monetary terms:&lt;br /&gt;&lt;blockquote&gt;As described in Section 6, $16 million of discounted investments have made by NIST and others in TREC have resulted in $81 million in discounted extrapolated benefits or a net present value of $65 million.&lt;/blockquote&gt;The report continues to say that the same rate of benefit may not continue because much  of the growth happened in the late 90s with the rapid expansion of the PC and the use of the Web.&lt;br /&gt;&lt;br /&gt;The other main section I found interesting was the possible future directions:&lt;br /&gt;&lt;blockquote&gt;Several trends in survey responses emerged; 37 respondents indicated that TREC should expand into new tracks, 20 said TREC should develop new evaluation methods, and 17 said TREC should develop new data sets. Common suggestions were the following:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Focus on more user behavior data (e.g., social data, Twitter, geographically based) to improve on the Interactive track.&lt;/li&gt;&lt;li&gt;Continue to look at multimedia search techniques (e.g., pictures, video).&lt;/li&gt;&lt;li&gt;Expand into more focused search areas (e.g., chemistry, drug design, evidence-based medicine).&lt;/li&gt;&lt;/ul&gt;More broadly, several respondents suggested that TREC should work with industry to increase their participation in the TREC workshops, as well as to solicit data that they might allow the TREC audience to use, thus increasing the usefulness of TREC results...&lt;/blockquote&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-8553678739918084986?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/8553678739918084986/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2010/07/impact-of-trec-and-its-future.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/8553678739918084986'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/8553678739918084986'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2010/07/impact-of-trec-and-its-future.html' title='The Impact of TREC and its Future Directions'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-7071327168133263835</id><published>2010-07-14T07:34:00.003-04:00</published><updated>2010-07-14T07:46:28.269-04:00</updated><title type='text'>Google Funds Digital Classics Research on Google Books</title><content type='html'>Today Google announced that is providing almost 1 million dollars to researchers working on projects in the digital humanities. &lt;br /&gt;&lt;br /&gt;The first award winners are &lt;a href="http://googleresearch.blogspot.com/2010/07/our-commitment-to-digital-humanities.html"&gt;listed in the blog entry&lt;/a&gt;.   In particular I want to highlight Gregory Crane's award on classics.  They previously worked to release a collection of &lt;a href="http://booksearch.blogspot.com/2010/06/google-releases-500-scans-of-ancient.html"&gt;high quality scans for 500 important classics works&lt;/a&gt; which are available for download.&lt;br /&gt;&lt;br /&gt;Also, congratulations to &lt;a href="http://www.cs.umass.edu/%7Emimno"&gt;David Mimno&lt;/a&gt; from UMass with David Blei for , &lt;i&gt;The Open Encyclopedia of Classical Sites.  &lt;/i&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-7071327168133263835?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/7071327168133263835/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2010/07/google-funds-digital-classics-research.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/7071327168133263835'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/7071327168133263835'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2010/07/google-funds-digital-classics-research.html' title='Google Funds Digital Classics Research on Google Books'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-8997407768508469312</id><published>2010-07-09T09:40:00.005-04:00</published><updated>2010-07-09T09:46:24.643-04:00</updated><title type='text'>SELand Interview with Johanna Wright from Google on Directions in Search</title><content type='html'>SELand has an &lt;a href="http://searchengineland.com/where-is-search-going-googles-johanna-wright-45983?utm_source=feedburner&amp;amp;utm_medium=feed&amp;amp;utm_campaign=Feed%3A+searchengineland+%28Search+Engine+Land%29&amp;amp;utm_content=Google+Feedfetcher"&gt;interview with Johanna Wright&lt;/a&gt;, Google’s Product Management Director for Web Search.  While I think some of the questions were marginally interesting, one part of the interview caught my attention:&lt;br /&gt;&lt;blockquote&gt;...users are finding things like event dates and locations right in their search results—information that someone could “use” to go to an event. With more recent innovations like our updated answers feature, we’ve moving beyond ranking webpages to actually extract and understand structured information from across the web. Do a search today for [barack obama birthday] and you don’t just get a website, you get the answer, sourced from across the web.&lt;br /&gt;&lt;/blockquote&gt;It is interesting to see the emphasis on exploiting structured and semantic data in search.  However, overall there is really nothing new in the interviw.  It's mostly a rehash of the company line on previous product rollouts related to Universal Search, real-time search, etc...&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-8997407768508469312?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/8997407768508469312/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2010/07/seland-interview-with-johanna-wright.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/8997407768508469312'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/8997407768508469312'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2010/07/seland-interview-with-johanna-wright.html' title='SELand Interview with Johanna Wright from Google on Directions in Search'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-3245109366795828895</id><published>2010-07-05T11:51:00.001-04:00</published><updated>2010-07-05T11:51:00.970-04:00</updated><title type='text'>Google gets into Travel: ITA Acquisition</title><content type='html'>I use &lt;a href="http://www.itasoftware.com/"&gt;ITA software&lt;/a&gt; quite regularly when planning trips.  However, it caught me by surprise when Google announced late last week that it was buying them.  The acquisition was &lt;a href="http://googleblog.blogspot.com/2010/07/taking-off-with-ita.html"&gt;announced on the Google blog&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Google is trying to make travel easier.  From the post:&lt;br /&gt;&lt;blockquote&gt;... finding the right flight at the best price is a frustrating  experience; pricing and availability change constantly, and even a  simple two city itinerary involves literally thousands of different  options. We’d like to make that search much easier... Once we’ve completed our acquisition of ITA, we’ll work on creating new  flight search tools that will make it easier for you to search for  flights, compare flight options and prices and get you quickly to a site  where you can buy your ticket.&lt;/blockquote&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-3245109366795828895?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/3245109366795828895/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2010/07/google-gets-into-travel-ita-acquisition.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/3245109366795828895'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/3245109366795828895'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2010/07/google-gets-into-travel-ita-acquisition.html' title='Google gets into Travel: ITA Acquisition'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-8372393775495604477</id><published>2010-07-02T03:53:00.003-04:00</published><updated>2010-07-02T04:19:28.433-04:00</updated><title type='text'>Google Scholar and Microsoft Academic Search Updates</title><content type='html'>&lt;a href="http://scholar.google.com"&gt;Google Scholar&lt;/a&gt; &lt;a href="http://googlescholar.blogspot.com/2010/07/search-within-citing-articles.html"&gt;announced yesterday&lt;/a&gt; a new feature to allow search within a citing articles.&lt;br /&gt;&lt;br /&gt;Microsoft has also re-released the &lt;a href="http://academic.research.microsoft.com/"&gt;Microsoft Academic Search&lt;/a&gt; from MSR Asia.  The MAS search has a faceted search interface which allows filtering by author, conference, and journal.  In addition, it goes beyond searching papers by providing a page for each author.  This is quite useful and it has interesting visualizations to show citations over time and connections to other authors. &lt;br /&gt;&lt;br /&gt;It's great to see improvements in this area.  Scholarly research is still far too difficult and we have a long way to go in making the search tools more capable.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-8372393775495604477?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/8372393775495604477/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2010/07/google-scholar-and-microsoft-academic.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/8372393775495604477'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/8372393775495604477'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2010/07/google-scholar-and-microsoft-academic.html' title='Google Scholar and Microsoft Academic Search Updates'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-6610747289306925227</id><published>2010-07-01T10:31:00.001-04:00</published><updated>2010-07-01T10:31:00.062-04:00</updated><title type='text'>2010 Hadoop Summit</title><content type='html'>Yesterday was the &lt;a href="http://developer.yahoo.com/events/hadoopsummit2010/"&gt;2010 Y! Hadoop Summit&lt;/a&gt;. Be sure to read &lt;a href="http://www.bytemining.com/2010/06/my-experience-at-hadoop-summit-2010-hadoopsummit/"&gt;Ryan  Rosario's coverage&lt;/a&gt;.  Many of  presentation talks are linked from the &lt;a href="http://developer.yahoo.com/events/hadoopsummit2010/agenda.html"&gt;Agenda page&lt;/a&gt; and via the &lt;a href="http://www.slideshare.net/ydn/presentations"&gt;YDN slideshare channel&lt;/a&gt;.  I'll highlight a few presentations that caught my attention:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.slideshare.net/ydn/3-xxl-graphalgohadoopsummit2010"&gt;XXL Graph Algorithms&lt;/a&gt; - by &lt;a href="http://research.yahoo.com/Sergei_Vassilvitskii"&gt;Sergei Vassilvitskii&lt;/a&gt; Connected component analysis in large graphs&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.slideshare.net/ydn/2-mining-billionnodeshadoopsummit2010"&gt;Mining Billion-node Graphs: Patterns, Generators and Tools&lt;/a&gt; - Jimmy Lin's presentation on experience computing PageRank on a section from the ClueWeb09 web graph.&lt;br /&gt;&lt;br /&gt;Hopefully, the morning talks will also be made available online.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-6610747289306925227?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/6610747289306925227/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2010/07/2010-hadoop-summit.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/6610747289306925227'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/6610747289306925227'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2010/07/2010-hadoop-summit.html' title='2010 Hadoop Summit'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-5975543473888524632</id><published>2010-06-30T05:44:00.004-04:00</published><updated>2010-06-30T05:59:16.979-04:00</updated><title type='text'>Eclipse Helios Release - Improved C++ and Web tools</title><content type='html'>Every June we look forward to a new version of &lt;a href="http://www.eclipse.org"&gt;Eclipse&lt;/a&gt;.  Earlier this month we had the release of, Eclipse 3.6 aka Helios.&lt;br /&gt;&lt;br /&gt;Get started with a &lt;a href="http://www.ibm.com/developerworks/opensource/library/os-eclipse-helios/index.html"&gt;tour from IBM developerworks&lt;/a&gt;.  A few participating plugins worth highlighting:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.eclipse.org/egit/"&gt;EGit&lt;/a&gt; - an Eclipse plugin for the GIT version control system.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.eclipse.org/linuxtools/"&gt;Linux Tools&lt;/a&gt; - A full featured C and C++ IDE building on the older CDT development toolkit.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://www.eclipse.org/webtools/jsdt/"&gt;Javascript Development Tools&lt;/a&gt; - Updated and improved javascript development and debugging&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-5975543473888524632?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/5975543473888524632/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2010/06/eclipse-helios-release-improved-c-and.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/5975543473888524632'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/5975543473888524632'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2010/06/eclipse-helios-release-improved-c-and.html' title='Eclipse Helios Release - Improved C++ and Web tools'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-6633183299107448127</id><published>2010-06-29T06:24:00.006-04:00</published><updated>2010-06-30T04:23:42.339-04:00</updated><title type='text'>ICML 2010 and Yahoo! Learning to Rank Workshop</title><content type='html'>Last week was &lt;a href="http://www.icml2010.org/"&gt;ICML 2010&lt;/a&gt; in Haifa.  You can read the Hal's &lt;a href="http://nlpers.blogspot.com/2010/06/icml-2010-retrospective.html"&gt;coverage on NLPers&lt;/a&gt;. The conference also had two workshops of note, the &lt;a href="http://learningtorankchallenge.yahoo.com/workshop.php"&gt;Yahoo! Learning to Rank Workshop&lt;/a&gt; and the &lt;a href="http://learningtorankchallenge.yahoo.com/workshop.php"&gt;Machine Learning Open Source Software (mloss)&lt;/a&gt; workshop.  I'm going to focus mainly on the LTR workshop, but be sure to check out the mloss site for more details.&lt;br /&gt;&lt;br /&gt;One highlight of YLTR was &lt;a href="http://research.microsoft.com/en-us/people/cburges/"&gt;Chris Burges&lt;/a&gt;' MSR team winning track 1 with LambdaMART.  They given an overview of their method in a recent tech report:&lt;br /&gt;&lt;br /&gt;&lt;a href="http://research.microsoft.com/en-us/um/people/cburges/tech_reports/MSR-TR-2010-82.pdf"&gt;From RankNet to LambdaRank to LambdaMART: An Overview&lt;/a&gt;&lt;br /&gt;&lt;blockquote&gt;LambdaMART is the boosted tree version of LambdaRank, which is based on RankNet. RankNet, LambdaRank, and LambdaMART have proven to be very successful algorithms for solving real world ranking problems: for example an ensemble of LambdaMART rankers won Track 1 of the recent Yahoo! Learning To Rank Challenge.&lt;/blockquote&gt;The other winning teams were from the Russian search company, Yandex.  See the &lt;a href="http://translate.google.es/translate?hl=en&amp;amp;sl=ru&amp;amp;u=http://habrahabr.ru/company/yandex/blog/97689/&amp;amp;ei=fcwpTLfLAoaJOK6Y4LID"&gt;company blog post&lt;/a&gt; on the topic (via Google translate).  You can also read the presentations from the top leaders:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://download.yandex.ru/company/ICML2010_pavlov.pdf"&gt;BagBoo: Bagging the Gradient Boosting&lt;/a&gt; by Dmitry Pavlov and Cliff Brunk&lt;br /&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://download.yandex.ru/company/ICML2010-kuralenok.pdf"&gt;YetiRank: Everybody Lies&lt;/a&gt; by Andrey Gulin and Igor Kuralenok&lt;br /&gt;I find YetiRank particularly interesting because they show that modeling error and uncertainty in relevance judgments can improve model effectiveness. &lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;The two teams' methods are related to those used by Yandex for it's &lt;a href="http://translate.google.com/translate?hl=en&amp;amp;sl=ru&amp;amp;u=http://company.yandex.ru/technology/matrixnet/"&gt;ranking&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-6633183299107448127?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/6633183299107448127/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2010/06/icml-2010-and-learning-to-rank-workshop.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/6633183299107448127'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/6633183299107448127'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2010/06/icml-2010-and-learning-to-rank-workshop.html' title='ICML 2010 and Yahoo! Learning to Rank Workshop'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-49057895394093770</id><published>2010-06-09T04:20:00.003-04:00</published><updated>2010-06-09T04:32:15.057-04:00</updated><title type='text'>Google Indexing gets high on Caffeine</title><content type='html'>Last August, I wrote about &lt;a href="http://www.searchenginecaffe.com/2009/08/google-unveils-new-caffeine-search.html"&gt;Google testing the new Caffeine infrastructure&lt;/a&gt;.  Today, Carrie Grimes &lt;a href="http://googleblog.blogspot.com/2010/06/our-new-search-index-caffeine.html"&gt;announced on their blog&lt;/a&gt; that the new indexing system is complete.  &lt;div&gt;&lt;blockquote&gt;Caffeine provides 50 percent fresher results for web searches than our last index... Our old index had several layers, some of which were refreshed at a faster rate than others; the main layer would update every couple of weeks. To refresh a layer of the old index, we would analyze the entire web, which meant there was a significant delay between when we found a page and made it available to you.&lt;br /&gt;&lt;br /&gt;With Caffeine, we analyze the web in small portions and update our search index on a continuous basis, globally...&lt;/blockquote&gt;&lt;/div&gt;My congratulations to the search infrastructure team.  It sounds like a significant milestone!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-49057895394093770?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/49057895394093770/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2010/06/google-indexing-gets-high-on-caffeine.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/49057895394093770'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/49057895394093770'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2010/06/google-indexing-gets-high-on-caffeine.html' title='Google Indexing gets high on Caffeine'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-3054797630832497496</id><published>2010-06-08T04:12:00.004-04:00</published><updated>2010-06-08T05:07:33.141-04:00</updated><title type='text'>SIGMOD 2010 live stream</title><content type='html'>Just a quick note to say that I'm back from holiday.  I'm settled in Barcelona for the summer, so the posting time schedule may be slightly shifted.&lt;br /&gt;&lt;br /&gt;The &lt;a href="http://www.sigmod2010.org/"&gt;SIGMOD 2010&lt;/a&gt; conference is underway this week.  The papers are online at the ACM DL, &lt;a href="http://portal.acm.org/toc.cfm?id=1807167"&gt;SIGMOD page&lt;/a&gt;. The keynote talks are &lt;a href="http://www.sigmod2010.org/stream.shtml"&gt;being streamed live&lt;/a&gt;.  Today is a keynote talk by Jon Kleinberg at 8:30am EDT.&lt;br /&gt;&lt;br /&gt;Also, in case you've been living under a rock (like me), you can catch up on the &lt;a href="http://videolectures.net/wsdm2010_newyork/"&gt;WSDM 2010 videos&lt;/a&gt; that have been online for awhile.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-3054797630832497496?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/3054797630832497496/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2010/06/getting-back-into-things.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/3054797630832497496'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/3054797630832497496'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2010/06/getting-back-into-things.html' title='SIGMOD 2010 live stream'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-8363560860604628538</id><published>2010-05-20T04:33:00.004-04:00</published><updated>2010-05-20T04:42:43.346-04:00</updated><title type='text'>Traveling...</title><content type='html'>You've probably noticed the lack of updates around here.  I am travelling.  I'm currently in London on my way to Barcelona.  I'll be in Barcelona this summer for internship at &lt;a href="http://research.yahoo.com/"&gt;Yahoo! Research&lt;/a&gt;.  My project there will likely involve search over semantic web data, in the vein of the work at the &lt;a href="http://km.aifb.kit.edu/ws/semsearch10/"&gt;SemSearch Workshop&lt;/a&gt;.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I expect it will be early June before I'm settled and get back to a more regular schedule.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-8363560860604628538?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/8363560860604628538/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2010/05/traveling.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/8363560860604628538'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/8363560860604628538'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2010/05/traveling.html' title='Traveling...'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-882942796328051340</id><published>2010-05-10T11:41:00.004-04:00</published><updated>2010-05-10T11:58:27.811-04:00</updated><title type='text'>OpenLibrary relaunch, LSH on MapReduce, and WWW Inferring User Intent Tutorial</title><content type='html'>A couple quick links from late last week to share.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;a href="http://openlibrary.org/"&gt;OpenLibrary &lt;/a&gt;Relaunch- A redesigned OpenLibrary website was launched last week, designed by Caterina Flake.   The new site is like Wikipedia meets a Library catalog.  See their &lt;a href="http://blog.openlibrary.org/"&gt;blog for details&lt;/a&gt;.  &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;a href="http://code.google.com/p/likelike/"&gt;LikeLike &lt;/a&gt;- An implementation of LSH written for Hadoop.  There isn't much documentation on how it is parallelizing the computation.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;a href="http://ir.mathcs.emory.edu/intent_tutorial/"&gt;Inferring Web Searcher Intent Tutorial&lt;/a&gt; - The slides from &lt;a href="http://www.mathcs.emory.edu/~eugene/"&gt;Eugene Agichtein's&lt;/a&gt; WWW 2010 tutorial are now available.  The first part of the tutorial provides an overview of user task and behaviour models.  The second part focuses on utilizing implicit feedback from clicks and other interaction activity.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-882942796328051340?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/882942796328051340/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2010/05/openlibrary-relaunch-lsh-on-mapreduce.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/882942796328051340'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/882942796328051340'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2010/05/openlibrary-relaunch-lsh-on-mapreduce.html' title='OpenLibrary relaunch, LSH on MapReduce, and WWW Inferring User Intent Tutorial'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-3502214893307978681</id><published>2010-05-01T00:29:00.003-04:00</published><updated>2010-05-01T08:03:51.184-04:00</updated><title type='text'>Best Paper Awards and Nominees at WWW 2010</title><content type='html'>&lt;span&gt;&lt;span&gt;&lt;a href="http://www.ischool.utexas.edu/~ml/"&gt;Matt Lease&lt;/a&gt; kindly sent me the best paper award information. (You should check out the&lt;a href="http://www.ischool.utexas.edu/~ml/ir-sp10/"&gt; grad IR class&lt;/a&gt; he is teaching this semester).  Unfortunately, I can't find them all available online yet.&lt;/span&gt;&lt;/span&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;&lt;br /&gt;&lt;b&gt;Best Poster Award&lt;/b&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;How much is your Personal Recommendation Worth&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;Paul Dütting, Monika Henzinger and Ingmar Weber&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;&lt;br /&gt;&lt;div&gt;SourceRank: Relevance and Trust Assessment for Deep &lt;span&gt;Web Sources Based on Inter-Source Agreement (&lt;/span&gt;&lt;a href="http://rakaposhi.eas.asu.edu/SourceRank_Poster_WWW.pdf"&gt;PDF&lt;/a&gt;&lt;span&gt;)&lt;/span&gt;&lt;/div&gt;&lt;/span&gt;&lt;span&gt;Raju Balakrishnan and &lt;/span&gt;&lt;a href="http://rakaposhi.eas.asu.edu"&gt;Subbarao Kambhampati&lt;/a&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;&lt;br /&gt;&lt;b&gt;Best Student Paper&lt;/b&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;Privacy Wizards for Social Networking Sites (&lt;/span&gt;&lt;a href="http://www.eecs.umich.edu/~klefevre/Publications_files/www2010.pdf"&gt;PDF&lt;/a&gt;&lt;span&gt;)&lt;br /&gt;&lt;a href="http://www.eecs.umich.edu/~ljfang/"&gt;Lujun Fang&lt;/a&gt;, &lt;/span&gt;&lt;a href="http://www.eecs.umich.edu/~klefevre/Home.html"&gt;Kristen LeFevre&lt;/a&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;&lt;br /&gt;&lt;b&gt;Best Paper nominees&lt;/b&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;Factorizing Personalized Markov Chains for Next-Basket Recommendation &lt;i&gt;&lt;b&gt;(winner)&lt;/b&gt;&lt;/i&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;Steffen Rendle (Osaka University), Christoph Freudenthaler, and Lars Schmidt-Thieme (University of Hildesheim).&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;&lt;br /&gt;A Refreshing Perspective of Search Engine Caching (&lt;a href="http://research.yahoo.com/files/paper_13.pdf"&gt;PDF&lt;/a&gt;)&lt;br /&gt;Flavio Junqueira, Berkant Barla Cambazoglu, Vassilis Plachouras, Swee Lim, Baoqiu Cui, Scott Banachowski&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span&gt;&lt;span&gt;&lt;br /&gt;&lt;div&gt;AdHeat: An Influence-based Diffusion Model for Propagating Hints to Match Ads&lt;/div&gt;&lt;div&gt;Hongji Bao, Ed Chang&lt;/div&gt;&lt;/span&gt;&lt;/span&gt;&lt;div&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-3502214893307978681?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/3502214893307978681/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2010/05/best-paper-awards-and-nominees-at-www.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/3502214893307978681'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/3502214893307978681'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2010/05/best-paper-awards-and-nominees-at-www.html' title='Best Paper Awards and Nominees at WWW 2010'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-237220996426966172</id><published>2010-04-29T12:43:00.004-04:00</published><updated>2010-04-29T12:52:30.759-04:00</updated><title type='text'>Danah Boyd WWW keynote: Privacy and Publicity in Big Data</title><content type='html'>Today &lt;a href="http://twitter.com/Zephoria"&gt;Danah Boyd&lt;/a&gt;'s gave an address on the &lt;i&gt;Privacy and Publicity in the context of big data &lt;span class="Apple-style-span" style="font-style: normal;"&gt;at &lt;/span&gt;&lt;/i&gt;&lt;a href="http://www2010.org/www/"&gt;WWW 2010&lt;/a&gt; .  Danah released a &lt;a href="http://www.danah.org/papers/talks/2010/WWW2010.html"&gt;crib sheet summary&lt;/a&gt; on her website, which you should read.  Here are &lt;a href="http://ciir.cs.umass.edu/~bemike/"&gt;Michael'&lt;/a&gt;s notes from the talk.&lt;div&gt;&lt;ul type="DISC"&gt;  &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Privacy concerns are everywhere&lt;/span&gt;&lt;/span&gt;&lt;/li&gt; &lt;ul type="DISC"&gt;&lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Big data (Social data created&lt;br /&gt;by people) magnifies these concerns&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;  &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Data is cheap today&lt;/span&gt;&lt;/span&gt;&lt;/li&gt; &lt;ul type="DISC"&gt;&lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Making sense of the data, however, is still hard&lt;/span&gt;&lt;/span&gt;&lt;/li&gt; &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Accessing and processing it an ethical way is not investigated&lt;/span&gt;&lt;/span&gt;&lt;/li&gt; &lt;/ul&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt; &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Methodological  issues&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;ul type="DISC"&gt;    &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Big data introduces more&lt;br /&gt;questions than answers&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;    &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Ethnography tries to answer&lt;br /&gt;some of these “why” questions&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;  &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Social Sciences Approach&lt;br /&gt;– 4 key points&lt;/span&gt;&lt;/span&gt;&lt;/li&gt; &lt;ul type="DISC"&gt;&lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Bigger is not always better&lt;/span&gt;&lt;/span&gt;&lt;/li&gt; &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Not all data equal&lt;/span&gt;&lt;/span&gt;&lt;/li&gt; &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;“What” != “Why” &lt;/span&gt;&lt;/span&gt;&lt;/li&gt; &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Be careful in interpretations&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;    &lt;/ul&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt; &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Sampling&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;ul type="DISC"&gt;    &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;The way you sample affects&lt;br /&gt;your results – hard to create truly random representative sample&lt;/span&gt;&lt;/span&gt;&lt;/li&gt; &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Big data doesn’t mean&lt;br /&gt;“the whole of the data”&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;    &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;No matter how many tweets&lt;br /&gt;you have, your sample is always biased&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;ul type="DISC"&gt;      &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Oversampling users who tweet&lt;br /&gt;  frequently&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/ul&gt;  &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Not all data are equal&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;ul type="DISC"&gt;    &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;What does your network represent?&lt;br /&gt;Types of social network&lt;/span&gt;&lt;/span&gt;&lt;/li&gt; &lt;ul type="DISC"&gt;&lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Articulated&lt;/span&gt;&lt;/span&gt;&lt;/li&gt; &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Behavioral&lt;/span&gt;&lt;/span&gt;&lt;/li&gt; &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Personal&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/ul&gt;&lt;ul type="DISC"&gt;    &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Data from Facebook is not&lt;br /&gt;necessarily more accurate that other social (smaller) network&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;ul type="DISC"&gt;      &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Facebook friends  != person’s social network&lt;/span&gt;&lt;/span&gt;&lt;/li&gt; &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Frequency of conversation != personal closeness&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/ul&gt;  &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;What != Why&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;ul type="DISC"&gt;    &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Correlation does not mean causation&lt;/span&gt;&lt;/span&gt;&lt;/li&gt; &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Even if your model points that there are two connected events doesn’t mean one causes&lt;br /&gt;the other&lt;/span&gt;&lt;/span&gt;&lt;/li&gt; &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Results need to be interpreted&lt;/span&gt;&lt;/span&gt;&lt;/li&gt; &lt;ul type="DISC"&gt;&lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Technology can corrupt social&lt;br /&gt;  science research by making simplifying assumptions and ignoring how&lt;br /&gt;  the context in which original results were obtained&lt;/span&gt;&lt;/span&gt;&lt;/li&gt; &lt;/ul&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;   &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Uncertainty principle applies&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;ul type="DISC"&gt;      &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Networks are made of people,&lt;br /&gt;  not of abstract nodes on the graph&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;      &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Data in the network is about&lt;br /&gt;  real people’s lives&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/ul&gt;  &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Just because data is accessible&lt;br /&gt;doesn’t mean that using it is ethical!&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;ul type="DISC"&gt;    &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Privacy is context&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;    &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Walls (Technology) have&lt;br /&gt;ears (and mouths)&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;  &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Five point for privacy security&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;  &lt;ul type="DISC"&gt;&lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Security through obscurity&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;ul type="DISC"&gt;      &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Violated more and more by&lt;br /&gt;  technology&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;      &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Technologies change people’s&lt;br /&gt;  behavior&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;    &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Not all is meant to be publicized&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;ul type="DISC"&gt;      &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Do we all want to become&lt;br /&gt;  “digital micro-celebrities” and fear the “digital paparazzi”?&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;    &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;PII vs. PEI (Personal Identifiable&lt;br /&gt;vs. Embarrassing Information)&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;ul type="DISC"&gt;      &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Algorithms have a hard time&lt;br /&gt;  discerning PII &amp;amp; PEI&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;    &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Data out of context is a&lt;br /&gt;privacy violation&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;    &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Privacy is not access control&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;    &lt;/ul&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt; &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;People care about privacy&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;ul type="DISC"&gt;    &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;But they all also care about&lt;br /&gt;publicity – a right to be in public&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;  &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Facebook&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;ul type="DISC"&gt;    &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Facebook users have an impression&lt;br /&gt;that “Facebook is more private than MySpace”&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;    &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Newsfeed – publicizing&lt;br /&gt;implicit (but accessible) content in explicit way&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;ul type="DISC"&gt;      &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Initially controversial,&lt;br /&gt;  became a great success&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;      &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Created a set of norms in&lt;br /&gt;  the “Facebook world”&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;    &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Beacon – people are vessels&lt;br /&gt;for advertisements&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;ul type="DISC"&gt;      &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Was a failure, ended in&lt;br /&gt;  a user lawsuit&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;    &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;New default privacy settings&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;ul type="DISC"&gt;      &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Research shows that people&lt;br /&gt;  do not understand their privacy settings in Facebook&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;      &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;In fact, their mental map&lt;br /&gt;  of settings doesn’t match the actual settings&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;    &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Slow changes from private&lt;br /&gt;to public&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;ul type="DISC"&gt;      &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Users are like frogs who&lt;br /&gt;  are slowly “cooked” and do not realize it&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;      &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Data from 3&lt;/span&gt;&lt;sup&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;rd&lt;/span&gt;&lt;/sup&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;  party sites is slowly aggregates&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;ul type="DISC"&gt;        &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Tastes,  web actions&lt;br /&gt;    are made public&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/ul&gt;    &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Opt-out is the norm at Facebook&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;ul type="DISC"&gt;      &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;People do not understand&lt;br /&gt;  what they implicitly agree to&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/ul&gt;  &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Regulations&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;ul type="DISC"&gt;    &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Involvement from governments&lt;br /&gt;(esp. from Europe,Canada)&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;    &lt;li&gt;&lt;span style="font-family:Calibri;font-size:78%;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Researchers --- need to&lt;br /&gt;understand the consequences of their analysis&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/ul&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-237220996426966172?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/237220996426966172/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2010/04/danah-boyd-www-keynote-privacy-and.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/237220996426966172'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/237220996426966172'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2010/04/danah-boyd-www-keynote-privacy-and.html' title='Danah Boyd WWW keynote: Privacy and Publicity in Big Data'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-6302767643323374550</id><published>2010-04-28T16:10:00.006-04:00</published><updated>2010-04-28T16:32:03.067-04:00</updated><title type='text'>Search is Dead! Long Live Search Panel at WWW 2010</title><content type='html'>&lt;div&gt;Continuing the WWW 2010 coverage, this after there was a panel, &lt;a href="http://www.google.com/calendar/render?eid=cjJpaDYyMG12Z2VyZDYyM3FrNGF1cHY0cmMgY2E0ZmNqNW0xdW1tMDFmanZhZjJjcmtmNjhAZw&amp;amp;ctz=America/New_York&amp;amp;sf=true&amp;amp;output=xml"&gt;Search is Dead! Long Live Search&lt;/a&gt;. You can see a poor quality &lt;a href="http://qik.com/video/6360405"&gt;video stream&lt;/a&gt; of the panel.  The following are the notes from my labmate, &lt;a href="http://ciir.cs.umass.edu/~bemike/"&gt;Michael&lt;/a&gt;.  You can also see the discussion on Twitter, &lt;a href="http://twitter.com/search?q=%23searchisdead"&gt;#searchisdead&lt;/a&gt;.&lt;/div&gt;&lt;ul&gt;&lt;p&gt;&lt;span style="font-family:Calibri;"&gt;&lt;b&gt;&lt;span class="Apple-style-span"  style="font-size:large;"&gt;Search is dead! Long Live Search.&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;&lt;/ul&gt;&lt;ul&gt;&lt;ul type="DISC"&gt; &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Search for 10 blue links is already dead&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt; &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;A failure case is if a user sees just the 10 blue links&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt; &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;There is much more diverse data sources and presentations than links to web pages&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt; &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Intense competition to get the tail queries right&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;ul type="DISC"&gt;   &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;You miss everyone if you miss the tail&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;   &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;It doesn’t take much to get into the tail – 1 or 2 more keywords&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt; &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Enormous need to resurface implicit structured information for keyword queries&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt; &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;How to satisfy the tail?&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;ul type="DISC"&gt;   &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;10 blue links are not enough&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;   &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Show structured data not&lt;br /&gt;just for popular queries, but for tail queries as well: maps for “historical houses in Raleigh”&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;   &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;UI challenge&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;   &lt;/ul&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt; &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Change in how people produce content&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;ul type="DISC"&gt;   &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;From newswire docs to webdocs to blogs to Twitter&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;   &lt;/ul&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt; &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Searches that don’t work&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;ul type="DISC"&gt;   &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Book Search&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;   &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Wikipedia&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;   &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Images&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;   &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Complex queries&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;   &lt;/ul&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt; &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Capturing user behavior activity&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;ul type="DISC"&gt;   &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Reformulations given by users are more likely to be clicked than automatically generated ones.&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;   &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Better tools for capturing&lt;br /&gt;how users interact with the results&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt; &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Facets&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;ul type="DISC"&gt;   &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Already are used in some vertical domains by Bing&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;   &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Automatically extracting facets from raw text&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;ul type="DISC"&gt;     &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Caveat: are they meaningful?&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;   &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Are users getting used to facets? The “jury is still out”&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;   &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Can become exponentially complex&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;   &lt;/ul&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt; &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Mobile Search&lt;/span&gt;&lt;/span&gt;&lt;/li&gt; &lt;ul type="DISC"&gt;&lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Voice search is a big change&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;ul type="DISC"&gt;&lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Penalty for longer queries go down – natural language processing will become more important&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Surprising finding – people tend to type (not speak) longer queries on mobile&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;ul type="DISC"&gt;&lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Recognizing long stateless speech  utterances is hard&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/ul&gt;   &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Lots of apps&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;ul type="DISC"&gt;     &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Tail queries can be better served by niche apps&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;     &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;How to integrate results between apps?&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;     &lt;/ul&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;   &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Geo-information&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;ul type="DISC"&gt;     &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Comes for free in phones and has to be used by a successful search engine&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;     &lt;/ul&gt;&lt;/ul&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt; &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Social Search&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;ul type="DISC"&gt;   &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;We already do social search by using click data&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;   &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Can we do better using social networks?&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;   &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Are applications like Aardvark effective?&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/ul&gt;&lt;/ul&gt;&lt;br /&gt;&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-6302767643323374550?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/6302767643323374550/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2010/04/search-is-dead-panel-at-www-2010.html#comment-form' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/6302767643323374550'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/6302767643323374550'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2010/04/search-is-dead-panel-at-www-2010.html' title='Search is Dead! Long Live Search Panel at WWW 2010'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-687378101305061150</id><published>2010-04-28T15:12:00.008-04:00</published><updated>2010-04-28T15:59:42.195-04:00</updated><title type='text'>Yahoo! Expands Restaurant Vertical with Menu information</title><content type='html'>&lt;div&gt;For those of you who know me, you know that I love to cook, and to eat.  So I was excited when today the Yahoo!Search blog &lt;a href="http://www.ysearchblog.com/2010/04/28/get-the-dish-that-you-wish/"&gt;highlighted the addition of a new feature&lt;/a&gt; to that allows you to search for a specific dish at local restaurants.  The post makes vague reference to information extraction from menus: &lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"   style="  color: rgb(85, 85, 85); line-height: 18px; font-family:'lucinda grande', 'trebuchet ms', verdana, sans-serif;font-size:14px;"&gt;&lt;blockquote&gt;By extracting structured content – in this case, menu items – from unstructured web pages and matching them to restaurant entities, Yahoo! Search can return results of restaurants near you that serve the dish you crave for when you enter the name of the dish in the search box. You can also try this experience in Yahoo! Local.&lt;/blockquote&gt;&lt;/span&gt;&lt;div&gt;I tried a search for &lt;a href="http://search.yahoo.com/search?p=roasted+chicken+san+francisco"&gt;roasted chicken, san francisco&lt;/a&gt;.  I expected Zuni Cafe and their world famous &lt;a href="http://www.zunicafe.com/pdfs/zuni_dinner_menu.pdf"&gt;roasted chicken with bread salad&lt;/a&gt;.  However, I was disappointed and the vertical did not trigger.  Giving it another shot, I tried &lt;a href="http://search.yahoo.com/search?p=burger+northampton,+ma"&gt;burger northampton, ma&lt;/a&gt;.  My favorite burger joint, Local Burger is second on the list; right after burger king. Not horrible, but not great either.  The local ranking could be improved.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I love the idea, but it still needs work.  First, I usually don't think of a specific dish when I'm picking a restaurant, unless it's pizza or a burger and its pretty obvious in these cases.  What I would really love to see are search options that let me find restaurants with menus that cater to specialty diets.  For example, find me a restaurant with options that are gluten-free, low sodium, kosher, dairy-free, etc...  This is important because like many other people I know, my mom has celiac disease and other dietary restrictions.  &lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/18315968-687378101305061150?l=www.searchenginecaffe.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://www.searchenginecaffe.com/feeds/687378101305061150/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.searchenginecaffe.com/2010/04/yahoo-expands-restaurant-vertical-with.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/687378101305061150'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/18315968/posts/default/687378101305061150'/><link rel='alternate' type='text/html' href='http://www.searchenginecaffe.com/2010/04/yahoo-expands-restaurant-vertical-with.html' title='Yahoo! Expands Restaurant Vertical with Menu information'/><author><name>jeff.dalton</name><uri>http://www.blogger.com/profile/12887721174386884522</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='31' src='http://1.bp.blogspot.com/-BQPIreWshSg/Tf-6pG_XoCI/AAAAAAAAACs/0kJUPQH9tQI/s220/tw-32-sm.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-18315968.post-414538397170427690</id><published>2010-04-28T10:26:00.005-04:00</published><updated>2010-04-28T13:00:30.552-04:00</updated><title type='text'>Vint Cerf WWW 2010 Keynote</title><content type='html'>Here are the notes that &lt;a href="http://ciir.cs.umass.edu/~bemike/"&gt;Michael&lt;/a&gt; sent me  on the Vint Cerf keynote address at WWW 2010.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style=" ;font-family:Calibri;"&gt;&lt;b&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;span class="Apple-style-span" style="font-weight: normal; "&gt;&lt;b&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;“Everything is Connected” by &lt;/span&gt;&lt;/b&gt;&lt;/span&gt;Vint Cerf &lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;p&gt;&lt;span class="Apple-style-span" style="font-family: Calibri; "&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;Note: the slides from the talk are &lt;/span&gt;&lt;a href="http://analytics.ncsu.edu/reports/www/www2010-cerf.pdf"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;available online&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt; and you can watch a video of the talk via &lt;/span&gt;&lt;a href="http://www.ustream.tv/recorded/6503627"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;Wayne Sutton's livestream&lt;/span&gt;&lt;/a&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;.&lt;/span&gt;&lt;/b&gt;&lt;/span&gt;&lt;/p&gt;&lt;ul type="DISC"&gt;  &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;“I’m the guy behind the underlying plumbing, not the applications. So this talk is going to be about the plumbing not the applications built upon this plumbing”&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;  &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Internet is a network of autonomous, independent systems&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;  &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Internet Statistics&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;ul type="DISC"&gt;    &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;1.8B people (26% of the world population)&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;    &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;4.2B mobiles and 1.3B PC’s&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;    &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Asia 770M (20% penetration)&lt;br /&gt;(Half of users in China)&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;    &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Europe 425M (53% penetration)&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;    &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;N. America 260M (76% penetration)&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;    &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Rapid drop of available IP’s (Sometimes in 2012 IPv4 will run out)&lt;/span&gt;&lt;/span&gt;&lt;/li&gt; &lt;/ul&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Major Near Term Changes&lt;br /&gt;(Nothing too surprising)&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;ul type="DISC"&gt;    &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Introducing IPv6&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;    &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Digitally Signed Address&lt;br /&gt;Registration to prevent fraud&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;    &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Sensor Networks&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;    &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Smart Grid – Appliances on the Net&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;    &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Mobiles&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;    &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Cloud Computing&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;    &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Social Networks&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;    &lt;/ul&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Mobility&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;ul type="DISC"&gt;    &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Persistent state, disrupted connectivity (transactions mode)&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;    &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Multiple types of networks (Wifi, 3G, 4G)&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;    &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;New sensory inputs from the mobiles: sound, speech, video --- Everyone can report almost everything&lt;br /&gt;in real time&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;    &lt;/ul&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Beyond text search (Mainly&lt;br /&gt;Google applications)&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;ul type="DISC"&gt;    &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Image Search --- Google&lt;br /&gt;Goggles&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;    &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Speech recognition --- Easier&lt;br /&gt;for some tasks&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;    &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Gestures controlling the&lt;br /&gt;device (Patti Maes – see a TED talk)&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;    &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Semantic Web&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;ul type="DISC"&gt;      &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Still, a lot of dark information&lt;br /&gt;  in the web&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;      &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Web “publishing” ---&lt;br /&gt;  not just making the raw data available, making it available for use&lt;br /&gt;  and consumption by some other applications&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;      &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Semantic “printing”&lt;br /&gt;  --- Information Representations&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;      &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Creating Persistent Object&lt;br /&gt;  Identifiers&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;ul type="DISC"&gt;        &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;What is an object?&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;        &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Uniqueness of the object&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;        &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Interpretation “ “ “&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;        &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Authenticity “ “ “&lt;br /&gt;    --- Digital Signatures (and supporting laws)&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/ul&gt;&lt;/ul&gt;  &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Security Issues – both&lt;br /&gt;system and user issues&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;ul type="DISC"&gt;    &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Spam&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;    &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Viruses/Trojans&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;    &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Re-use of (poor) password&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;    &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Social Engineering – phishing,&lt;br /&gt;deceiving emails&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;    &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Human Errors (how to detect bad configurations) – incident of  marking every website as malware&lt;br /&gt;in Google search for 15 minutes&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;    &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;StopBadWare.org organization&lt;br /&gt;– non-profit organization that detecting sites that carry malware.&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;    &lt;/ul&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Privacy&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;ul type="DISC"&gt;    &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Lax user behavior&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;    &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Weak protection of personal&lt;br /&gt;data by businesses and government&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;    &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Invasive devices: every&lt;br /&gt;mobile device has a potential for privacy invasion&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;    &lt;/ul&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;New Technologies&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;ul type="DISC"&gt;    &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="font-size:small;"&gt;Flow routers&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;    &lt;li&gt;&lt;span style="font-family:Calibri;"&gt;&lt;span class="Apple-style-span"  style="fo
