Tuesday, October 28, 2014

Rankings Everywhere

Like many (theoretical) computer scientists, I spent far, far too much time this weekend thinking about rankings.  Well, really, it was more that I was being far, far too amused by the various posts and comments on the novel ranking site "Ranking of CS Departments based on the Number of Papers in Theoretical Computer Science",  which led to posts at the Computational Complexity blog and in theory, the first of which shut down the comments rather quickly, and the second of which is at this time nearing the amazing 100 comment line.  (Luca's post deserves it -- it was the funniest thing I've seen all week.  Kudos.) 

Now, to fan these flames to further untold heights, I was informed that new US News and World Report Computer Science Rankings are out.  Needless to say, I disagree with them.  How exactly can Harvard be second to MIT?  Ridiculous.

(For those who are concerned about such things, the methodology is also given, though I haven't checked to see what parts of it are reproducible.  And what is that "Web Of Science" thing?  Is it as good as the citation tools in Google Scholar?)

For the ultimate in ranking tongue-in-cheek-iness, and for those who haven't seen it, I also recommend the NRC Deranker, which will (eventually) provide you the ranking you personally agree most with. 


Monday, October 27, 2014

Welcome, Attorneys!

You might wonder why I am welcoming attorneys to the blog.  Well, I happen to know that some attorneys read my blog.  I know this because recently, as an expert witness, while I was testifying in court, the attorney on the other side during my cross-examination specifically asked me about my blog.  Based on their questions, I am reasonably sure they had downloaded all the posts.  I imagine some poor associate attorney low on the totem pole had to read through it all to see if they could find anything embarrassing they could use in the context of the case. 

I don't think it was successful.  But who knows.  At one point, the attorney cross-examining me asked if one of the reasons I liked to do consulting work was because it pays well.  I answered, "I would say that's one of the reasons."  That wasn't a hard answer to give;  it's an honest answer.  But afterward I recalled that, years ago, I had talked about consulting on my blog.  I looked it up, and sure enough, here is what I wrote years ago:
I enjoy consulting, for several reasons:
  1. It fulfills a need to be more directly relevant to the real world.
  2. It gives me a chance to work with different sets of people.
  3. It allows me to learn new things and exercise skills that I don't necessarily use as a professor.
  4. It pays well.
Based on my experience as a witness, I am of very high certainty that the attorney had this post in front of him when he asked the question.  Of course, when he asked the question, he failed to note the other reasons I had given for consulting, or provide me the blog post for context.  But my assumption is that they were simply looking for a "gotcha" moment.  Perhaps the question and my response made me look bad to the jury.  (After subsequent clarifying questions on the issue from the client's attorney, I would like to believe it had no effect.)  I imagine that they were hoping that I would say that the pay wasn't important, in which case I am sure they would have readily put up the post from my blog to show the jury to try to discredit me.   

I talk very little about my legal consulting on the blog or publicly.  The main reason, obviously, is out of respect for and professional obligation to my clients.  But I find I don't blog about it even in very general terms, where confidentiality or other similar issues would not come into play.  And one good reason not to is that an attorney could use anything I write on the subject to try to impeach my credibility as a witness, even if it is out of context. 

Of course, as this example shows, anything I post online can potentially later be used against me in a case or in court.  Or in other situations.  These issues regarding our permanent digital identity, and how modern technology allows our past to easily be brought up and used against us, are not new;  I'm not suggesting I'm offering novel insight here.  Really, I'm just relaying an instance where the issue of my digital identity, in the context of this blog, hit me head on.  Certainly when I wrote that post years ago about consulting I did not think it would come up later in this way.  Mostly it makes me wonder how my children, who are busy exploring their own identities (digital and otherwise), will manage in a world where most of their history will be available online, for people to use or misuse in ways that I could not imagine.     

Friday, October 24, 2014

Harvard Junior Faculty Job Search

Harvard Computer Science is doing a junior faculty search this year.  We pretty much have been doing one every year, but it's nice to be able to make an announcement and point people to the ad.

So here is the ad, telling you what you need so send in if you're applying. 

Also as usual, we're looking in all areas, because we're always interested in great people.  However, in the interest of full disclosure, the focus this year is described as
This is a broad faculty search and we welcome outstanding applicants in all areas of computer science, including applicants whose research and interests connect to such areas as engineering, health and medicine, or the social sciences. Of particular interest are candidates with a focus on systems and data science, including operating systems, networking, software engineering, data storage and management, and computational science.
We'll look forward to your application.

Wednesday, October 22, 2014

Cuckoo Filters

An upcoming paper appearing at CoNext will be of interest to any Bloom Filter user or aficionado:

Cuckoo Filter:  Practically Better than Bloom
(Bin Fan, David Andersen, Michael Kaminsky, Michael Mitzenmacher)

Let me describe a cuckoo filter and some of what's in the paper for you.  If you want to avoid a technical discussion, all you need to know is that for reasonably large sized sets, for the same false positive rate as a corresponding Bloom filter, cuckoo filters use less space than Bloom filters, are faster on lookups (but slower on insertions/to construct), and amazingly also allow deletions of keys (which Bloom filters cannot do).  If you want to look at code, there's even a github repository for you with code for cuckoo filters. 

Note:  I'll assume you probably know what a Bloom filter is in the discussion that follows.  As a reminder, it's a hash-based data structure for set membership that can give false positives;  by allowing false positives, you can save space in your representation of the set.  I'll assume you probably know what cuckoo hashing is.  As a reminder, in cuckoo hashing, you have buckets that can contain a fixed number of keys, and each key gets some number d of buckets (obtained by getting d values from hashing the key) where it can be placed.  When you insert a new key, if all of its buckets are full, you can move keys out of that bucket to one of its other buckets to make room.  Since apparently there are drinking games that exist based on taking a drink whenever I use the word "Bloom filter" or "cuckoo hashing" in a talk, I imagine most readers of the blog know about these things.  (Or, always, Wikipedia.)

The framework used in the paper comes from the following idea.  One way of implementing Bloom filter functionality when given a static set of items is to use a perfect hash function;  that is, as hash function that maps the n elements of your set into an array of size n in a 1-1 fashion.  Then, you use another hash function to store a fingerprint of the element instead of the element itself in the array location.  To do a set membership test on an element z, you hash z once to find its location in the array, and hash z again to find its fingerprint, and check against the fingerprint in the array location.  For elements not in the set, you get a false positive rate of 2^{# of fingerprint bits}.

The problem is that the perfect hashing framework doesn't allow you to insert new items naturally.  Or do deletions (but standard Bloom filters don't allow deletions also -- you need a counting Bloom filter for that).  So a natural thing to do is to think of using some other sort of hash table, besides a perfect hash table, and see what happens.  This idea was explored way back in for instance some papers I worked on in 2006 (here and here) using "d-left" hash tables, which are based on a multiple choice hashing scheme.

If you're going to use multiple choice hashing schemes, though, you should think about using cuckoo hashing.  The ability to move keys around means you should get better space utilization;  for example, even with 2 choices, if your buckets can hold 4 items, cuckoo hashing can get you about 95% space utilization.  The problem with cuckoo hashing in this setting is that, for a Bloom filter, you want to just keep fingerprints of keys, not the keys themselves.  So, when you want to move the key, how do you figure out where to move it to -- you no longer have the key to hash?

The approach used in the paper is a trick called partial-key cuckoo hashing.  To get two buckets for our key, we do a hash to get the first bucket (idealized to be uniformly at random), but to get the second bucket, we do not hash the key, but just the fingerprint of the key;  we XOR that value with the value for the first bucket to get the second.  The upside of this approach is given a bucket and a fingerprint in the bucket we can find the other bucket corresponding to the fingerprint.  (XORing the hash of the fingerprint works if the fingerprint is in the first bucket or the second, as is easily checked.)  The downside is if our fingerprint has F bits, that means our second bucket is really only chosen from 2^F possibilities;  it is not uniform conditioned on the first.

Because of this limitation, one interesting aspect of the paper is that we show, in theory, this approach just will not work.  Specifically, there is an Omega(log n) lower bound on the size of the fingerprint that must be stored; this means the cuckoo filter is super-linear in space, unlike standard Bloom filters which are linear in space.  This lower bound is easy to derive.  Suppose that you have buckets that can hold b keys.  If 2b+1 keys have the same pair of buckets, then there is trouble;  you can't fit 2b+1 keys into 2b bucket slots.  So the question is whether there are collections of 2b+1 keys that have the same pair of buckets.  With standard cuckoo hashing, this would be a low probability event.  With partial-key cuckoo hashing, the probability depends on the size of the fingerprint, since this determines how many possible second choices of a bucket there are for each key (given their first random choice of a bucket).  If you have too few choices, you will get too many collisions, or too many keys in a pair of buckets.  Some straightforward calculations yield the Omega(log n) lower bound on the size of the fingerprint.

So now that we've proven theoretically that this won't work, how could it work in practice?  If you go back to those calculations, you find that the lower bound has some nice constant factors.  The bounds you get on the fingerprint size are actually (roughly speaking) log n/(2b).  That constant factor in the denominator, even when b=4, covers a large range of n.  In practice, fingerprints can be quite a reasonable size, and still avoid the potential bottleneck of too large a group of keys all hashing to the same small number of buckets. 

In any case, this may be the first paper I've written that contains a proof that the described data structure fails in theory but works in practice.  (The other way around I generally find much easier.) 

Big open theoretical question:  try to analyze why partial-key cuckoo hashing gets roughly the same load threshold as regular cuckoo hashing.  (Anyone want to work on that?  Let me know...)

This post is long enough;  there's a lot more in the paper, including comparisons with other schemes, bit-saving tricks, and lots of performance results.  Have a look.  

Postscript/Tangent:  In a recent review (for a different paper, with a different theme) one reviewer made the following comment/criticism:  "Despite years of research, no one has found a variant of cuckoo hashing that can outperform simpler methods in realistic situations."  To which I say, hunh?  Yes, I understand linear probing is wonderful in practice due to cache effects and block sizes and in many and perhaps even most situations can yield better performance than cuckoo hash tables.  But certainly not everywhere.  Cuckoo hash tables certainly seem at least potentially better for many hardware settings, and for at least some important applications -- such as this one -- cuckoo hashing seems to be a clear win.  (The issue here is the number of fingerprints you have to deal with on a lookup affects the false positive probability -- so even if linear probing lookups are "faster" due to cache effects and block sizes, if you try to get this sort of space utilization you get larger contiguous blocks which increase your false positives.)

My co-author David Andersen also violently disagrees with this review comment, and points to these recent papers at NSDI and Eurosys.  

Perhaps I'm missing something, but my thought was this was a sign of a bad review (both in content, and in its overall rudeness).  I felt obliged to try to correct the record (and I can only hope the reviewer sees this post).     
 

Friday, October 03, 2014

Andreessen-Horowitz Again

Andreessen Horowitz had a second Academic Roundtable, gathering together academics, VCs, and people in startups to talk for a few days about a variety of topics.  I wasn't sure why I was invited the first time (which I wrote about last year), and am even less clear on why I was invited back, but I again had a very good time. 

Before a few words about the talks, the high order point:  I'm told that most/all of the talks will be put online over the coming weeks.  So if you're interested, you can experience them yourself.  I think it's great that they're doing that this year;  if you like the talks, you should tell them, so they keep doing it.  (If they don't have comments there, you can always comment here.)  Sadly, they don't seem to be up yet, but I'll post the links when they become available. 

Highlights would include a discussion of Bitcoin -- interesting to hear what Ed Felten, well-known Princeton professor and now ex-Chief Technologist of the FTC, thinks about the Bitcoin economy.  Dawn Song of Berkeley gave a general talk on security issues of the present and future, while Dan Boneh of Stanford gave a talk on the power of program obfuscation.  Raj Rajikumar of CMU gave some history and some peeks into the future of driverless cars -- it's not just Google, you know.  Tuomas Sandholm of CMU talked about his take on the starting of startups while still being an academic (based on now-multiple experiences), and Josh Bloom of UC Berkeley (and wise.io) described the differences between writing papers about machine learning and building products using machine learning.

Of course, some heated discussion about the variety of issues that arise between academic work and transfer of technology to startups inevitably ensued.  (Maybe that's why I get invited.)  The key idea repeated by many (on both sides of the fence) in various forms was that businesses (and in particular startups) are very busy going down their road of development and product, and they may see many things out the sides of that road that are very interesting, but don't have time to explore off the road.  Academics are interested in things way off the road, often thinking of issues much further out in time-scale.  And (at least in my opinion) the role academics play is a good thing;  there (obviously) remains a lot of ways the two worlds can interact and cooperate.