Interviewing programmers: Analysis taster (work in progress)

Groxx · on June 4, 2010

I'd be interested in an analysis of the compiled assembly for those 5 versions in 2b. A quick test on my end shows them all to be extremely similar.

I know little assembly, and am not going to attempt to pick it apart, so I can't attest to any speed difference. But as an example the first and the last compile to the same, though the last has 2 fewer instructions and some -16s changed to -24s. The first is identical to the second. The third has a bit of shuffling, but also appears identical to the first. And the last two are identical to each other (well, obviously).

How much optimization are you looking for in your C-code vs your coder's productivity? If they can get the last ones in one shot, all well and good, but if not is refactoring worth anything in this case?

edit: with -O3 optimizations on gcc, they all appear to be literally identical, the only differences I can see being variable names and the order of functions (in code, not in call order) between the first three and the last two.

RiderOfGiraffes · on June 5, 2010

Indeed - these are issues that I will be discussing, and would discuss with a candidate once I had their code in hand. I say in 2b:

    Each of these changes is unnecessary and in some cases
    damage readability. Further, a good compiler will get
    most, if not all, of the efficiency for you without
    making these changes.

If changes damage readability and don't change the compiled code, don't make them!

Readability has to be balanced against efficiency, but too often people think something is more efficient when it isn't. As always, profile before "improving."

As I say, these are issues for discussion. Let me add at this point that some of the submitted versions, and some of my reference versions, are significantly different in both style and efficiency.

There is more to come.

Groxx · on June 5, 2010

I'll definitely follow along, it's been interesting. And I'd probably come up with something like the last one after a little thought, because it's pretty idiomatic of string copying.

I just tend to seriously dislike code refactoring instead of algorithm refactoring, and anything that advocates it. Typically, the compiler is far better than the coder, especially when you turn on optimizations, but the compiler (nearly) can't swap out your code for a better algorithmic approach. Thus, you'd be much better off with expansive, seemingly inefficient code from a smart programmer who knows the be(st|tter) algorithm(s).

dkersten · on June 4, 2010

Wow, I was surprised to see mine at the edge of the graph on its own. I wonder is that a good or bad thing..! :-)

cperciva · on June 4, 2010

My first two submissions aren't even on the graph. :-)

RiderOfGiraffes · on June 4, 2010

There's a surprise. I did say I'd excluded outliers.

dunstad · on June 4, 2010

Mine's also absent, though I suppose that's to be expected: I did it in Python.

RiderOfGiraffes · on June 4, 2010

Yours does have an unusual feature.

what · on June 4, 2010

Is there a full size image so I can read the labels and find my submission? :)

RiderOfGiraffes · on June 4, 2010

It wasn't intended for that purpose, but stand by ...

EDIT: OK, I've uploaded a larger version where the labels are readable, but I've had to change the layout so the nodes don't overlap. The lengths of the edges are now not always realted to the distance, although it's approximate.

http://www.solipsys.co.uk/Writings/x.dot.png

Note: Yours might still not be on it, as yours might be one of the outliers. You can email me your ID and I'll try to find time to send you some data. Or you can wait.

ydant · on June 4, 2010

Thanks - that's a bit better of a teaser. I'm happy to see mine isn't red, but it's pretty tightly clumped in the main group of solutions. That's not really surprising as it's not that far off from your reference solution, which seems to be the most straightforward way to go about it.

dpritchett · on June 4, 2010

Yeowch! Mine (g034.d) shows up as the second red dot from the left. I submitted a quick pseudocode answer without any research in an effort to mimic interview whiteboarding conditions.

I had no idea my solution would be automatically validated - I was happy enough to get an in-place O(n) solution. The complete lack of pointers on my part must be what separated me from the cluster of correct solutions at the center.

RiderOfGiraffes · on June 4, 2010

You use sizeof(test_string) which isn't what you want.

As an experiment I fixed that, but you had another problem as well which also caused it to fail.

what · on June 4, 2010

Thanks! I just wanted to see how similar my submission was to everyone else's and if it was a fail.

cperciva · on June 4, 2010

I see a "g902.d" on that graph -- is that a typo?

EDIT: and also a g900.d and a g901.d. Maybe not a typo after all.

RiderOfGiraffes · on June 4, 2010

The 900's are my reference solutions. I'm intending to create solutions with the various feature sets to try to create clumps around them. If there are suitable submissions, I'll just identify and use them instead. For example, g080.d is pretty central.

Agent101 · on June 4, 2010

Err. Never mind. Found what I was looking for in my inbox, buried. Thanks for the interesting experiment.

Edit: I suspect one of the clusters will be people like me who use array access syntax rather than pointer access.

xenophanes · on June 4, 2010

What proportion of the entries are on the graph?

RiderOfGiraffes · on June 6, 2010

Sorry, missed your question. About 75%.

nitrogen · on June 4, 2010

I'm quite familiar with C, but I don't see the difference between the while and for loops in the trivia box on article 2 of the series.

I would expect the while loop to compile something like this:

  while(condition)
  {
    // stuff
    i++;
  }

becomes

  loop:
    if(!condition)
      goto end;
    // stuff
    i++;
    goto loop;
  end:

I would expect the for loop to do this:

  for(;condition;i++) {
    // stuff
  }

becomes

  loop:
    if(!condition)
      goto end;
    // stuff
    i++;
    goto loop;
  end:

So what's actually going on?

RiderOfGiraffes · on June 5, 2010

Consider the effect of a "continue" statement in the "stuff" section.

nitrogen · on June 5, 2010

Ah, thanks. It makes perfect sense now.

cperciva · on June 4, 2010

How are you computing your similarity measure between routines?

RiderOfGiraffes · on June 4, 2010

I'm replacing recognised keywords with their initial letter, other symbols with "x", removing all spacing, retaining all punctuation, and then using a Levenshtein distance.

For example, this:

  void condense_by_removing(
      char *z_terminated ,
      char char_to_remove
      ) {
    char *p_read;
    for (p_read = z_terminated;*p_read;p_read++)
      if (*p_read != char_to_remove)
        *z_terminated++ = *p_read;

    *z_terminated = '\0';
  }

gets mapped to this:

  vx(c*x,cx){c*x;f(x=x;*x;x++)i(*x!=x)*x++=*x;*x='\0';}

I'm debating inserting braces around every block to assist with the similarity concept, but that's hard to do automatically without fully parsing the routine. The above "fingerprint" would then become this:

  vx(c*x,cx){c*x;f(x=x;*x;x++){i(*x!=x){*x++=*x;}}*x='\0';}

cperciva · on June 4, 2010

Did you consider tokenizing the inputs and comparing those? Based on Levenshtein distance alone you're basically saying that "++" and "!=" are twice as important as "=" or "*", which doesn't seem right to me.

Next question: How are you picking (x,y) coordinates for the graph? You've explained how you determine the connectivity, but the positioning is a bit unclear -- edges with the same score often have quite different lengths.

RiderOfGiraffes · on June 4, 2010

I did consider tokenising the inputs, and probably will. The only reason not to have done so yet was that this was a no-brainer in terms of getting something working just to see if produced something useful.

I'm using neato for the layout. Graph layout is hard, and in some cases unsolved. I'm using this for rough visualisation, then I'll write code to find true clusters.

cperciva · on June 4, 2010

I'm using neato for the layout.

Ok, so you're using all the pairwise distances for computing the layout, even though you're only showing the tree edges on the graph?

RiderOfGiraffes · on June 4, 2010

No, I'm generating a tree.

Put every node in its own component. Find the shortest edge that joins two components, emit that edge, merge the components. Lather, Rinse, Repeat.

Also, braces penalise you twice. Code that is identical except that one includes, the other excludes, a pair of braces are distance 2 apart. There is some reason to say they should have distance 0. Fully parenthesised code, and then ignore the close (or open) brace would fix that.

unwind · on June 4, 2010

This one is awesomely embarassing. I submitted a quick response, but it turned out to (probably) be quite pessimal. Heh.

The challenge was of the category that triggers my "oh, that's easy!" knee-jerk response, and I still think it's easy. It's just that my knee didn't jerk towards the solution shown by RiderOfGiraffes, which really feels like the best one. :)

I think I'll blame job stress for my performance when it comes to this one.

RiderOfGiraffes · on June 4, 2010

I wouldn't claim my solution is the best. In fact, I think it's definitely not the best. It is, however, a starting point for comparison.

Given your comment I'd be interested to know which is yours - email me your submission ID?

RiderOfGiraffes · on June 5, 2010

Earlier item in this series: http://news.ycombinator.com/item?id=1401117

ghfdrvtsyup · on June 4, 2010

whew! I'm glad I passed, though it should be obvious from my submission that my C is far from fluent! I am looking forward to the analysis - thank you for doing this RoG!

johkra · on June 4, 2010

I seem to have one of the outliers. Well, let's wait for the detailed analysis to see whether my test cases were good enough. ;-)