Hacker News new | past | comments | ask | show | jobs | submit login

I'm replacing recognised keywords with their initial letter, other symbols with "x", removing all spacing, retaining all punctuation, and then using a Levenshtein distance.

For example, this:

  void condense_by_removing(
      char *z_terminated ,
      char char_to_remove
      ) {
    char *p_read;
    for (p_read = z_terminated;*p_read;p_read++)
      if (*p_read != char_to_remove)
        *z_terminated++ = *p_read;

    *z_terminated = '\0';
  }
gets mapped to this:

  vx(c*x,cx){c*x;f(x=x;*x;x++)i(*x!=x)*x++=*x;*x='\0';}
I'm debating inserting braces around every block to assist with the similarity concept, but that's hard to do automatically without fully parsing the routine. The above "fingerprint" would then become this:

  vx(c*x,cx){c*x;f(x=x;*x;x++){i(*x!=x){*x++=*x;}}*x='\0';}



Did you consider tokenizing the inputs and comparing those? Based on Levenshtein distance alone you're basically saying that "++" and "!=" are twice as important as "=" or "*", which doesn't seem right to me.

Next question: How are you picking (x,y) coordinates for the graph? You've explained how you determine the connectivity, but the positioning is a bit unclear -- edges with the same score often have quite different lengths.


I did consider tokenising the inputs, and probably will. The only reason not to have done so yet was that this was a no-brainer in terms of getting something working just to see if produced something useful.

I'm using neato for the layout. Graph layout is hard, and in some cases unsolved. I'm using this for rough visualisation, then I'll write code to find true clusters.


I'm using neato for the layout.

Ok, so you're using all the pairwise distances for computing the layout, even though you're only showing the tree edges on the graph?


No, I'm generating a tree.

Put every node in its own component. Find the shortest edge that joins two components, emit that edge, merge the components. Lather, Rinse, Repeat.

Also, braces penalise you twice. Code that is identical except that one includes, the other excludes, a pair of braces are distance 2 apart. There is some reason to say they should have distance 0. Fully parenthesised code, and then ignore the close (or open) brace would fix that.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: