Hacker News new | past | comments | ask | show | jobs | submit login
Most-mentioned links in Hacker News comments, 2006–2015 (github.com/antontarasenko)
254 points by simonebrunozzi on Nov 12, 2020 | hide | past | favorite | 68 comments



A reminder that BigQuery (as used in the query in this link) is the best way to play with Hacker News data; don't scrape HN data manually!

The `bigquery-public-data.hacker_news.full` table appears to be up to date with the most recent HN data as well (table last updated today).

However, I'm not 100% sure the query is correct for unilaterally getting all links, as running the query on the full dataset returns the same results as running it from 2006-2015. And I value my sanity enough to not fuss around with the regex.


What is the best way to download this dataset? Last time I messed with it I had to pay for a Google Cloud bucket and run through some awkward sequence of steps to eventually get a local copy.

I think I ended up following the advice here: https://stackoverflow.com/questions/18493533/how-to-download....


That's essentially it (export the BQ table as a CSV to a Google Cloud Storage bucket, then download it from there), but you can do that entirely in the web UI, no CLI needed.

If you just want a subset of the data, run a query, then save the query as a table in your project and export from there.


Does it cost money to do that or it is only time consuming?


It’s not free: Only storage and egress, which are trivial.

Running the query in BQ is free up to 1 TB and the 16000 row download is free so I recommend that if necessary.


I think an alternative option could be to have a torrent (or other file sharing mechanism) with the public HN information. Am I missing something? 16k rows seems very tiny for doing an analysis of HN.


The vast majority of potential HN analysis can be done in SQL alone within BigQuery, which is the proper way to handle a data warehouse regardless.

You mostly need the raw data for ML model training which is a very niche use case.


You can read it directly into a pandas DataFrame. [1]

https://pandas.pydata.org/pandas-docs/stable/reference/api/p...


This pandas trick is for small result datasets, in which case the 16,000 row limit from BigQuery more than satisfies it.

I would not recommend using it for the full many-million-row dataset.


Oh no way, thank you!


Funny that searchyc.com was so necessary for so long, coming in at #54 and #68 (it seems it should be higher as these should be combined). Now it just redirects to a spam/ad website, but before HN had a search bar it was very useful.

Also interesting that it contains at #55: https://news.ycombinator.com/best

But not the ostensibly more useful: https://news.ycombinator.com/active

The url "u.ly/73I" #63 is very interesting, its not seen almost anywhere else on the web (at least on Google), and is apparently spam, now, and when you click on mentions for that matter you get:

> We found no comments matching u.ly/73I

What's the deal with that one? Was it spam comments that all got removed? It may be since some of the others were spam, like this one: https://goo.gl/l5v0b

It's impressive how little spam spam (as opposed to submarines, this is where I link to PGs essay) is on HN.


The u.ly link seems to have been spam for shoes. The wayback machine captured it: https://web.archive.org/web/20110319115346/http://u.ly/73I


HN has a search bar? scrolls down Oh, wow! I always use hn.algolia.com.


That bar uses hn.algolia.com for search as well.


I’m pretty sure the search bar just redirects to that.


The HN guidelines deserve to be #1, not just in this list, but for the whole internet.

If conversation was as civil elsewhere as it is on HN, Americans might rediscover the value of community which they have lost to mindless bickering encouraged by commercial algorithms elsewhere.


I very much appreciate the urbane nature of HN, and strive to contribute in a way that brings light; not darkness.

My digital community experience harkens from the USENET days, which made the worst pissing matches on Faecesbook look like polite disagreements between scholars.

A day or two ago, someone made a real mild slap at me (I can come across as a bit tiresome, if you haven’t noticed. Take my word for it; it’s preferable to my USENET persona), using a very tired old troll technique, and someone else flagged it.

I was actually surprised it was flagged, but it does show that people take civility seriously, hereabouts.


Since we’re on the topic of civil conversation, I’m going to be a bit picky:

“American” is a generalization. Accusing them is not productive to discussion.


I was thinking of American social media companies with exploitative engagement algorithms that foment adversarial discourse. You are right if you are saying the users are all over the world, not just America.

Is that what you mean? Or is the term “American” problematic in some other way?


I though you were referring to American users. Accusing American companies is ok.



I'm quite surprised #3 has only been mentioned 197 times (well, a few more after this thread) in 14 years.


Linked 197 times, possibly referenced more than that; plus it’s a neatly exaggerated example of a joke that I’d imagine has been made many more times again:

“I have a problem I’m trying to fix with X.”

“Ah, so now you have 2 problems.”


I find it very hopeful that that particular one has been linked to so many times, maybe it's sinking in? :)


Surprised that 552 is not among these. One of my favorites:

https://xkcd.com/552/


My personal experience is that posting this leads to and instant surge of downvotes


The semi-inevitable downvotes may be because xkcd links are seen as cheap canned injects that although relevant and often funny don't really add anything novel to a debate. And perhaps - the horror! - they might not actually always be correct in a specific case.

Disclaimer: I really love xkcd.


Probably because someone posts "correlation does not imply causation" as a rote "I am very smart" reply to any published study, regardless of the actual contents and claims of the study.


#11 literally changed my life. I used to debate people online and get irrationally upset when I couldn't change their opinions. Reading that xkcd was like getting a dope slap. I still debate but I rarely let myself get upset and if I do I try to hold that in my mind.


I'm honestly surprised that the Wisdom of the Ancients xkcd [0] isn't on the list...

[0] https://xkcd.com/979/


The only thing worse than that is when they come back and say, "Oh nevermind, I figured it out." and don't post the fix.


Surprised 2347, Dependency [1], didn't make the list. But it's pretty recent.

[1] https://xkcd.com/2347/


#3 makes sense (I believe I've posted it in a few comments) because HN has a lot of news on new standards


Some XKCD comics have been translated to Chinese:

https://github.com/stevenliuyi/xkcd-cn/tree/master/pics

I've combined them side-by-side with the English:

https://mega.nz/folder/modDRIYB#uxRKw78Fcv6mpclovOTRbw

I'd also like to transcribe them for use with Pingtype, but didn't get around to that yet.


936 is my personal favorite.


My understanding is that diceware is no longer considered secure though.

Aren't there wordlists that just consist of combining dictionary words, rendering diceware more dangerous than a password manager?


There have always been dictionary attacks.

You're judging it from the wrong position. Diceware is an improvement upon the usual kinds of passwords people feel forced to come up with, and then re-use everywhere.

A password manager is better than diceware, and whilst it feels like less friction to people who use them, people can be unwilling to use one.

Just rejecting passwords found in HIBP's stolen passwords list had me receive death threats and long rants about how password security isn't my responsibility and I was putting up too much friction.


Ah, that makes sense. Thanks for clarifying for my slow self.

Edit: Also

> rejecting passwords found in HIBP's stolen passwords list had me receive death threats and long rants about how password security isn't my responsibility

That is amazing in a horrifying sort of way. I'm sorry you went through that.


The comic calculates entropy with the assumption that the attacker knows that your password is four common words. Even if they run a dictionary attack on your password, it will still be secure.

Diceware passwords are still recommended by security experts over any other method.

https://www.eff.org/dice


#78 -xkcd 810 broke my brain. Nice !

Will add to my list of things I’d look into if I had more time.


Haha, just #927, nothing else. #927 XKCD receipt about everything


I thought we'd see #91 (1053/"Ten Thousand") but it's lame having a 4-digit XKCD in the top 100


"The submarine" was nice for its time. The reality is today astroturfing is very much a real thing especially as print media has died and the same types of firms find ways to hawk their wares on the internet.

pg also had the right idea but you could, if you desire sit further back and ask about how someone's personal self-interests affect what they write (or nowadays, post or whatever) about and how that can affect the sincerity of what they are writing.


the submarine article was originally about me (but was wrong, we hadn’t hired PR)


Well, maybe the target was wrong but the idea was still quite good.


browsing some of these, i'm reminded of "Cool URIs don't change": https://www.w3.org/Provider/Style/URI


If I ever needed to summarize Hacker News into a single list, this would absolutely be it.


Was kinda surprised to see https://circleci.com at #37. When I clicked through, turns out it was me repeatedly reminding anyone who'd listen that we existed.


Is it possible to do this for 2015 until now as well?


Very cool! It's already useful, but you could make it even more so by enabling to sort by smaller time ranges (e.g. 1 year).

It would also be interesting to see a version of this list weighted by karma scores of users who posted the links.

Edit: even better, use your own h-index ranking https://github.com/antontarasenko/smq/blob/master/reports/ha...


Looks like someone made this 5 years ago from a data dump, no updates to the repo in 5 years.


Nice idea! Note the scraper seems to have produced a duplicate entry:

  1. en.wikipedia.org/wiki/Betteridges_Law_of_Headlines
  2. en.wikipedia.org/wiki/Betteridge%2527s_law_of_headlines
The second is a bad link, so I am curious how that link got shared so much.


Interesting, looks like the wikipedia title spells it Betteridge's with an apostrophe. %27 is the url encoding for the apostrophe and %25 is the url encoding for the %. So it looks like the link somehow got url encoded twice before it was shared resulting in the apostrophe becoming %27 and then the %27 becoming %2527 and producing a bad link.

Some common method of people finding and sharing that url must have had an issue where it incorrectly encoded the url twice. I wonder what it was :) .

Or perhaps the scraper has a bug where it is double url encoding the en.wikipedia.org/wiki/Betteridge%27s_law_of_headlines link?


I watched https://www.youtube.com/watch?v=dBnniua6-oM a few days ago and was surprised to see is the #1 youtube link. It does have 11M views, however, and it is a great talk.


Had missed this Paul Graham one: http://www.paulgraham.com/disagree.html but it's really good.



Can someone explain why Paul Graham? I read a few of those links and don’t get it.


He is the founder of Y Combinator.


I’m so happy to see XKCD 927, because it was my immediate first thought. It is one of the few XKCD comics I know by number, and while witty I do feel it gets misused in thought-terminating ways sometimes. Still, I feel validated in my knee-jerk reaction that it would be the top XKCD comic on the list.


What about 2015 - now?


Very cool! Inspired me to get the most upvoted XKCD comics on Reddit for 2019.

https://gist.github.com/davidgasquez/3aeaac54c5a61216ffc8f7d...


Impressive that the 17th most upvoted XKCD comic isn't even an XKCD comic!


Nice catch! I used the first regex I could find on SO.


This is gold!


It amuses me that the Dunning–Kruger effect shows up here. It has become such a cliche for people to reference it.


I link this[1] every time I see it mentioned, because the standard depiction of it bears no resemblance to what's in their paper.

[1]: https://www.talyarkoni.org/blog/2010/07/07/what-the-dunning-...


I'm one of today's 10,000.

Thanks!


It seems like most people greatly overestimates their understanding of the Dunning-Kruger effect.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: