A reminder that BigQuery (as used in the query in this link) is the best way to play with Hacker News data; don't scrape HN data manually!
The `bigquery-public-data.hacker_news.full` table appears to be up to date with the most recent HN data as well (table last updated today).
However, I'm not 100% sure the query is correct for unilaterally getting all links, as running the query on the full dataset returns the same results as running it from 2006-2015. And I value my sanity enough to not fuss around with the regex.
What is the best way to download this dataset? Last time I messed with it I had to pay for a Google Cloud bucket and run through some awkward sequence of steps to eventually get a local copy.
That's essentially it (export the BQ table as a CSV to a Google Cloud Storage bucket, then download it from there), but you can do that entirely in the web UI, no CLI needed.
If you just want a subset of the data, run a query, then save the query as a table in your project and export from there.
I think an alternative option could be to have a torrent (or other file sharing mechanism) with the public HN information. Am I missing something? 16k rows seems very tiny for doing an analysis of HN.
Funny that searchyc.com was so necessary for so long, coming in at #54 and #68 (it seems it should be higher as these should be combined). Now it just redirects to a spam/ad website, but before HN had a search bar it was very useful.
The url "u.ly/73I" #63 is very interesting, its not seen almost anywhere else on the web (at least on Google), and is apparently spam, now, and when you click on mentions for that matter you get:
> We found no comments matching u.ly/73I
What's the deal with that one? Was it spam comments that all got removed? It may be since some of the others were spam, like this one: https://goo.gl/l5v0b
It's impressive how little spam spam (as opposed to submarines, this is where I link to PGs essay) is on HN.
The HN guidelines deserve to be #1, not just in this list, but for the whole internet.
If conversation was as civil elsewhere as it is on HN, Americans might rediscover the value of community which they have lost to mindless bickering encouraged by commercial algorithms elsewhere.
I very much appreciate the urbane nature of HN, and strive to contribute in a way that brings light; not darkness.
My digital community experience harkens from the USENET days, which made the worst pissing matches on Faecesbook look like polite disagreements between scholars.
A day or two ago, someone made a real mild slap at me (I can come across as a bit tiresome, if you haven’t noticed. Take my word for it; it’s preferable to my USENET persona), using a very tired old troll technique, and someone else flagged it.
I was actually surprised it was flagged, but it does show that people take civility seriously, hereabouts.
I was thinking of American social media companies with exploitative engagement algorithms that foment adversarial discourse. You are right if you are saying the users are all over the world, not just America.
Is that what you mean? Or is the term “American” problematic in some other way?
Linked 197 times, possibly referenced more than that; plus it’s a neatly exaggerated example of a joke that I’d imagine has been made many more times again:
The semi-inevitable downvotes may be because xkcd links are seen as cheap canned injects that although relevant and often funny don't really add anything novel to a debate. And perhaps - the horror! - they might not actually always be correct in a specific case.
Probably because someone posts "correlation does not imply causation" as a rote "I am very smart" reply to any published study, regardless of the actual contents and claims of the study.
#11 literally changed my life. I used to debate people online and get irrationally upset when I couldn't change their opinions. Reading that xkcd was like getting a dope slap. I still debate but I rarely let myself get upset and if I do I try to hold that in my mind.
You're judging it from the wrong position. Diceware is an improvement upon the usual kinds of passwords people feel forced to come up with, and then re-use everywhere.
A password manager is better than diceware, and whilst it feels like less friction to people who use them, people can be unwilling to use one.
Just rejecting passwords found in HIBP's stolen passwords list had me receive death threats and long rants about how password security isn't my responsibility and I was putting up too much friction.
Ah, that makes sense. Thanks for clarifying for my slow self.
Edit: Also
> rejecting passwords found in HIBP's stolen passwords list had me receive death threats and long rants about how password security isn't my responsibility
That is amazing in a horrifying sort of way. I'm sorry you went through that.
The comic calculates entropy with the assumption that the attacker knows that your password is four common words. Even if they run a dictionary attack on your password, it will still be secure.
Diceware passwords are still recommended by security experts over any other method.
"The submarine" was nice for its time. The reality is today astroturfing is very much a real thing especially as print media has died and the same types of firms find ways to hawk their wares on the internet.
pg also had the right idea but you could, if you desire sit further back and ask about how someone's personal self-interests affect what they write (or nowadays, post or whatever) about and how that can affect the sincerity of what they are writing.
Was kinda surprised to see https://circleci.com at #37. When I clicked through, turns out it was me repeatedly reminding anyone who'd listen that we existed.
Interesting, looks like the wikipedia title spells it Betteridge's with an apostrophe. %27 is the url encoding for the apostrophe and %25 is the url encoding for the %. So it looks like the link somehow got url encoded twice before it was shared resulting in the apostrophe becoming %27 and then the %27 becoming %2527 and producing a bad link.
Some common method of people finding and sharing that url must have had an issue where it incorrectly encoded the url twice. I wonder what it was :) .
Or perhaps the scraper has a bug where it is double url encoding the en.wikipedia.org/wiki/Betteridge%27s_law_of_headlines link?
I watched https://www.youtube.com/watch?v=dBnniua6-oM a few days ago and was surprised to see is the #1 youtube link. It does have 11M views, however, and it is a great talk.
I’m so happy to see XKCD 927, because it was my immediate first thought. It is one of the few XKCD comics I know by number, and while witty I do feel it gets misused in thought-terminating ways sometimes. Still, I feel validated in my knee-jerk reaction that it would be the top XKCD comic on the list.
The `bigquery-public-data.hacker_news.full` table appears to be up to date with the most recent HN data as well (table last updated today).
However, I'm not 100% sure the query is correct for unilaterally getting all links, as running the query on the full dataset returns the same results as running it from 2006-2015. And I value my sanity enough to not fuss around with the regex.