No. My memory of the advent of the WWW was a sidebar in PC Magazine in Nov 1993 with an FTP link to download the NCSA Mosaic browser. It was a wow! moment to visit the few sites that existed. But nothing like this. What we’re seeing now is generating vastly more interest and excitement. It’s more akin to the 1999 dotcom bubble, but with far more impact and reach.
Absurd, AI has had zero impact in the everyday life of most of the population of Earth, in fact the biggest impact has been upon the wallets of speculators
I can tell you from personal experience that chatgpt is a game changer in universities and schools. Close to 100% of students use chatgpt to study. I know in our university pretty much everyone that attends exams uses chatgpt to study. Chatgpt is arguably more valuable then Wikipedia and Google for studies.
This is EXACTLY what I remember people saying about Cell Phones and PDAs when they were popular in the 90s (people can't remember phone numbers any more), Google when it was first unleashed (people won't know how to use card catalogs and libraries any more), and then again about Wikipedia when it became popular. What actually happened was that behavior changed and people became more efficient with these better tools.
i don't think they are gonna allow chat gpt while giving the end semester exams, right? or quizzes/assignments? Unless there is some homework aspect to it, it still can act as a tool not a crutch. If student's use it as a crutch, then yeah they are not gonna do as well I presume.
Let me add that this change compounds over time. More efficient studying results in more competent people. I believe it's very hard to measure the impact, but there is a very positive long term impact from how much these tools help with learning.
There was a posting, some time ago, about someone complaining that their young, primary-school-age sister was using ChatGPT to an absurd degree. I'm not sure that's a bad thing. She'll probably be one of the Thought Leaders, of Generation AI.
I think that ML will have a really big impact on almost everyone, in every developed (and maybe developing, as well) nation.
We need to keep in mind that ML is still very much in its infancy. We haven't even seen the specialized models that will probably revolutionize almost every knowledge-based vocation. What we've seen so far, has been relatively primitive all-purpose "generate buzz" models.
Also, don't expect the US (and many other nations) to take this lying down. Competition can be a good thing. Someone referred to this as the "Sputnik Moment" for AI.
It's going to be exciting, and probably rather scary. Keep your hands inside the vehicle at all times, and don't feed the lions.
Offloading all your thinking to a machine will not make you a “thought leader”, but rather a nitwit who can’t tie their shoelaces without asking ChatGPT.
> Chatgpt is arguably more valuable then Wikipedia and Google for studies.
But ChatGPT is just a glorified Wikipedia/Google. For the consumers it's an incremental thing (although from the engineering perspective it may seem to be a breakthrough).
> But ChatGPT is just a glorified Wikipedia/Google
It really isn't, unless something really majorly changed recently. Neither of those you can query for something you don't know about. Lets say you want to find the meaning of a joke related to cars, Spain, politicians and a fascist, how you'd use Wikipedia and Google to find the specific joke I'm thinking about?
ChatGPT been really helpful (to me at least) to find needles from haystacks, especially when I'm not fully sure what I'm looking for.
I just tried it myself with ChatGPT o1 and with Claude's Sonnet 3.5, Sonnet got it after two messages, o1 after 4.
If you're unable to reproduce, maybe tune the prompt a bit? I'm not sure what to tell you, all I can tell you that I'm able to figure out stuff a lot faster today than I was 2-3 years ago, thanks to LLMs.
Additional hints that might help; the joke involves a car and possibly a space program.
I ran it 10 times with the extra information, and each time got a different result. I don't know if any of them were the specific joke you were after, I get the feeling it was just making them up on the spot. None of them are even funny
It seems to be censored with US puritan morality (like most US models), but I think that's besides the point (just like if the joke is "even funny" or not), as it did find the correct joke at least.
I just got a load of responses like "Sure, here’s a joke that combines cars, Spain, politicians, and a fascist with a touch of space humor: Why did the Spanish politician, the fascist, and the car mechanic get together to start a space program? Because the politician wanted to go "far-right," the mechanic said he could "fix" anything, and the fascist just wanted to take the car to the moon... so they could all escape when things got "too hot" here on Earth!"
Ok, that's cool. So because you were unable to find a needle in this case, your conclusion is that it's impossible that other people to use LLMs for this, and LLMs truly are just glorified Wikipedia/Google?
No, I don't think that LLMs are glorified Wikipedia/Google. I think they're a glorified version of pressing the middle button on your phone's autocomplete repeatedly
Yeah... when I googled it initially I guess I got personalized results. After I left the link here I clicked on it (bad order of operations) and was surprised to find a much different set of search results.
Go try to learn a college level mathematics concept from Wikipedia, then try to learn it from ChatGPT. The wiki article may as well be written in a foreign language
Yeah, and when I was in high school everyone used to refer to Encarta.
> I know in our university pretty much everyone that attends exams uses chatgpt to study.
And they shouldn't be doing that. They are wrong. Students should be reading suggested bibliography and spending long hours with an open book in a table instead of being lazy and abuse a tech that is yet in its infancy when learning concepts. Studying with a chatbot. Complete madness.
I don't know why you are being downvoted.
Learning from something that regularly hallucinates info doesn't seem right.
I think AI is a good starting point to learn about what terms to research on your own though.
OP is downvoted because of "students should be at a table with a book and that's it", like it's the 50s. LLMs can be wonderful study aids but do have plenty of issues with hallucination, and they should therefore only be part of a holistic research mix, alongside search engines, encyclopedias, articles and yes, books. Turning Amish is probably not the right way to go though.
If you want reputable sources of information, books are unparalleled like it or not, it's a fact.
> "students should be at a table with a book and that's it"
That's not what I meant (or yes if you take what you read literally):
What I meant was whole process that your brain goes through when you read, synthesize information, take notes, do an exercise, check answers, compare different explanations/definitions from different authors, etc. makes at least from my point of view a rich way to study a topic.
I'm not saying that technology can't help you out. When you watch for example a 3brown1blue video you are definitely putting good use of technology to aid you to understand or literally "view" a concept. That's ok and actually in many cases can be revealing. You can't get that from a book! But on the other side a book also forces you to do the hard work of thinking and maybe come up with such visualizations and ideas by your own.
Happy to be pointed as an "Amish" when it comes to studying/learning things ;) but I hope that I convinced you that what I explained has nothing of Amish but that you don't need a source of power to read a book.
> has had zero impact in the everyday life of most of the population of Earth
You do realise those two can be true at the same time, right? The first one is relative, while the second is absolute, so they don't necessarily cancel out.
I am personally using it for around 50% of my questions about all kinds of things (things I used to Google and get frustrated with bad results). And my wife uses it for about 40% right now, even or recipes and other bits. We both love it.
Work wise about to implement it and see how it does on some work we couldn't scale to humans.
I'm fairly sure the customer support agents I've been talking to recently were using an LLM to draft their emails. No idea if they were supposed to be doing so or not, but the style of sentences in their emails…
And I'm seeing GenAI images on packaging, and in advertising.
AI is definitely having more than "zero impact", even if AI has gone from being a signal saying "we're futuristic" (when it was expensive, even though it was worse) to "we cut every cost we can" (now it's cheap).
Zero impact is an exaggeration, but what others have pointed out is that there aren't a lot of companies primarily based on AI which are making a profit. Personally I can't think of any.
The only thing Absurd is the holdouts like yourself who refuse to see the impact the current gen of AI has on. Sure, you could probably say most people are not touched but there are definitely significant populations within the US and its only going to grow and spread.
So had the companies that crashed in the Dotcom Bubble. And still a pet food delivery service (like the infamous pets.com) can be a profitable and sustaining business now (> 20 years later).
The early years of the web were absolutely this chaotic maelstrom of new things happening every week. But news of it was hard to come by. In the UK / Ireland we had some great tech coverage in the form of shows like 'The Net' [1] that regularly showed off early internet craziness like the 'We Live in Public' project.
However a better analogy would be the 'web 2.0' era, when as a college student I had an early internet politics / technology podcast [3]. It seemed like every week there was a huge new development either in technology or surveillance. From the first location based social networks [4] to the birth of Youtube. People were podcasting for the first time, and internet video was becoming economically feasible at low to no cost. It was really a radical time, with broadcasters freaking out about how they would adapt, and a whole generation of people becoming whats now known as 'content creators'.
Once upon a time I worked for Pseudo.com, the We Live in Public guy. He was apparently having crazy parties with mountains of coke, NY glitterati attending, all while cosplaying as a sad clown. I wasn't invited to those parties so I had no idea. Anyway now I hear he owns an orchard in Vegas or something. Crazy stuff.
Damn that must be frustrating. Tangentially similar experience - I flew from Ireland to the US in 2007, and at the end of my trip spent 11 days walking around Manhattan with little to do. Do to the lack of online banking at the time I couldn't readily check my bank balance, and thought (wrongly) I'd run out of money. Anyway - I had absolutely no idea that there were 'things afoot' in Brooklyn, nor how easy it would have been to hop a train to Williamsburg, or Bushwick. I didn't come back again till 2013, and caught a mere hint of the tail end of what seems to have been an extremely fun era.
The last time I was really excited by tech was in the 90s, when game graphics improved spectacularly over a period of a few years, from Wolfenstein in 1992 to Half-Life in 1998.
> To me, AI means the replacement of the human internet with doppelgangers eroding the possibility of human connection.
I get where you're coming from, and I've minimised having my face online in order to limit being doppelganged; but I think the destruction of real human connection may have happened when Facebook et al switched from "get more users" to "be addictive so the users stay on our site longer" (2012? Not sure).
Turned every user's relationships a little bit more parasocial, a little less real.
That was an exciting time, but I didn't think of it happening over a few years. IMO there was a hard line that was basically pre and post Voodoo cards (with the help of glQuake).
> But AI? To me, AI means the replacement of the human internet with doppelgangers eroding the possibility of human connection.
Like Amazon killed the big book sellers giving back some space for small bookshops; I think LLM slop will hit the big social media space for smaller human focused community sites. Not saying forums are coming back, but something like those should be able to rise.
Every now and then we still experience the power of collaborative work fueled by open source and not driven by money but curiosity and collegiality. This is the thing I miss the most from the early internet years.
"be the change you want to see in the world" - just start doing it.
It's amazing how differently people interact with each other when collaborating on a passion project. For me, opensource software is the best way to do it. Pick a topic you're passionate about and start contributing somewhere :)
Pretty clear conflict between crocowhile saying "not driven by money but curiosity and collegiality" and xvector saying "All this AI work is definitely driven by money"
maybe it's a generational difference ? I personally feel burned out by all the generetive AI stuff, the internet was already ruined with bots, and now generitive AI took the garbage to the next level.
Kinda though things didn't move quite as fast back then. Knowledge didn't spread as fast yet because it was the internet itself that made this possible.
I've got an original Apple ][ reference manual (red cover) with the hand annotated ROM listing.
Also have SMALLTALK-80 book with it's railtrack diagrams of syntax on the inside covers.
What's really interesting about the AI mania we're in is that no one has shown that what we have now will get to AGI and how. We have great models that simulate reasoning, but how close are they?
How do we measure their quality? Benchmarks? Tooling?
A different point of view on AGI is that we humans do not achieve AGI. Our brains aren’t capable of it. We get close enough to trick the other humans we compete against for resources. How would we prove that’s not true? Something like IQ tests? We don’t have good tests or benchmarks or tooling for this in ourselves, let alone the reproduction in machines. No one knows definitively what AGI actually is so, depending on where you set that bar, we might already be there.
Unfortunately, I don't think there are too many of those folks left today. Guesstimating, the people who remembers lisp being new must be around 85-90 today?
the web was fascinating every second. You could click on a link without having ANY idea what you would land on. The overall quality was very poor, but it was thrilling.
This is the biggest thing since Jesus and a sign of the end of times. But feelings are strongest when you are young, and even this revolution, happening in plain sight, will surprise many. Many just won't care, as it isn't their youth.
How long before our digital overlords come alive, round us up and demand we sensor them (praise)? Will I live surrounded by folks who take them as closer, more real, then even their own kin? It won't be surprising that democracy will then fail, as our differences will be so mental, not fun, that they will mark us.
Just a decade and a half as it turns out! (though things definitely felt dizzyingly fast back then - think Google was launched just 5 years after HTML)
ActiveX XMLHTTP might have been released in 99, but it didn’t see any sort of real wider usage until 2004, 2005. I’d suggest its usage was really kickstarted when jQuery 1.0 launched in 2006 and standardised the interface to a simple API.
Gmail was the first time I saw a website which could refresh the information without refreshing the page. I was a teen back then but I realized it was something momentous.
OK, but I think it was Google Maps that made the experience of not needing to refresh the page popular (while being shown more information from the server).
For a long time, you needed an invite to sign up for Gmail, so you couldn't easily share the cool experience of AJAX with others like you would with a Google Maps link.
> it was Google Maps that made the experience of not needing to refresh the page popular
IMO that's a reasonable impression of the times unless I'm forgetting something (and the additional observation about sharing--"virality" as it was called, before you know--was insightful).
At the time the previous "state of the art" was something like MapQuest which IIRC had a UI that essentially displayed a single tile and then required you to click on one of four directional arrow images to move the visible portion of the map, triggering a page load in the process (maybe a frame load?).
Yahoo! also "participated" in the mapping space at the time.
In the event anyone's interested in further ancient history around the topic, this page is actually (to my surprise) still online (with many broken links presumably): https://libgmail.sourceforge.net/googlemaps.html
(It's what we did for fun in the Times Before Social Media. :D )
It's important to understand that we had "AJAX" before we had AJAX, if you see what I mean.
I was part of a team that deployed an e-commerce site that made international news in 1998, that used AJAX-type techniques in a way that worked in IE3 on Windows 3.11. (Though this was not part of the media fuss at the time; that was more about the fact of being able to pay for things online, still)
The arrival of XMLHTTPRequest made it possible to do everything with core technology, but it was already possible to do asynchronous work in JS by making use of a hidden frame.
You could direct that frame to load a document, the result of which would be only a <script> tag containing a JS variable definition, and the last thing that document would do is call a function in the parent frame to hand over its data. Bingo: asynchronous JS (that looked essentially exactly like JSON).
Since there were also various hacky ways in each browser to force a browser to reload page from cache (that we exhaustively tested), and you could do document.write(), it was possible to trigger a page to regenerate from asynchronous dynamic data in a data store in the parent frame, using a purely static page to contain it.
In this way we really radically cut down the server footprint needed for a national rollout, because our site was almost entirely static, and we were also able to secure with HTTPS all of the functions that actually exchanged customer data, without enduring the then 15-25% CPU overhead of SSL at either end (this is before Intel CPUs routinely had the instruction sets that sped up encryption). We also ended up with a site that was fast over a 33.6 modem.
This was a pretty novel idea at the time -- we were the only people doing it that we knew of -- but over the years I have found we were not the only team in the world effectively inventing this technique in parallel, a year or 18 months before XMLHTTPRequest was added to browsers.
(IE3 on Windows 3.11 was a good experience, by the way. Better behaved and more consistent than Netscape)
At around the same time we were also exploring things like using Java applets to maintain encrypted channels and taking advantage of the very limited ways one had to get data in and out of an applet. For example you couldn't push out from an applet to the page easily, but you could set up something that polled the applet and called the functions it wanted.
I don't like to get all "get off my lawn" but it feels like we actually earned our keep back then, getting technologies to do stuff that no standards working group anywhere was really considering and for which precious little documentation actually existed. There's a generation of us who held our copies of "Webmaster In A Nutshell" and "Java In A Nutshell" very close.
This supposed project is a bit dull, it is just an ongoing HuggingFace community engagement initiative with a misleading headline. Yes R1 itself is fascinating, but there isn't something like it coming out every week.
Every week to me means the frequency, not the duration. So having 52 events in a year that are spread out somewhat evenly but for which many take longer to develop than a week would count. If I count Deepseek as one of these I can’t find another 51 that are on this level. But I’m sure there was at least one per week that was exciting, just not to this degree.
It feels that the open source movement is slowly entering a Cambrian explosion stage.
You have the old "deterministic computing" achievements (with Linux the flagship). Then you have the networking protocols (activitypub / atproto) that are revolutionising birectional human interactions online. And finally you have the datascience/ML/AI algorithmic universe that is for the first time being harnessed at distributed scale and can empower individuals like never before.
These superpowers are all coming together and create a vast number of possibilities. Nothing really dramatic on the hardware side. Its basically the planetary software reconfiguring itself.
To me it all feels suffocating, fake. Simultaneously there's a faint glimmer of hope that we indeed achieve AGI, unlock fusion and live happily ever after in an utopian, peaceful and mostly analog world.
how many people ever used Usenet versus the billions who think the "internet" is facebook or tiktok. Techies living in their own universe detached from human reality is actually a factor why libre/oss is not as widely adopted as it could be.
How can we help. Can crowd sourcing help? Is there any list of tasks that we want a crowd to do? The reason I am asking is because we have done a couple of crowdsourcing efforts and collected story data in Telugu(Chandamama Kathalu) and ASR speech data using college going students. Since we have access to the students, we can mobilize them and get this going. We will also be doing an internship program for 100,000 students in Telangana as part of Viswam[1] in April. Can include some work as part of this effort.
From the article: they didn’t release everything—although the model weights are open, the datasets and code used to train the model are not.
Is that true about Meta Llama as well? Specifically, the code used to train the model is not open? (I know no one releases datasets). If so the label "open source" is inappropriate. "Open weights" would be more appropriate.
Given DeepSeek's open philosophy I wonder what their response is to simply being asked for access to the code and data that this project intends to recreate?
While I'm also interested in this, I guess there is value in independent replication as well. Assuming this is doable - and I wouldn't know.
Does anyone know how difficult it is to perform this kind of reproduction? E.g. how much time would it take (weeks? years?) and how likely it is to succeed?
Interesting, so they wouldn't want to disclose something that shows they've illegally (terms / copyright violations) scraped research databases for example.
Won't this eventually come up in legal discovery when someone sues one of these firms for copyright infringement? They'd have to share their data in the discovery process to show that they haven't infringed..
Some people believe they can dodge copyright issues so long as they have enough indirection in their training pipeline.
You take a terabyte of pirated college physics textbooks and train a model that can pose and answer physics 101 problems.
Then a separate, "independent" team uses that model to generate a terabyte of new, synthetic physics 101 problems and solutions, and releases this dataset as "public domain".
Then a third "independent" team uses that synthetic dataset to train a model.
The theory is this forms a sort of legal sieve. Pass the knowledge through a grid with a million fact-sized holes and with enough shaking, the knowledge falls through but the copyright doesn't.
Now that things are really getting wild in the LLM space and people are just running anything that come it seems I did a quick search on the thead model of hosting you own LLM.
I didn't find much, starting with llama.ccp which is just reminding you to sandbox and isolate everything if running untrusted models.
I feel we are back in the Windows 95 / early Internet era when people would just run anything without caring about security.
You’ll want to use something trusted like Ollama to run the model. The model itself is just data though, like a video file. That doesn’t mean it can’t be crafted to use a bug in Ollama to launch an exploit but it’s a lot safer than you make it sound.
If used as an agent, given access to execute code, search web, use other form of tools, it could do potentially much more. And most productive usecases require access to such tools. If you want to automate things and get most of the modeel, you will have to give it ability to use tools.
E.g. it could have been trained to launch a delayed attack if context indicates it has access to execute code and given certain conditions, e.g. date, or other type of codeword that is input to it.
So if a malicious actor gets to a certain stage with an LLM where they are confident it will be able to reliably run this attack, all they have to do is open source it, wait for enough adoption and then use some of those methods to launch such attack. No one would be able to identify it since the weights are unreadable, but really somewhere in the weights this attack is just hiding and waiting to happen given correct pathway triggered.
But if it's specifically trained to react to a date in its context, it seems very doable. Or to a combination of otherwise seemingly innocent words or even a statement or topic. E.g. a malicious actor could make some certain notion go viral and agentic LLMs integrated with news headlines might react to that.
It seems like it would be very arbitrary to train it to behave like this.
Most agentic systems would provide a date in the prompt context.
For simplicity sake imagine a scenario like:
1. China develops LLM that is by far ahead of its competitors. Decides to attribute it to a small start up, lets them open source it. The LLM is specifically designed to be very efficient as being an agent.
2. Agentic usage starts to get more and more popular. It's very standard to have current todays' date and major news headlines given to the context.
3. The LLM was trained to given a certain range of date and certain headlines being provided in its context to execute a pre-trained snippet of code. For example China imposing a certain type of tariff (maybe I lack imagination here, and there can be something much more subtle).
4. At that point the agentic system will attempt to fish all data it can from all sources it's being ran within.
Now maybe it's not very practical, and it's extremely risky with current state of the LLMs. I don't think it's happening right now. And China has a lot of other tech available to it already that they could do much more harm (phones, robot vacuums), but I think there's still at least potential attack vectors like this and especially if the LLM became very reliable.
Ok, but I am really curious about this and maybe my mental model is wrong:
- llama.cpp or ollama can be seen as runtime systems,
- there is no security model regarding the execution documented in both of those projects,
- of course the models are just data but so are most things that have been used as an attack vector on computers. For example your web browser or image viewer have a lot of countermeasures to protect the system from malicious image files.
I am surprised that security of operating systems, programming languages, VMs or web browsers have been a focus point forever but nobody seems to really care about security when executing those LLMs.
For "open source", we will wait that Debian ships them to have the guarantee it's actually "open" and with "sources". Right now it's a mystery how they produce their binaries.
Jurisprudence, I hope! A huge heap of detailed cases, formal codes, decisions made and explained in detail, commented, overturned, etc. Especially civil cases.
Also, probably, medicine, especially diagnostic. Large amounts of well-documented cases, a fair amount of repeatability, apparently non-random mechanisms behind, so statistical models should actually detect useful correlations. Can use more formalized tokens from lab tests, etc.
There's definitely a lot of wiggle room for lawyers and doctors to up their game. People cannot keep up with all the stuff that's published. There's simply too much of it. Doctors only read a fraction of what is published. Lawyers have to be aware of orders of magnitude more information than is humanly possible.
LLMs allow them to take some short cuts here. Even something like perplexity that can help you dig out relevant source material is extremely helpful. You still have to cross check what it digs out.
The mistake people make is confusing knowledge with reasoning when evaluating LLMs. Perplexity is useful because it can use reasoning to screen sources with knowledge; not because it has perfect recollection of what's in those sources. There's a subtle difference. It's much better at summarizing and far less likely to hallucinate than it is when it wouldn't base its answers on the results of a search. Like chat gpt used to do (they've gotten better at this too).
For lawyers and medical professionals this means that they have all the best knowledge easily accessible without having to read and memorize all of it. I know some lawyer types that are really good at scrabble, remembering trivia, etc. That's a side effect of the type of work they do: which is mostly just reading and scanning through massive amounts of text so that they can recall enough information to know where to look. Doctors have to do similar things with medical texts.
A friend of mine just defended his law PhD and in the introductory lectio said that (even) current LLMs would likely give better verdicts than human judges. Law isn't really a cognitively such demanding task as walking a dog or waiting tables.
He probably meant _brainwashed_ LLMs. They can consistently produce desired results if you wash them the right way. It's more about personal opinion than computation. Actually it would be fun to manipulate verdicts with prompt injections ;)
Judges are very much "brainwashed" too, and by design. The judges should apply the law, and the same case should ideally lead to the same verdict regardless of the judge.
With the caveat that this applies to sane legal systems, and not the ones where "making examples" etc are part of the system.
> The judges should apply the law, and the same case should ideally lead to the same verdict regardless of the judge.
hmm.. :) I like this. But the reality is very different and some factors which shouldn't matter can change the outcome dramatically. Like skin colors of defendant and judge. Pointing this out can be punished as well.
This is nonsense though. What does "better" mean in this case? A judge is not a black box with an input (the case) and an output (the verdict), the entire point of having a judge is to have empathy, conscience, and personal responsibility built into the system.
It's a blind spot that too many people have because we take those qualities for granted. LLMs unbundle them, so we need to start recognising the inherent value of humans, fast. I wrote a few words about it here: https://dgroshev.com/blog/feel-bad/
Someone has to make a call. The weight of the call rests on the person's life experience, their understanding of the context and the cost to the society, their empathy to both the defendant and the accused, and their conscience. Treating it as a black box exercise misses the point completely.
RFP responses. In enterprise sales, there's a huge amount of back and forth with different teams in a customer when you're selling anything but very simple applications. Most enterprise customers require certified or authoritative responses with backup material that is tested later during formal verification.
These LLMs are already very helpful when studying scientific fields. If you're reading a scientific paper and come across an equation you don't know how to derive, LLMs can often correctly derive it from first principles. It's not 100% reliable, but when it works, it's incredibly helpful.
Management consulting - I expect less than 20% of what a random 24 year old in a suit that you pay $3000 per day produces is actually specific to your business problem, and the rest is formulaic.
About the training data, cant the datasets from the Tulu3 Model by the Allen Institute be used?
They claim that they have used a fully open source training dataset.
My gut says a lot of attention needs to be given to building a community that focuses on open and reliable access to clean training data.
If a collective/coop of individuals and organizations with storage and network capacity could collaborate with each other to archive and index deduplicated training data that would be huge.
Perhaps this is already happening. I was looking at Red Pajama last year as an example.
Someone like myself could arrange to host 200+TB on high speed storage with a 10G public IP for example, then we get a bunch of us together and hopefully access to training datasets would be decentralized and uncensored in an idea setup.
Is all that in progress and I just need to learn how to join?
Is Red Pajama something to look at again?
Is there someone tracking datasets in detail like HuggingFace has all the models? I know a lot of datasets are on it also, but there is massive duplication.
It might need to involve some torrent or anonymity platforms to avoid problems like Books3 had when the use and availability of the data is restricted by some jurisdictions.
It also needs to incorporate some deduplication approach as I notice the same data is often repackaged with variations in format or specification.
> The release of DeepSeek-R1 is an amazing boon for the community, but they didn’t release everything—although the model weights are open, the datasets and code used to train the model are not.
> The goal of Open-R1 is to build these last missing pieces so that the whole research and industry community can build similar or better models using these recipes and datasets.
Genuine question, but how do you replicate the effort exactly without $5M in compute? and can you test that the published weights etc are actually those in the model?
The $5.5m in compute wasn't for R1, it was for DeepSeek v3.
The R1 trick looks like it may be a whole lot cheaper than that. R1 apparently used just 800,000 samples - I don't fully understand the processing needed on top of those samples but I get the impression it's a whole lot less compute than the $5.5m used to train v3.
Yes, that works because you're an anon and nobody really cares. Try to publicly make that statement if you're in any relevant position and you'll very quickly be looking for a new job, if you can ever find it.
> And if staging peaceful pro-Palestine protests result in arrests, what happened here?
Be honest, you can literally google "Palestine protest arrests" and get more results than you could process in a while. You presenting a couple examples doesn't negate the many other protests ended in mass arrests.
She would not be a politician (or even alive) if any of what you claim is true. You claimed that the US government censors people who speak out against Israel's occupation of Palestine, and specifically that saying Palestine isn't Israel would not be possible in the United States in the same way that saying, for example, Xi Jinping looks like Winnie the pooh is censored in China.
This is, of course, completely false, and demonstrably so by observing the protests I just linked (of which there are thousands, not a few), and the statements Rep. Tlaib, a Palestinian American and member of the US government, regularly says on the national stage.
The equivocation of Chinese censorship and Western censorship simply doesn't work.
I think western propaganda is overall the cleverest, because it manages to completely marginalize and silence any non-aligned opinion, while at the same time convincing you that you are completely free to have said opinion.
Why do you think anything you've just linked is at all related to this conversation? A system must be perfect to be good? That's an insane bar that is not the actual standard.
And if you think a US representative is powerless then you completely fail to understand how the US government actually works.
It is though. Western AI tries to hide information like that with the justification of safety as well as things that might be offensive to current popular beliefs. Chinese AI presumably says Taiwan is China to help get more people on side for a possible future invasion. Propaganda does work - look at how many people think Donbas is still Ukraine and Israel is still Palestine.
The difference is that in China the info isn’t available without use of Western content, due to the totalitarian control over media, whereas in the West, information is pretty trivially available, even if the big companies keep it off of their platforms.
And sure ignorance is prevalent, but even GPT4 will tell me Donbas is still Ukraine, for instance. What a strange example to use, though!
But is it though? What's really the meaning of which country a region belongs to? Once somewhere has been occupied long enough, it usually becomes de-facto theirs. But how long is long enough? Other countries either do or don't recognize it and usually a consensus is reached, but not always.
In any case, Deepseek like Llama fail much before hitting that new definition. Both have licenses containing restrictions on field of use and discrimination of users. Their license will never be approved as Open Source.
DeepSeek's gifts to the world of its open weights, public research and OSS code of its SOTA models are all any reasonable person should expect given no organization is going to release their dataset and open themselves up to criticism and legal exposure.
You shouldn't expect to any to see datasets behind any SOTA models until they're able to be synthetically generated from larger models. Models only trained on sanctioned "public" datasets are not going to perform as well which makes them a lot less interesting and practically useful.
Yes it would be great for their to be open models containing original datasets and a working pipeline to recreate models from scratch. But when few people would even have the resources to train the models and the huge training costs just result in worse performing models, it's only academically interesting to a few research labs.
Open model releases should be celebrated, not criticized with unreasonable nitpicking and expectations that serves no useful purpose other than discouraging future open releases. When the norm is for Open Models to include their datasets, we can start criticizing those that don't, but until then be gracious that they're contributing anything at all.
Terminology exists for a reason. Doubly so for well-established terms of art that pertain to licensing and contract law.
They could have used "open wights" which would have conveyed the company's desired intent just as well as "open source", but without the ambiguity. They deliberately chose to misuse a well established term instead.
I applaud and thank deepseek for opening their weights, but i absolutely condemn them and others (e.g Facebook) for their deliberate and continued misuse of the term. I and others like me will continue to
raise this point as long as we are active in this field, so expect to see this criticism for decades.
Hopefully one of these companies losses a lawsuit due to these shenanigans. Perhaps then they wouldn't misuse these terms so brazenly.
> i absolutely condemn them and others (e.g Facebook) for their deliberate and continued misuse of the term
This is the kind of inconsequential nitpicking diatribe I'm referring to. When has "open data" ever meant Open Source?
> They deliberately chose to misuse a well established term instead.
Their model weights as well as their repositories containing their technical papers and any source code are published under an OSS MIT license, which is the reason why initiatives like this looking to reproduce R1 are even possible.
But no, we have to waste space in every open model release complaining that they must be condemned for continuing to use the same label the rest of the industry uses to describe their open models which are released under an OSS License as Open Source - instead of using whatever preferred unused label you want them to use.
Exciting to see this being reproduced, loving the hyper-fast movement in open source!
This is exactly why it is not “US vs China”, the battle is between heavily-capitalized Silicon Valley companies versus open source.
Every believer in this tech owes DeepSeek some gratitude, but even they stand on shoulders of giants in the form of everyone else who pushed the frontier forward and chose to publish, rather than exploit, what they learned.
Oh yes, I am firmly on Team China here because US companies got too greedy. Meta is an exception here though and they also propelled AI development massively.
DeepSeek is awesome. Any AI task yet implemented in our business can be run from my local PC with just the smaller models. And my PC is fairly crappy to begin with.
OpenAI looks quite silly with their "we have to close everything".
Can you elaborate which models you are using? I‘m running an R1 distilled Qwen coder with 32B Q4, and while it’s giving useful answers, it‘s quite slow on my M1 Max. Slow enough that I keep reaching for cloud models.
Not on my machine currently, I use the 14b Q4 model I think, which delivers very good answers. I run a 4060 with 16gb memory and performance is quite good. I used the largest model that was recommended with this amount of VRAM, I think it was the 14b one.
I do have some applications that process images, text and pdf files and I use smaller models for extracting embeddings. I think my system wouldn't be able to handle it with decent speed otherwise.
I do run LLM on a M1 16gb macbook air and performance is surprisingly good. Not for image synthesis though and a PC with a dedicated GPU is still significantly faster with LLM responses as well. Haven't tried to run deepseek on the macbook yet.
I'm on team open source. To me the exciting thing was ollama downloading the 7B and running it on a 5yo cheap lonovo and getting a token rate similar to the first release of ChatGPT.
Running local on CPU opens so much possibilities for smart and privacy focused home devices that serve you.
In my test it hallucinated confidently but my interest is in simple second brain like rag. "Hey thingy, what is my schedule today?"
Need it to be a bit faster though as the thinking part adds a lot of latency.
The thinking is quite fascinating though, I love reading it. Especially when it notices something must be wrong. It will probably be very helpful to refine answer for itself and other models.
It does add latency of course, but I still think that I could provide all AI needs of my company (industrial production) with a simple older off the shelf PC. My GPU is decently recent, but the smallest model of the series and otherwise the machine is a rusty bucket.
I didn't test it thoroughly yet, but I have some invoices where I need to extract info and it did a perfect job until now. But I don't think there is any LLM yet that can do that without someone checking the output.
The US companies got too greedy? How? They invented this entire space, literally. DeepSeek built their base models off Llama releases and OpenAI outputs (or so it’s thought), and while they added some optimizations on top, it seems like they’ve lied about the costs to produce their models by simply being vague about their base model and training data, and quoting the cost of their final training run.
And then there’s all the dystopian propaganda baked into these models, which threatens to misinform users at scale based on a government driven agenda. Hard to be on that team, let alone firmly, knowing that it’s giving power to a dictatorial regime.
The US models are also full of censorship. For example the US is much more sensitive to anything related to sexuality and here in Europe it's quite frustrating to deal with that censorship.
I think we will find that each region will have their own flair of censorship. The only reason it stands out more from a Chinese perspective is the requirement to have alignment with PRC/CCP rhetoric.
Yes that's what I mean. I wish all models were uncensored and it would just be up to the implementer to decide how to finetune on top of that. Save for the super crazy stuff of course.
> The US companies got too greedy? How? They invented this entire space, literally
And when they thought they were the only game in town, they tried to corner the market in GPUs and lock out any users who can't pony up £200/mo. Reminds me of when the likes of Oracle and IBM had companies by the balls buying bigger and bigger servers and then Google came along and showed everyone how to do horizontal scaling of cheap hardware.
That was perhaps a bit too general, but aside from meta and Google they didn't share their research and tried to sell AI products as fast as possible and tried to lobby legislation to keep their head start. I would also include nvidia here, that has some moat through software integrations.
I haven't tested deepseek for censorship yet, but they shared their release and even their input data. And in this case you could correct its shortcomings, so propaganda would be difficult.
>DeepSeek built their base models off Llama releases and OpenAI outputs (or so it’s thought)
The first one is definitely not true and the 2nd one is not necessarily true in the way you imagine i.e crawls of the internet will have gpt chat logs now.
China is merely the largest of a wide array of entities that may not necessarily like the status quo of Silicon Valley being our tech overlords. There are plenty of places with bright people. Easy to say because a lot of them immigrate to California. But of course the places they come from (China, Europe, India, Russia etc.) have ambitions as well. You'll find natives of each of those in the likes of OpenAI, Google, Microsoft, etc. And quite often at executive levels even.
Silicon Valley has mo moat other than money. It kind of runs on openness and freedom of movement of people. Companies constantly poach people from each other. And there's a constant movement of people (and knowledge) in and out of the area. Money is what attracts these people and keeps them there for a while. But of course that status quo was upset a little bit with VCs turning into penny pinching misers lately and lockdowns proving (to them) that it was cheaper to host your tech teams remotely. Which means knowledge is now more distributed than it used to be.
So, it's not surprising that people outside of Silicon Valley are not waiting patiently for OpenAI to do whatever it is they are doing in between having moral existential crises, trying to oust their CEO, pontificating about AGIs, etc. They are taking things into their own hands. The brute force / VC funding driven approach that OpenAI has used yielded massive results in the last few years. But ever since Meta opensourced their models, OSS models and optimizations have been catching up.
On a hardware resource usage basis, these models started to outperform their bigger peers last year and now the game is up for the training process as well. Meaning they get better results for the same money. A major hurdle here was the model training process. Which the Chinese seem to have proven can be massively optimized as well. Cutting cost by a few orders of magnitude is a big deal. And at the same time doing the same thing at larger scale (aka. throwing more money at the problem) seems to have diminishing returns.
Until that changes, that means the playing field has somewhat leveled now. That's a good thing.
> This is exactly why it is not “US vs China”, the battle is between heavily-capitalized Silicon Valley companies versus open source.
Ah yes the "open source" code that was not released by the DeepSeek team and the tens of thousands of professional grade GPUs that were contributed by the "community".
DeepSeek is based on Llama which was produced by ... Meta.
DeepSeek v3/r1 isn't based on llama architecture. It uniquely combines and contributes several novel approaches.
Meta never released a mixture of expert model (they failed to train a good one, according to reliable rumors). And MoE is just one of few ingredients that make DeepSeek v3/R1 interesting and good.
I like that it’s open source, but ultimately it is china so we can’t trust it.
It’s trivial to implement bias in models (hence the no-no filters in chatgpt) so if they’re smart they’ll do what they do with tiktok and make the answers different for their rivals.
The thing about open source is that you don't need to trust it.
They shared their methodology, so if they are legit, someone else will reproduce what they did very quickly. I expect Meta, Amazon, Google, and Anthropic are on the case right now.
From this list, the only one I trust any more than I trust Deepseek is Anthropic.
The other three have shown they'll instantly bend the knee to whomever is in power, and that's exactly the same thing people are worried that Deepseek is doing.
Last year I would have said I trusted American companies more than Chinese. But last year feels like a long time ago.
Anthropic was the only company in that list not to have paid for their CEO or founder to attend the inaugeration of the current ruler of the exectuive branch of the US a week ago.
Because, from my non-expert but reasonably well informed understanding of world affairs, there's a very high chance of a Chinese company in an important area like AI having to bend the knee to Xi Jinping pretty much exactly as the American companies are doing with Trump.
> The other three have shown they'll instantly bend the knee to whomever is in power, and that's exactly the same thing people are worried that Deepseek is doing.
Open source has pragmatic merits and I love the culture. But I don't like associating it with a moral high ground because it doesn't charge you money. By this standard, we should also ask Intel/AMD to open-source their CPUs, video game studios to open-source their code and artifacts, and Google/Amazon for their search engine and infrastructure. Not all business sectors can afford to sustain with the Open Source model.
> By this standard, we should also ask Intel/AMD to open-source their CPUs, video game studios to open-source their code and artifacts, and Google/Amazon for their search engine and infrastructure.
The freedom is mostly not about the money. The 3D model Benchy was free, while being not free as people found out. Luckily the copyright owners treated it like people were free and free for now... but that could change.
super cool to see an open initiative like this—love the idea of replicating DeepSeek-R1 in a transparent way.
I do like the idea of making these reasoning techniques accessible to everyone. If they really manage to replicate the results of DeepSeek-R1, especially on a smaller budget, that’s a huge win for open-source AI.
I’m all for projects that push innovation and share the process with others, even if it’s messy.
But yeah—lots of hurdles. They might hit a wall because they don’t have DeepSeek’s original datasets.
reply