I skimmed through the book, and think it does a very poor job at showcasing how R and Python are juxtaposed in industry.
To be fair, the book advertises showing R and Python code side-by-side. And that’s what it does. But it does it unlike how the languages are most often used in industry.
As a quick example, I saw no tidyverse code, which is essentially the only thing keeping R in the game. Learning R from this book won’t prepare you for writing R in most R shops.
I don’t see the utility in knowing how to do the same thing in both python and R if you’re a beginner. This is even more true if you’re not taking advantage of the strengths/weaknesses of either language.
Instead, just learn one of the languages well, and then learn the other well. Shallow dives in both will make you weak in both.
Unfortunately, 90% of data science content seems to be geared at beginners.
> tidyverse code, which is essentially the only thing keeping R in the game
From my experience this is not the case. In biomedicine and bioinformatics few people actually use tidyverse because the data is much better represented as a matrix, and not in the "tidy" form.
Outside of that corporations (well at least 2 I contracted with) used `data.table` explicitly. Join 3 ad-click dataframes matching by userID, sessionID and closest possible time-point - that's one line in `data.table`.
Tidyverse is well suited for learning and for managing (relatively) simple datasets. But becomes cumbersome for more complex data. It can be used for those data too of course, just that it will be adding ad-hoc solutions and maybe get in a way more than help.
I've only use base R for my medical data (subsetting dataframe and such). Very rarely do I need tidy and also I find the pipe operator makes debugging harder. If and when I need it I'll use it that's that.
I think R have much more packages in medical, especially statistical packages, where many fields within medical cares about inferences not just prediction/forecasting. So I disagree with the "essentially the only thing keeping R in the game". The breath of packages in R is one of the many things that keep R in the game.
The tribalism and highly bias comments makes it very toxic and harder to have an honest discord.
They are just tools, use what makes you happy and get the job done.
I have a similar feeling. And that is why I spent one whole chapter in data.table (and pandas). Hope more R users would like to learn and use data.table.
I agree for the most part, but R does have a few things beyond the tidyverse: built-in dataframe support, lots of domain-specific packages, more consistent interfaces for basic statistics and machine learning models, etc. Python is definitely better for matrices (because of NumPy) and anything involving custom gradient descent methods (because of TensorFlow).
I think 90% of data science content is for beginners because anything more advanced isn't best described as data science. As soon as you get beyond the initial stages of data analysis (cleaning and processing data), you're doing something best described as some other word (statistics, machine learning, etc.) - although, granted, there isn't much content in these areas if you don't know _exactly_ what you're looking for.
Even if you ignore the tidyverse, the example code for "roll your own linear regressions by hand" uses the R6 object system, which is... not even one of the two popular object systems for R (which are S3 and S4). No beginner needs to learn how to write classes in R.
`no beginner needs to learn how to write classes in R`.
a) using classes properly is great for all level R users;
b) a major reason that classes are not widely used (for beginners) is that S3/S4 are not easy to follow. R6 provides a natural and clear way to understand and write classes (especially for beginners).
Using classes at all is unnecessary for most R users. R is really, to the extent that paradigms matter to the average R user at all (which is: not much) a functional-first language. The idiomatic way to deal with the things you would use classes for is to use functions and closures. There are people who need objects in R, which is why R has object systems available, but it is of no help to a beginner to know them -- it doesn't help them to interact with the code they are going to see, and they don't have the background to understand why you would use classes instead of functions.
Not an advantage if you ask me - exactly because data.frame is built in, people have been building their own versions (tibble, data.table) instead of improving it. That's how R ended up with 3 different structures that are similar but have inconsistent apis and behaviour.
> lots of domain-specific packages
That's true.
> more consistent interfaces for basic statistics and machine learning models
Can't disagree more - there is no one go-to library for ML in R (like sklearn in Python) and each package has it's own strange interface and implementation.
> Not an advantage if you ask me - exactly because data.frame is built in, people have been building their own versions (tibble, data.table) instead of improving it. That's how R ended up with 3 different structures that are similar but have inconsistent apis and behaviour.
I've been fortunate to only work on projects that use built-in data frames, never encountered tibble or data.table in the wild.
> there is no one go-to library for ML in R (like sklearn in Python) and each package has it's own strange interface and implementation.
I still disagree here - one example being the unified interface for generalized linear models. Also, the vast majority of classifiers (RF, SVM, etc.) have similar or identical interfaces. Also, the unified `predict` interface as well. Granted, `sklearn` does have a consistent API as well.
That said, some of this is just a personal preference for the vaguely functional interface in R. The object-orientedness in Python feels a little forced for some tasks in `sklearn`.
> because data.frame is built in, people have been building their own versions (tibble, data.table) instead of improving it.
To make what I think your point is more explicit, people build their own things in R because R must maintain compatibility with S. So by and large, changes happen in packages and not the base language. This does lead to a proliferation of solutions for the same kinds of problems.
You mean like keras? or tensorflow?
Or base random forest. You know, like the original Breiman implementation.
Python has utility. But R is far superior in its the quality of the packages, their documentation, their ability to behave predictably on a given data type.
I run a machine learning shop. Right now all of the training, application, and data management is handled via R. R is simply superior in too many ways for us to be bothered with python for the scale of work we are doing.
Since we're moving some big applications to keras/ TF we do use python and will be using more in the future. However, for almost all data management, munging, movement visualization, reporting, its an R world.
> You mean like keras? or tensorflow? Or base random forest. You know, like the original Breiman implementation.
> ...
> Since we're moving some big applications to keras/ TF we do use python and will be using more in the future.
Not sure if I misunderstood, or you're contradicting yourself there.
> R is far superior in its the quality of the packages, their documentation, their ability to behave predictably on a given data type.
I not only disagree but I think that the exact opposite is true for each one of these points. But if things are working well in our shop, I'm not going to try to convince you otherwise.
> > R is far superior in its the quality of the packages, their documentation, their ability to behave predictably on a given data type.
> I not only disagree but I think that the exact opposite is true for each one of these points. But if things are working well in our shop, I'm not going to try to convince you otherwise.
I partially agree with you here. I'm extremely careful about what non-standard packages I use in R. Code quality varies wildly outside of these, likewise for documentation. But outside of neural networks, I've never found a package in Python that I felt better about in terms of code quality or documentation than its equivalent in R.
My point behind the keras/ TF comment is that the libraries have front ends in both python and R, so its mix mox/ dealers choice on what you like to work in (since the backends of both are identical).
The primary reason to moving these to python is due to convenience/ the community. Most new work is published in python. If we find a new/ interesting model we want to implement, its probably written in python. Rather than reskin the thing in its entirety, its easier here to work in python.
A couple disclaimers: my group works primarily in geospatial data, and principally in LiDAR and multispectral imagery.
The coarse division I see between R/ Python, is that if you come from a research/ academic background (non-engineering), you probably learned to program in R. If you were an engineer, you probably learned matlab. If you are self taught/ coursera/ youtube, you probably learned in python.
R libraries are generally more geared towards academic research, and specifically, working within existing frameworks (handling geospatial data as geospatial data rather then turning them into a numpy arrays). Working in python, there is far more re-invention of the wheel, and its always a pain the ass to get things back into the structures they came in as.
Python has huge utility and is an important tool for certain work. But its really really not faster than R (it def used to be, this isnt the case any more).
R has better support for more scientific programming than python.
> My point behind the keras/ TF comment is that the libraries have front ends in both python and R, so its mix mox/ dealers choice on what you like to work in (since the backends of both are identical).
Not as a point of argument, just additional information:
R's support for keras and TF is a wrapper around the Python interface to those libraries.
numpy is significantly faster and arguably more usable (e.g. broadcasting) than anything in R, and only recently has there been progress in more efficient matrix manipulation in R like rray[0], a wrapper for xtensor.
As far as I can tell, R uses BLAS for matrix operations, and Python probably does the same, so in terms of efficiency I wouldn't expect a big difference between the two.
Both R and numpy use BLAS, and if both are linked to the same library, say OpenBLAS or Intel MKL, then performance is in fact almost identical for expensive operations like matrix multiplication. (R also ships with its own internal BLAS implementation, which is reliable but not very fast, and I believe is still single threaded, so the first thing you should do if you are using R and care about performance is to swap it out.)
For more sophisticated linear algebra algorithm, such as SVD, both will use typically LAPACK, and again, both will exhibit essential identical performance.
There is one important difference though: when R is compiled for 64-bit machines, it can only use 64-bit floats! While numpy can support 32 and even (through software emulation) 16 bit floats. This can halve memory usage, which in turn halves cache misses, which results in a significant speed up in cases where 64-bits of precision is not needed.
This is really interesting! I had always just claimed that Python was faster (see the benchmark I linked above [1]) based on personal experience. I wonder if this internal implementation has something to do with it...
So then I would assume you must be working with tables of less than 1000 rows, because thats pretty much the only case where it doesn't matter. At anything more than 1k rows, the differences are substantial.
Hundreds of rows is about usual for me. I do analysis on clinical studies with human participants. Nothing too tricky, most of my munging runs in effectively zero time.
I was going to make this point, but yeah.
The only thing I think people have a bit of a time with is how you do operations in data.table. If you are coming from plyr/dplyr, the transition can be difficult. However, I've found that the more I do, the more I prefer it, inspite of the fact that the main reason I use dt over tidy is the phenomenal performance gain.
That's pretty much the only group of people who will use this though. Those that are serious about it or have some background won't really look at another book on data science and probably do the necessary research themselves
This is my initial thought. I can't imagine learning more than one syntax at once. This is probably aimed towards eidetic folks (tbh, probably not, but it should be).
Yeaah, to your last thought - without being too cynical, looks like the author wanted to write a beginner data science book but realised that the good beginner data science books are already written (as well as many bad ones).
I've been in that position; for most situations, I can't imagine it being efficient for the data scientists to move over to Python themselves, if they don't have experience (or interest) in software engineering.
Depends on what you're interested in, your goals and your starting point. You have two main learning lanes:
1. Theory.
Math, calculus, linear algebra, probability, statistics, ML algorithms. ISLR is a very good beginner-ish resource that helps you understand the algorithms but doesn't go too deep into the math. As you go deeper, you may realise that you have gaps in your math knowledge and you need to cover a lot more probability, calculus and linear algebra.
2. Programming.
2.1. Good software engineering practices - writing maintainable code, design patters, version control, unit testing etc.
2.2. Tooling - knowing the language-specific ecosystem of libraries (the OP is an attempt to teach you this in two languages at the same time). This is what most beginner resources focus on; your knowledge here has the least transferability and tends to go out of date quickly. Still, using the right tools and knowing them well goes a long way.
I'm sorry in advance if I'm taking too much of your time.
#2: coding is my hobby and have been writing well designed apps for a long time, so thats not an issue
#3: ISLR is teaching how to do ML algo in R, so there goes that point
#1 is what I'd like more information. Good important is the maths to work as a data scientist? I'm planning ISLR and then maybe ESL or some advance course
A few of my friends work on Data science and they said that maths isn't that important as in, one needs to know the formulas and why things work the way they do as in not rote learning the math.
It'd be great if you can list down intermediate courses, the learning market has drowned good tutorials and books with not so good guides!
> A few of my friends work on Data science and they said that maths isn't that important
Without the math you won't understand anything in ESL.
Which might be okay if the job doesn't require you to go into that much depth - some data science jobs are more focused on research (very math-heavy), some on ETL and/or engineering, others on business understanding and communication; it's a really broad title.
SAS is still used by companies who don't want to use open source tools and would rather pay a lot of money for an established product name and have a support line that they can call if anything goes wrong. Understandably, it's slowly dying out.
Yes, though at least where I am it tends to be a legacy system these days. So that means governments and large corporations might reasonably feature some use of it somewhere within their walls.
i only read the chapter on optimization/linear programming - its far too brief and misses key concepts for it to be useful. there are no real applications of the concepts covered in the section.
Yes, it is very brief. I tried to give some useful reference books/papers/links for each subject appeared in the book. Hope that is useful if the readers are willing to dive deeper. But the concept of linear programming itself is not complex, the users are generally not required to understand simplex/interior point algorithm. If they are able to translate the actual problem into the code then it is basically done. It is not like constrained optimization for which customized treatment based on mathematical theories is more important than writing the code.
So? This whole argument over which programming language is the best all seems a bit tools-first to me. I used to program in R and was a package owner, but switched to python because it had stuff I needed (TF before it ever got ported to R).
Nowadays I program in python because it has all I need. If something comes out in Julia that makes the cost of picking up another language worth it then I'll do it without a second thought. Until then, why bother? The don't waste your time argument can cut both ways, you know.
To clarify: I see no inherent reason to not program in Julia or any other language. But you work with whatever gets your job done efficiently and right now that's neither of those languages for a lot of people.
Uh, adoption by the community at large? The best program is the one you didn't have to write because a package existed for it already. Just because a language has a foreign function interface doesn't mean it's easy to interact with other libraries.
R, Python, Matlab, and C++ are the big dogs in scientific programming, and the inertia behind having a large community behind than will continue to drive adoption.
It's a matter of ecosystem of packages. R has a huge number of packages for many fields. Python has fewer, but might work for particular use cases. I was excited for Julia, and played around with it since 0.2, but it really hasn't generated very many packages of note in my particular field (bioinformatics).
I mean fewer (and less developed) in the context of data analysis and statistics where it is a competitor to R. Every time I consider using Python after getting frustrated with the more ugly features of R as a language, I take a look at what's available in Python, as from a language point of view it is a bit nicer (as is Julia). But then I see what I would have to reimplement myself if I switched, so I don't.
It would really still depend on the support libraries around the task you want to write, if you really need more performance for the parts that those support libraries don't cover (and beyond what you get with PyPy or Numba) and which language you have more experience with.
If you're really going down to the FFI, it's hard to think it wouldn't be more productive, but that's not what a true beginner like the target of this book would do. Though it's quite nice to quickly extend some tool for your purpose without compromising anything or to understand how something works thanks to being written in high level code.
Syntax-wise, Julia's Common Lisp-like feature set gives the language a lot of power, but normal use will probably be just on par with Python in terms of productivity.
I love Julia, but feel the interop story could use some work. If I want to have my python package depend on some Julia code, how easy is that? Last time I checked (a few months ago), it was pretty difficult.
That's an opinion my friend, not a fact. If you want to be taken seriously, be sure to be careful in your assertions and always have hard data to back them up
I whole heartedly agree. Python is garbage for data science. If an industrial grade NN library was written for it, plus some quant libraries, I think most people would switch.
I work in finance doing data science-y things and have yet to meet anyone who doesn’t think that Python is a pile of garbage.
People used to make the easy to learn argument, but Julia is even easier. And more elegant, extensible, and faster.
> If an industrial grade NN library was written for it
Can you give tell us which language and industrial grade NN library you're using?
Because from where I'm sitting I see that Python is the only language that gets first grade support for both Tensorflow and Pytorch. It's so ahead for working with NN that it's not even close.
Whenever I see the “I’ve never met x” argument, I’m always looking out for the catch. Because it’s a bad argument to begin with. But the catch is almost always in a situation where x is everywhere, and you would have to have your head willfully stuck in a hole in the ground to not see it.
Not sure how that's relevant. Everything is written in something else.
Python itself is written in C. The Julia github repo shows Julia 68.2%, C 16.3%, C++ 10.4%, Scheme 3.2%. R is a mix of C, C++, R and some Fortran I think...
It is relevant from the point of view of what a developer is able to achieve without being forced to drop down to a 2nd programming language, aka "2 language syndrome".
And how many Python libraries are just plain wrappers, not really written in Python.
are you suggesting that data scientists use C++ for day to day work? those libraries have first-class wrappers in Python (there is R support, but not at the same level).
To be fair, the book advertises showing R and Python code side-by-side. And that’s what it does. But it does it unlike how the languages are most often used in industry.
As a quick example, I saw no tidyverse code, which is essentially the only thing keeping R in the game. Learning R from this book won’t prepare you for writing R in most R shops.
I don’t see the utility in knowing how to do the same thing in both python and R if you’re a beginner. This is even more true if you’re not taking advantage of the strengths/weaknesses of either language.
Instead, just learn one of the languages well, and then learn the other well. Shallow dives in both will make you weak in both.
Unfortunately, 90% of data science content seems to be geared at beginners.