Another Book on Data Science – Learn R and Python in Parallel

clarityPhone · on June 21, 2019

I skimmed through the book, and think it does a very poor job at showcasing how R and Python are juxtaposed in industry.

To be fair, the book advertises showing R and Python code side-by-side. And that’s what it does. But it does it unlike how the languages are most often used in industry.

As a quick example, I saw no tidyverse code, which is essentially the only thing keeping R in the game. Learning R from this book won’t prepare you for writing R in most R shops.

I don’t see the utility in knowing how to do the same thing in both python and R if you’re a beginner. This is even more true if you’re not taking advantage of the strengths/weaknesses of either language.

Instead, just learn one of the languages well, and then learn the other well. Shallow dives in both will make you weak in both.

Unfortunately, 90% of data science content seems to be geared at beginners.

_fnhr · on June 21, 2019

> tidyverse code, which is essentially the only thing keeping R in the game

From my experience this is not the case. In biomedicine and bioinformatics few people actually use tidyverse because the data is much better represented as a matrix, and not in the "tidy" form.

Outside of that corporations (well at least 2 I contracted with) used `data.table` explicitly. Join 3 ad-click dataframes matching by userID, sessionID and closest possible time-point - that's one line in `data.table`.

Tidyverse is well suited for learning and for managing (relatively) simple datasets. But becomes cumbersome for more complex data. It can be used for those data too of course, just that it will be adding ad-hoc solutions and maybe get in a way more than help.

anthony_doan · on June 21, 2019

I agree with your experiences.

I've only use base R for my medical data (subsetting dataframe and such). Very rarely do I need tidy and also I find the pipe operator makes debugging harder. If and when I need it I'll use it that's that.

I think R have much more packages in medical, especially statistical packages, where many fields within medical cares about inferences not just prediction/forecasting. So I disagree with the "essentially the only thing keeping R in the game". The breath of packages in R is one of the many things that keep R in the game.

The tribalism and highly bias comments makes it very toxic and harder to have an honest discord.

They are just tools, use what makes you happy and get the job done.

zelda_1 · on June 21, 2019

I have a similar feeling. And that is why I spent one whole chapter in data.table (and pandas). Hope more R users would like to learn and use data.table.

wxnx · on June 21, 2019

I agree for the most part, but R does have a few things beyond the tidyverse: built-in dataframe support, lots of domain-specific packages, more consistent interfaces for basic statistics and machine learning models, etc. Python is definitely better for matrices (because of NumPy) and anything involving custom gradient descent methods (because of TensorFlow).

I think 90% of data science content is for beginners because anything more advanced isn't best described as data science. As soon as you get beyond the initial stages of data analysis (cleaning and processing data), you're doing something best described as some other word (statistics, machine learning, etc.) - although, granted, there isn't much content in these areas if you don't know _exactly_ what you're looking for.

cwyers · on June 21, 2019

Even if you ignore the tidyverse, the example code for "roll your own linear regressions by hand" uses the R6 object system, which is... not even one of the two popular object systems for R (which are S3 and S4). No beginner needs to learn how to write classes in R.

zelda_1 · on June 21, 2019

`no beginner needs to learn how to write classes in R`. a) using classes properly is great for all level R users; b) a major reason that classes are not widely used (for beginners) is that S3/S4 are not easy to follow. R6 provides a natural and clear way to understand and write classes (especially for beginners).

cwyers · on June 21, 2019

Using classes at all is unnecessary for most R users. R is really, to the extent that paradigms matter to the average R user at all (which is: not much) a functional-first language. The idiomatic way to deal with the things you would use classes for is to use functions and closures. There are people who need objects in R, which is why R has object systems available, but it is of no help to a beginner to know them -- it doesn't help them to interact with the code they are going to see, and they don't have the background to understand why you would use classes instead of functions.

randomvectors · on June 21, 2019

> built-in dataframe support

Not an advantage if you ask me - exactly because data.frame is built in, people have been building their own versions (tibble, data.table) instead of improving it. That's how R ended up with 3 different structures that are similar but have inconsistent apis and behaviour.

> lots of domain-specific packages

That's true.

> more consistent interfaces for basic statistics and machine learning models

Can't disagree more - there is no one go-to library for ML in R (like sklearn in Python) and each package has it's own strange interface and implementation.

wxnx · on June 21, 2019

> Not an advantage if you ask me - exactly because data.frame is built in, people have been building their own versions (tibble, data.table) instead of improving it. That's how R ended up with 3 different structures that are similar but have inconsistent apis and behaviour.

I've been fortunate to only work on projects that use built-in data frames, never encountered tibble or data.table in the wild.

> there is no one go-to library for ML in R (like sklearn in Python) and each package has it's own strange interface and implementation.

I still disagree here - one example being the unified interface for generalized linear models. Also, the vast majority of classifiers (RF, SVM, etc.) have similar or identical interfaces. Also, the unified `predict` interface as well. Granted, `sklearn` does have a consistent API as well.

That said, some of this is just a personal preference for the vaguely functional interface in R. The object-orientedness in Python feels a little forced for some tasks in `sklearn`.

euler_angles · on June 21, 2019

> because data.frame is built in, people have been building their own versions (tibble, data.table) instead of improving it.

To make what I think your point is more explicit, people build their own things in R because R must maintain compatibility with S. So by and large, changes happen in packages and not the base language. This does lead to a proliferation of solutions for the same kinds of problems.

RosanaAnaDana · on June 21, 2019

You mean like keras? or tensorflow? Or base random forest. You know, like the original Breiman implementation.

Python has utility. But R is far superior in its the quality of the packages, their documentation, their ability to behave predictably on a given data type.

I run a machine learning shop. Right now all of the training, application, and data management is handled via R. R is simply superior in too many ways for us to be bothered with python for the scale of work we are doing.

Since we're moving some big applications to keras/ TF we do use python and will be using more in the future. However, for almost all data management, munging, movement visualization, reporting, its an R world.

randomvectors · on June 21, 2019

> You mean like keras? or tensorflow? Or base random forest. You know, like the original Breiman implementation.

> ...

> Since we're moving some big applications to keras/ TF we do use python and will be using more in the future.

Not sure if I misunderstood, or you're contradicting yourself there.

> R is far superior in its the quality of the packages, their documentation, their ability to behave predictably on a given data type.

I not only disagree but I think that the exact opposite is true for each one of these points. But if things are working well in our shop, I'm not going to try to convince you otherwise.

wxnx · on June 21, 2019

> > R is far superior in its the quality of the packages, their documentation, their ability to behave predictably on a given data type.

> I not only disagree but I think that the exact opposite is true for each one of these points. But if things are working well in our shop, I'm not going to try to convince you otherwise.

I partially agree with you here. I'm extremely careful about what non-standard packages I use in R. Code quality varies wildly outside of these, likewise for documentation. But outside of neural networks, I've never found a package in Python that I felt better about in terms of code quality or documentation than its equivalent in R.

RosanaAnaDana · on June 21, 2019

My point behind the keras/ TF comment is that the libraries have front ends in both python and R, so its mix mox/ dealers choice on what you like to work in (since the backends of both are identical).

The primary reason to moving these to python is due to convenience/ the community. Most new work is published in python. If we find a new/ interesting model we want to implement, its probably written in python. Rather than reskin the thing in its entirety, its easier here to work in python.

A couple disclaimers: my group works primarily in geospatial data, and principally in LiDAR and multispectral imagery.

The coarse division I see between R/ Python, is that if you come from a research/ academic background (non-engineering), you probably learned to program in R. If you were an engineer, you probably learned matlab. If you are self taught/ coursera/ youtube, you probably learned in python.

R libraries are generally more geared towards academic research, and specifically, working within existing frameworks (handling geospatial data as geospatial data rather then turning them into a numpy arrays). Working in python, there is far more re-invention of the wheel, and its always a pain the ass to get things back into the structures they came in as.

Python has huge utility and is an important tool for certain work. But its really really not faster than R (it def used to be, this isnt the case any more).

R has better support for more scientific programming than python.

euler_angles · on June 21, 2019

> My point behind the keras/ TF comment is that the libraries have front ends in both python and R, so its mix mox/ dealers choice on what you like to work in (since the backends of both are identical).

Not as a point of argument, just additional information: R's support for keras and TF is a wrapper around the Python interface to those libraries.

kgwgk · on June 21, 2019

> Python is definitely better for matrices (because of NumPy)

How so?

claytonjy · on June 21, 2019

numpy is significantly faster and arguably more usable (e.g. broadcasting) than anything in R, and only recently has there been progress in more efficient matrix manipulation in R like rray[0], a wrapper for xtensor.

[0]https://github.com/r-lib/rray

lottin · on June 21, 2019

As far as I can tell, R uses BLAS for matrix operations, and Python probably does the same, so in terms of efficiency I wouldn't expect a big difference between the two.

olooney · on June 21, 2019

Both R and numpy use BLAS, and if both are linked to the same library, say OpenBLAS or Intel MKL, then performance is in fact almost identical for expensive operations like matrix multiplication. (R also ships with its own internal BLAS implementation, which is reliable but not very fast, and I believe is still single threaded, so the first thing you should do if you are using R and care about performance is to swap it out.)

For more sophisticated linear algebra algorithm, such as SVD, both will use typically LAPACK, and again, both will exhibit essential identical performance.

There is one important difference though: when R is compiled for 64-bit machines, it can only use 64-bit floats! While numpy can support 32 and even (through software emulation) 16 bit floats. This can halve memory usage, which in turn halves cache misses, which results in a significant speed up in cases where 64-bits of precision is not needed.

wxnx · on June 21, 2019

This is really interesting! I had always just claimed that Python was faster (see the benchmark I linked above [1]) based on personal experience. I wonder if this internal implementation has something to do with it...

[1] https://julialang.org/benchmarks/

wxnx · on June 21, 2019

Speed. [1]

See the yellow benchmark (matrix multiply). I suspect it's memory-related.

[1] https://julialang.org/benchmarks/

bryanrasmussen · on June 21, 2019

Theodore Sturgeon update: 90% of all programming books are geared at beginners.

thomasfedb · on June 21, 2019

Is tidyverse really the only option? I'm a big fan of data.table + magrittr as a very powerful data munging combo.

RosanaAnaDana · on June 21, 2019

This is what I've got my crew running. Tidyverse is basicly worthless once you hit a certain scale of data. If you've got datatables representing.

List functionality within datatable is blazingly fast. Faster than anything else I've seen in python or R.

notafraudster · on June 21, 2019

magrittr is part of the tidyverse, but I agree that data.table is a comparably powerful and sometimes faster option versus dplyr.

Bootvis · on June 21, 2019

magrittr existed before the tidyverse and can be used on standalone perfectly fine.

In all benchmarks I've seen data.table is faster than dplyr on all tasks. Curious to see other results.

thomasfedb · on June 21, 2019

At the scale of what I'm doing the benchmarks don't sway me, but I do like the syntax of data.table - it feels a bit like relational algebra.

RosanaAnaDana · on June 21, 2019

So then I would assume you must be working with tables of less than 1000 rows, because thats pretty much the only case where it doesn't matter. At anything more than 1k rows, the differences are substantial.

thomasfedb · on June 22, 2019

Hundreds of rows is about usual for me. I do analysis on clinical studies with human participants. Nothing too tricky, most of my munging runs in effectively zero time.

creddit · on June 21, 2019

Almost always faster, actually.

https://github.com/Rdatatable/data.table/wiki/Benchmarks-:-G...

RosanaAnaDana · on June 21, 2019

I was going to make this point, but yeah. The only thing I think people have a bit of a time with is how you do operations in data.table. If you are coming from plyr/dplyr, the transition can be difficult. However, I've found that the more I do, the more I prefer it, inspite of the fact that the main reason I use dt over tidy is the phenomenal performance gain.

ackbar03 · on June 21, 2019

That's pretty much the only group of people who will use this though. Those that are serious about it or have some background won't really look at another book on data science and probably do the necessary research themselves

dlphn___xyz · on June 21, 2019

any suggestions for intermediate-advanced level articles outside of distill?

randomvectors · on June 21, 2019

This just doesn't seem to have a place.

1. It's aimed at beginners.

2. If you're a beginner, you're best off picking one language and sticking with it for a while.

3. There are so many other beginner resources that are much better.

anonytrary · on June 21, 2019

This is my initial thought. I can't imagine learning more than one syntax at once. This is probably aimed towards eidetic folks (tbh, probably not, but it should be).

randomvectors · on June 21, 2019

Yeaah, to your last thought - without being too cynical, looks like the author wanted to write a beginner data science book but realised that the good beginner data science books are already written (as well as many bad ones).

rhizome31 · on June 21, 2019

As a Python dev who often have to teach Python to data scientists or sometimes integrate their R code, I can imagine this could be a useful reference.

danso · on June 21, 2019

I've been in that position; for most situations, I can't imagine it being efficient for the data scientists to move over to Python themselves, if they don't have experience (or interest) in software engineering.

thewhitetulip · on June 21, 2019

Can you please list a few?

I'm currently following ISLR book & course

randomvectors · on June 21, 2019

Depends on what you're interested in, your goals and your starting point. You have two main learning lanes:

1. Theory.

Math, calculus, linear algebra, probability, statistics, ML algorithms. ISLR is a very good beginner-ish resource that helps you understand the algorithms but doesn't go too deep into the math. As you go deeper, you may realise that you have gaps in your math knowledge and you need to cover a lot more probability, calculus and linear algebra.

2. Programming.

2.1. Good software engineering practices - writing maintainable code, design patters, version control, unit testing etc.

2.2. Tooling - knowing the language-specific ecosystem of libraries (the OP is an attempt to teach you this in two languages at the same time). This is what most beginner resources focus on; your knowledge here has the least transferability and tends to go out of date quickly. Still, using the right tools and knowing them well goes a long way.

thewhitetulip · on June 21, 2019

I'm sorry in advance if I'm taking too much of your time.

#2: coding is my hobby and have been writing well designed apps for a long time, so thats not an issue

#3: ISLR is teaching how to do ML algo in R, so there goes that point

#1 is what I'd like more information. Good important is the maths to work as a data scientist? I'm planning ISLR and then maybe ESL or some advance course

A few of my friends work on Data science and they said that maths isn't that important as in, one needs to know the formulas and why things work the way they do as in not rote learning the math.

It'd be great if you can list down intermediate courses, the learning market has drowned good tutorials and books with not so good guides!

randomvectors · on June 21, 2019

> then maybe ESL or some advance course

> A few of my friends work on Data science and they said that maths isn't that important

Without the math you won't understand anything in ESL.

Which might be okay if the job doesn't require you to go into that much depth - some data science jobs are more focused on research (very math-heavy), some on ETL and/or engineering, others on business understanding and communication; it's a really broad title.

thewhitetulip · on June 21, 2019

Thank you. That's what I was thinking. The jobs I'll apply won't be math oriented.

anonu · on June 21, 2019

What does the HN community think about Python for statistical computing?

Core Python is fine. But pandas is an atrocious mess of object orientedness and other weird stuff.

omnimkar69 · on June 21, 2019

I also skimmed through the book and it does very poor work in case of python I think they have to change this concept

thatcat · on June 21, 2019

I'd be interested in a version that included SAS. Is SAS ever used outside of academia?

randomvectors · on June 21, 2019

SAS is still used by companies who don't want to use open source tools and would rather pay a lot of money for an established product name and have a support line that they can call if anything goes wrong. Understandably, it's slowly dying out.

tomrod · on June 21, 2019

Support, and also indemnification.

SAS lives on borrowed time.

s_Hogg · on June 21, 2019

Yes, though at least where I am it tends to be a legacy system these days. So that means governments and large corporations might reasonably feature some use of it somewhere within their walls.

hobofromabroad · on June 21, 2019

SAS ist still quite a big thing in the insurance sector, at least in Europe.

But the more Data Science focused roles (vs. Pure actuarial roles) are going more and more with R or Python.

anthony_doan · on June 21, 2019

Yes.

I've seen pharma, nonprofit cancer organization, kaiser permanente hospital, State government (epidemiology), etc...

Dkastro92 · on June 21, 2019

I think it's a good first impression of either languague

dlphn___xyz · on June 21, 2019

i only read the chapter on optimization/linear programming - its far too brief and misses key concepts for it to be useful. there are no real applications of the concepts covered in the section.

zelda_1 · on June 21, 2019

Yes, it is very brief. I tried to give some useful reference books/papers/links for each subject appeared in the book. Hope that is useful if the readers are willing to dive deeper. But the concept of linear programming itself is not complex, the users are generally not required to understand simplex/interior point algorithm. If they are able to translate the actual problem into the code then it is basically done. It is not like constrained optimization for which customized treatment based on mathematical theories is more important than writing the code.

dlphn___xyz · on June 21, 2019

but it looks nice

euler_angles · on June 21, 2019

What are the good intermediate level data science books?

adamnemecek · on June 21, 2019

Julia is hands down much better than either of these languages. Don't waste your time.

s_Hogg · on June 21, 2019

So? This whole argument over which programming language is the best all seems a bit tools-first to me. I used to program in R and was a package owner, but switched to python because it had stuff I needed (TF before it ever got ported to R).

Nowadays I program in python because it has all I need. If something comes out in Julia that makes the cost of picking up another language worth it then I'll do it without a second thought. Until then, why bother? The don't waste your time argument can cut both ways, you know.

To clarify: I see no inherent reason to not program in Julia or any other language. But you work with whatever gets your job done efficiently and right now that's neither of those languages for a lot of people.

adamnemecek · on June 21, 2019

> This whole argument over which programming language is the best all seems a bit tools-first to me.

Not really, syntax and semantics are adjoints.

s_Hogg · on June 21, 2019

Exactly: you're talking about semantics.

In a great many cases, they aren't a first-order issue - which is where my objection to a blanket "don't waste your time" claim comes from.

adamnemecek · on June 21, 2019

What is there besides semantics?

krapht · on June 21, 2019

Uh, adoption by the community at large? The best program is the one you didn't have to write because a package existed for it already. Just because a language has a foreign function interface doesn't mean it's easy to interact with other libraries.

R, Python, Matlab, and C++ are the big dogs in scientific programming, and the inertia behind having a large community behind than will continue to drive adoption.

adamnemecek · on June 21, 2019

That's my point, Julia's interop is unparalleled. Like it legit takes one line to call a python library. You get back a result in native julia type.

jhbadger · on June 21, 2019

It's a matter of ecosystem of packages. R has a huge number of packages for many fields. Python has fewer, but might work for particular use cases. I was excited for Julia, and played around with it since 0.2, but it really hasn't generated very many packages of note in my particular field (bioinformatics).

Der_Einzige · on June 21, 2019

Pretty sure that Python has a larger total ecosystem of packages than R.

jhbadger · on June 23, 2019

I mean fewer (and less developed) in the context of data analysis and statistics where it is a competitor to R. Every time I consider using Python after getting frustrated with the more ugly features of R as a language, I take a look at what's available in Python, as from a language point of view it is a bit nicer (as is Julia). But then I see what I would have to reimplement myself if I switched, so I don't.

adamnemecek · on June 21, 2019

julia has really good interop with both of python and R as well as cpp, matlab, mathematica and others.

Also it's not just about the numbers.

randomvectors · on June 21, 2019

So you're telling us not to waste our time with R or Python... but to constantly interop with R and Python?

adamnemecek · on June 21, 2019

I mean if there are packages that you need then be my guest. But like if you are starting a new project, julia is a more productive language.

s_Hogg · on June 21, 2019

> julia is a more productive language

Source for this, please?

ddragon · on June 21, 2019

It would really still depend on the support libraries around the task you want to write, if you really need more performance for the parts that those support libraries don't cover (and beyond what you get with PyPy or Numba) and which language you have more experience with.

If you're really going down to the FFI, it's hard to think it wouldn't be more productive, but that's not what a true beginner like the target of this book would do. Though it's quite nice to quickly extend some tool for your purpose without compromising anything or to understand how something works thanks to being written in high level code.

Syntax-wise, Julia's Common Lisp-like feature set gives the language a lot of power, but normal use will probably be just on par with Python in terms of productivity.

adamnemecek · on June 21, 2019

https://news.ycombinator.com/item?id=20240155

ivirshup · on June 21, 2019

I love Julia, but feel the interop story could use some work. If I want to have my python package depend on some Julia code, how easy is that? Last time I checked (a few months ago), it was pretty difficult.

ChrisRackauckas · on June 21, 2019

You can get it done with a little bit of user interaction: https://github.com/JuliaDiffEq/diffeqpy

kensai · on June 21, 2019

Julia is indeed coming hard to bridge and surpass these two languages. I really hope it succeeds since it is really very well designed.

thomasfedb · on June 21, 2019

What's a good resource for a R user looking to jump over to Julia?

adamnemecek · on June 21, 2019

https://learnxinyminutes.com/docs/julia/

https://juliaobserver.com

The website is decent. Read the standard library https://github.com/JuliaLang/julia/tree/master/stdlib

search github for cool projects https://github.com/JuliaInterop/RCall.jl

locoman88 · on June 21, 2019

That's an opinion my friend, not a fact. If you want to be taken seriously, be sure to be careful in your assertions and always have hard data to back them up

mruts · on June 21, 2019

I whole heartedly agree. Python is garbage for data science. If an industrial grade NN library was written for it, plus some quant libraries, I think most people would switch.

I work in finance doing data science-y things and have yet to meet anyone who doesn’t think that Python is a pile of garbage.

People used to make the easy to learn argument, but Julia is even easier. And more elegant, extensible, and faster.

randomvectors · on June 21, 2019

> If an industrial grade NN library was written for it

Can you give tell us which language and industrial grade NN library you're using?

Because from where I'm sitting I see that Python is the only language that gets first grade support for both Tensorflow and Pytorch. It's so ahead for working with NN that it's not even close.

mruts · on June 21, 2019

I was talking about Julia, not Python. Python obviously has industrial grade NN libraries.

ianamartin · on June 21, 2019

Whenever I see the “I’ve never met x” argument, I’m always looking out for the catch. Because it’s a bad argument to begin with. But the catch is almost always in a situation where x is everywhere, and you would have to have your head willfully stuck in a hole in the ground to not see it.

And this is no exception.

huac · on June 21, 2019

You, uh, don't like PyTorch and TensorFlow? I can't tell if this is sarcastic.

pjmlp · on June 21, 2019

They are written in C++.

randomvectors · on June 21, 2019

Not sure how that's relevant. Everything is written in something else.

Python itself is written in C. The Julia github repo shows Julia 68.2%, C 16.3%, C++ 10.4%, Scheme 3.2%. R is a mix of C, C++, R and some Fortran I think...

pjmlp · on June 21, 2019

It is relevant from the point of view of what a developer is able to achieve without being forced to drop down to a 2nd programming language, aka "2 language syndrome".

And how many Python libraries are just plain wrappers, not really written in Python.

I use TensorFlow from .NET ML and C++ API.

huac · on June 21, 2019

are you suggesting that data scientists use C++ for day to day work? those libraries have first-class wrappers in Python (there is R support, but not at the same level).

pjmlp · on June 21, 2019

Having worked at CERN, many do in fact use C++ for day to day work, hence why CINT and ROOT exist.

However my point was that with Julia, those libraries would have been written in Julia.

All the Python libraries that one throws around for these use cases are C, C++ and Fortran libraries, that happen to have Python wrappers.

Any programming language can have wrappers for them, there is nothing written in Python per se.

mruts · on June 21, 2019

I was talking about Julia, not Python. Python clearly has great NN libraries and great data science libraries.

ackbar03 · on June 21, 2019

I think he uses a calculator for that

adamnemecek · on June 21, 2019

Plus a good package manager, insanely good interop, one good ide, etc etc.

omnimkar69 · on June 21, 2019

I also skimmed through the book and it does very poor work in the case of python I think that they should change R bcoz 90% are the beginners