Python vs Julia – an example from machine learning

hartror · on March 12, 2014

As others have pointed out improvements could be made to improve performance of the Cython code. However it seems to me the take away from this post should be in this application Julia has a better performance profile while being as simple and expressive as Python [1].

As a Python programmer that often works with these sorts of problems this reinforces my perception that Julia, while not suited to all programming problems offers a productivity gain when it comes to mathematical/algorithmic computing.

Sure I could sit and tweak the Cython code or rewrite the hot loops in C++, but it seems to me Julia hits a sweet spot where I am less likely to have to compromise programmer performance for computational performance.

[1] I will note however that the code in the OP doesn't seem like the clearest implementation of the algorithms in either language, some of that heavy nesting should be factored for a start.

StefanKarpinski · on March 12, 2014

Another subtle but important point is that the Julia code can be very easily made completely generic without sacrificing any performance or clarity. All you need to do is remove those Float64 annotations and replace the 0.0 values with `zero(y[1]*weights[1])`. With that small change, the same exact code will work for Float32, Rational, BigFloat, Complex (if that even makes sense), etc. – even new, user-defined numeric types that didn't exist when you wrote this code and that you know nothing about, as long as they can be compared, multiplied, added, and divided.

This kind of generality isn't necessary for a lot of user-level code, where there's a specific application, but it is quite important when writing libraries that need to be as reusable as possible. And this is where the Cython solution has some problems – sure, you can make it pretty fast, but in doing so, you lose the generality that the original Python code had, and your code becomes as monomorphic as C.

btown · on March 12, 2014

Cython does have limited support for generics in the form of its fused types [1]; it can be made to generate C code for different primitive types, for instance. It's nowhere near as flexible as Julia in that regard, since you can't redefine operators for working on Complex numbers for instance, but if you want the interoperability and maturity (read: available programmers and existing libraries) that Python offers, you don't need to trade away generics completely. I'd love to live in a completely-Julia ecosystem, but unfortunately that's not quite reality yet, and luckily Cython is more than capable for making do.

[1] http://docs.cython.org/src/userguide/fusedtypes.html

ajtulloch · on March 12, 2014

In terms of the heavy nesting, that's born out of performance measurements - factoring out the inner loop into a `cdef inline` function in Cython appeared to be a performance hit, so I manually inlined it. When it came for the Julia implementation, I wanted to keep it as similar as possible to the tuned Cython implementation for pedagogical purposes.

In general, if anyone has performance improvements or code cleanups to the Cython PAVA code [1] (the implementation used by scikit-learn), please benchmark and send a pull request over to the scikit-learn team. I'm sure they'll be very happy to accept improvements.

[1]: https://github.com/scikit-learn/scikit-learn/blob/master/skl...

sesqu · on March 14, 2014

This isn't github, but I'll point out one cleanup:

  # solution[i:k + 1] is a decreasing subsequence, so

should read

  # solution[i:k] is a decreasing subsequence, so

heydenberk · on March 12, 2014

This Python code is "as expressive" as the Julia code because it's not written in idiomatic Python. Por ejemplo:

    for i in range(n):
        solution[i] = y[i]

when one could simply

     solution[:n] = y[:n]

Don't mean to bikeshed, but this is one of many cases.

ajtulloch · on March 13, 2014

FWIW, benchmarks indicated that the former was faster than the latter. Remember, this is Cython, so the goal is to write code that is easily compiled into efficient C code, not to write idiomatic Python.

srean · on March 12, 2014

If the objective is to get a fast implementation of isotonic regression, I think the priority would be to choose the right algorithm. PAVA is indeed one of the fastest such algorithms and has a running time bound that is linear in the number of data points. I may be wrong, but from a quick look at the code it seems neither the [P/C]ython version nor the Julia version has an implementation that ensures linear time. The files are named Linear PAVA so I believe there was an expectation that they are indeed linear time.

Consider applying it to this sequence

     1, 2, 3, 4, 5, 6, 7, 6, 5, 4, 3, 2, -30.

It seems that the number of operations required according to these implementations would be quadratic in the size of the sequence. (It will average the tail repeatedly.)

I cannot read Julia yet, so not sure about the Julia version, but the others do indeed look very much a quadratic time algorithm at the least. The behavior will of course depend on input, so on certain inputs they may have a runtime that scales linearly. Complexity addresses the worst case behavior and as I showed coming up with such bad cases is not hard, and they need not be very un-natural either.

   OTOH:

If the objective was to compare speeds of CPython, Cython and Julia, then it does not matter whether an efficient algorithm was chosen for the task. My hunch is, because the current implementations computes more than necessary, it is going to magnify the speed difference between Cython and Julia.

Does Julia do any inlining ? Or are there any plans for such.

If so, that would be one aspect where it could shine. As mentioned in the comment the Cython version is unfactored and manually inlined for performance. If Julia inlines in its JIT pass, such measures would be unnecessary and would allow one to write cleaner code.

ajtulloch · on March 12, 2014

Yeah, I implemented a guaranteed O(N) PAVA in Julia (see https://github.com/ajtulloch/Isotonic.jl/blob/master/src/poo...), and experimented with one in Cython - it was uniformly slower than the 'linear' PAVA across a few sample datasets I tried. See e.g. http://bit.ly/1oDF7nH for some graphs.

I'm not claiming I've written the fastest possible implementations or anything though - please feel free to improve any of this code and submit to scikit-learn if you find some improvements!

srean · on March 12, 2014

> I'm not claiming I've written the fastest possible implementations or anything though

Oh I never thought that you were claiming so. Thanks for clarifying the implementation. Julia is indeed something that I want to pickup. BTW what data-structure does this snippet invoke

    ydash = [1 => y[1]]

fixed length array ?

> please feel free to improve any of this code and submit to scikit-learn if you find some improvements!

Ah! then you would all hound me for my insistence of keeping https://bitbucket.org/sreangsu/libpav LGPL. A discussion I dont want to engage in. BTW I am not even sure if libpav would be faster as it stands now, because it does some really stupid stuff, like unnecessary allocation/deallocation on the heap. Its an old piece that I wrote years ago to entertain myself (proof: comments like these

   // I hereby solemnly declare
   // that 'it' is not aliased 
   // with anything that may get
   // deleted during this object's
   // lifetime. May the  runtime
   // smite me to my core otherwise.

https://bitbucket.org/sreangsu/libpav/src/dcc8411b10a0d4e220... and the perverse namespace pollution).

Its such a strange coincidence: I started hacking on it again just a few days ago, and lo and behold a thread on HN on PAVA. Now motivated enough to fix my crap :) Once I get rid of that damned linked-list I think it would be decent enough, although there is a fair chance that it would be faster than the current Julia version.

That said, libpav does a lot more than just isotonic regression, its more of a marriage between isotonic regression and large margin methods like SVM. Even for just the isotonic regression case, it can work with any abstraction of an iterator (so it is container agnostic) can take a ordering operator, and is of course fully generic (meaning can handle float32, double64 without any change).

@jamesjporter Thanks

jamesjporter · on March 12, 2014

    ydash = [1 => y[1]]

creates a Dict that maps for from Int64 to Float64, `=>` is used for declaring Dict literals in Julia. c.f.: http://docs.julialang.org/en/latest/stdlib/base/?highlight=d...

idunning · on March 12, 2014

Very cool. We had a similar experience when implementing the Simplex method for linear optimization, except we evaluated PyPy versus Julia instead of Cython. Its similar in that it benchmarks "just" the language, not BLAS, which is common pitfall I see when people compare the two. I guess at this point, a year or so later, it isn't surprising to me anymore that Julia is as fast as it is for these sorts of computational tasks - its been demonstrated pretty comprehensively at this point.

Slides: http://www.mit.edu/~mlubin/informs2013.pdf

Paper: http://arxiv.org/abs/1312.1431

lqdc13 · on March 12, 2014

I am always confused by this. I think the primary reason for picking a language is how simple it is to write something in it.

So if I can call scipy and get the result easily in one line, I would do that. When implementing new algorithms, it will indeed be easier with Julia.

Pypy unfortunately cannot be used by most people who would benefit a lot from it because numpy/scipy isn't currently supported.

I am curious if you tried running your pypy code with Numba.

EpicEng · on March 12, 2014

Depends on your requirements. In many applications, performance matters a whole lot more than "how much can I accomplish in one line of code."

lqdc13 · on March 12, 2014

In those applications, everyone would be writing everything in C then and such benchmarks would be useless.

pjmlp · on March 12, 2014

Why spend time increasing the amount of unsafe code in the world, if comparable speeds can be achieved in better designed languages?!

EpicEng · on March 12, 2014

That's certainly not true. Take the matter at hand; if I have performance requirements that can be met using Julia, but not by using Python, I'll go with Julia over C because ease of use also matters, though it's not always the driving requirement.

tenfingers · on March 12, 2014

Actually, performance evaluation is one of my first considerations for picking a language if I already know what kind of operating environment/performance I need to achieve.

I would argue that my personal choice is 1) familiarity with the language (ie, pick the most comfortable language for the task at hand) 2) performance required

thisrod · on March 12, 2014

This looks dodgy to me. The purple and red lines on the graph don't end up parallel, so the active set algorithm appears to have time complexity O(n) in Julia, but O(n^2) or O(n^3) in Python. Can that be explained?

Sweyla · on March 12, 2014

No kidding the Active Set solution in Cython is slow, it barely leverages Cython at all! It uses a Python list which is managed by Python and uses none of Cython's static typing.

Implement the algorithm inside a "with nogil" and you will see some nice speed-ups. It won't look pretty, but for certain performance-sensitive code it will be worth it.

ajtulloch · on March 12, 2014

Hi, author here.

Yeah, the active set implementation in Cython is relatively slow - see [1] for where I discussed how I benchmarked the existing algorithm and demonstrated the PAVA implementation was faster.

Then again, until two days ago the Cython active set algorithm was the production implementation of isotonic regression in scikit-learn, so it's not like it was a strawman example that I made up.

[1]: http://tullo.ch/articles/speeding-up-isotonic-regression/

Sweyla · on March 12, 2014

Try this implementation for PAVA:

https://gist.github.com/gustavla/9499068

Also, how are you timing Python? When I run this on a sample size of 10 using IPython's %timeit, it takes a few microseconds, which is 2 orders of magnitude from what you report.

crasta · on March 12, 2014

Coming from R, I have to ask... Why am I seeing 'for' and 'while' loops? Why isn't this code vectorized? It looks like both implementations suck.

mbauman · on March 12, 2014

Yeah, it's a big change in mindset for Julia.

Basically, the reason MATLAB/Numpy/R folks harp so much about vectorizing code is because their interpreters are slow, and within a vectorized function they can punt to some C/C++/Cython implementation. But Julia loops compile down to almost ideal assembly; they are remarkably fast. In fact, verbose loops acting in-place almost always beat the vectorized alternatives because they don't need to allocate space for the result.

The wonderful thing about Julia is that you can code it vectorized first, and if that's not fast enough (or if it uses too much memory), you can easily rewrite it in the same language.

shabadoop · on March 12, 2014

That, and the fact that Julia has pretty good CUDA modules, makes it seem like Julia is perfectly suited for work with Neural Networks.

crasta · on March 12, 2014

You can't parallelize a for-loop the way you can apply a function on every member of a vector. If code can be vectorized in Julia, that is what I'm interested in seeing. These samples are obviously not intended for distributed compute. Thousands of CUDA cores sit idle waiting for this synchronous for-loop to finish.

maxerickson · on March 12, 2014

Cython has a construct for parallelizing for loops.

http://docs.cython.org/src/userguide/parallelism.html

I wouldn't want to claim an informed opinion about whether it is a good style or not, but a block with no side effects isn't that far off from a function with no side effects.

crasta · on March 12, 2014

@maxerickson I can't reply to you so I'll reply next to you. That looks very interesting. I will have to look into it. Thanks

CatMtKing · on March 12, 2014

Aye, first thing I learned about R was that their for loops suck. On the bright side, it taught me how to use apply.

westurner · on March 12, 2014

1. Where is the source for this benchmark?

2.http://benchmarksgame.alioth.debian.org could be a bit more representative of broad-based algorithmic performance.

3. There are lots of Python libraries for application features other than handpicked algorithms. I would be interested to see benchmarks of the marshaling code in IJulia (IPython with Julia)

maxerickson · on March 12, 2014

How much time is Cython spending making the copy into the solution array?

(Or is Julia making a copy and I just don't see it?)

gregimba · on March 12, 2014

If you need speed wouldn't the ability to write as a C extension beat using Julia?

ajtulloch · on March 12, 2014

For what it's worth, a quickly done C++11 implementation is <1.5x speedup over Julia.

https://gist.github.com/ajtulloch/9499872

ihnorton · on March 12, 2014

Writing performant extensions for (C)Python requires knowing both C and the CPython API. In Julia it is easy to call C shared library functions directly with no overhead (similar to PyPy with CFFI).

mmerickel · on March 12, 2014

CFFI works fine with CPython, and ctypes loads shared libraries without issue as well.

ihnorton · on March 12, 2014

Both of which have non-trivial overhead in CPython.

zaphar · on March 12, 2014

Julia makes it pretty easy to write it in C if you want as well: http://docs.julialang.org/en/release-0.2/stdlib/base/#c-inte...

So that doesn't really give python a leg up over julia.

pekk · on March 12, 2014

If I am going to write it in C then what in the world am I using Julia for?

dagw · on March 12, 2014

Data IO, formatting, pre and post processing and presentation.

My standard approach is to use python to read in all my data, parse it, format it and beat it into the shape I want. Then, if I need the performance, I pass the data to a highly tuned C function for slowest number crunching parts. Then I take the data back from C and parse it, format it, summarize it and write it out to the file formats I want using python. C may be faster for number crunching, but python is much nicer for data handling, and there the performance difference is negligible.

idunning · on March 12, 2014

Exactly - the C interop is used almost exclusively for connecting to existing libraries.

ermintrude · on March 12, 2014

Can you also call C++ libraries directly from Julia?

mbauman · on March 12, 2014

Name mangling, as always, is a mess. If they export extern "C" headers, then of course there's no trouble.

There's work on interfacing with the C++ ABI, though. See Cpp.jl[0] and Clang.jl (whose C++ support is still a WIP)[1].

[0]. https://github.com/timholy/Cpp.jl

[1]. https://github.com/ihnorton/Clang.jl/issues/20

_clhx · on March 12, 2014

I think the point is the fun of the language. See here: http://julialang.org/ and scroll down a bit for the speed comparison.

nnq · on March 12, 2014

Wow, looking at those benchmarks it amazes me that Mathematica is so much faster than Matlab and Octave!

I used to recommend Mathematica to anyone wishing to "play around with math" and "get a feel for it", due to the user friendly interface and 'manipulate' function that allows you to pop a few sliders for realtime manipulation of parameters in functions and instant visual feedback in one line of code, but always thought of it as either a "toy" or a "learning tool", compared to Matlab that is used by "serious engineers" (mind it, I like neither, I'm a Python and C person, but I'm also closer to what some would consider a "software engineer" and definitely not a mathematician or scientist).

dagw · on March 12, 2014

There are really two reasons that Mathematica isn't taken seriously by engineers. The first is that Mathematica started life as a symbolic math package while engineers need numeric math. Now mathematica has done a lot over the past few release to catch up on the numeric side, but the notion that it is just for symbolic math lingers.

The second is the language itself. Matlab is a procedural language heavily inspired by Fortran. Mathematica of the other hand as a largely functional language that takes several design cues from Lisp. Historically most engineers came from a Fortran rather than a Lisp background and most engineering programming is still taught in a "Fortran-inspired" way, so the Matlab language is simply more comfortable for most engineers to think and work in.

carlob · on March 12, 2014

To be honest, I've given a quick look at the implementation in Mathematica, and I think one could do much better.

The code is mostly wrapped in Compile statements and is very procedural in style.

As a quick test I compared their quicksort with the built in Sort obtaining results about 30 times faster than theirs.

pjmlp · on March 12, 2014

Another good example how dynamic languages can be made to execute faster with good native code generation support than having to force people to use C or similar.

Looking forward to Julia's JIT improvements. This is only the early days.

e12e · on March 12, 2014

Wait, are we looking at the same graphs? Doesn't this show that for large problem sets cython and julia are pretty much equivalent as long as you choose the best algorithm?

amit_m · on March 12, 2014

tl;dr: Julia is killing it.

est · on March 12, 2014

In every benchmark article: you are doing X wrong. Use Y method instead.

coldtea · on March 12, 2014

Yes, it's as if those people think that instead of trying to measure rough relative speed for similar implementations of some algorithm (which is what you want when you benchmark), you are interest in finding the optimal algorithm for the problem.

halayli · on March 12, 2014

Why not use C++11 if performance matters?

ajtulloch · on March 12, 2014

I'm not averse to C++11 - I personally think it's a great language for certain tasks, and have used it extensively for performance-critical ML - see [1] for some public work.

The value proposition of Julia (and Python with Cython) is that you don't have to choose between the productivity you have from a dynamic language (IPython/IJulia [2] are incredibly productive environments for ML/data science IMO), and you have the ability to easily optimize hotspots in your code as you come across them - whether that be adding type annotations/disabling bounds checking in Julia, dropping into Cython, etc.

[1]: http://tullo.ch/articles/decision-tree-evaluation/

[2]: http://bit.ly/1cOTzJY

sixbrx · on March 12, 2014

I think the biggest thing is that you would lose the REPL if you went entirely C++, which can be a big deal when you want to try things incrementally.

On the other hand, this code is written in such a low-level style (nested loops mutating arrays, all functions are top level with no env-dependence) that it's hard to see any other advantage in code like this over just using C++. In fact C++ can leverage closures now with little or no performance loss, unlike either of these languages, so hof-style programming becomes feasible even if high performance is a goal. Also I think there's some benefit to static typing for catching errors early (especially as the program gets larger), and documenting return types which I always appreciate as a code reader (!).

If it's sufficiently easy to link to C++ (or Rust), you could still implement parts in C++ (or Rust) and use the REPL for playing around, that might make a good combination...

emidln · on March 12, 2014

"Why not use Fortran 2008 if performance matters?"

pekk · on March 12, 2014

Well, why not? Julia is very different from Python, if I am willing to use a completely different tool then the field is wide open and includes anything that has better performance on average

StefanKarpinski · on March 12, 2014

Julia's mental programming model for many things is not much different from Python: dynamically typed, pass by sharing, the implicit scoping rules are even quite similar. Sure, Julia has a few extra bells and whistles, like type annotations and dynamic multiple dispatch (which you can ignore and you have something very much like Python but faster), but still, Fortran and C++ are much more different from Python than Julia is.

tejinderss · on March 12, 2014

Any tentative idea when Julia 1.0 will be released?

StefanKarpinski · on March 12, 2014

Possibly late 2014 or early 2015, but we're not going to rush it if things don't feel ready yet.

coldtea · on March 12, 2014

Because not all "completely different tools" are equally valid or use the same compromises.

pjmlp · on March 12, 2014

Because the goal should be to use productive safer languages, while increasing the code quality of their toolchains.

Many languages that have native code compilers (AOT or JIT) can be used. It is just many seem to think only C and C++ are the only game in town.

Personally I look forward to improved code generation in alternative language compilers, so that we go back to the time where C and C++ were just two among many languages with available compilers one could choose from.

rasz_pl · on March 12, 2014

Dont know about c++, but matlab/octave comparison would be nice.

Blahah · on March 12, 2014

Matlab (and Octave) has terrible performance in every benchmark comparison I've ever seen. For example see the Julia benchmarks: http://julialang.org/benchmarks/

crasta · on March 12, 2014

Have you only seen one benchmark or did you choose such a biased source for trolling?

How I know this graph has nothing to do with machine learning? It has JavaScript listed and JavaScript is on the faster end of the chart.

Do you really expect anyone to believe that a single threaded JavaScript is going to outperform python or R with highly parallel machine learning algorithms?

t3kcit · on March 12, 2014

Why are you not using typed memory views in Cython? That might easily close the performance gap.

dbecker · on March 12, 2014

Would love to see numba added into this benchmark.