As others have pointed out improvements could be made to improve performance of the Cython code. However it seems to me the take away from this post should be in this application Julia has a better performance profile while being as simple and expressive as Python [1].
As a Python programmer that often works with these sorts of problems this reinforces my perception that Julia, while not suited to all programming problems offers a productivity gain when it comes to mathematical/algorithmic computing.
Sure I could sit and tweak the Cython code or rewrite the hot loops in C++, but it seems to me Julia hits a sweet spot where I am less likely to have to compromise programmer performance for computational performance.
[1] I will note however that the code in the OP doesn't seem like the clearest implementation of the algorithms in either language, some of that heavy nesting should be factored for a start.
Another subtle but important point is that the Julia code can be very easily made completely generic without sacrificing any performance or clarity. All you need to do is remove those Float64 annotations and replace the 0.0 values with `zero(y[1]*weights[1])`. With that small change, the same exact code will work for Float32, Rational, BigFloat, Complex (if that even makes sense), etc. – even new, user-defined numeric types that didn't exist when you wrote this code and that you know nothing about, as long as they can be compared, multiplied, added, and divided.
This kind of generality isn't necessary for a lot of user-level code, where there's a specific application, but it is quite important when writing libraries that need to be as reusable as possible. And this is where the Cython solution has some problems – sure, you can make it pretty fast, but in doing so, you lose the generality that the original Python code had, and your code becomes as monomorphic as C.
Cython does have limited support for generics in the form of its fused types [1]; it can be made to generate C code for different primitive types, for instance. It's nowhere near as flexible as Julia in that regard, since you can't redefine operators for working on Complex numbers for instance, but if you want the interoperability and maturity (read: available programmers and existing libraries) that Python offers, you don't need to trade away generics completely. I'd love to live in a completely-Julia ecosystem, but unfortunately that's not quite reality yet, and luckily Cython is more than capable for making do.
In terms of the heavy nesting, that's born out of performance measurements - factoring out the inner loop into a `cdef inline` function in Cython appeared to be a performance hit, so I manually inlined it. When it came for the Julia implementation, I wanted to keep it as similar as possible to the tuned Cython implementation for pedagogical purposes.
In general, if anyone has performance improvements or code cleanups to the Cython PAVA code [1] (the implementation used by scikit-learn), please benchmark and send a pull request over to the scikit-learn team. I'm sure they'll be very happy to accept improvements.
FWIW, benchmarks indicated that the former was faster than the latter. Remember, this is Cython, so the goal is to write code that is easily compiled into efficient C code, not to write idiomatic Python.
If the objective is to get a fast implementation of isotonic regression, I think the priority would be to choose the right algorithm. PAVA is indeed one of the fastest such algorithms and has a running time bound that is linear in the number of data points. I may be wrong, but from a quick look at the code it seems neither the [P/C]ython version nor the Julia version has an implementation that ensures linear time. The files are named Linear PAVA so I believe there was an expectation that they are indeed linear time.
Consider applying it to this sequence
1, 2, 3, 4, 5, 6, 7, 6, 5, 4, 3, 2, -30.
It seems that the number of operations required according to these implementations would be quadratic in the size of the sequence. (It will average the tail repeatedly.)
I cannot read Julia yet, so not sure about the Julia version, but the others do indeed look very much a quadratic time algorithm at the least. The behavior will of course depend on input, so on certain inputs they may have a runtime that scales linearly. Complexity addresses the worst case behavior and as I showed coming up with such bad cases is not hard, and they need not be very un-natural either.
OTOH:
If the objective was to compare speeds of CPython, Cython and Julia, then it does not matter whether an efficient algorithm was chosen for the task. My hunch is, because the current implementations computes more than necessary, it is going to magnify the speed difference between Cython and Julia.
Does Julia do any inlining ? Or are there any plans for such.
If so, that would be one aspect where it could shine. As mentioned in the comment the Cython version is unfactored and manually inlined for performance. If Julia inlines in its JIT pass, such measures would be unnecessary and would allow one to write cleaner code.
I'm not claiming I've written the fastest possible implementations or anything though - please feel free to improve any of this code and submit to scikit-learn if you find some improvements!
> I'm not claiming I've written the fastest possible implementations or anything though
Oh I never thought that you were claiming so. Thanks for clarifying the implementation. Julia is indeed something that I want to pickup. BTW what data-structure does this snippet invoke
ydash = [1 => y[1]]
fixed length array ?
> please feel free to improve any of this code and submit to scikit-learn if you find some improvements!
Ah! then you would all hound me for my insistence of keeping https://bitbucket.org/sreangsu/libpav LGPL. A discussion I dont want to engage in. BTW I am not even sure if libpav would be faster as it stands now, because it does some really stupid stuff, like unnecessary allocation/deallocation on the heap. Its an old piece that I wrote years ago to entertain myself (proof: comments like these
// I hereby solemnly declare
// that 'it' is not aliased
// with anything that may get
// deleted during this object's
// lifetime. May the runtime
// smite me to my core otherwise.
Its such a strange coincidence: I started hacking on it again just a few days ago, and lo and behold a thread on HN on PAVA. Now motivated enough to fix my crap :) Once I get rid of that damned linked-list I think it would be decent enough, although there is a fair chance that it would be faster than the current Julia version.
That said, libpav does a lot more than just isotonic regression, its more of a marriage between isotonic regression and large margin methods like SVM. Even for just the isotonic regression case, it can work with any abstraction of an iterator (so it is container agnostic) can take a ordering operator, and is of course fully generic (meaning can handle float32, double64 without any change).
Very cool. We had a similar experience when implementing the Simplex method for linear optimization, except we evaluated PyPy versus Julia instead of Cython. Its similar in that it benchmarks "just" the language, not BLAS, which is common pitfall I see when people compare the two. I guess at this point, a year or so later, it isn't surprising to me anymore that Julia is as fast as it is for these sorts of computational tasks - its been demonstrated pretty comprehensively at this point.
That's certainly not true. Take the matter at hand; if I have performance requirements that can be met using Julia, but not by using Python, I'll go with Julia over C because ease of use also matters, though it's not always the driving requirement.
Actually, performance evaluation is one of my first considerations for picking a language if I already know what kind of operating environment/performance I need to achieve.
I would argue that my personal choice is 1) familiarity with the language (ie, pick the most comfortable language for the task at hand) 2) performance required
This looks dodgy to me. The purple and red lines on the graph don't end up parallel, so the active set algorithm appears to have time complexity O(n) in Julia, but O(n^2) or O(n^3) in Python. Can that be explained?
No kidding the Active Set solution in Cython is slow, it barely leverages Cython at all! It uses a Python list which is managed by Python and uses none of Cython's static typing.
Implement the algorithm inside a "with nogil" and you will see some nice speed-ups. It won't look pretty, but for certain performance-sensitive code it will be worth it.
Yeah, the active set implementation in Cython is relatively slow - see [1] for where I discussed how I benchmarked the existing algorithm and demonstrated the PAVA implementation was faster.
Then again, until two days ago the Cython active set algorithm was the production implementation of isotonic regression in scikit-learn, so it's not like it was a strawman example that I made up.
Also, how are you timing Python? When I run this on a sample size of 10 using IPython's %timeit, it takes a few microseconds, which is 2 orders of magnitude from what you report.
Basically, the reason MATLAB/Numpy/R folks harp so much about vectorizing code is because their interpreters are slow, and within a vectorized function they can punt to some C/C++/Cython implementation. But Julia loops compile down to almost ideal assembly; they are remarkably fast. In fact, verbose loops acting in-place almost always beat the vectorized alternatives because they don't need to allocate space for the result.
The wonderful thing about Julia is that you can code it vectorized first, and if that's not fast enough (or if it uses too much memory), you can easily rewrite it in the same language.
You can't parallelize a for-loop the way you can apply a function on every member of a vector. If code can be vectorized in Julia, that is what I'm interested in seeing. These samples are obviously not intended for distributed compute. Thousands of CUDA cores sit idle waiting for this synchronous for-loop to finish.
I wouldn't want to claim an informed opinion about whether it is a good style or not, but a block with no side effects isn't that far off from a function with no side effects.
3. There are lots of Python libraries for application features other than handpicked algorithms. I would be interested to see benchmarks of the marshaling code in IJulia (IPython with Julia)
Writing performant extensions for (C)Python requires knowing both C and the CPython API. In Julia it is easy to call C shared library functions directly with no overhead (similar to PyPy with CFFI).
Data IO, formatting, pre and post processing and presentation.
My standard approach is to use python to read in all my data, parse it, format it and beat it into the shape I want. Then, if I need the performance, I pass the data to a highly tuned C function for slowest number crunching parts. Then I take the data back from C and parse it, format it, summarize it and write it out to the file formats I want using python. C may be faster for number crunching, but python is much nicer for data handling, and there the performance difference is negligible.
Wow, looking at those benchmarks it amazes me that Mathematica is so much faster than Matlab and Octave!
I used to recommend Mathematica to anyone wishing to "play around with math" and "get a feel for it", due to the user friendly interface and 'manipulate' function that allows you to pop a few sliders for realtime manipulation of parameters in functions and instant visual feedback in one line of code, but always thought of it as either a "toy" or a "learning tool", compared to Matlab that is used by "serious engineers" (mind it, I like neither, I'm a Python and C person, but I'm also closer to what some would consider a "software engineer" and definitely not a mathematician or scientist).
There are really two reasons that Mathematica isn't taken seriously by engineers. The first is that Mathematica started life as a symbolic math package while engineers need numeric math. Now mathematica has done a lot over the past few release to catch up on the numeric side, but the notion that it is just for symbolic math lingers.
The second is the language itself. Matlab is a procedural language heavily inspired by Fortran. Mathematica of the other hand as a largely functional language that takes several design cues from Lisp. Historically most engineers came from a Fortran rather than a Lisp background and most engineering programming is still taught in a "Fortran-inspired" way, so the Matlab language is simply more comfortable for most engineers to think and work in.
Another good example how dynamic languages can be made to execute faster with good native code generation support than having to force people to use C or similar.
Looking forward to Julia's JIT improvements. This is only the early days.
Wait, are we looking at the same graphs? Doesn't this show that for large problem sets cython and julia are pretty much equivalent as long as you choose the best algorithm?
Yes, it's as if those people think that instead of trying to measure rough relative speed for similar implementations of some algorithm (which is what you want when you benchmark), you are interest in finding the optimal algorithm for the problem.
I'm not averse to C++11 - I personally think it's a great language for certain tasks, and have used it extensively for performance-critical ML - see [1] for some public work.
The value proposition of Julia (and Python with Cython) is that you don't have to choose between the productivity you have from a dynamic language (IPython/IJulia [2] are incredibly productive environments for ML/data science IMO), and you have the ability to easily optimize hotspots in your code as you come across them - whether that be adding type annotations/disabling bounds checking in Julia, dropping into Cython, etc.
I think the biggest thing is that you would lose the REPL if you went entirely C++, which can be a big deal when you want to try things incrementally.
On the other hand, this code is written in such a low-level style (nested loops mutating arrays, all functions are top level with no env-dependence) that it's hard to see any other advantage in code like this over just using C++. In fact C++ can leverage closures now with little or no performance loss, unlike either of these languages, so hof-style programming becomes feasible even if high performance is a goal. Also I think there's some benefit to static typing for catching errors early (especially as the program gets larger), and documenting return types which I always appreciate as a code reader (!).
If it's sufficiently easy to link to C++ (or Rust), you could still implement parts in C++ (or Rust) and use the REPL for playing around, that might make a good combination...
Well, why not? Julia is very different from Python, if I am willing to use a completely different tool then the field is wide open and includes anything that has better performance on average
Julia's mental programming model for many things is not much different from Python: dynamically typed, pass by sharing, the implicit scoping rules are even quite similar. Sure, Julia has a few extra bells and whistles, like type annotations and dynamic multiple dispatch (which you can ignore and you have something very much like Python but faster), but still, Fortran and C++ are much more different from Python than Julia is.
Because the goal should be to use productive safer languages, while increasing the code quality of their toolchains.
Many languages that have native code compilers (AOT or JIT) can be used. It is just many seem to think only C and C++ are the only game in town.
Personally I look forward to improved code generation in alternative language compilers, so that we go back to the time where C and C++ were just two among many languages with available compilers one could choose from.
Matlab (and Octave) has terrible performance in every benchmark comparison I've ever seen. For example see the Julia benchmarks: http://julialang.org/benchmarks/
Have you only seen one benchmark or did you choose such a biased source for trolling?
How I know this graph has nothing to do with machine learning? It has JavaScript listed and JavaScript is on the faster end of the chart.
Do you really expect anyone to believe that a single threaded JavaScript is going to outperform python or R with highly parallel machine learning algorithms?
As a Python programmer that often works with these sorts of problems this reinforces my perception that Julia, while not suited to all programming problems offers a productivity gain when it comes to mathematical/algorithmic computing.
Sure I could sit and tweak the Cython code or rewrite the hot loops in C++, but it seems to me Julia hits a sweet spot where I am less likely to have to compromise programmer performance for computational performance.
[1] I will note however that the code in the OP doesn't seem like the clearest implementation of the algorithms in either language, some of that heavy nesting should be factored for a start.