Paper review: TensorFlow, Machine Learning on Heterogeneous Distributed Systems

nl · on Jan 11, 2016

I think Google has designed and developed TensorFlow as a Maui-style integrated code-offloading framework for machine learning. What is Maui you ask? (Damn, I don't have a Maui summary in my blog?)
Maui is a system for offloading of smartphone code execution onto backend servers at method-granularity. The system relies on the ability of managed code environment (.NET CLR) to be run on different platforms. By introducing this automatic offloading framework, Maui enables applications that exceed memory/computation limits to run on smartphones in a battery- & bandwidth-efficient manner.

TensorFlow enables cloud backend support for machine learning to the private/device-level machine learning going on in your smartphone. It doesn't make sense for a power-hungry entire TensorFlow program to run on your wimpy smartphone. Your smartphone will be running only certain TensorFlow nodes and modules, the rest of the TensorFlow graph will be running on the Google cloud backend. Such a setup is also great for preserving privacy of your phone while still enabling machine learned insights on your Android.

I don't see any real evidence of this in the paper, code or examples. Certainly running it it doesn't feel like it is designed to operate over high-latency connections like that.

Instead, I think TensorFlow is deigned to work on mobile so that it is comparatively easy for a pre-trained model to be deployed and run on a constrained mobile environment.

That's what Google did for the Android voice recognition model (although that was pre-TensorFlow), and they have released a pre-trained image classification model that runs on TensorFlow.

krasin · on Jan 11, 2016

Another supportive (for your point) evidence is the Tensorflow Android example: https://github.com/tensorflow/tensorflow/tree/master/tensorf...

It runs the network locally on the device. It's fast enough.

dgacmu · on Jan 12, 2016

"Sometimes, a cigar is just a cigar."

Phones are increasingly powerful, prevalent, and for a lot of people, their primary computer. Being able to use the same code, data, and test cases for every platform you want to train or run your model on is good from a software engineering perspective. Don't underestimate the challenges of software eng. when ML is involved -- see, for example, http://icml.cc/2015/invited/LeonBottouICML2015.pdf

srean · on Jan 12, 2016

Where can I know more about Maui that OP speaks of ?

ericmo · on Jan 12, 2016

Maybe this? http://research.microsoft.com/en-us/um/people/ranveer/docs/m...

srean · on Jan 13, 2016

Yes that was it, thanks.

WestCoastJustin · on Jan 11, 2016

> 9.2 Performance Tracing - We also have an internal tool called EEG (not included in the initial open source release in November, 2015) that we use to collect and visualize very fine-grained information about the exact ordering and performance characteristics of the execution of TensorFlow graphs. This tool works in both our single machine and distributed implementations, and is very useful for understanding the bottlenecks in the computation and communication patterns of a TensorFlow program.

The flow graphs look very similar to a talk given by Dick Sites [2, 3], where he talks about performance monitoring across the Google fleet. Highly recommend watching this if you are at all interested in performance monitoring (he has a clear passion and deep understanding in this area). What's really interesting about this talk, is that it appears Google engineers do not need to bake this into their distributed applications, it works at the OS level, along with supporting tracing systems (tagging, log collections, aggregation into a central tool, analytics, etc [4 - Dapper]). So, you get it for free basically if you are running in the Google environment. Compare this with something you want to debug outside of Google, you need to add all types of debug bits into your code, and then do the analysis in your own little silo. Very thought provoking idea. I think this is why they are not releasing it. This is a guess based off a few things I've read though. So, I could be wrong.

ps - if you want to get really crazy about this, you might read into this that they "would like to" or "have built" their own CPUs, to address some of the limitations they see from the talk.

[1] https://en.wikipedia.org/wiki/Electroencephalography

[2] http://www.pdl.cmu.edu/SDI/2015/slides/DatacenterComputers.p...

[3] https://vimeo.com/121396406

[4] http://static.googleusercontent.com/media/research.google.co...

staticmalloc · on Jan 11, 2016

There are three reasons Google open-sourced TensorFlow:

1. Their stated reason.

2. Branding. Google likes to show off advanced products that aren't fully developed in order to elevate their brand and recruit engineers (even though most do not work on such products). Project Soli is the best example of this: the product video shows off a small radar and some range-doppler video, but it does not show complicated gesture recognition capabilities, which is the hardest part.

3. Free labor. Deep learning algorithms require a ton of data in order to not overfit to the training data. Google has tons of data while academic researchers do not. If Google gets researchers to develop new algorithms that work well on smaller amounts of data and the researchers develop their algorithms in TensorFlow, then Google can easily incorporate the work into their own products and make it better.

nl · on Jan 11, 2016

The free labor argument isn't particularly compelling in this case. Academic researchers focus on publishing their results in papers, and published neural network architectures are comparatively trivial to port between software platforms.

Even when code is published, it's a few hours work to copy a (say) Torch based architecture to TensorFlow - indeed most platforms import (and export?) Caffe-style (ProtoBuf based) model descriptions.

magicalist · on Jan 11, 2016

> Branding. Google likes to show off advanced products that aren't fully developed in order to elevate their brand and recruit engineers (even though most do not work on such products). Project Soli is the best example of this: the product video shows off a small radar and some range-doppler video, but it does not show complicated gesture recognition capabilities, which is the hardest part.

Project Soli doesn't seem remotely analogous. One project is an open source library that is being run as an open source library and works today, but (so far) holds back a killer feature. The other is a product concept video.

jedberg · on Jan 11, 2016

4. (Which is related to 2) Screening college graduates. By releasing "toy" versions of their stuff, they get professors and students using their tools for projects, so when they go to interview with Google, Google already has some data on whether they are using the tools well, and it drives those kids towards google so they can work on "the real thing". It's a great strategy.

nazka · on Jan 11, 2016

Plus 5. Devs learn to use their tool. So when they hire the bests devs (hopefully) elsewhere, they already know how to use their tool without requiring any training time.

jamesblonde · on Jan 11, 2016

Add "bazel" to this growing list. Blaze (the real thing) is a great build tool, the best. But they kept the distributed engine for themselves, rendering Bazel useless. So far, they have done the same thing with TensorFlow - kept the distributed engine proprietary.

melted · on Jan 11, 2016

I wouldn't call it "useless". It's less useful than the internal distributed version, sure, but it does offer value: a drop dead simple build file syntax that can be learned in 15 minutes and then just works.

jamesblonde · on Jan 11, 2016

Useless is a bit harsh, but judging by Bazel's uptake, that's the "market's" take on Bazel.

melted · on Jan 11, 2016

What did you expect, it's only been a few months. There's tremendous inertia when it comes to build systems, people don't change them on a whim.

serge2k · on Jan 11, 2016

With the fancy distributed bits what's the win I get using Bazel? Maybe it's simpler, but if I already know another system then why switch?

melted · on Jan 11, 2016

Simple, repeatable builds, almost no learning curve. I do agree that this is a vastly less potent value proposition than "build your C++ in minutes instead of hours", but it's a start. I think Google will offer the distributed cloud piece for this at some point. They'd be stupid not to.

nl · on Jan 12, 2016

It takes years for a new build format to be widely used.

There are many more important thing holding up wider adoption before the lack of distributed support becomes a issue: Windows support, how hard it is to get working on ARM, etc etc.

TensorFlow is it's own thing.

sgt101 · on Jan 11, 2016

This seems to be a big problem for some people, but if you can stick 8 GPUs in a box with two 12 core CPU's and a 450000 IIOPS TB "disk" it seems to me that you have something interesting for Tensorflowing over.

domdip · on Jan 11, 2016

The stated reason for extending this to phones was so that they could reuse these models for inference purposes. I'm not sure what this post adds to that (the author seems to miss that point).

PascLeRasc · on Jan 11, 2016

I don't see the justification for criticizing "shoehorning Tensorflow to run on a wimpy phone". Phones today are nearly as powerful or more [1] than laptops. Mobile and laptop processor speeds are outpacing network speeds, so it's often faster for a program/game to be able to run these tests locally rather than send data to a server. In addition, you might use Tensorflow more for the ease of use and pre-implemented strutures rather than simple speed, like the Python vs C argument, so not all use cases will be for huge data-crunching.

[1] http://fossbytes.com/iphone-6s-is-more-powerful-than-the-new...