codehaus


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: TensorFlow, PyTorch, and manylinux1


Can PyTorch provide and maintain a conda-forge recipe?

This would allow the large and growing conda forge ecosystem to easily
install PyTorch in a community-supported way.

Are there problems with using conda or another general package manager?

I agree that the machine learning packages are trying to make a language
specific package manager do more than it was intended and other open source
solutions already exist.

Thanks,

Travis


On Mon, Dec 17, 2018, 12:32 AM soumith <soumith@xxxxxxxxx wrote:

> I'm reposting my original reply below the current reply (below a dotted
> line). It was filtered out because I wasn't subscribed to the relevant
> mailing lists.
>
>  tl;dr: manylinux2010 looks pretty promising, because CUDA supports CentOS6
> (for now).
>
> In the meanwhile, I dug into what pyarrow does, and it looks like it links
> with `static-libstdc++` along with a linker version script [1].
>
> PyTorch did exactly that until Jan this year [2], except that our linker
> version script didn't cover the subtleties of statically linking stdc++ as
> well as Arrow did. Because we weren't covering all of the stdc++ static
> linking subtleties, we were facing huge issues that amplified wheel
> incompatibility (import X; import torch crashing under various X). Hence,
> we moved since then to linking with system-shipped libstdc++, doing no
> static stdc++ linking.
>
> I'll revisit this in light of manylinux2010, and go down the path of static
> linkage of stdc++ again, though I'm wary of the subtleties around handling
> of weak symbols, std::string destruction across library boundaries [3] and
> std::string's ABI incompatibility issues.
>
> I've opened a tracking issue here:
> https://github.com/pytorch/pytorch/issues/15294
>
> I'm looking forward to hearing from the TensorFlow devs if manylinux2010 is
> sufficient for them, or what additional constraints they have.
>
> As a personal thought, I find multiple libraries in the same process
> statically linking to stdc++ gross, but without a package manager like
> Anaconda that actually is willing to deal with the C++-side dependencies,
> there aren't many options on the table.
>
> References:
>
> [1] https://github.com/apache/arrow/blob/master/cpp/src/arrow/symbols.map
> [2] https://github.com/pytorch/pytorch/blob/v0.3.1/tools/pytorch.version
> [3] https://github.com/pytorch/pytorch/issues/5400#issuecomment-369428125
>
> ............................................................................................................................................................
> Hi Philipp,
>
> Thanks a lot for getting a discussion started. I've sunk ~100+ hours over
> the last 2 years making PyTorch wheels play well with OpenCV, TensorFlow
> and other wheels, that I'm glad to see this discussion started.
>
>
> On the PyTorch wheels, we have been shipping with the minimum glibc and
> libstdc++ versions we can possibly work with, while keeping two hard
> constraints:
>
> 1. CUDA support
> 2. C++11 support
>
>
> 1. CUDA support
>
> manylinux1 is not an option, considering CUDA doesn't work out of CentOS5.
> I explored this option [1] to no success.
>
> manylinux2010 is an option at the moment wrt CUDA, but it's unclear when
> NVIDIA will lift support for CentOS6 under us.
> Additionally, CuDNN 7.0 (if I remember) was compiled against Ubuntu 12.04
> (meaning the glibc version is newer than CentOS6), and binaries linked
> against CuDNN refused to run on CentOS6. I requested that this constraint
> be lifted, and the next dot release fixed it.
>
> The reason PyTorch binaries are not manylinux2010 compatible at the moment
> is because of the next constraint: C++11.
>
> 2. C++11
>
> We picked C++11 as the minimum supported dialect for PyTorch, primarily to
> serve the default compilers of older machines, i.e. Ubuntu 14.04 and
> CentOS7. The newer options were C++14 / C++17, but we decided to polyfill
> what we needed to support older distros better.
>
> A fully fleshed out C++11 implementation landed in gcc in various stages,
> with gradual ABI changes [2]. Unfortunately, the libstdc++ that ships with
> centos6 (and hence manylinx2010) isn't sufficient to cover all of C++11.
> For example, the binaries we built with devtoolset3 (gcc 4.9.2) on CentOS6
> didn't run with the default libstdc++ on CentOS6 either due to ABI changes
> or minimum GLIBCXX version for some of the symbols being unavailable.
>
> We tried our best to support our binaries running on CentOS6 and above with
> various ranges of static linking hacks until 0.3.1 (January 2018), but at
> some point hacks over hacks was only getting more fragile. Hence we moved
> to a CentOS7-based image in April 2018 [3], and relied only on dynamic
> linking to the system-shipped libstdc++.
>
> As Wes mentions [4], an option is to host a modern C++ standard library via
> PyPI would put manylinux2010 on the table. There are however subtle
> consequences with this -- if this package gets installed into a conda
> environment, it'll clobber anaconda-shipped libstdc++, possibly corrupting
> environments for thousands of anaconda users (this is actually similar to
> the issues with `mkl` shipped via PyPI and Conda clobbering each other).
>
>
> References:
>
> [1] https://github.com/NVIDIA/nvidia-docker/issues/348
> [2] https://gcc.gnu.org/wiki/Cxx11AbiCompatibility
> [3]
>
> https://github.com/pytorch/builder/commit/44d9bfa607a7616c66fe6492fadd8f05f3578b93
> [4] https://github.com/apache/arrow/pull/3177#issuecomment-447515982
>
> ..............................................................................................................................................................................................
>
> On Sun, Dec 16, 2018 at 2:57 PM Wes McKinney <wesmckinn@xxxxxxxxx> wrote:
>
> > Reposting since I wasn't subscribed to developers@xxxxxxxxxxxxxx. I
> > also didn't see Soumith's response since it didn't come through to
> > dev@xxxxxxxxxxxxxxxx
> >
> > In response to the non-conforming ABI in the TF and PyTorch wheels, we
> > have attempted to hack around the issue with some elaborate
> > workarounds [1] [2] that have ultimately proved to not work
> > universally. The bottom line is that this is burdening other projects
> > in the Python ecosystem and causing confusing application crashes.
> >
> > First, to state what should hopefully obvious to many of you, Python
> > wheels are not a robust way to deploy complex C++ projects, even
> > setting aside the compiler toolchain issue. If a project has
> > non-trivial third party dependencies, you either have to statically
> > link them or bundle shared libraries with the wheel (we do a bit of
> > both in Apache Arrow). Neither solution is foolproof in all cases.
> > There are other downsides to wheels when it comes to numerical
> > computing -- it is difficult to utilize things like the Intel MKL
> > which may be used by multiple projects. If two projects have the same
> > third party C++ dependency (e.g. let's use gRPC or libprotobuf as a
> > straw man example), it's hard to guarantee that versions or ABI will
> > not conflict with each other.
> >
> > In packaging with conda, we pin all dependencies when building
> > projects that depend on them, then package and deploy the dependencies
> > as separate shared libraries instead of bundling. To resolve the need
> > for newer compilers or newer C++ standard library, libstdc++.so and
> > other system shared libraries are packaged and installed as
> > dependencies. In manylinux1, the RedHat devtoolset compiler toolchain
> > is used as it performs selective static linking of symbols to enable
> > C++11 libraries to be deployed on older Linuxes like RHEL5/6. A conda
> > environment functions as sort of portable miniature Linux
> > distribution.
> >
> > Given the current state of things, as using the TensorFlow and PyTorch
> > wheels in the same process as other conforming manylinux1 wheels is
> > unsafe, it's hard to see how one can continue to recommend pip as a
> > preferred installation path until the ABI problems are resolved. For
> > example, "pip" is what is recommended for installing TensorFlow on
> > Linux [3]. It's unclear that non-compliant wheels should be allowed in
> > the package manager at all (I'm aware that this was deemed to not be
> > the responsibility of PyPI to verify policy compliance [4]).
> >
> > A couple possible paths forward (there may be others):
> >
> > * Collaborate with the Python packaging authority to evolve the
> > manylinux ABI to be able to produce compliant wheels that support the
> > build and deployment requirements of these projects
> > * Create a new ABI tag for CUDA/C++11-enabled Python wheels so that
> > projects can ship packages that can be guaranteed to work properly
> > with TF/PyTorch. This might require vendoring libstdc++ in some kind
> > of "toolchain" wheel that projects using this new ABI can depend on
> >
> > Note that these toolchain and deployment issues are absent when
> > building and deploying with conda packages, since build- and run-time
> > dependencies can be pinned and shared across all the projects that
> > depend on them, ensuring ABI cross-compatibility. It's great to have
> > the convenience of "pip install $PROJECT", but I believe that these
> > projects have outgrown the intended use for pip and wheel
> > distributions.
> >
> > Until the ABI incompatibilities are resolved, I would encourage more
> > prominent user documentation about the non-portability and potential
> > for crashes with these Linux wheels.
> >
> > Thanks,
> > Wes
> >
> > [1]:
> >
> https://github.com/apache/arrow/commit/537e7f7fd503dd920c0b9f0cef8a2de86bc69e3b
> > [2]:
> >
> https://github.com/apache/arrow/commit/e7aaf7bf3d3e326b5fe58d20f8fc45b5cec01cac
> > [3]: https://www.tensorflow.org/install/
> > [4]: https://www.python.org/dev/peps/pep-0513/#id50
> > On Sat, Dec 15, 2018 at 11:25 PM Robert Nishihara
> > <robertnishihara@xxxxxxxxx> wrote:
> > >
> > > On Sat, Dec 15, 2018 at 8:43 PM Philipp Moritz <pcmoritz@xxxxxxxxx>
> > wrote:
> > >
> > > > Dear all,
> > > >
> > > > As some of you know, there is a standard in Python called manylinux (
> > > > https://www.python.org/dev/peps/pep-0513/) to package binary
> > executables
> > > > and libraries into a “wheel” in a way that allows the code to be run
> > on a
> > > > wide variety of Linux distributions. This is very convenient for
> Python
> > > > users, since such libraries can be easily installed via pip.
> > > >
> > > > This standard is also important for a second reason: If many
> different
> > > > wheels are used together in a single Python process, adhering to
> > manylinux
> > > > ensures that these libraries work together well and don’t trip on
> each
> > > > other’s toes (this could easily happen if different versions of
> > libstdc++
> > > > are used for example). Therefore *even if support for only a single
> > > > distribution like Ubuntu is desired*, it is important to be manylinux
> > > > compatible to make sure everybody’s wheels work together well.
> > > >
> > > > TensorFlow and PyTorch unfortunately don’t produce manylinux
> compatible
> > > > wheels. The challenge is due, at least in part, to the need to use
> > > > nvidia-docker to build GPU binaries [10]. This causes various levels
> of
> > > > pain for the rest of the Python community, see for example [1] [2]
> [3]
> > [4]
> > > > [5] [6] [7] [8].
> > > >
> > > > The purpose of the e-mail is to get a discussion started on how we
> can
> > > > make TensorFlow and PyTorch manylinux compliant. There is a new
> > standard in
> > > > the works [9] so hopefully we can discuss what would be necessary to
> > make
> > > > sure TensorFlow and PyTorch can adhere to this standard in the
> future.
> > > >
> > > > It would make everybody’s lives just a little bit better! Any ideas
> are
> > > > appreciated.
> > > >
> > > > @soumith: Could you cc the relevant list? I couldn't find a pytorch
> dev
> > > > mailing list.
> > > >
> > > > Best,
> > > > Philipp.
> > > >
> > > > [1] https://github.com/tensorflow/tensorflow/issues/5033
> > > > [2] https://github.com/tensorflow/tensorflow/issues/8802
> > > > [3] https://github.com/primitiv/primitiv-python/issues/28
> > > > [4] https://github.com/zarr-developers/numcodecs/issues/70
> > > > [5] https://github.com/apache/arrow/pull/3177
> > > > [6] https://github.com/tensorflow/tensorflow/issues/13615
> > > > [7] https://github.com/pytorch/pytorch/issues/8358
> > > > [8] https://github.com/ray-project/ray/issues/2159
> > > > [9] https://www.python.org/dev/peps/pep-0571/
> > > > [10]
> > > >
> >
> https://github.com/tensorflow/tensorflow/issues/8802#issuecomment-291935940
> > > >
> > > > --
> > > > You received this message because you are subscribed to the Google
> > Groups
> > > > "ray-dev" group.
> > > > To unsubscribe from this group and stop receiving emails from it,
> send
> > an
> > > > email to ray-dev+unsubscribe@xxxxxxxxxxxxxxxx.
> > > > To post to this group, send email to ray-dev@xxxxxxxxxxxxxxxx.
> > > > To view this discussion on the web visit
> > > >
> >
> https://groups.google.com/d/msgid/ray-dev/CAFs1FxUBAag6AThj34twiAB6KY3t5sJSJF3g70K3SvF-%2BzGGgw%40mail.gmail.com
> > > > <
> >
> https://groups.google.com/d/msgid/ray-dev/CAFs1FxUBAag6AThj34twiAB6KY3t5sJSJF3g70K3SvF-%2BzGGgw%40mail.gmail.com?utm_medium=email&utm_source=footer
> > >
> > > > .
> > > > For more options, visit https://groups.google.com/d/optout.
> > > >
> >
>