このページは https://speakerdeck.com/ohe/introduction-to-scientific-programming-in-python の内容を掲載しています。

掲載を希望されないスライド著者の方は、削除申請よりご連絡下さい。

埋込み型プレイヤーを使用せず、常に元のサイトでご覧になりたい方は、自動遷移設定をご利用下さい。

約3年前 (2014/09/13)にアップロードinテクノロジー

Talk given at the Pycon JP - 2014/09/13

- Introduction to scientific

programming in python

Olivier Hervieu - Pycon JP - 2014/09/13 - Why this talk?

• born on twitter

• not my initial proposal for

pyconjp*

!

* speakerdeck.com/ohe/shit-happens-dot-dot-dot-v2 - You already are scientific programmers!

what can I teach you? - What to expect?

• tour of tools that can (must?) be used by the

everyday scientific programmer

• some guidelines on how to industrialize your stack - A little about me

• software engineer at @tinyclues

• 10 years of experience, work

everyday with python for 6 years

(I know, I’m old)

• first conference in japan (yeah!)

• more about.me/ohe

• slides can be found on

speakerdeck.com/ohe - ipython
- • ipython is a must-use (for every pythonista)

• if you don’t use it, install it now (pip install ipython)

• ipython provides:

• a powerful interactive shell

• a browser-based notebook with support for code,

text, mathematical expressions, inline plots and

other rich media (as described on their website)

• easy to use, high performance tools for parallel

computing - • notebook mode supports literate programming and

reproducible sessions

• notebook allows to store chunks of python along

side the results and additional comments (HTML,

Latex, MarkDown)

• a notebook can be exported in various file formats - • ipython is the de-facto standard for sharing python

sessions —> see nbviewer.ipython.org

• the project is well maintained, very stable (no

surprises when you upgrade your version) - numpy
- • numpy provides a powerful N-dimensions array

object

• methods on these arrays are fast because they

relies on well-optimised librairies for linear algebra

(BLAS, ATLAS, MKL)

• numpy is tolerant to python’s lists - python vs numpy

def matmult(a,b):

zip_b = zip(*b)

return [[sum(ele_a*ele_b for ele_a, ele_b in zip(row_a, col_b))

for col_b in zip_b] for row_a in a]

matmult

np.dot

speedup

(10, 10)

936µs

2µs

450x

(100, 100)

693000µs

53µs

13000x

(1000, 1000)

744000000µs

13900µs

53000x - • you don’t want to implement your matrix

multiplication method :-)

• numpy inherits from years of computer based

numerical analysis problem solving

• don’t believe benchmarks about python

performance (who says Julia?) - scipy
- • provides numerous numerical routines, that run

efficiently on top of numpy arrays for:

• optimization

• signal processing

• linear algebra …

• provides also some convenient data structures as

compressed sparse matrix and spatial data

structures - • if you had already use some scikits (scikit-learn,

scikit-image) you already used scipy extensively

• in other words, scipy is a toolbox for

mathematicians, it contains many hidden treasures

for them

• for the programmer, APIs are a bit harsh, as for the

naming of methods (but this naming is totally

explicit for mathematicians) - matplotlib
- • The ultimate plotting library that renders 2D and 3D

high-quality plots for python (I think other languages

are a bit jealous too ;)

• The API mimics, in many ways the MATLAB one, easing

the transition from MATLAB users to python

• Once again, no surprises, matplotlib is a very stable

and mature project (expect one major release per year)

• I recommend you to watch “Introduction to Numpy and

Matplotlib” (4hours!) on youtube*

* https://www.youtube.com/watch?v=3Fp1zn5ao2M - scikit-learn
- • scikit-learn is one of the numerous scikits that have

been developed in the last years (there’s also

scikit-image, scikit-statsmodel etc…)

• it provides a ready-to-use environment to play with

standard machine learning algorithms

• expect a very clean API

• the project is very active and have an awesome

community - pandas
- • fairly “new” project (open-sourced in 2009) but

development is really active since 2012

• data manipulation library based on Numpy

• provides a DataFrame data structure that furnishes

methods for accessing, merging/grouping,

indexing data easily

• doesn’t play well (yet?) with scikits (there’s some

attempt like sklearn-pandas) - Industrial-grade scientific python

(lessons learned) - • numpy/scipy/scikit-learn rely on many low-level

Fortran/C library such as BLAS, ATLAS, the Intel

MKL…

• most of these libraries are shipped by your favorite

OS unoptimized (well, this is not the case for Mac

OS)

• you may want to re-compile these librairies - • re-compile is the (very) long way!

• we did that at tinyclues for two years, we’re now

using a packaged python distribution. Some of

them:

• anaconda (powered by continuum analytics)

• canopy (powered by enthought) - • sadly, these distributions

come with another package

management tool (conda,

enpkg) that are sometimes not

playing nice with pip and/or

virtualenv

• adds a new step to this

famous tweet about python

package managers :) - We’re not done

• librairies for performance: numba, cython, …

• domain specific librairies: sympy, nltk, …

• bindings: rpy2, …

• storage: pytables, … - Free (as in free beer)

• All these libraries come for free and are developed by

passionate developers.

• Please, be grateful; help them!

• by finding and filling bugs (we always love to see

that our code is really used by someone)

• by fixing bugs or giving a beer to developers

• by supporting them financially

• by hosting one of their sprint (if your office is big

enough) - scikit-learn sprint hosted at tinyclues

july 2014 - ありがとう!
- Recommended

• API design for machine learning software:

experiences from the scikit-learn project: http://

arxiv.org/abs/1309.0238

• Programming Collective Intelligence: http://

shop.oreilly.com/product/9780596529321.do

• PyData Channel on Vimeo: http://vimeo.com/pydata