Skip to content

A collection of data science examples implemented across a variety of languages and libraries.

Notifications You must be signed in to change notification settings

h2oai/data-science-examples

Repository files navigation

data-science-examples

View this site in GitHub Pages: http://h2oai.github.io/data-science-examples/



1. Goals

Goal: To provide a side-by-side framework for adding code examples in many different environments

  • Think of this as a rosetta stone for examples. If you are familiar with one environment, but want to explore a different language or library, find the use case you are interested in and compare the implementation for the environment you know and compare it with the others.

Goal: It should be easy to add an example

  • Adding a new example requires a minimum of files to touch.
  • Actual ugly html is autogenerated.
  • gen.py generator tool has error messages that are human readable.

Goal: Encourage lots of different people to add an example

  • Use a public github repository.
  • Provide docs making it clear how to contribute.

Goal: It should be easy to add a new kind of example

  • e.g. a new language type.

Goal: Examples should be testable

  • Example code snippets should be runnable.
  • Jenkins job should run them automatically.

Goal: Provide a library of runnable and easy-to-access answers to common questions

  • A library of "How do I do this?" answers.

Goal: It should be possible to cut-and-paste a "stable" link for a given example

  • When responding to a question "on the internet" (i.e. something searchable with historical memory), it's good to provide a stable link.
  • The relative_hyperlink() method in gen.py calculates this stable hyperlink and embeds it into the generated examples.html document.
  • The hyperlink depends on the example name ('ex.txt'), not the directory names.
  • Note: The numbering is automatic and can change if new items are inserted. As such, the numbering is not stable. If items are only "appended", then the number sequence is stable.

Goal: Provide support for tags

  • Have a convenient way of tagging examples with descriptive words.

Non-goals

  • While providing big-data examples is certainly in scope, providing examples "on" big-data is not. The data sets actually in the repo should be small. An example of a good practice would be to provide a runnable example on a small dataset with a non-running comment pointing to where to find the big data.
  • Creating a new blog framework.
  • Human language internationalization.


2. Adding a new example

Conventions

  • gen.py is PEP-8 compliant. Please keep it that way.
  • Intellij IDEA (or PyCharm) was used to develop gen.py, and has built-in PEP-i support.
  • Data files should be small (in general, less than 1 MB). 'git clone'-ing the repo should be fast, so it's fast for a new person to clone and make a contribution. Re-use data files whenever possible.

The generation process

The gen.py tool creates the result examples.html file. (Look at the trivial Makefile.)

Tools required to run the generator

  • Python (gen.py was developed with Python 2.7)
  • npm
  • npm's markdown command-line tool

I installed the markdown tool on my Macbook Pro with the following command:

npm install markdown-to-html -g

Commands to run

On Macbook Pro:

make
git add examples.html
git commit
git push

Top-level directory layout

README.md
This file.

Makefile
Very simple helper for running the generation process.

./gen.py
Tool to generate examples.html.

examples
The example code. New files generally want to go somewhere in here.

examples.html
Generated from files in the examples directory.

data
Data used by examples.

index.html
What gh-pages points to.

packages
Some helper packages used by the examples (ex package for R).

static
Static resources (jquery, bootstrap, highlight.js).

Adding a new case for an existing example

Usually this is as easy as just dropping in one more file with the right name that gen.py knows to look for. You need to add that file in the one specific already-existing example directory. No metadata files need to be updated.

Unless you want to add a totally new kind of example, in which case read on...

Adding a new kind of example (i.e. language type)

gen.py has the following three arrays. (The names are named weirdly to satisfy PEP-8 and still visually line up nicely.)

    _lang__________ = ["lang-r", "lang-r"]
    _tabs_to_check_ = ["R",      "h2o-R"]
    _files_to_check = ["ex-R.R", "ex-h2o.R"]

Adding a new kind of example means adding an element to each of these arrays.

  • The lang array is the name of the language according to highlight.js. If you add a new less-common language, the highlight.js package might need updating (see the 'static' directory). Note the following additional languages were added to the set of "checked language boxes" when downloading highlight.js.
    • R
    • Scala
  • The tabs array is the name of the tab as seen by the user.
  • The files array is the name of the file checked for by gen.py.

Adding a new example

  • Create a new directory. The name of the directory is not used for any of the generated output.

  • Add your new directory to the 'idx' file of the category that contains it.

  • Add a 'ex.txt' file in your new directory. This is one line that contains the example name.

  • Add a 'ex.md' file in your new directory. This is a markdown file that describes the example. For consistency with the generated code from gen.py, this should not include H1, H2 or H3 tags. This may include H4, H5, H6 tags.

  • The example description markdown file 'ex.md' is converted to html using the node markdown tool.

  • Optionally create an 'ex.tags' file with one tag per line. Tags may include lower-case letters, numbers, underscores, and spaces.

  • Create a single code file for each kind of example you want to provide.

    • ex-R.R
    • ex-h2o.R
    • ... etc.
  • Code example files are copied verbatim into the generated examples.html.

The names of the code example files must match exactly what gen.py expects.

Finding data files

The ex R package has a locate function which you may find helpful.

Adding a new category (or subcategory)

  • Create a new directory. The name of the directory is not used for any of the generated output.

  • Add your new directory to the idx file of the category that contains it.

  • Add a 'cat.txt' file in your new directory. This is one line that contains the category name.

  • Add an 'idx' file in your new directory. 'idx' is a multi line file. each line contains the name of a directory. each directory is either a sub-category or an example. items appear in the order they are included in the idx file.

  • Note that the top-level category (Data Science Examples) is special and generally ignored by gen.py so that "Data Science Examples" isn't repeated all over the place.



3. Testing

Testing will be driven by a jenkins job that makes some assumptions.

  • Assumption: H2O can run anywhere (in terms of cwd) on the local machine. The test must provide an absolute path to h2o. (This is why the "locate" methods is useful.)
  • Assumption: H2O is running with 1 node on the local machine. It will be started ahead of time. h2o.init() will work and find an h2o.
  • Assumption: Tests will be run one at a time on a machine with 8 GB RAM.
  • Assumption: H2O will be started with -Xmx5g.
  • Assumption: Each example has a "fresh" H2O with no keys in the K/V store.
  • Assumption: Tests can be run in any order.
  • Assumption: The test itself will be started with the cwd that the test file lives in.
  • Assumption: The ex package will be installed when R is run.
  • Assumption: The H2O package will be installed when R is run.

Other do's and dont's:

  • Examples should not call h2o.shutdown().
  • H2O R tests will be run via:
    • R -f ex-h2o.R

TODO:

  • Set up Jenkins job
  • How to see the results? Would be nice to have a per-example red/green light somehow.

How to locally build and install the ex R package

I did this with RStudio... TODO: Need better instructions here.

About

A collection of data science examples implemented across a variety of languages and libraries.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •