Tiling the Genome into Consistently Named Subsequences
Enables Precision Medicine and Machine Learning with
Millions of Complex Individual Data-Sets

Contains 502 whole genomes from the 1000 Genomes Project and 178 whole genomes from the Harvard Personal Genome Project (PGP)

The image and text below describe this Arvados Project and each of its Sub-projects. Each project, which contains Collections (shown as squares), Jobs (shown as ovals), and Pipelines (abstracted as arrows which connect Collections and Jobs), are bounded by purple dashed boxes. Inputs to projects which contain pipelines are colored in green. Outputs of projects which contain pipelines are colored in light pink.

Raw data (FASTA, GFF, and masterVar files) are contained in Raw Input Data. These data are the input to Tiling – Raw Data to FASTJ, which outputs FASTJ (a verbose tiling format) files. The description of FASTJ is in Tiling – Raw Data to FASTJ. These FASTJ files are the input to Tiling – FASTJ to tile library and pythonic tilings, which outputs pythonic tilings and a tile library. These are contained in this Project and are used as inputs to Blood Type Classifiers and Principal Component Analysis. Blood Type Classifiers is an example of supervised learning and uses ABO blood type phenotypes. Principal Component Analysis is an example of unsupervised learning and uses ethnicity phenotypes. These phenotypes and others for the PGP participants can be found in Harvard PGP Database Snapshot. Two Sub-projects are used for provenance (to enable users to rerun jobs): Log files and Docker images. VCF-based Precision Medicine contains CAVA annotations on the BRCA regions for the 1000 Genomes and Harvard PGP genomes and cross-references each annotation with ClinVar and ExAC. Specific documentation for each step is given within each project.


  • Some pipeline templates take an ‘accepted-paths’ parameter. This allows the user to specify a subset of the genome to run the pipeline on by using path strings ([[“000”,“002”], [“00f”,“014”]] will result in paths 000, 001, 00f, 010, 011, 012, and 013 being used). A site where conversions between paths and cytobands can be found here.
  • Each pipeline template uses an Arvados git repository.
  • Please contact science@curoverse.com with any problems.