Pig Contributors Workshop
- 2 - Agenda Introductions What we are working on Usability Howl TLP Lunch Turing Completeness Workflow Fun (Bocci ball)
- 3 - Richard Ding – Usage stats collection New Top-level API package org.apache.pig; public class PigRunner { public static PigStats run(String args[]); } New Entries in Job XML pig.script.id, pig.script.id, pig.launcher.host, pig.command.line, pig.parent.jobid, pig.alias, pig.script.features, pig.job.feature pig.version, pig.hadoop.version New Counter Groups MultiStoreCounters, MultiInputCounters
- 4 - Ashutosh Chauhan – UDFs in scripting languages
- 5 - Daniel Dai – Optimizer rewrite Why do we need an optimizer –Complex script is hard to optimize –In reality, optimizer kick in quite often in user script Brand new framework to add a rule easier (PIG-1178) Optimization rules (PIG-1319) –Split filter –Pushup Filter –Merge filter –Prune Columns –Pushdown foreach flatten –Expression optimizer –Merge foreach –…
- 6 - Aniket Mokashi – Custom partitioner && Scalar Custom partitioner –Use case Controls the spraying of output by getPartition function Allows custom grouping policy Scalar B = group A by $0 PARTITION BY org.apache.pig.test.utils.SimpleCustomPartitioner parallel 2; A = load 'censors_total' as (state, population); B = group A all; total = foreach B generate SUM(population); C = foreach A generate state, population/(long)total as percentage; store C into 'censors_percentage'; A = load 'censors_total' as (state, population); B = group A all; total = foreach B generate SUM(population); C = foreach A generate state, population/(long)total as percentage; store C into 'censors_percentage'; Scalar
- 7 - Olga Natkovich – Usability and error messages New parser that allows better control over error messages More meaningful error messages Early error detection Clarified language semantics Resurrect support for illustrate
- 8 - Howl, Why We Need It What we have now Hive has its own data catalog Pig, Map Reduce can –Use a InputFormat or loader that knows the schema (e.g. ElephantBird) –Describe the schema in code A = load ‘foo’ as (x:int, y:float) –Still have to know where to read and write files themselves Must write Loader, and SerDe to read new file type in Pig, and Hive Workflow systems must poll HDFS to see when data is available 8
- 9 - Howl, What We Want Given an InputFormat and OutputFormat only need to write one piece of code to read/write data for all tools Schema shared across tools Disk location and storage format abstracted by service Workflow notified of data availability by service 9 table mgmt service Pig Hive Map Reduce Streaming RCFile Sequence File Text File
TLP
Alan Gates – Turing complete Pig Options on the table so far Extend Pig Latin itself Embed in scripting language via precompiler Embed in scripting language as DSL
Pig Integration With Workflow
In Conclusion Should we do this more often? Thanks everyone for coming