Download presentation
Presentation is loading. Please wait.
Published byQuentin Stokes Modified over 9 years ago
1
Capybara Hive Integration Testing
2
Issues We’ve Seen at Hortonworks Many tests for different permutations –e.g. does it work with Orc, with Parquet, with Text Can’t run Hive tests on a cluster –Forces QE to rewrite tests from scratch, hard to share resources with dev Tests are all small, no ability to scale Golden files are a grievous evil –Test writers have to eye-ball results, error prone –Small change in query plan forces hundreds of expected output changes QE and dev working in different languages and frameworks It’s hard to get user queries with user-like data into the framework –Tests built based on feature testing and bug fixing, not user experience
3
Proposed Requirements One test should run in all reasonable permutations –Spark/Tez, Orc/Parquet/Text, secure/non-secure, etc. –Tests can specify which options make no sense for them Same tests locally and on cluster Auto-generation of data and expected results –At varying scales –Expected results generated by source of truth, won’t work for all but should cover 80% Programmatic access to query plan –Add tools to make it easy to find tasks, operators, and patterns Java, runs in Junit Ability to simulate user data and run user queries
4
What’s There Today Automated data generation (random, stats based, dev specified) Data loaded into Hive and benchmark –State remembered so that tables not created for every test Queries run against Hive and benchmark Comparison of select queries and insert statements Works on dev’s machine or against a cluster –Dev’s machine: miniclusters and Derby –Cluster: user provided cluster and Postgres A few basic tables provided for tests –alltypes, capysrc, capysrcpart, TPC-H like tables UserQueryGenerator –Takes in set of user queries –Reads user’s metastore (user has to first run analyze table on included tables) –Generates Java test file that builds simulated data
5
What’s There Today Continued SQL Ansifier – takes Hive query and converts to ANSI SQL to run against benchmark (incomplete) A given run of tests can be configured with a set of features –e.g. file format=orc, engine=tez Annotations –ignore a test when inappropriate with configured features (e.g. no acid when spark is the engine) –set configuration for features (e.g. @AcidOn) Scale can be set User can provide custom benchmark and comparator Programmatic access to query plan –very limited tools today, need more work here Initial patch posted to HIVE-12316HIVE-12316
6
Missing Pieces Limited working options –Need to add HBase metastore, LLAP, Spark, security, Hive Streaming,... –Tez there but SUPER slow –JDBC in process –binary data, complex types don’t work –parallel data generation and comparison written but not yet tested –Not yet a way to set or switch users (for security tests) Limited usage testing –Many options haven’t been tried and I’m sure some don’t work –Limited qfiles converted
7
Example Test @Test public void simple() throws Exception { TableTool.createAllTypes(); runQuery("select cvarchar from alltypes"); sortAndCompare(); }
8
Example Test @Test public void simpleJoin() throws Exception { TableTool.createPseudoTpch(); runQuery("select p_name, avg(l_price) " + "from ph_lineitem join ph_part " + "on (l_partkey = p_partkey) " + "group by p_name " + "order by p_name"); compare(); }
9
Example Test @Test public void q1() throws Exception { set("hive.auto.convert.join", true); runQuery("drop table if exists t"); runQuery("create table t (a string, b bigint); "); runQuery("insert into t select c, d from u;"); IMetaStoreClient msClient = new HiveMetaStoreClient(new HiveConf()); Table msTable = msClient.getTable("default", "t"); TestTable tTable = new TestTable(msTable); tableCompare(tTable); }
10
Example Explain @Test public void explain() throws Exception { TableTool.createCapySrc(); Explain explain = explain("select k,value from capysrc order by k"); // Expect that somewhere in the plan is a MapRedTask. MapRedTask mrTask = explain.expect(MapRedTask.class); // Find all scans in the MapRedTask. List scans = explain.findAll(mrTask, TableScanOperator.class); Assert.assertEquals(1, scans.size()); }
11
Run a Test Locally, use default options mvn test -Dtest=TestSkewJoin Locally, specify using tez mvn test -Dtest=TestSkewJoin -Dhive.test.capybara.engine=tez On a cluster mvn test -Dtest=TestSkewJoin -Dhive.test.capybara.use.cluster=true -DHADOOP_HOME=your_hadoop_path -DHIVE_HOME=your_hive_path
12
Simulate User Queries Select queries create, one file for each test (may contain more than 1 query) Run analyze table with collect column stats for each table with source data Then run, outputs TestQueries.java hive --service capygen -i queries/*.sql -o TestQueries
13
Questions
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.