Using a Hive UDF of Advance Matching Module
To run each Hive UDF job, you can either run these steps individually on your Hive client within a single session, or create an HQL file compiling all the required steps sequentially and run it in one go.
- In your Hive client, log in to the required Hive database.
-
Register the JAR file of Spectrum™ Data & Address Quality for Big Data SDK AMM Module.
ADD JAR <Directory path>/amm.hive.${project.version}.jar;
-
Create an alias for the Hive UDF of the Data Quality job you wish to run.
Note: String in quotes represents the class names needed for this job to run.For example:
CREATE TEMPORARY FUNCTION bestofbreed as 'com.pb.bdq.amm.process.hive.consolidation.bestofbreed.BestOfBreedUDAF';
-
Switch off the aggregation of data between Reducer and Mapper, by setting the
Hive.Map.Aggr
environment variable configuration to false, as indicated in this example:set hive.map.aggr = false;
Note: This configuration is required for all UDFs. -
Specify the configurations and other details for the job, and assign these to
respective variables or configuration properties.
Note: The rule must be in JSON format.
For example,
set hivevar:rule='{"consolidationConditions": [{"consolidationRule":{"conditionClass":"simpleRule", "operation":"HIGHEST", "fieldName":"column2", "value":null, "valueFromField":false, "valueNumeric":true}, "actions":[]}], "removeDuplicates":true}';
Note: Use the configuration properties in the respective job configurations. For example,pb.bdq.match.rule
,pb.bdq.match.express.column
, andpb.bdq.consolidation.sort.field
where indicated in the respective sample HQL files. -
Specify the header fields of the input table in comma-separated
format, and assign to a variable or configuration property.
set hivevar:header ='column1,column2,column3,column4,column5,id';
Note: Use the configuration property, where indicated. For example,pb.bdq.match.header
,pb.bdq.consolidation.header
, and so on where indicated in the respective sample HQL files. -
Set the sorting parameter to the alias used in the query with the help of the
configuration property
'hivevar:sortfield'
.set hivevar:sortfield='id';
-
To run the job and display the job output on the
console, write the query as indicated in this example:
To run the job and dump the job output in a designated file, write the query as indicated in this example:SELECT tmp2.record["column1"], tmp2.record["column2"], tmp2.record["column3"], tmp2.record["column4"], tmp2.record["column5"] FROM ( SELECT filter (${hivevar:rule}, ${hivevar:sortfield}, ${hivevar:header}, innerRowID.column1, innerRowID.column2, innerRowID.column3, innerRowID.column4, innerRowID.column5, innerRowID.id ) AS matchgroup FROM ( SELECT column1, column2, column3, column4, column5, rowid(*) AS id FROM data ) innerRowID GROUP BY column3 ) AS innerResult LATERAL VIEW explode(innerResult.matchgroup) tmp2 AS record ;
INSERT OVERWRITE LOCAL DIRECTORY '/home/hadoop/HiveUDF/filter/' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' collection items terminated by '||' map keys terminated by ':' SELECT tmp2.record["column1"], tmp2.record["column2"], tmp2.record["column3"], tmp2.record["column4"], tmp2.record["column5"] FROM ( SELECT filter (innerRowID.column1, innerRowID.column2, innerRowID.column3, innerRowID.column4, innerRowID.column5, innerRowID.id ) AS matchgroup FROM ( SELECT column1, column2, column3, column4, column5, rowid(*) AS id FROM data ) innerRowID GROUP BY column3 ) AS innerResult LATERAL VIEW explode(innerResult.matchgroup) tmp2 AS record ;
Note: Use the alias defined earlier for the UDF.