Using a Hive UDF of Data Normalization Module
To run each Hive UDF job, you can either run these steps individually on your Hive client within a single session, or create an HQL file compiling all the required steps sequentially and run it in one go.
- In your Hive client, log in to the required Hive database.
-
Register the JAR file of Spectrum™ Data & Address Quality for Big Data SDK DNM Module.
ADD JAR <Directory path>/dnm.hive.${project.version}.jar;
-
Create an alias for the Hive UDF of the Data Quality job you wish to run.
Note: String in quotes represents the class names needed for this job to run.For example:
CREATE TEMPORARY FUNCTION tablelookup as 'com.pb.bdq.dnm.process.hive.tablelookup.TableLookUpUDF';
-
Specify the reference data path.
- Reference data is on HDFS
- Reference data is to be downloaded to a working directory for
jobs
- If the reference data is in unarchived file
format, set the reference directory
as:
set hivevar:refereceDataDetails='{"referenceDataPathLocation":"HDFS", "dataDir":"./referenceData","dataDownloader":{"dataDownloader":"DC"}}';
- If the reference data is in archived format,
set the reference directory
as:
set hivevar:refereceDataDetails='{"referenceDataPathLocation":"HDFS", "dataDir":"./referenceData.zip","dataDownloader": {"dataDownloader":"DC"}}';
- If the reference data is in unarchived file
format, set the reference directory
as:
- Reference data is to be downloaded on local nodes for jobs.
In this case, set the reference data directory
as:
set hivevar:refereceDataDetails='{"referenceDataPathLocation":"HDFS", "dataDir":"/home/data/dm/referenceData","dataDownloader":{"dataDownloader": "HDFS","localFSRepository":"/local/download"}}';
- Reference data is to be downloaded to a working directory for
jobs
- Reference data is on local path: Ensure that data is present on
each node of the cluster on the same path.
Set the reference directory as:
set hivevar:refereceDataDetails='{"referenceDataPathLocation":"LocaltoDataNodes", "dataDir":"/home/data/referenceData"}';
- Reference data is on HDFS
-
Specify the configurations and other details for the job, and assign these to
respective variables or configuration properties.
Note: The rule must be in JSON format.
For example,
set hivevar:rule='{"rules":[{"action":"Standardize", "source":"CityCode", "tableName":"State Name Abbreviations", "lookupMultipleWordTerms":false, "lookupIndividualTermsWithinField":false, "destination":"CityCode"}]}';
Note: Ensure to use the configuration properties in the respective job configurations. For example,pb.bdq.match.rule
,pb.bdq.match.express.column
,pb.bdq.consolidation.sort.field
, and so on where indicated in the respective sample HQL files. -
Specify the header fields of the input table in comma-separated
format, and assign to a variable or configuration property.
set hivevar:header ='AccountDescription,Address,ApartmentNumber,CityCode';
-
To run the job and display the job output on the
console, write the query as indicated in this example:
To run the job and dump the job output in a designated file, write the query as indicated in this example:SELECT bar.ret["StandardizationTermIdentified"], bar.ret["accountdescription"], bar.ret["address"], bar.ret["apartmentnumber"], bar.ret["citycode"] FROM ( SELECT tablelookup(${hivevar:rule}, ${hiveconf:refdir}, ${hiveconf:header}, accountdescription, address, apartmentnumber, citycode) AS ret FROM citizen_data ) bar;
INSERT OVERWRITE LOCAL DIRECTORY '/home/hadoop/TableLookup/' row format delimited FIELDS TERMINATED BY ',' lines terminated by '\n' STORED AS TEXTFILE SELECT bar.ret["StandardizationTermIdentified"], bar.ret["accountdescription"], bar.ret["address"], bar.ret["apartmentnumber"], bar.ret["citycode"] FROM ( SELECT tablelookup(${hiveconf:rule}, ${hiveconf:refdir}, ${hiveconf:header}, accountdescription, address, apartmentnumber, citycode) AS ret FROM citizen_data ) bar;
Note: Use the alias defined earlier for the UDF.