Using a Hive UDF of Universal Addressing Module
To run each Hive UDF job, you can either run these steps individually on your Hive client within a single session, or create an HQL file compiling all the required steps sequentially and run it in one go.
- In your Hive client, log in to the required Hive database.
-
Register the JAR file of Spectrum™ Data & Address Quality for Big Data SDK UAM Module.
ADD JAR /home/hduser/uam/uam.universaladdress.hive.${project.version}.jar;
-
Create an alias for the Hive UDF of the Address Quality job you want to
run.
Note: String in quotes represents the class names needed for this job to run.For example:
CREATE TEMPORARY FUNCTION addressvalidation as 'com.pb.bdq.uam.process.hive.addressvalidation.AddressValidationUDF';
-
Enable or disable the hive fetch task conversion.
For Example:
set hive.fetch.task.conversion=none;
-
Use
hivevar:engineconf
to set the engine configurations. It includes details, such as database settings, COBOL runtime path, process type, DPV DB path, suiteLinkDBPath, ewsDBPath, rdiDBPath, lacsDBPath and preloading typeFor Example:set hivevar:engineconf='{"referenceData":{"dataDir":"/user/hduser/ReferenceData/ AddressQuality/UAM/Data.zip","referenceDataPathLocation":"HDFS"},"cobolRuntimePath":"", "modulesDir":"","dpvDbPath":"/user/hduser/ReferenceData/AddressQuality/UAM/Data.zip", "suiteLinkDBPath":"/user/hduser/ReferenceData/AddressQuality/UAM/Data.zip","ewsDBPath": "/user/hduser/ReferenceData/AddressQuality/UAM/Data.zip","rdiDBPath":null,"lacsDBPath": "/user/hduser/ReferenceData/AddressQuality/UAM/Data.zip"}';
-
Specify the settings for the input data using the
hivevar:inputoption
parameter.For example:set hivevar:inputoption='{"casing":"Mixed","matchMode":"Relaxed","defaultCountry":"GBR", "maximumResults":2,"returnInputAddress":false, "returnParsedAddress":false,"returnPrecisionCode":false,"returnMatchScore":true, "mustMatchAddressNumber":false,"mustMatchStreet":false,"mustMatchCity":false, "mustMatchLocality":false,"mustMatchState":false,"mustMatchStateProvince":false, "mustMatchPostCode":false,"keepMultiMatch":true,"preferPostalOverCity":false, "cityFallback":true,"postalFallback":true, "validationLevel":"ADDRESS"}';
-
Set the general configurations, such as cacheSize, maxAddressObjectCount, and
maxMemoryUsageMB, using
hivevar:generalconf
.For example:set hivevar:generalconf='{"cacheSize":"LARGE","maxThreadCount":8,"maxAddressObjectCount":8, "rangesToExpand":"NONE","flexibleRangeExpansion":"ON","enableTransactionLogging":false, "maxMemoryUsageMB":1024,"verbose":false}';
-
Specify the desired validation level to be used in a particular Hive job.
Currently, only address validation is supported. So, set this value to
VALIDATE.
For example;
set hivevar:processtype='VALIDATE';
-
Specify the header fields of the input table in comma-separated
format, and assign to a variable or configuration property.
set hivevar:header='inputkeyvalue,AddressLine1,AddressLine2,City,postalcode, StateProvince,firmname,Country';
-
To run the job and display the job output on the
console, write the query as indicated in this example:
Note: This query returns a map of key value pairs containing output fields for each row.
To run the job and dump the job output in a designated file, write the query as indicated in this example:SELECT tmp2.record["HouseNumber"],tmp2.record["Confidence"],tmp2.record["AddressLine1"], tmp2.record["StreetName"],tmp2.record["PostalCode"],tmp2.record["ElementInputStatus"], tmp2.record["MailabilityScore"] FROM ( SELECT globalvalidation(${hivevar:engineconf}, ${hivevar:generalconf},${hivevar:inputconf},${hivevar:unlockCode},${hivevar:header},recordid, addressline1,city,stateprovince,postalcode,country) as mygp from address) as addressgroup LATERAL VIEW explode(addressgroup.mygp) tmp2 as record ;
INSERT OVERWRITE LOCAL DIRECTORY '/home/hadoop/GlobalAddressing/' row format delimited FIELDS TERMINATED BY ',' lines terminated by '\n' STORED AS TEXTFILE SELECT tmp2.record["HouseNumber"],tmp2.record["Confidence"],tmp2.record["AddressLine1"], tmp2.record["StreetName"],tmp2.record["PostalCode"],tmp2.record["ElementInputStatus"], tmp2.record["MailabilityScore"] FROM ( SELECT globalvalidation(${hivevar:engineconf}, ${hivevar:generalconf},${hivevar:inputconf},${hivevar:unlockCode},${hivevar:header}, recordid,addressline1,city,stateprovince,postalcode,country) as mygp from address) as addressgroup LATERAL VIEW explode(addressgroup.mygp) tmp2 as record ;
Note: Use the alias defined earlier for the UDF.