Using a Hive UDF of Universal Name Module
To run each Hive UDF job, you can either run these steps individually on your Hive client within a single session, or create an HQL file compiling all the required steps sequentially and run it in one go.
- In your Hive client, log in to the required Hive database.
-
Register the JAR file of Spectrum™ Data & Address Quality for Big Data SDK UNM Module.
ADD JAR <Directory path>/unm.hive.${project.version}.jar;
-
Create an alias for the Hive UDF of the Data Quality job you wish to run.
For example:
CREATE TEMPORARY FUNCTION opennameparser as 'com.pb.bdq.unm.process.hive. opennameparser.OpenNameParserUDF';
-
Specify the reference data path.
- Reference data is on HDFS
- Reference data is to be downloaded to a working directory for
jobs
- If the reference data is in unarchived file
format, set the reference directory
as:
set hivevar:refereceDataDetails='{"referenceDataPathLocation":"HDFS", "dataDir":"./referenceData","dataDownloader":{"dataDownloader":"DC"}}';
- If the reference data is in archived format,
set the reference directory
as:
set hivevar:refereceDataDetails='{"referenceDataPathLocation":"HDFS", "dataDir":"./referenceData.zip","dataDownloader": {"dataDownloader":"DC"}}';
- If the reference data is in unarchived file
format, set the reference directory
as:
- Reference data is to be downloaded on local nodes for jobs.
In this case, set the reference data directory
as:
set hivevar:refereceDataDetails='{"referenceDataPathLocation":"HDFS", "dataDir":"/home/data/dm/referenceData","dataDownloader":{"dataDownloader": "HDFS","localFSRepository":"/local/download"}}';
- Reference data is to be downloaded to a working directory for
jobs
- Reference data is on local path: Ensure that data is present on
each node of the cluster on the same path.
Set the reference directory as:
set hivevar:refereceDataDetails='{"referenceDataPathLocation":"LocaltoDataNodes", "dataDir":"/home/data/referenceData"}';
- Reference data is on HDFS
-
Specify the configurations and other details for the job, and assign these to
respective variables or configuration properties.
Note: The rule must be in JSON format.
For example,
set hivevar:rule='{"name":"name", "culture":"", "splitConjoinedNames":false, "shortcutThreshold":0, "parseNaturalOrderPersonalNames":false, "naturalOrderPersonalNamesPriority":1, "parseReverseOrderPersonalNames":false, "reverseOrderPersonalNamesPriority":2, "parseConjoinedNames":false, "naturalOrderConjoinedPersonalNamesPriority":3, "reverseOrderConjoinedPersonalNamesPriority":4, "parseBusinessNames":false, "businessNamesPriority":5}';
Note: Use the configuration properties in the respective job configurations. For example,pb.bdq.match.rule
,pb.bdq.match.express.column
, andpb.bdq.consolidation.sort.field
where indicated in the respective sample HQL files. -
Specify the header fields of the input table in comma-separated
format, and assign to a variable or configuration property.
set hivevar:header='inputrecordid,Name,nametype';
-
To run the job and display the job output on the
console, write the query as indicated in this example:
To run the job and dump the job output in a designated file, write the query as indicated in the below example:select adTable.adid["Name"], adTable.adid["NameScore"], adTable.adid["CultureCode"] from (select opennameparser(${hivevar:rule}, ${hivevar:refdir}, ${hivevar:header}, inputrecordid, name, nametype) as tmp1 from nameparser) as tmp LATERAL VIEW explode(tmp1) adTable AS adid;
INSERT OVERWRITE LOCAL DIRECTORY '/home/hadoop/opennameparser/' row format delimited FIELDS TERMINATED BY ',' lines terminated by '\n' STORED AS TEXTFILE select adTable.adid["Name"], adTable.adid["NameScore"], adTable.adid["CultureCode"] from (select opennameparser(${hivevar:rule}, ${hivevar:refdir}, ${hivevar:header}, inputrecordid, name, nametype) as tmp1 from nameparser) as tmp LATERAL VIEW explode(tmp1) adTable AS adid;
Note: Use the alias defined earlier for the UDF.