Using a Hive UDF
To run each Hive UDF job, you can either run these steps individually on your Hive client within a single session, or create an HQL file compiling all the required steps sequentially and run it in one go.
- In your Hive client, log in to the required Hive database.
- Register the JAR file of the particular Spectrum™ Data & Address Quality for Big Data SDK Module to which the desired Data Quality Hive UDF belongs.
-
In case of the Validate Address UDF, to set the path of the COBOL libraries,
set the environment variable
LD_LIBRARY_PATH
as:set mapreduce.admin.user.env = LD_LIBRARY_PATH=/home/hduser/~/runtime/lib: /home/hduser/~/runtime/bin:/home/hduser/~/server/modules/universaladdress/lib, ACU_RUNCBL_JNI_ONLOAD_DISABLE=1, G1RTS=/home/hduser/~/ ;
- In case of the Validate Address Global UDF, add the file libAddressDoctor5.so file as well.
-
In case of the Validate Address Loqate UDF, add these required files to the
distributed cache.
- loqate-core.car
- LoqateVerificationLevel.csv
- Loqate.csv
- countryTables.csv
- countryNameTables.csv
-
Create an alias for the Hive UDF of the Data Quality job you wish to run.
For example:
CREATE TEMPORARY FUNCTION matchkeygenerator as 'com.pb.bdq.amm.process.hive.matchkeygenerator.MatchKeyGeneratorUDF';
-
Specify the reference data path.
- If the reference data is on HDFS, add the reference data and set
reference directory as shown in this example.
If the reference data is in file format:
hdfs://<HOST>:<PORT>/home/hduser/Refdata/; set hivevar:refdir='./Refdata';
If the reference data is in archive format:
hdfs://<HOST>:<PORT>/home/hduser/ref.zip; set hivevar:refdir='./ref.zip';
- If the reference data is on local path, ensure data is present on each
node of the cluster on the same path.
Set the reference directory as shown:
set hivevar:refdir='/home/hadoop/reference/';
- If the reference data is on HDFS, add the reference data and set
reference directory as shown in this example.
-
Specify the configurations, such as the match rule, sort field, express match
column, and other details for the job and assign to respective variable or
configuration properties.
Note: The rule must be in JSON format.
For example
set rule='{"matchKeys":[{"expressMatchKey":false, "matchKeyField":"MatchKey1", "rules":[{"algorithm":"Soundex"," field":"businessname", "startPosition":1, "length":0,"active":true, "sortInput":null, "removeNoiseCharacters":false}]}, {"expressMatchKey":false, "matchKeyField":"MatchKey2", "rules":[{"algorithm":"Koeln", "field":"businessname", "startPosition":1, "length":0, "active":true, "sortInput":null, "removeNoiseCharacters":false}]}]}';
Note: Use the configuration properties in the respective job configurations. For example,pb.bdq.match.rule
,pb.bdq.match.express.column
, andpb.bdq.consolidation.sort.field
, wherever indicated in the respective sample HQL files. -
Specify the header fields of the input table, in comma-separated format, and
assign to a variable or configuration property.
set pb.bdq.match.header='businessname,recordid';
Note: Use the configuration property where indicated in the HQL file. For example,pb.bdq.match.header
andpb.bdq.consolidation.header
. -
Switch off the aggregatiion of data between Reducer and Mapper, by seting the
Hive.Map.Aggr
environment variable configuration to false, as indicated in this example:set hive.map.aggr = false;
Note: This configuration is required for all UDFs. -
Set the general configurations for running the job as indicated in this
example:
set pb.bdq.uam.universaladdress.general.configuration = {"dFileType":"SPLIT", "dMemoryModel":"MEDIUM", "lacsLinkMemoryModel":"MEDIUM", "suiteLinkMemoryModel":"MEDIUM"};
Note: This configuration is required only for Universal Addressing Module Hive UDAFs. -
Set the input configurations for running the job as indicated in this
example:
set pb.bdq.uam.universaladdress.input.configuration = {"outputStandardAddress":true, "outputPostalData":false, "outputParsedInput":false, "outputAddressBlocks":true, "performUSProcessing":true, "performCanadianProcessing": false, "performInternationalProcessing":false, "outputFormattedOnFail":false, "outputCasing":"MIXED", "outputPostalCodeSeparator":true, "outputMultinationalCharacters": false, "performDPV":false, "performRDI":false, "performESM":false, "performASM":false, "performEWS":false, "performLACSLink":false, "performLOT":false, "failOnCMRAMatch":false, "extractFirm":false, "extractUrb":false, "outputReport3553":false, "outputReportSERP":false, "outputReportSummary":true, "outputCASSDetail":false, "outputFieldLevelReturnCodes":false, "keepMultimatch":false, "maximumResults":10, "standardAddressFormat": "STANDARD_ADDRESS_FORMAT_COMBINED_UNIT", "standardAddressPMBLine": "STANDARD_ADDRESS_PMB_LINE_NONE", "cityNameFormat":"CITY_FORMAT_STANDARD", "vanityCityFormatLong":true, "outputCountryFormat":"ENGLISH", "homeCountry": "United States", "streetMatchingStrictness":"MATCHING_STRICTNESS_MEDIUM", "firmMatchingStrictness":"MATCHING_STRICTNESS_MEDIUM", "directionalMatchingStrictness": "MATCHING_STRICTNESS_MEDIUM", "dualAddressLogic":"DUAL_NORMAL", "dpvSuccessfulStatusCondition":"A", "reportListFileName":"", "reportlistProcessorName":"", "reportlistNumber":1, "reportMailerAddress":"", "reportMailerName":"", "reportMailerCityLine":"", "canReportMailerCPCNumber":"", "canReportMailerAddress":"", "canReportMailerName":"", "canReportMailerCityLine":"", "internationalCityStreetSearching" :100, "addressLineSearchOnFail":true, "outputStreetAlias":true, "outputVeriMoveBlock":false, "dpvDetermineNoStat":false, "dpvDetermineVacancy":false, "outputAbbreviatedAlias":false, "outputPreferredAlias":false, "outputPreferredCity":"CITY_OVERRIDE_NAME_ZIP4", "performSuiteLink":false, "suppressZplusPhantomCarrierR777":false, "canStandardAddressFormat" :"D", "canEnglishApartmentLabel":"APT", "canFrenchApartmentLabel":"APP", "canFrenchFormat": "C", "canOutputCityFormat":"D", "canOutputCityAlias":true, "canDualAddressLogic":"D", "canPreferHouseNum":false, "canSSLVRFLG":false, "canRuralRouteFormat":"A", "canNonCivicFormat":"A", "canDeliveryOfficeFormat":"I", "canEnableSERP":false, "canSwitchManagedPostalCodeConfidence":false, "stats":null, "counts":null, "z3seg":null, "serpStats":null, "dpvSeedList":null, "lacsSeedList":null, "zipInputSet":null, "reportName" :null, "currentUser":null, "jobName":null, "jobId":null, "jobRequest":false, "properties": {"DPVDetermineVacancy":"N", "DualAddressLogic":"N", "ExtractUrb":"N", "CanFrenchFormat" :"C", "AddressLineSearchOnFail":"Y", "OutputFieldLevelReturnCodes":"N", "OutputFormattedOnFail":"N", "OutputStreetNameAlias":"Y", "OutputReportSERP":"N", "OutputAddressBlocks":"Y", "ExtractFirm":"N", "CanEnglishApartmentLabel":"APT", "OutputPreferredCity":"Z", "FirmMatchingStrictness":"M", "CanFrenchApartmentLabel":"APP", "KeepMultimatch":"N", "StandardAddressPMBLine":"N", "PerformSuiteLink":"N", "CanStandardAddressFormat":"D", "DPVSuccessfulStatusCondition":"A", "PerformLACSLink":"N", "PerformUSProcessing":"Y", "PerformEWS":"N", "StandardAddressFormat":"C", "SuppressZplusPhantomCarrierR777":"N", "HomeCountry":"United States", "ReportMailerAddress":"", "OutputReport3553":"N", "OutputVeriMoveDataBlock":"N", "CanDeliveryOfficeFormat":"I", "OutputAbbreviatedAlias":"N", "PerformCanadianProcessing": "N", "PerformDPV":"N", "PerformInternationalProcessing":"N", "CanSSLVRFlg":"N", "StreetMatchingStrictness":"M", "InternationalCityStreetSearching":"100", "canSwitchManagedPostalCodeConfidence":"N", "CanDualAddressLogic":"D", "PerformASM": "N", "OutputCasing":"M", "ReportListFileName":"", "CanReportMailerAddress":"", "ReportMailerCityLine":"", "CanReportMailerCPCNumber":"", "ReportListProcessorName":"", "CanOutputCityAlias":"Y", "DirectionalMatchingStrictness":"M", "CanRuralRouteFormat": "A", "CanOutputCityFormat":"D", "ReportListNumber":"1", "CanReportMailerCityLine":"", "OutputMultinationalCharacters":"N", "EnableSERP":"N", "CanNonCivicFormat":"A", "OutputShortCityName":"S", "OutputPostalCodeSeparator":"Y", "FailOnCMRAMatch":"N", "PerformLOT":"N", "OutputCountryFormat":"E", "CanPreferHouseNum":"N", "CanReportMailerName":"", "PerformRDI":"N", "ReportMailerName":"", "PerformESM":"N", "OutputReportSummary":"Y", "OutputVanityCityFormatLong":"Y", "OutputPreferredAlias":"N", "DPVDetermineNoStat":"N", "MaximumResults":"10"}}};
Note: This configuration is required only for Universal Addressing Module Hive UDAFs. -
Set the engine configurations for running the job as indicated in this
example:
set pb.bdq.uam.universaladdress.engine.configurations = { "referenceData":{ "dataDir":"/home/hduser/resources/uam/universaladdress/UAM_universaladdress4.0_Feb15/", "referenceDataPathLocation":"LocaltoDataNodes"}, "cobolRuntimePath":"/home/hduser/addressquality/", "modulesDir":"/home/hduser/tapan/addressquality/modules", "dpvDbPath":null, "suiteLinkDBPath":null, "ewsDBPath":null, "rdiDBPath":null, "lacsDBPath":null};
Note: This configuration is required only for Universal Addressing Module Hive UDAFs. -
Set the process type to indicate the desired validation level. We currently
support address validation only.
For example, in the Validate Address job, set the process type as below:
set pb.bdq.uam.universaladdress.process.type=VALIDATE;
Note: This configuration is required only for the Validate Address and Validate Address Loqate Hive UDAFs. -
To run the job and display the job output on the console, write the query as
indicated in this example:
To run the job and dump the job output in a designated file, write the query as indicated in the below example:SELECT businessname, recordid, bar.ret["MatchKey1"] AS MatchKey1, bar.ret["MatchKey2"] AS MatchKey2 FROM ( SELECT *, matchkeygenerator (${hiveconf:rule}, ${hiveconf:header}, businessname, recordid) AS ret FROM cust ) bar;
INSERT OVERWRITE LOCAL DIRECTORY '/home/hadoop/MatchKey/' row format delimited FIELDS TERMINATED BY ',' MAP FIELDS TERMINATED BY ':' COLLECTION ITEMS TERMINATED BY '|' LINES TERMINATED BY '\n' STORED AS TEXTFILE SELECT businessname, recordid, bar.ret["MatchKey1"] AS MatchKey1, bar.ret["MatchKey2"] AS MatchKey2 FROM ( SELECT *, matchkeygenerator (${hiveconf:rule}, ${hiveconf:header}, businessname, recordid) AS ret FROM cust ) bar;
Note: Ensure to use the alias defined earlier for the UDF.
Important: For all UDAF jobs, use the respective configuration properties as
variables while defining the input parameters, where indicated in the respective
sample HQL files.
For example, pb.bdq.match.rule
,
pb.bdq.match.express.column
and
pb.bdq.consolidation.sort.field
.