Cómo usar una UDF de Hive
Para ejecutar cada trabajo basado en UDF, puede ejecutar los siguientes pasos individualmente en su cliente de Hive en una única sesión, o bien puede crear un archivo HQL que agrupe todos los pasos requeridos de forma secuencial y ejecutarlo de una sola vez.
- En su cliente de Hive, inicie sesión en la base de datos de Hive obligatoria.
- Registre el archivo JAR del módulo Big Data Quality SDK específico al que pertenece la UDF de Hive de Calidad de los datos.
-
In case of the Validate Address UDAF, to set the path of the COBOL libraries, set the environment variable
LD_LIBRARY_PATH
as below:set mapreduce.admin.user.env = LD_LIBRARY_PATH=/home/hduser/~/runtime/lib: /home/hduser/~/runtime/bin:/home/hduser/~/server/modules/universaladdress/lib, ACU_RUNCBL_JNI_ONLOAD_DISABLE=1, G1RTS=/home/hduser/~/ ;
- In case of the Validate Address Global UDAF, add the file libAddressDoctor5.so file as well.
-
En el caso de UDAF de Validate Address Loqate, agregue estos campos obligatorios a la memoria caché distribuida.
- loqate-core.car
- LoqateVerificationLevel.csv
- Loqate.csv
- countryTables.csv
- countryNameTables.csv
-
Cree un alias de la UDF de Hive del trabajo de calidad de datos que desea ejecutar.
Por ejemplo:
CREATE TEMPORARY FUNCTION matchkeygenerator as 'com.pb.bdq.amm.process.hive.matchkeygenerator.MatchKeyGeneratorUDF';
-
Especifique las configuraciones, como la regla de cruce, el orden de campos, la columna de cruce inmediato y otros detalles del trabajo, y asígnelas a la variable o las propiedades de configuración correspondientes.
Nota: La regla debe tener el formato JSON.
Por ejemplo:
set rule='{"matchKeys":[{"expressMatchKey":false, "matchKeyField":"MatchKey1", "rules":[{"algorithm":"Soundex"," field":"businessname", "startPosition":1, "length":0,"active":true, "sortInput":null, "removeNoiseCharacters":false}]}, {"expressMatchKey":false, "matchKeyField":"MatchKey2", "rules":[{"algorithm":"Koeln", "field":"businessname", "startPosition":1, "length":0, "active":true, "sortInput":null, "removeNoiseCharacters":false}]}]}';
Nota: Asegúrese de usar las propiedades de configuración respectivas para el trabajo. Por ejemplo,pb.bdq.match.rule
,pb.bdq.match.express.column
,pb.bdq.consolidation.sort.field
, etc., como se indica en los archivos HQL de muestra correspondientes. -
Especifique los campos del encabezado de los datos de entrada, en formato separado por comas, y asígnelos a una variable o propiedad de configuración.
set pb.bdq.match.header='businessname,recordid';
Nota: Asegúrese de usar la propiedad de configuración donde sea necesario. Por ejemplo,pb.bdq.match.header
,<codeph>pb.bdq.match.header</codeph>,pb.bdq.consolidation.header
, etc., como se indica en los archivos HQL de muestra correspondientes. -
Switch off the aggregatiion of data between Reducer and Mapper, by seting the
Hive.Map.Aggr
environment variable configuration to false, as indicated in the below example:set hive.map.aggr = false;
Nota: This configuration is required for all UDAFs. -
Set the general configurations for running the job as indicated in the below example:
set pb.bdq.uam.universaladdress.general.configuration = {"dFileType":"SPLIT", "dMemoryModel":"MEDIUM", "lacsLinkMemoryModel":"MEDIUM", "suiteLinkMemoryModel":"MEDIUM"};
Nota: This configuration is required only for Universal Addressing Module Hive UDAFs. -
Set the input configurations for running the job as indicated in the below example:
set pb.bdq.uam.universaladdress.input.configuration = {"outputStandardAddress":true, "outputPostalData":false, "outputParsedInput":false, "outputAddressBlocks":true, "performUSProcessing":true, "performCanadianProcessing":false, "performInternationalProcessing":false, "outputFormattedOnFail":false, "outputCasing":"MIXED", "outputPostalCodeSeparator":true, "outputMultinationalCharacters":false, "performDPV":false, "performRDI":false, "performESM":false, "performASM":false, "performEWS":false, "performLACSLink":false, "performLOT":false, "failOnCMRAMatch":false, "extractFirm":false, "extractUrb":false, "outputReport3553":false, "outputReportSERP":false, "outputReportSummary":true, "outputCASSDetail":false, "outputFieldLevelReturnCodes":false, "keepMultimatch":false, "maximumResults":10, "standardAddressFormat":"STANDARD_ADDRESS_FORMAT_COMBINED_UNIT", "standardAddressPMBLine":"STANDARD_ADDRESS_PMB_LINE_NONE", "cityNameFormat":"CITY_FORMAT_STANDARD", "vanityCityFormatLong":true, "outputCountryFormat":"ENGLISH", "homeCountry":"United States", "streetMatchingStrictness":"MATCHING_STRICTNESS_MEDIUM", "firmMatchingStrictness":"MATCHING_STRICTNESS_MEDIUM", "directionalMatchingStrictness":"MATCHING_STRICTNESS_MEDIUM", "dualAddressLogic":"DUAL_NORMAL", "dpvSuccessfulStatusCondition":"A", "reportListFileName":"", "reportlistProcessorName":"", "reportlistNumber":1, "reportMailerAddress":"", "reportMailerName":"", "reportMailerCityLine":"", "canReportMailerCPCNumber":"", "canReportMailerAddress":"", "canReportMailerName":"", "canReportMailerCityLine":"", "internationalCityStreetSearching":100, "addressLineSearchOnFail":true, "outputStreetAlias":true, "outputVeriMoveBlock":false, "dpvDetermineNoStat":false, "dpvDetermineVacancy":false, "outputAbbreviatedAlias":false, "outputPreferredAlias":false, "outputPreferredCity":"CITY_OVERRIDE_NAME_ZIP4", "performSuiteLink":false, "suppressZplusPhantomCarrierR777":false, "canStandardAddressFormat":"D", "canEnglishApartmentLabel":"APT", "canFrenchApartmentLabel":"APP", "canFrenchFormat":"C", "canOutputCityFormat":"D", "canOutputCityAlias":true, "canDualAddressLogic":"D", "canPreferHouseNum":false, "canSSLVRFLG":false, "canRuralRouteFormat":"A", "canNonCivicFormat":"A", "canDeliveryOfficeFormat":"I", "canEnableSERP":false, "canSwitchManagedPostalCodeConfidence":false, "stats":null, "counts":null, "z3seg":null, "serpStats":null, "dpvSeedList":null, "lacsSeedList":null, "zipInputSet":null, "reportName":null, "currentUser":null, "jobName":null, "jobId":null, "jobRequest":false, "properties":{"DPVDetermineVacancy":"N", "DualAddressLogic":"N", "ExtractUrb":"N", "CanFrenchFormat":"C", "AddressLineSearchOnFail":"Y", "OutputFieldLevelReturnCodes":"N", "OutputFormattedOnFail":"N", "OutputStreetNameAlias":"Y", "OutputReportSERP":"N", "OutputAddressBlocks":"Y", "ExtractFirm":"N", "CanEnglishApartmentLabel":"APT", "OutputPreferredCity":"Z", "FirmMatchingStrictness":"M", "CanFrenchApartmentLabel":"APP", "KeepMultimatch":"N", "StandardAddressPMBLine":"N", "PerformSuiteLink":"N", "CanStandardAddressFormat":"D", "DPVSuccessfulStatusCondition":"A", "PerformLACSLink":"N", "PerformUSProcessing":"Y", "PerformEWS":"N", "StandardAddressFormat":"C", "SuppressZplusPhantomCarrierR777":"N", "HomeCountry":"United States", "ReportMailerAddress":"", "OutputReport3553":"N", "OutputVeriMoveDataBlock":"N", "CanDeliveryOfficeFormat":"I", "OutputAbbreviatedAlias":"N", "PerformCanadianProcessing":"N", "PerformDPV":"N", "PerformInternationalProcessing":"N", "CanSSLVRFlg":"N", "StreetMatchingStrictness":"M", "InternationalCityStreetSearching":"100", "canSwitchManagedPostalCodeConfidence":"N", "CanDualAddressLogic":"D", "PerformASM":"N", "OutputCasing":"M", "ReportListFileName":"", "CanReportMailerAddress":"", "ReportMailerCityLine":"", "CanReportMailerCPCNumber":"", "ReportListProcessorName":"", "CanOutputCityAlias":"Y", "DirectionalMatchingStrictness":"M", "CanRuralRouteFormat":"A", "CanOutputCityFormat":"D", "ReportListNumber":"1", "CanReportMailerCityLine":"", "OutputMultinationalCharacters":"N", "EnableSERP":"N", "CanNonCivicFormat":"A", "OutputShortCityName":"S", "OutputPostalCodeSeparator":"Y", "FailOnCMRAMatch":"N", "PerformLOT":"N", "OutputCountryFormat":"E", "CanPreferHouseNum":"N", "CanReportMailerName":"", "PerformRDI":"N", "ReportMailerName":"", "PerformESM":"N", "OutputReportSummary":"Y", "OutputVanityCityFormatLong":"Y", "OutputPreferredAlias":"N", "DPVDetermineNoStat":"N", "MaximumResults":"10"}}};
Nota: This configuration is required only for Universal Addressing Module Hive UDAFs. -
Set the engine configurations for running the job as indicated in the below example:
set pb.bdq.uam.universaladdress.engine.configurations = { "referenceData":{ "dataDir":"/home/hduser/resources/uam/universaladdress/UAM_universaladdress4.0_Feb15/", "referenceDataPathLocation":"LocaltoDataNodes"}, "cobolRuntimePath":"/home/hduser/tapan/addressquality/", "modulesDir":"/home/hduser/tapan/addressquality/modules", "dpvDbPath":null, "suiteLinkDBPath":null, "ewsDBPath":null, "rdiDBPath":null, "lacsDBPath":null};
Nota: This configuration is required only for Universal Addressing Module Hive UDAFs. -
Set the process type to indicate the desired validation level. We currently support address validation only.
For example, in the Validate Address job, set the process type as below:
set pb.bdq.uam.universaladdress.process.type=VALIDATE;
Nota: This configuration is required only for the Validate Address and Validate Address Loqate Hive UDAFs. -
Para hacer el trabajo y ver los resultados en la consola, escriba la consulta como se indica en el siguiente ejemplo:
Para hacer el trabajo y volcar los resultados en un archivo designado, escriba la consulta como se indica en el siguiente ejemplo:SELECT businessname, recordid, bar.ret["MatchKey1"] AS MatchKey1, bar.ret["MatchKey2"] AS MatchKey2 FROM ( SELECT *, matchkeygenerator (${hiveconf:rule}, ${hiveconf:header}, businessname, recordid) AS ret FROM cust ) bar;
INSERT OVERWRITE LOCAL DIRECTORY '/home/hadoop/MatchKey/' row format delimited FIELDS TERMINATED BY ',' MAP FIELDS TERMINATED BY ':' COLLECTION ITEMS TERMINATED BY '|' LINES TERMINATED BY '\n' STORED AS TEXTFILE SELECT businessname, recordid, bar.ret["MatchKey1"] AS MatchKey1, bar.ret["MatchKey2"] AS MatchKey2 FROM ( SELECT *, matchkeygenerator (${hiveconf:rule}, ${hiveconf:header}, businessname, recordid) AS ret FROM cust ) bar;
Nota: Asegúrese de usar el alias definido antes para la UDF.
Importante: For all UDAF jobs, use the respective configuration properties as variables while defining the input parameters, where indicated in the respective sample HQL files.
Por ejemplo, pb.bdq.match.rule
,pb.bdq.match.express.column
,pb.bdq.consolidation.sort.field
, etc.