Cómo usar una UDF de Hive

Para ejecutar cada trabajo basado en UDF, puede ejecutar los siguientes pasos individualmente en su cliente de Hive en una única sesión, o bien puede crear un archivo HQL que agrupe todos los pasos requeridos de forma secuencial y ejecutarlo de una sola vez.

  1. En su cliente de Hive, inicie sesión en la base de datos de Hive obligatoria.
  2. Registre el archivo JAR del módulo Big Data Quality SDK específico al que pertenece la UDF de Hive de Calidad de los datos.
  3. In case of the Validate Address UDAF, to set the path of the COBOL libraries, set the environment variable LD_LIBRARY_PATH as below:
    set mapreduce.admin.user.env = LD_LIBRARY_PATH=/home/hduser/~/runtime/lib:
    /home/hduser/~/runtime/bin:/home/hduser/~/server/modules/universaladdress/lib, ACU_RUNCBL_JNI_ONLOAD_DISABLE=1, G1RTS=/home/hduser/~/ ;
  4. In case of the Validate Address Global UDAF, add the file libAddressDoctor5.so file as well.
  5. En el caso de UDAF de Validate Address Loqate, agregue estos campos obligatorios a la memoria caché distribuida.
    • loqate-core.car
    • LoqateVerificationLevel.csv
    • Loqate.csv
    • countryTables.csv
    • countryNameTables.csv
  6. Cree un alias de la UDF de Hive del trabajo de calidad de datos que desea ejecutar.
    Por ejemplo:
    CREATE TEMPORARY FUNCTION matchkeygenerator as 'com.pb.bdq.amm.process.hive.matchkeygenerator.MatchKeyGeneratorUDF';
  7. Especifique las configuraciones, como la regla de cruce, el orden de campos, la columna de cruce inmediato y otros detalles del trabajo, y asígnelas a la variable o las propiedades de configuración correspondientes.
    Nota: La regla debe tener el formato JSON.

    Por ejemplo:

    set rule='{"matchKeys":[{"expressMatchKey":false, "matchKeyField":"MatchKey1",
    "rules":[{"algorithm":"Soundex"," field":"businessname", "startPosition":1, "length":0,"active":true, "sortInput":null, "removeNoiseCharacters":false}]}, 
    {"expressMatchKey":false, "matchKeyField":"MatchKey2", "rules":[{"algorithm":"Koeln", "field":"businessname", "startPosition":1, "length":0, "active":true, "sortInput":null, "removeNoiseCharacters":false}]}]}';
    Nota: Asegúrese de usar las propiedades de configuración respectivas para el trabajo. Por ejemplo, pb.bdq.match.rule,pb.bdq.match.express.column,pb.bdq.consolidation.sort.field, etc., como se indica en los archivos HQL de muestra correspondientes.
  8. Especifique los campos del encabezado de los datos de entrada, en formato separado por comas, y asígnelos a una variable o propiedad de configuración.
    set pb.bdq.match.header='businessname,recordid';
    Nota: Asegúrese de usar la propiedad de configuración donde sea necesario. Por ejemplo, pb.bdq.match.header,<codeph>pb.bdq.match.header</codeph>,pb.bdq.consolidation.header, etc., como se indica en los archivos HQL de muestra correspondientes.
  9. Switch off the aggregatiion of data between Reducer and Mapper, by seting theHive.Map.Aggr environment variable configuration to false, as indicated in the below example:
    set hive.map.aggr = false;
    Nota: This configuration is required for all UDAFs.
  10. Set the general configurations for running the job as indicated in the below example:
    set pb.bdq.uam.universaladdress.general.configuration = {"dFileType":"SPLIT", "dMemoryModel":"MEDIUM", "lacsLinkMemoryModel":"MEDIUM", "suiteLinkMemoryModel":"MEDIUM"};
    Nota: This configuration is required only for Universal Addressing Module Hive UDAFs.
  11. Set the input configurations for running the job as indicated in the below example:
    set pb.bdq.uam.universaladdress.input.configuration = {"outputStandardAddress":true, "outputPostalData":false, "outputParsedInput":false, "outputAddressBlocks":true, "performUSProcessing":true, "performCanadianProcessing":false, "performInternationalProcessing":false, "outputFormattedOnFail":false, "outputCasing":"MIXED", "outputPostalCodeSeparator":true, "outputMultinationalCharacters":false, "performDPV":false, "performRDI":false, "performESM":false, "performASM":false, "performEWS":false, "performLACSLink":false, "performLOT":false, "failOnCMRAMatch":false, "extractFirm":false, "extractUrb":false, "outputReport3553":false, "outputReportSERP":false, "outputReportSummary":true, "outputCASSDetail":false, "outputFieldLevelReturnCodes":false, "keepMultimatch":false, "maximumResults":10, "standardAddressFormat":"STANDARD_ADDRESS_FORMAT_COMBINED_UNIT", "standardAddressPMBLine":"STANDARD_ADDRESS_PMB_LINE_NONE", "cityNameFormat":"CITY_FORMAT_STANDARD", "vanityCityFormatLong":true, "outputCountryFormat":"ENGLISH", "homeCountry":"United States", "streetMatchingStrictness":"MATCHING_STRICTNESS_MEDIUM", "firmMatchingStrictness":"MATCHING_STRICTNESS_MEDIUM", "directionalMatchingStrictness":"MATCHING_STRICTNESS_MEDIUM", "dualAddressLogic":"DUAL_NORMAL", "dpvSuccessfulStatusCondition":"A", "reportListFileName":"", "reportlistProcessorName":"", "reportlistNumber":1, "reportMailerAddress":"", "reportMailerName":"", "reportMailerCityLine":"", "canReportMailerCPCNumber":"", "canReportMailerAddress":"", "canReportMailerName":"", "canReportMailerCityLine":"", "internationalCityStreetSearching":100, "addressLineSearchOnFail":true, "outputStreetAlias":true, "outputVeriMoveBlock":false, "dpvDetermineNoStat":false, "dpvDetermineVacancy":false, "outputAbbreviatedAlias":false, "outputPreferredAlias":false, "outputPreferredCity":"CITY_OVERRIDE_NAME_ZIP4", "performSuiteLink":false, "suppressZplusPhantomCarrierR777":false, "canStandardAddressFormat":"D", "canEnglishApartmentLabel":"APT", "canFrenchApartmentLabel":"APP", "canFrenchFormat":"C", "canOutputCityFormat":"D", "canOutputCityAlias":true, "canDualAddressLogic":"D", "canPreferHouseNum":false, "canSSLVRFLG":false, "canRuralRouteFormat":"A", "canNonCivicFormat":"A", "canDeliveryOfficeFormat":"I", "canEnableSERP":false, "canSwitchManagedPostalCodeConfidence":false, "stats":null, "counts":null, "z3seg":null, "serpStats":null, "dpvSeedList":null, "lacsSeedList":null, "zipInputSet":null, "reportName":null, "currentUser":null, "jobName":null, "jobId":null, "jobRequest":false, "properties":{"DPVDetermineVacancy":"N", "DualAddressLogic":"N", "ExtractUrb":"N", "CanFrenchFormat":"C", "AddressLineSearchOnFail":"Y", "OutputFieldLevelReturnCodes":"N", "OutputFormattedOnFail":"N", "OutputStreetNameAlias":"Y", "OutputReportSERP":"N", "OutputAddressBlocks":"Y", "ExtractFirm":"N", "CanEnglishApartmentLabel":"APT", "OutputPreferredCity":"Z", "FirmMatchingStrictness":"M", "CanFrenchApartmentLabel":"APP", "KeepMultimatch":"N", "StandardAddressPMBLine":"N", "PerformSuiteLink":"N", "CanStandardAddressFormat":"D", "DPVSuccessfulStatusCondition":"A", "PerformLACSLink":"N", "PerformUSProcessing":"Y", "PerformEWS":"N", "StandardAddressFormat":"C", "SuppressZplusPhantomCarrierR777":"N", "HomeCountry":"United States", "ReportMailerAddress":"", "OutputReport3553":"N", "OutputVeriMoveDataBlock":"N", "CanDeliveryOfficeFormat":"I", "OutputAbbreviatedAlias":"N", "PerformCanadianProcessing":"N", "PerformDPV":"N", "PerformInternationalProcessing":"N", "CanSSLVRFlg":"N", "StreetMatchingStrictness":"M", "InternationalCityStreetSearching":"100", "canSwitchManagedPostalCodeConfidence":"N", "CanDualAddressLogic":"D", "PerformASM":"N", "OutputCasing":"M", "ReportListFileName":"", "CanReportMailerAddress":"", "ReportMailerCityLine":"", "CanReportMailerCPCNumber":"", "ReportListProcessorName":"", "CanOutputCityAlias":"Y", "DirectionalMatchingStrictness":"M", "CanRuralRouteFormat":"A", "CanOutputCityFormat":"D", "ReportListNumber":"1", "CanReportMailerCityLine":"", "OutputMultinationalCharacters":"N", "EnableSERP":"N", "CanNonCivicFormat":"A", "OutputShortCityName":"S", "OutputPostalCodeSeparator":"Y", "FailOnCMRAMatch":"N", "PerformLOT":"N", "OutputCountryFormat":"E", "CanPreferHouseNum":"N", "CanReportMailerName":"", "PerformRDI":"N", "ReportMailerName":"", "PerformESM":"N", "OutputReportSummary":"Y", "OutputVanityCityFormatLong":"Y", "OutputPreferredAlias":"N", "DPVDetermineNoStat":"N", "MaximumResults":"10"}}};
    Nota: This configuration is required only for Universal Addressing Module Hive UDAFs.
  12. Set the engine configurations for running the job as indicated in the below example:
    set pb.bdq.uam.universaladdress.engine.configurations = { "referenceData":{ "dataDir":"/home/hduser/resources/uam/universaladdress/UAM_universaladdress4.0_Feb15/", "referenceDataPathLocation":"LocaltoDataNodes"}, "cobolRuntimePath":"/home/hduser/tapan/addressquality/", "modulesDir":"/home/hduser/tapan/addressquality/modules", "dpvDbPath":null, "suiteLinkDBPath":null, "ewsDBPath":null, "rdiDBPath":null, "lacsDBPath":null};
    Nota: This configuration is required only for Universal Addressing Module Hive UDAFs.
  13. Set the process type to indicate the desired validation level. We currently support address validation only.
    For example, in the Validate Address job, set the process type as below:
    set pb.bdq.uam.universaladdress.process.type=VALIDATE; 
    Nota: This configuration is required only for the Validate Address and Validate Address Loqate Hive UDAFs.
  14. Para hacer el trabajo y ver los resultados en la consola, escriba la consulta como se indica en el siguiente ejemplo:
    SELECT businessname, recordid, bar.ret["MatchKey1"] AS MatchKey1, bar.ret["MatchKey2"] AS MatchKey2 FROM (
    SELECT *, matchkeygenerator (${hiveconf:rule}, ${hiveconf:header}, businessname, recordid) AS ret FROM cust ) bar;
    Para hacer el trabajo y volcar los resultados en un archivo designado, escriba la consulta como se indica en el siguiente ejemplo:
    INSERT OVERWRITE LOCAL DIRECTORY '/home/hadoop/MatchKey/' row format delimited FIELDS TERMINATED BY ',' MAP FIELDS TERMINATED BY ':' COLLECTION ITEMS TERMINATED BY '|' LINES TERMINATED BY '\n' STORED AS TEXTFILE 
    SELECT businessname, recordid, bar.ret["MatchKey1"] AS MatchKey1, bar.ret["MatchKey2"] AS MatchKey2 FROM (
    SELECT *, matchkeygenerator (${hiveconf:rule}, ${hiveconf:header}, businessname, recordid) AS ret FROM cust ) bar;
    Nota: Asegúrese de usar el alias definido antes para la UDF.
Importante: For all UDAF jobs, use the respective configuration properties as variables while defining the input parameters, where indicated in the respective sample HQL files.

Por ejemplo, pb.bdq.match.rule,pb.bdq.match.express.column,pb.bdq.consolidation.sort.field, etc.