hive中get_json_object函数不支持解析json中文key

问题

今天在 Hive 中 get_json_object 函数解析 json 串的时候,发现函数不支持解析 json 中文 key。
例如:

select get_json_object('{ "姓名":"张三" , "年龄":"18" }', '$.姓名');

我们希望的结果是得到姓名对应的值张三,而运行之后的结果为 NULL 值。

select get_json_object('{ "abc姓名":"张三" , "abc":"18" }', '$.abc姓名');

我们希望的结果是得到姓名对应的值张三,而运行之后的结果为 18

产生问题的原因

是什么原因导致的呢?我们查看 Hive 官网中 get_json_object 函数的介绍,可以发现 get_json_object 函数不能解析 json 里面中文的 key,如下图所示:
在这里插入图片描述
json 路径只能包含字符 [0-9a-z_],即不能包含 大写或特殊字符 。此外,键不能以数字开头。

那为什么 json 路径只能包含字符 [0-9a-z_] 呢?

通过查看源码我们发现 get_json_object 对应的 UDF 类的源码如下:

import java.util.ArrayList;
import java.util.Iterator;
import java.util.LinkedHashMap;
import java.util.List;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;import com.fasterxml.jackson.core.json.JsonReadFeature;
import com.fasterxml.jackson.databind.JavaType;
import com.fasterxml.jackson.databind.ObjectMapper;
import com.google.common.collect.Iterators;
import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;/*** UDFJson.*/
@Description(name = "get_json_object",value = "_FUNC_(json_txt, path) - Extract a json object from path ",extended = "Extract json object from a json string based on json path "+ "specified, and return json string of the extracted json object. It "+ "will return null if the input json string is invalid.\n"+ "A limited version of JSONPath supported:\n"+ "  $   : Root object\n"+ "  .   : Child operator\n"+ "  []  : Subscript operator for array\n"+ "  *   : Wildcard for []\n"+ "Syntax not supported that's worth noticing:\n"+ "  ''  : Zero length string as key\n"+ "  ..  : Recursive descent\n"+ "  @   : Current object/element\n"+ "  ()  : Script expression\n"+ "  ?() : Filter (script) expression.\n"+ "  [,] : Union operator\n"+ "  [start:end:step] : array slice operator\n")//定义了一个名为UDFJson的类,继承自UDF类。
public class UDFGetJsonObjectCN extends UDF {//定义一个静态正则表达式模式,用于匹配JSON路径中的键。//匹配英文key:匹配一个或多个大写字母、小写字母、数字、下划线、连字符、冒号或空格。private static final Pattern patternKey = Pattern.compile("^([a-zA-Z0-9_\\-\\:\\s]+).*");//定义一个静态正则表达式模式,用于匹配JSON路径中的索引。private static final Pattern patternIndex = Pattern.compile("\\[([0-9]+|\\*)\\]");//创建一个ObjectMapper对象,用于解析JSON字符串。private static final ObjectMapper objectMapper = new ObjectMapper();//创建一个JavaType对象,用于表示Map类型。private static final JavaType MAP_TYPE = objectMapper.getTypeFactory().constructType(Map.class);//创建一个JavaType对象,用于表示List类型。private static final JavaType LIST_TYPE = objectMapper.getTypeFactory().constructType(List.class);//静态代码块,用于配置ObjectMapper的一些特性。static {// Allows for unescaped ASCII control characters in JSON valuesobjectMapper.enable(JsonReadFeature.ALLOW_UNESCAPED_CONTROL_CHARS.mappedFeature());// Enabled to accept quoting of all character backslash qooting mechanismobjectMapper.enable(JsonReadFeature.ALLOW_BACKSLASH_ESCAPING_ANY_CHARACTER.mappedFeature());}// An LRU cache using a linked hash map//定义了一个静态内部类HashCache,用作LRU缓存。static class HashCache<K, V> extends LinkedHashMap<K, V> {private static final int CACHE_SIZE = 16;private static final int INIT_SIZE = 32;private static final float LOAD_FACTOR = 0.6f;HashCache() {super(INIT_SIZE, LOAD_FACTOR);}private static final long serialVersionUID = 1;@Overrideprotected boolean removeEldestEntry(Map.Entry<K, V> eldest) {return size() > CACHE_SIZE;}}//声明了一个名为extractObjectCache的HashMap对象,用于缓存已提取的JSON对象。Map<String, Object> extractObjectCache = new HashCache<String, Object>();//声明了一个名为pathExprCache的HashMap对象,用于缓存已解析的JSON路径表达式。Map<String, String[]> pathExprCache = new HashCache<String, String[]>();//声明了一个名为indexListCache的HashMap对象,用于缓存已解析的JSON路径中的索引列表。Map<String, ArrayList<String>> indexListCache =new HashCache<String, ArrayList<String>>();//声明了一个名为mKeyGroup1Cache的HashMap对象,用于缓存JSON路径中的键。Map<String, String> mKeyGroup1Cache = new HashCache<String, String>();//声明了一个名为mKeyMatchesCache的HashMap对象,用于缓存JSON路径中的键是否匹配的结果。Map<String, Boolean> mKeyMatchesCache = new HashCache<String, Boolean>();//构造函数,没有参数。public UDFGetJsonObjectCN() {}/*** Extract json object from a json string based on json path specified, and* return json string of the extracted json object. It will return null if the* input json string is invalid.** A limited version of JSONPath supported: $ : Root object . : Child operator* [] : Subscript operator for array * : Wildcard for []** Syntax not supported that's worth noticing: '' : Zero length string as key* .. : Recursive descent &amp;#064; : Current object/element () : Script* expression ?() : Filter (script) expression. [,] : Union operator* [start:end:step] : array slice operator** @param jsonString*          the json string.* @param pathString*          the json path expression.* @return json string or null when an error happens.*///evaluate方法,用于提取指定路径的JSON对象并返回JSON字符串。public Text evaluate(String jsonString, String pathString) {if (jsonString == null || jsonString.isEmpty() || pathString == null|| pathString.isEmpty() || pathString.charAt(0) != '$') {return null;}int pathExprStart = 1;boolean unknownType = pathString.equals("$");boolean isRootArray = false;if (pathString.length() > 1) {if (pathString.charAt(1) == '[') {pathExprStart = 0;isRootArray = true;} else if (pathString.charAt(1) == '.') {isRootArray = pathString.length() > 2 && pathString.charAt(2) == '[';} else {return null;}}// Cache pathExprString[] pathExpr = pathExprCache.get(pathString);if (pathExpr == null) {pathExpr = pathString.split("\\.", -1);pathExprCache.put(pathString, pathExpr);}// Cache extractObjectObject extractObject = extractObjectCache.get(jsonString);if (extractObject == null) {if (unknownType) {try {extractObject = objectMapper.readValue(jsonString, LIST_TYPE);} catch (Exception e) {// Ignore exception}if (extractObject == null) {try {extractObject = objectMapper.readValue(jsonString, MAP_TYPE);} catch (Exception e) {return null;}}} else {JavaType javaType = isRootArray ? LIST_TYPE : MAP_TYPE;try {extractObject = objectMapper.readValue(jsonString, javaType);} catch (Exception e) {return null;}}extractObjectCache.put(jsonString, extractObject);}for (int i = pathExprStart; i < pathExpr.length; i++) {if (extractObject == null) {return null;}extractObject = extract(extractObject, pathExpr[i], i == pathExprStart && isRootArray);}Text result = new Text();if (extractObject instanceof Map || extractObject instanceof List) {try {result.set(objectMapper.writeValueAsString(extractObject));} catch (Exception e) {return null;}} else if (extractObject != null) {result.set(extractObject.toString());} else {return null;}return result;}//extract方法,递归地提取JSON对象。private Object extract(Object json, String path, boolean skipMapProc) {// skip MAP processing for the first path element if root is arrayif (!skipMapProc) {// Cache patternkey.matcher(path).matches()Matcher mKey = null;Boolean mKeyMatches = mKeyMatchesCache.get(path);if (mKeyMatches == null) {mKey = patternKey.matcher(path);mKeyMatches = mKey.matches() ? Boolean.TRUE : Boolean.FALSE;mKeyMatchesCache.put(path, mKeyMatches);}if (!mKeyMatches.booleanValue()) {return null;}// Cache mkey.group(1)String mKeyGroup1 = mKeyGroup1Cache.get(path);if (mKeyGroup1 == null) {if (mKey == null) {mKey = patternKey.matcher(path);mKeyMatches = mKey.matches() ? Boolean.TRUE : Boolean.FALSE;mKeyMatchesCache.put(path, mKeyMatches);if (!mKeyMatches.booleanValue()) {return null;}}mKeyGroup1 = mKey.group(1);mKeyGroup1Cache.put(path, mKeyGroup1);}json = extract_json_withkey(json, mKeyGroup1);}// Cache indexListArrayList<String> indexList = indexListCache.get(path);if (indexList == null) {Matcher mIndex = patternIndex.matcher(path);indexList = new ArrayList<String>();while (mIndex.find()) {indexList.add(mIndex.group(1));}indexListCache.put(path, indexList);}if (indexList.size() > 0) {json = extract_json_withindex(json, indexList);}return json;}//创建一个名为jsonList的AddingList对象,用于存储提取出来的JSON对象。private transient AddingList jsonList = new AddingList();//定义了一个静态内部类AddingList,继承自ArrayList<Object>,用于添加JSON对象到jsonList中。private static class AddingList extends ArrayList<Object> {private static final long serialVersionUID = 1L;@Overridepublic Iterator<Object> iterator() {return Iterators.forArray(toArray());}@Overridepublic void removeRange(int fromIndex, int toIndex) {super.removeRange(fromIndex, toIndex);}};//extract_json_withindex方法,根据JSON路径中的索引提取JSON对象。@SuppressWarnings("unchecked")private Object extract_json_withindex(Object json, ArrayList<String> indexList) {jsonList.clear();jsonList.add(json);for (String index : indexList) {int targets = jsonList.size();if (index.equalsIgnoreCase("*")) {for (Object array : jsonList) {if (array instanceof List) {for (int j = 0; j < ((List<Object>)array).size(); j++) {jsonList.add(((List<Object>)array).get(j));}}}} else {for (Object array : jsonList) {int indexValue = Integer.parseInt(index);if (!(array instanceof List)) {continue;}List<Object> list = (List<Object>) array;if (indexValue >= list.size()) {continue;}jsonList.add(list.get(indexValue));}}if (jsonList.size() == targets) {return null;}jsonList.removeRange(0, targets);}if (jsonList.isEmpty()) {return null;}return (jsonList.size() > 1) ? new ArrayList<Object>(jsonList) : jsonList.get(0);}//extract_json_withkey方法,根据JSON路径中的键提取JSON对象。@SuppressWarnings("unchecked")private Object extract_json_withkey(Object json, String path) {if (json instanceof List) {List<Object> jsonArray = new ArrayList<Object>();for (int i = 0; i < ((List<Object>) json).size(); i++) {Object json_elem = ((List<Object>) json).get(i);Object json_obj = null;if (json_elem instanceof Map) {json_obj = ((Map<String, Object>) json_elem).get(path);} else {continue;}if (json_obj instanceof List) {for (int j = 0; j < ((List<Object>) json_obj).size(); j++) {jsonArray.add(((List<Object>) json_obj).get(j));}} else if (json_obj != null) {jsonArray.add(json_obj);}}return (jsonArray.size() == 0) ? null : jsonArray;} else if (json instanceof Map) {return ((Map<String, Object>) json).get(path);} else {return null;}}
}

代码做了一些注释,我们可以发现 private final Pattern patternKey = Pattern.compile("^([a-zA-Z0-9_\\-\\:\\s]+).*"); 这个就是匹配 key 的模式串,它的意思是匹配以数字、字母、_、-、:、空格 为开头的字符串。那么这个匹配模式串就决定了,get_json_object 函数无法匹配出 key 中带中文的键值对,即select get_json_object('{ "姓名":"张三" , "年龄":"18" }', '$.姓名'); 结果为 null;而 select get_json_object('{ "abc姓名":"张三" , "abc":"18" }', '$.abc姓名'); 中只能匹配以数字、字母、_、-、:、空格 为开头的字符串,所以将 abc姓名 中的 abc 当作 key 去取 value 值,所以得到的值为18

解决办法

知道问题的原因了,那么我们怎么解决这个问题呢,其实很简单,我们只需要修改代码中匹配 key 的正则表达式就可以了。
其实我们可以将 get_json_object 函数的源码拿出来重新写一个 UDF 函数就可以了。

Hive-2.1.1 版本

需要注意自己 Hive 的版本,我们以 Hive-2.1.1 版本为例,代码如下:

package com.yan.hive.udf;import java.util.ArrayList;
import java.util.Iterator;
import java.util.LinkedHashMap;
import java.util.List;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;import com.google.common.collect.Iterators;
import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;
import org.codehaus.jackson.JsonFactory;
import org.codehaus.jackson.JsonParser.Feature;
import org.codehaus.jackson.map.ObjectMapper;
import org.codehaus.jackson.map.type.TypeFactory;
import org.codehaus.jackson.type.JavaType;/*** UDFJson.**/
@Description(name = "get_json_object_cn",value = "_FUNC_(json_txt, path) - Extract a json object from path ",extended = "Extract json object from a json string based on json path "+ "specified, and return json string of the extracted json object. It "+ "will return null if the input json string is invalid.\n"+ "A limited version of JSONPath supported:\n"+ "  $   : Root object\n"+ "  .   : Child operator\n"+ "  []  : Subscript operator for array\n"+ "  *   : Wildcard for []\n"+ "Syntax not supported that's worth noticing:\n"+ "  ''  : Zero length string as key\n"+ "  ..  : Recursive descent\n"+ "  &amp;#064;   : Current object/element\n"+ "  ()  : Script expression\n"+ "  ?() : Filter (script) expression.\n"+ "  [,] : Union operator\n"+ "  [start:end:step] : array slice operator\n")
public class UDFGetJsonObjectCN extends UDF {//private final Pattern patternKey = Pattern.compile("^([a-zA-Z0-9_\\-\\:\\s]+).*");private final Pattern patternKey = Pattern.compile("^([^\\[\\]]+).*");private final Pattern patternIndex = Pattern.compile("\\[([0-9]+|\\*)\\]");private static final JsonFactory JSON_FACTORY = new JsonFactory();static {// Allows for unescaped ASCII control characters in JSON valuesJSON_FACTORY.enable(Feature.ALLOW_UNQUOTED_CONTROL_CHARS);// Enabled to accept quoting of all character backslash qooting mechanismJSON_FACTORY.enable(Feature.ALLOW_BACKSLASH_ESCAPING_ANY_CHARACTER);}private static final ObjectMapper MAPPER = new ObjectMapper(JSON_FACTORY);private static final JavaType MAP_TYPE = TypeFactory.fromClass(Map.class);private static final JavaType LIST_TYPE = TypeFactory.fromClass(List.class);// An LRU cache using a linked hash mapstatic class HashCache<K, V> extends LinkedHashMap<K, V> {private static final int CACHE_SIZE = 16;private static final int INIT_SIZE = 32;private static final float LOAD_FACTOR = 0.6f;HashCache() {super(INIT_SIZE, LOAD_FACTOR);}private static final long serialVersionUID = 1;@Overrideprotected boolean removeEldestEntry(Map.Entry<K, V> eldest) {return size() > CACHE_SIZE;}}static Map<String, Object> extractObjectCache = new HashCache<String, Object>();static Map<String, String[]> pathExprCache = new HashCache<String, String[]>();static Map<String, ArrayList<String>> indexListCache =new HashCache<String, ArrayList<String>>();static Map<String, String> mKeyGroup1Cache = new HashCache<String, String>();static Map<String, Boolean> mKeyMatchesCache = new HashCache<String, Boolean>();Text result = new Text();public UDFGetJsonObjectCN() {}/*** Extract json object from a json string based on json path specified, and* return json string of the extracted json object. It will return null if the* input json string is invalid.** A limited version of JSONPath supported: $ : Root object . : Child operator* [] : Subscript operator for array * : Wildcard for []** Syntax not supported that's worth noticing: '' : Zero length string as key* .. : Recursive descent &amp;#064; : Current object/element () : Script* expression ?() : Filter (script) expression. [,] : Union operator* [start:end:step] : array slice operator** @param jsonString*          the json string.* @param pathString*          the json path expression.* @return json string or null when an error happens.*/public Text evaluate(String jsonString, String pathString) {if (jsonString == null || jsonString.isEmpty() || pathString == null|| pathString.isEmpty() || pathString.charAt(0) != '$') {return null;}int pathExprStart = 1;boolean isRootArray = false;if (pathString.length() > 1) {if (pathString.charAt(1) == '[') {pathExprStart = 0;isRootArray = true;} else if (pathString.charAt(1) == '.') {isRootArray = pathString.length() > 2 && pathString.charAt(2) == '[';} else {return null;}}// Cache pathExprString[] pathExpr = pathExprCache.get(pathString);if (pathExpr == null) {pathExpr = pathString.split("\\.", -1);pathExprCache.put(pathString, pathExpr);}// Cache extractObjectObject extractObject = extractObjectCache.get(jsonString);if (extractObject == null) {JavaType javaType = isRootArray ? LIST_TYPE : MAP_TYPE;try {extractObject = MAPPER.readValue(jsonString, javaType);} catch (Exception e) {return null;}extractObjectCache.put(jsonString, extractObject);}for (int i = pathExprStart; i < pathExpr.length; i++) {if (extractObject == null) {return null;}extractObject = extract(extractObject, pathExpr[i], i == pathExprStart && isRootArray);}if (extractObject instanceof Map || extractObject instanceof List) {try {result.set(MAPPER.writeValueAsString(extractObject));} catch (Exception e) {return null;}} else if (extractObject != null) {result.set(extractObject.toString());} else {return null;}return result;}private Object extract(Object json, String path, boolean skipMapProc) {// skip MAP processing for the first path element if root is arrayif (!skipMapProc) {// Cache patternkey.matcher(path).matches()Matcher mKey = null;Boolean mKeyMatches = mKeyMatchesCache.get(path);if (mKeyMatches == null) {mKey = patternKey.matcher(path);mKeyMatches = mKey.matches() ? Boolean.TRUE : Boolean.FALSE;mKeyMatchesCache.put(path, mKeyMatches);}if (!mKeyMatches.booleanValue()) {return null;}// Cache mkey.group(1)String mKeyGroup1 = mKeyGroup1Cache.get(path);if (mKeyGroup1 == null) {if (mKey == null) {mKey = patternKey.matcher(path);mKeyMatches = mKey.matches() ? Boolean.TRUE : Boolean.FALSE;mKeyMatchesCache.put(path, mKeyMatches);if (!mKeyMatches.booleanValue()) {return null;}}mKeyGroup1 = mKey.group(1);mKeyGroup1Cache.put(path, mKeyGroup1);}json = extract_json_withkey(json, mKeyGroup1);}// Cache indexListArrayList<String> indexList = indexListCache.get(path);if (indexList == null) {Matcher mIndex = patternIndex.matcher(path);indexList = new ArrayList<String>();while (mIndex.find()) {indexList.add(mIndex.group(1));}indexListCache.put(path, indexList);}if (indexList.size() > 0) {json = extract_json_withindex(json, indexList);}return json;}private transient AddingList jsonList = new AddingList();private static class AddingList extends ArrayList<Object> {@Overridepublic Iterator<Object> iterator() {return Iterators.forArray(toArray());}@Overridepublic void removeRange(int fromIndex, int toIndex) {super.removeRange(fromIndex, toIndex);}};@SuppressWarnings("unchecked")private Object extract_json_withindex(Object json, ArrayList<String> indexList) {jsonList.clear();jsonList.add(json);for (String index : indexList) {int targets = jsonList.size();if (index.equalsIgnoreCase("*")) {for (Object array : jsonList) {if (array instanceof List) {for (int j = 0; j < ((List<Object>)array).size(); j++) {jsonList.add(((List<Object>)array).get(j));}}}} else {for (Object array : jsonList) {int indexValue = Integer.parseInt(index);if (!(array instanceof List)) {continue;}List<Object> list = (List<Object>) array;if (indexValue >= list.size()) {continue;}jsonList.add(list.get(indexValue));}}if (jsonList.size() == targets) {return null;}jsonList.removeRange(0, targets);}if (jsonList.isEmpty()) {return null;}return (jsonList.size() > 1) ? new ArrayList<Object>(jsonList) : jsonList.get(0);}@SuppressWarnings("unchecked")private Object extract_json_withkey(Object json, String path) {if (json instanceof List) {List<Object> jsonArray = new ArrayList<Object>();for (int i = 0; i < ((List<Object>) json).size(); i++) {Object json_elem = ((List<Object>) json).get(i);Object json_obj = null;if (json_elem instanceof Map) {json_obj = ((Map<String, Object>) json_elem).get(path);} else {continue;}if (json_obj instanceof List) {for (int j = 0; j < ((List<Object>) json_obj).size(); j++) {jsonArray.add(((List<Object>) json_obj).get(j));}} else if (json_obj != null) {jsonArray.add(json_obj);}}return (jsonArray.size() == 0) ? null : jsonArray;} else if (json instanceof Map) {return ((Map<String, Object>) json).get(path);} else {return null;}}
}

需要导入的依赖,要和自己集群的版本契合,Hadoop 的版本及 Hive 的版本。

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"><modelVersion>4.0.0</modelVersion><groupId>com.atguigu.hive</groupId><artifactId>hivetest</artifactId><version>1.0-SNAPSHOT</version><properties><hadoop.version>3.0.0</hadoop.version><hive.version>2.1.1</hive.version><jackson.version>1.9.2</jackson.version><guava.version>14.0.1</guava.version></properties><dependencies><dependency><groupId>org.apache.hive</groupId><artifactId>hive-exec</artifactId><version>${hive.version}</version></dependency><!-- https://mvnrepository.com/artifact/org.apache.hive/hive-jdbc --><dependency><groupId>org.apache.hive</groupId><artifactId>hive-jdbc</artifactId><version>${hive.version}</version></dependency><dependency><groupId>org.apache.hadoop</groupId><artifactId>hadoop-common</artifactId><version>${hadoop.version}</version></dependency><dependency><groupId>com.google.guava</groupId><artifactId>guava</artifactId><version>${guava.version}</version></dependency><dependency><groupId>org.codehaus.jackson</groupId><artifactId>jackson-core-asl</artifactId><version>${jackson.version}</version></dependency><dependency><groupId>org.codehaus.jackson</groupId><artifactId>jackson-mapper-asl</artifactId><version>${jackson.version}</version></dependency><dependency><groupId>org.codehaus.jackson</groupId><artifactId>jackson-jaxrs</artifactId><version>${jackson.version}</version></dependency><dependency><groupId>org.codehaus.jackson</groupId><artifactId>jackson-xc</artifactId><version>${jackson.version}</version></dependency></dependencies></project>

注意: 因为上述UDF中也用到了 com.google.guavaorg.codehaus.jackson,所以这两个依赖要和 hive 版本中所用的依赖版本一致。

Hive-4.0.0 版本

package com.yan.hive.udf;import java.util.ArrayList;
import java.util.Iterator;
import java.util.LinkedHashMap;
import java.util.List;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;import com.fasterxml.jackson.core.json.JsonReadFeature;
import com.fasterxml.jackson.databind.JavaType;
import com.fasterxml.jackson.databind.ObjectMapper;
import com.google.common.collect.Iterators;
import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;/*** @author Yan* @create 2023-08-05 22:21* hive解析json中文key*/
@Description(name = "get_json_object_cn",value = "_FUNC_(json_txt, path) - Extract a json object from path ",extended = "Extract json object from a json string based on json path "+ "specified, and return json string of the extracted json object. It "+ "will return null if the input json string is invalid.\n"+ "A limited version of JSONPath supported:\n"+ "  $   : Root object\n"+ "  .   : Child operator\n"+ "  []  : Subscript operator for array\n"+ "  *   : Wildcard for []\n"+ "Syntax not supported that's worth noticing:\n"+ "  ''  : Zero length string as key\n"+ "  ..  : Recursive descent\n"+ "  &amp;#064;   : Current object/element\n"+ "  ()  : Script expression\n"+ "  ?() : Filter (script) expression.\n"+ "  [,] : Union operator\n"+ "  [start:end:step] : array slice operator\n")//定义了一个名为UDFJson的类,继承自UDF类。
public class UDFGetJsonObjectCN extends UDF {//定义一个静态正则表达式模式,用于匹配JSON路径中的键。//匹配英文key:匹配一个或多个大写字母、小写字母、数字、下划线、连字符、冒号或空格。//private static final Pattern patternKey = Pattern.compile("^([a-zA-Z0-9_\\-\\:\\s]+).*");//可以匹配中文,\\p{L}来匹配任意Unicode字母字符,包括中文字符:英文、数字、下划线、连字符、冒号、空格和中文字符。//private static final Pattern patternKey = Pattern.compile("^([a-zA-Z0-9_\\-\\:\\s\\p{L}]+).*");//可以匹配中文,\\p{L}来匹配任意Unicode字母字符,包括中文字符,但不包含特殊字符,特殊字符需自己添加//private static final Pattern patternKey = Pattern.compile("^([a-zA-Z0-9_\\-\\:\\s?%*+\\p{L}]+).*");//可以匹配中文,包含特殊字符,但不包含英文下的点(.);还有就是匹配不到路径中的索引了//private static final Pattern patternKey = Pattern.compile("^(.+).*");//可以匹配中文,包含特殊字符,不包中括号"[]",但不包含英文下的点(.);这样就可以匹配路径中的索引了private static final Pattern patternKey = Pattern.compile("^([^\\[\\]]+).*");//定义一个静态正则表达式模式,用于匹配JSON路径中的索引。private static final Pattern patternIndex = Pattern.compile("\\[([0-9]+|\\*)\\]");//创建一个ObjectMapper对象,用于解析JSON字符串。private static final ObjectMapper objectMapper = new ObjectMapper();//创建一个JavaType对象,用于表示Map类型。private static final JavaType MAP_TYPE = objectMapper.getTypeFactory().constructType(Map.class);//创建一个JavaType对象,用于表示List类型。private static final JavaType LIST_TYPE = objectMapper.getTypeFactory().constructType(List.class);//静态代码块,用于配置ObjectMapper的一些特性。static {// Allows for unescaped ASCII control characters in JSON valuesobjectMapper.enable(JsonReadFeature.ALLOW_UNESCAPED_CONTROL_CHARS.mappedFeature());// Enabled to accept quoting of all character backslash qooting mechanismobjectMapper.enable(JsonReadFeature.ALLOW_BACKSLASH_ESCAPING_ANY_CHARACTER.mappedFeature());}// An LRU cache using a linked hash map//定义了一个静态内部类HashCache,用作LRU缓存。static class HashCache<K, V> extends LinkedHashMap<K, V> {private static final int CACHE_SIZE = 16;private static final int INIT_SIZE = 32;private static final float LOAD_FACTOR = 0.6f;HashCache() {super(INIT_SIZE, LOAD_FACTOR);}private static final long serialVersionUID = 1;@Overrideprotected boolean removeEldestEntry(Map.Entry<K, V> eldest) {return size() > CACHE_SIZE;}}//声明了一个名为extractObjectCache的HashMap对象,用于缓存已提取的JSON对象。Map<String, Object> extractObjectCache = new HashCache<String, Object>();//声明了一个名为pathExprCache的HashMap对象,用于缓存已解析的JSON路径表达式。Map<String, String[]> pathExprCache = new HashCache<String, String[]>();//声明了一个名为indexListCache的HashMap对象,用于缓存已解析的JSON路径中的索引列表。Map<String, ArrayList<String>> indexListCache =new HashCache<String, ArrayList<String>>();//声明了一个名为mKeyGroup1Cache的HashMap对象,用于缓存JSON路径中的键。Map<String, String> mKeyGroup1Cache = new HashCache<String, String>();//声明了一个名为mKeyMatchesCache的HashMap对象,用于缓存JSON路径中的键是否匹配的结果。Map<String, Boolean> mKeyMatchesCache = new HashCache<String, Boolean>();//构造函数,没有参数。public UDFGetJsonObjectCN() {}/*** Extract json object from a json string based on json path specified, and* return json string of the extracted json object. It will return null if the* input json string is invalid.** A limited version of JSONPath supported: $ : Root object . : Child operator* [] : Subscript operator for array * : Wildcard for []** Syntax not supported that's worth noticing: '' : Zero length string as key* .. : Recursive descent &amp;#064; : Current object/element () : Script* expression ?() : Filter (script) expression. [,] : Union operator* [start:end:step] : array slice operator** @param jsonString*          the json string.* @param pathString*          the json path expression.* @return json string or null when an error happens.*///evaluate方法,用于提取指定路径的JSON对象并返回JSON字符串。public Text evaluate(String jsonString, String pathString) {if (jsonString == null || jsonString.isEmpty() || pathString == null|| pathString.isEmpty() || pathString.charAt(0) != '$') {return null;}int pathExprStart = 1;boolean unknownType = pathString.equals("$");boolean isRootArray = false;if (pathString.length() > 1) {if (pathString.charAt(1) == '[') {pathExprStart = 0;isRootArray = true;} else if (pathString.charAt(1) == '.') {isRootArray = pathString.length() > 2 && pathString.charAt(2) == '[';} else {return null;}}// Cache pathExprString[] pathExpr = pathExprCache.get(pathString);if (pathExpr == null) {pathExpr = pathString.split("\\.", -1);pathExprCache.put(pathString, pathExpr);}// Cache extractObjectObject extractObject = extractObjectCache.get(jsonString);if (extractObject == null) {if (unknownType) {try {extractObject = objectMapper.readValue(jsonString, LIST_TYPE);} catch (Exception e) {// Ignore exception}if (extractObject == null) {try {extractObject = objectMapper.readValue(jsonString, MAP_TYPE);} catch (Exception e) {return null;}}} else {JavaType javaType = isRootArray ? LIST_TYPE : MAP_TYPE;try {extractObject = objectMapper.readValue(jsonString, javaType);} catch (Exception e) {return null;}}extractObjectCache.put(jsonString, extractObject);}for (int i = pathExprStart; i < pathExpr.length; i++) {if (extractObject == null) {return null;}extractObject = extract(extractObject, pathExpr[i], i == pathExprStart && isRootArray);}Text result = new Text();if (extractObject instanceof Map || extractObject instanceof List) {try {result.set(objectMapper.writeValueAsString(extractObject));} catch (Exception e) {return null;}} else if (extractObject != null) {result.set(extractObject.toString());} else {return null;}return result;}//extract方法,递归地提取JSON对象。private Object extract(Object json, String path, boolean skipMapProc) {// skip MAP processing for the first path element if root is arrayif (!skipMapProc) {// Cache patternkey.matcher(path).matches()Matcher mKey = null;Boolean mKeyMatches = mKeyMatchesCache.get(path);if (mKeyMatches == null) {mKey = patternKey.matcher(path);mKeyMatches = mKey.matches() ? Boolean.TRUE : Boolean.FALSE;mKeyMatchesCache.put(path, mKeyMatches);}if (!mKeyMatches.booleanValue()) {return null;}// Cache mkey.group(1)String mKeyGroup1 = mKeyGroup1Cache.get(path);if (mKeyGroup1 == null) {if (mKey == null) {mKey = patternKey.matcher(path);mKeyMatches = mKey.matches() ? Boolean.TRUE : Boolean.FALSE;mKeyMatchesCache.put(path, mKeyMatches);if (!mKeyMatches.booleanValue()) {return null;}}mKeyGroup1 = mKey.group(1);mKeyGroup1Cache.put(path, mKeyGroup1);}json = extract_json_withkey(json, mKeyGroup1);}// Cache indexListArrayList<String> indexList = indexListCache.get(path);if (indexList == null) {Matcher mIndex = patternIndex.matcher(path);indexList = new ArrayList<String>();while (mIndex.find()) {indexList.add(mIndex.group(1));}indexListCache.put(path, indexList);}if (indexList.size() > 0) {json = extract_json_withindex(json, indexList);}return json;}//创建一个名为jsonList的AddingList对象,用于存储提取出来的JSON对象。private transient AddingList jsonList = new AddingList();//定义了一个静态内部类AddingList,继承自ArrayList<Object>,用于添加JSON对象到jsonList中。private static class AddingList extends ArrayList<Object> {private static final long serialVersionUID = 1L;@Overridepublic Iterator<Object> iterator() {return Iterators.forArray(toArray());}@Overridepublic void removeRange(int fromIndex, int toIndex) {super.removeRange(fromIndex, toIndex);}};//extract_json_withindex方法,根据JSON路径中的索引提取JSON对象。@SuppressWarnings("unchecked")private Object extract_json_withindex(Object json, ArrayList<String> indexList) {jsonList.clear();jsonList.add(json);for (String index : indexList) {int targets = jsonList.size();if (index.equalsIgnoreCase("*")) {for (Object array : jsonList) {if (array instanceof List) {for (int j = 0; j < ((List<Object>)array).size(); j++) {jsonList.add(((List<Object>)array).get(j));}}}} else {for (Object array : jsonList) {int indexValue = Integer.parseInt(index);if (!(array instanceof List)) {continue;}List<Object> list = (List<Object>) array;if (indexValue >= list.size()) {continue;}jsonList.add(list.get(indexValue));}}if (jsonList.size() == targets) {return null;}jsonList.removeRange(0, targets);}if (jsonList.isEmpty()) {return null;}return (jsonList.size() > 1) ? new ArrayList<Object>(jsonList) : jsonList.get(0);}//extract_json_withkey方法,根据JSON路径中的键提取JSON对象。@SuppressWarnings("unchecked")private Object extract_json_withkey(Object json, String path) {if (json instanceof List) {List<Object> jsonArray = new ArrayList<Object>();for (int i = 0; i < ((List<Object>) json).size(); i++) {Object json_elem = ((List<Object>) json).get(i);Object json_obj = null;if (json_elem instanceof Map) {json_obj = ((Map<String, Object>) json_elem).get(path);} else {continue;}if (json_obj instanceof List) {for (int j = 0; j < ((List<Object>) json_obj).size(); j++) {jsonArray.add(((List<Object>) json_obj).get(j));}} else if (json_obj != null) {jsonArray.add(json_obj);}}return (jsonArray.size() == 0) ? null : jsonArray;} else if (json instanceof Map) {return ((Map<String, Object>) json).get(path);} else {return null;}}
}

依赖

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"><modelVersion>4.0.0</modelVersion><groupId>com.atguigu.hive</groupId><artifactId>hivetest</artifactId><version>1.0-SNAPSHOT</version><properties><hadoop.version>3.3.1</hadoop.version><hive.version>4.0.0</hive.version><jackson.version>2.13.5</jackson.version><guava.version>22.0</guava.version></properties><dependencies><dependency><groupId>com.fasterxml.jackson</groupId><artifactId>jackson-bom</artifactId><version>${jackson.version}</version><type>pom</type><scope>import</scope></dependency><dependency><groupId>com.google.guava</groupId><artifactId>guava</artifactId><version>${guava.version}</version></dependency><dependency><groupId>com.fasterxml.jackson.core</groupId><artifactId>jackson-core</artifactId><version>${jackson.version}</version></dependency><dependency><groupId>com.fasterxml.jackson.core</groupId><artifactId>jackson-databind</artifactId><version>${jackson.version}</version></dependency><dependency><groupId>org.apache.hive</groupId><artifactId>hive-jdbc</artifactId><version>${hive.version}</version></dependency></dependencies></project>

Hive 版本不同,之间的依赖可能就有些许差距,如果不注意的话可能会报依赖错误。

参考文章:

get_json_object不能解析json里面中文的key

get_json_object源码

impala&hive自定义UDF解析json中文key

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.rhkb.cn/news/100557.html

如若内容造成侵权/违法违规/事实不符,请联系长河编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

七夕节日表白:七大网页风格与其适用人群

&#x1f337;&#x1f341; 博主猫头虎 带您 Go to New World.✨&#x1f341; &#x1f984; 博客首页——猫头虎的博客&#x1f390; &#x1f433;《面试题大全专栏》 文章图文并茂&#x1f995;生动形象&#x1f996;简单易学&#xff01;欢迎大家来踩踩~&#x1f33a; &a…

SVM详解

公式太多了&#xff0c;就用图片用笔记呈现&#xff0c;SVM虽然算法本质一目了然&#xff0c;但其中用到的数学推导还是挺多的&#xff0c;其中拉格朗日约束关于α>0这块证明我看了很长时间&#xff0c;到底是因为悟性不够。对偶问题也是&#xff0c;用了一个简单的例子才明…

文心一言 VS 讯飞星火 VS chatgpt (79)-- 算法导论7.4 4题

四、请用go语言&#xff0c;证明:RANDOMIZED-QUICKSORT 期望运行时间是 Ω(nlgn)。 文心一言&#xff1a; 首先&#xff0c;为了证明RANDOMIZED-QUICKSORT的期望运行时间是Ω(nlg n)&#xff0c;我们需要证明在最坏的情况下&#xff0c;该算法的运行时间是O(nlg n)。然后&…

1.0的星火2.0必将燎原——图文声影PPT全测试

一、前言 大家好&#xff0c;勇哥又来分享AI模型了&#xff0c;前几天讯飞发布的星火大模型2.0迅速的进入了我们圈子里&#xff0c;为了有更多更好的模型分享给大家&#xff0c;分享星火大模型2.0是必须做的&#xff0c;我做一个传递着&#xff0c;希望大家也星火相传啊。 我…

解决执行 spark.sql 时版本不兼容的一种方式

场景描述 hive 数据表的导入导出功能部分代码如下所示&#xff0c;使用 assemble 将 Java 程序和 spark 相关依赖一起打成 jar 包&#xff0c;最后 spark-submit 提交 jar 到集群执行。 public class SparkHiveApplication {public static void main(String[] args){long sta…

操作系统——shell编程

文章目录 shell入门什么是 Shell&#xff1f;Shell 编程的 Hello World Shell 变量Shell 编程中的变量介绍Shell 字符串入门Shell 字符串常见操作Shell 数组 Shell 基本运算符算数运算符关系运算符逻辑运算符布尔运算符字符串运算符文件相关运算符 shell流程控制if 条件语句for…

市面上那里有稳定L2股票行情数据接口?

随着市场的发展和技术的进步&#xff0c;level2股票行情数据接口已经成为股票交易软件的标准配置之一。虽然这些券商软件的功能在很大程度上相似&#xff0c;但它们仍然有自己的特点和优势。 例如&#xff1a;通过股票交易所以其专业的研究报告和丰富的信息服务而受到广泛关注&…

Shell编程基础02

0目录 1.case语法 2.grep 3.sed 4.awk 5.linux安装mysql 1.case语法 创建一个txt文档 执行 查询用户名 case 用法 写一个计算器脚本 加入函数 补充查看进程命名 2.find grep命令 Find 查询当前目录下 以sh结尾的文件 Grep 查询义开头的 或者加入正则表达…

Java --- 二维数组

一、二维数组的定义 public class TwoArrayTest {public static void main(String[] args) {//二维数组声明与初始化//方式1&#xff1a;静态初始化int[][] arr new int[][]{{1,2,3},{1,2,3},{1,2,3}};//方式2&#xff1a;动态初始化int[][] arr2 new int[3][3];arr2[0][1] …

开学季有哪些电容笔值得买吗?便宜的电容笔推荐

开学有哪些电容笔值得入手呢&#xff1f;这款名为Apple Pencil的平替电容笔&#xff0c;在压感功能方面这一项上&#xff0c;与Apple Pencil相比的主要区别就是&#xff0c;平替电容笔仅仅只拥有倾斜的压感功能&#xff0c;但用来书写和做笔记还是绰绰有余的&#xff0c;价格便…

C++系列-引用

引用 引用的基本使用引用的起源引用的语法引用的本质引用的注意事项引用和指针 引用作为函数参数引用作为函数的返回值常量引用其它用返回值方式调用函数&#xff08;case 1&#xff09;用函数的返回值初始化引用的方式调用函数&#xff08;case 2&#xff09;用返回引用的方式…

探工业互联网的下一站!腾讯云助力智造升级

引言 数字化浪潮正深刻影响着传统工业形态。作为第四次工业革命的重要基石&#xff0c;工业互联网凭借其独特的价值快速崛起&#xff0c;引领和推动着产业变革方向。面对数字化时代给产业带来的机遇与挑战&#xff0c;如何推动工业互联网的规模化落地&#xff0c;加速数字经济…

【Axure模板】APP帮助中心原型,在线客服意见反馈模块高保真原型

作品概况 页面数量&#xff1a;共 10 页 兼容软件&#xff1a;Axure RP 9/10&#xff0c;不支持低版本 应用领域&#xff1a;原型设计模板 作品申明&#xff1a;页面内容仅用于功能演示&#xff0c;无实际功能 作品特色 该模板作品为APP帮助与客服的通用模块&#xff0c;…

MySQL分页查询-性能优化

MySQL分页查询优化 一、背景二、原因三、解决四、原理探究 https://blog.csdn.net/hollis_chuang/article/details/130570281 一、背景 业务背景&#xff1a;给C端10万级别的用户&#xff0c;同时发送活动消息&#xff0c;活动消息分为6类。数据背景&#xff1a;mysql表有百万…

拒绝摆烂!C语言练习打卡第五天

&#x1f525;博客主页&#xff1a;小王又困了 &#x1f4da;系列专栏&#xff1a;每日一练 &#x1f31f;人之为学&#xff0c;不日近则日退 ❤️感谢大家点赞&#x1f44d;收藏⭐评论✍️ 目录 一、选择题 &#x1f4dd;1.第一题 &#x1f4dd;2.第二题 &#x1f4d…

【高级IO】- 多路转接之 poll | epoll

目录 I/O多路转接 - poll poll 函数 poll 服务器 poll 服务器 poll 的优点 poll 的缺点 I/O 多路转接 - epoll epoll 的相关系统调用 epoll_create 函数 epoll_ctl 函数 epoll_wait 函数 epoll 工作原理 epoll 服务器 ​编辑 epoll 的优点&#xff08;与 sele…

在线零售供应链管理:从0到100

全球57%的公司将其供应链视为竞争优势&#xff0c;然而只有6%的人表示完全了解它。无论您是首次在线卖家还是经验丰富的专业人士&#xff0c;驾驭管理高效且具有成本效益的零售供应链的复杂性都可能具有挑战性。一起来了解一下吧&#xff01; 供应链管理是什么&#xff1f;它基…

html动态爱心代码【二】(附源码)

目录 前言 效果演示 内容修改 完整代码 总结 前言 七夕马上就要到了&#xff0c;为了帮助大家高效表白&#xff0c;下面再给大家带来了实用的HTML浪漫表白代码(附源码)背景音乐&#xff0c;可用于520&#xff0c;情人节&#xff0c;生日&#xff0c;表白等场景&#xff0c…

前端打开弹窗时将链接转化为二维码

qrcodejs2 1.安装qrcodejs2 2.在使用页面中引入 import QRCode from "qrcodejs2";3.在组件中注册(Vue2项目) components: {QRCode,}, 4.在data中定义qrcode&#xff0c;以及方法中使用 showCode(row) {this.dialogVisible true;this.$nextTick(() > { 需要n…

21 移动测试神器:带你玩转Appium

Appium特点 Appium 作为目前主流的移动应用自动化测试框架&#xff0c;具有极强的灵活性&#xff0c;主要体现在以下 5 个方面&#xff1a; 测试用例的实现支持多种编程语言&#xff0c;比如 Java、Ruby、Python 等&#xff1b; Appium Server 支持多平台&#xff0c;既…