HttpClient 4.5 重定向到中文URL出现乱码的解决方案

一、问题描述:

遇到某个 URL A,请求时发现会重定向到某个包含了中文字符的 URL B。原以为只要 HttpClient 开启了自动重定向的功能,下载 A 指向的页面轻而易举,结果却出乎意料。HttpClient 在获取重定向后的 URL B 时出现了中文乱码,导致下载失败,具体报错信息见下图:

image

二、解决方案

问题的核心在于 ConnectionConfig 对象的 Charset 变量。如果你有使用到连接池,请参照如下方法:

PoolingHttpClientConnectionManager cm = new PoolingHttpClientConnectionManager();
cm.setDefaultConnectionConfig(ConnectionConfig.custom().setCharset(Charset.forName("UTF-8")).build());

如果你只是使用到 HttpClient 对象,那么可以参考以下方法:

CloseableHttpClient httpClient = HttpClients.custom()
            .setDefaultConnectionConfig(ConnectionConfig.custom().setCharset(Charset.forName("UTF-8")).build())
            .build();

三、过程分析

上面我直接给出了解决方案,有兴趣的话可以一起分析一下这个过程。

首先,我们要了解 HttpClient 在重定向的这个过程中做了什么。

默认情况下,HttpClient 的重定向策略依赖于 DefaultRedirectStrategy 这个类。该类的 getLocationURI(...) 方法用于获取重定向后的 URL,具体代码如下所示:

public URI getLocationURI(HttpRequest request, HttpResponse response, HttpContext context) throws ProtocolException {
    Args.notNull(request, "HTTP request");
    Args.notNull(response, "HTTP response");
    Args.notNull(context, "HTTP context");
    HttpClientContext clientContext = HttpClientContext.adapt(context);
    Header locationHeader = response.getFirstHeader("location");
    if(locationHeader == null) {
        throw new ProtocolException("Received redirect response " + response.getStatusLine() + " but no location header");
    } else {
        String location = locationHeader.getValue();
        if(this.log.isDebugEnabled()) {
            this.log.debug("Redirect requested to location \'" + location + "\'");
        }

        RequestConfig config = clientContext.getRequestConfig();
        URI uri = this.createLocationURI(location);

        try {
            if(!uri.isAbsolute()) {
                if(!config.isRelativeRedirectsAllowed()) {
                    throw new ProtocolException("Relative redirect location \'" + uri + "\' not allowed");
                }

                HttpHost redirectLocations = clientContext.getTargetHost();
                Asserts.notNull(redirectLocations, "Target host");
                URI requestURI = new URI(request.getRequestLine().getUri());
                URI absoluteRequestURI = URIUtils.rewriteURI(requestURI, redirectLocations, false);
                uri = URIUtils.resolve(absoluteRequestURI, uri);
            }
        } catch (URISyntaxException var12) {
            throw new ProtocolException(var12.getMessage(), var12);
        }

        RedirectLocations redirectLocations1 = (RedirectLocations)clientContext.getAttribute("http.protocol.redirect-locations");
        if(redirectLocations1 == null) {
            redirectLocations1 = new RedirectLocations();
            context.setAttribute("http.protocol.redirect-locations", redirectLocations1);
        }

        if(!config.isCircularRedirectsAllowed() && redirectLocations1.contains(uri)) {
            throw new CircularRedirectException("Circular redirect to \'" + uri + "\'");
        } else {
            redirectLocations1.add(uri);
            return uri;
        }
    }
}

注意其中的核心点:

Header locationHeader = response.getFirstHeader("location");

可以看到,在遇到需要重定向的 URL 时,HttpClient 会先获取响应头的 location 属性,然后将其封装成 URI 对象后重新请求。

了解这一点后,我们先 debug 到这个位置,看看实际获取到的 location 属性是怎样的。结果发现,在这个地方获取到的 location 的值就已经是乱码了。

这时候我们可以确定,问题不是出现在 responsegetFirstHeader(String name) 方法,而是出现在 response 本身。就是说,在我们发出请求后,获取到的 HttpResponse 实例本身就已经是出现问题的了。

那么,我们继续往底层跟踪,看看返回 HttpResponse 对象的 HttpRequestExecutor 在做什么。

protected HttpResponse doSendRequest(HttpRequest request, HttpClientConnection conn, HttpContext context) throws IOException, HttpException {
    Args.notNull(request, "HTTP request");
    Args.notNull(conn, "Client connection");
    Args.notNull(context, "HTTP context");
    HttpResponse response = null;
    context.setAttribute("http.connection", conn);
    context.setAttribute("http.request_sent", Boolean.FALSE);
    conn.sendRequestHeader(request);
    if(request instanceof HttpEntityEnclosingRequest) {
        boolean sendentity = true;
        ProtocolVersion ver = request.getRequestLine().getProtocolVersion();
        if(((HttpEntityEnclosingRequest)request).expectContinue() && !ver.lessEquals(HttpVersion.HTTP_1_0)) {
            conn.flush();
            if(conn.isResponseAvailable(this.waitForContinue)) {
                response = conn.receiveResponseHeader();
                if(this.canResponseHaveBody(request, response)) {
                    conn.receiveResponseEntity(response);
                }

                int status = response.getStatusLine().getStatusCode();
                if(status < 200) {
                    if(status != 100) {
                        throw new ProtocolException("Unexpected response: " + response.getStatusLine());
                    }

                    response = null;
                } else {
                    sendentity = false;
                }
            }
        }

        if(sendentity) {
            conn.sendRequestEntity((HttpEntityEnclosingRequest)request);
        }
    }

    conn.flush();
    context.setAttribute("http.request_sent", Boolean.TRUE);
    return response;
}

我们发现,真正发出请求和获取响应的是以下两段代码:

conn.sendRequestHeader(request);
response = conn.receiveResponseHeader();
if(this.canResponseHaveBody(request, response)) {  
    conn.receiveResponseEntity(response);
}

其中,在默认情况下,conn 的实现类是 DefaultBHttpClientConnection

由于负责重定向的 location 属性位于响应头中,所以我们进入到 DefaultBHttpClientConnectionreceiveResponseHeader() 方法,看看里面有什么门道:

public HttpResponse receiveResponseHeader() throws HttpException, IOException {
    this.ensureOpen();
    HttpResponse response = (HttpResponse)this.responseParser.parse();
    this.onResponseReceived(response);
    if(response.getStatusLine().getStatusCode() >= 200) {
        this.incrementResponseCount();
    }

    return response;
}

结果发现在这里还是没法看到响应头的具体获取过程,但是发现了 responseParser 的存在。经过跟踪,我们发现 responseParserparse() 方法是由抽象类 AbstractMessageParser 实现的:

public T parse() throws IOException, HttpException {
    int st = this.state;
    switch(st) {
    case 0:
        try {
            this.message = this.parseHead(this.sessionBuffer);
        } catch (ParseException var4) {
            throw new ProtocolException(var4.getMessage(), var4);
        }

        this.state = 1;
    case 1:
        Header[] headers = parseHeaders(this.sessionBuffer, this.messageConstraints.getMaxHeaderCount(), this.messageConstraints.getMaxLineLength(), this.lineParser, this.headerLines);
        this.message.setHeaders(headers);
        HttpMessage result = this.message;
        this.message = null;
        this.headerLines.clear();
        this.state = 0;
        return result;
    default:
        throw new IllegalStateException("Inconsistent parser state");
    }
}

注意到代码中的 Header[] 数组,可以明显地感觉到离目标已经非常接近了,所以我们继续深入到 parseHeaders(...) 方法中:

public static Header[] parseHeaders(SessionInputBuffer inbuffer, int maxHeaderCount, int maxLineLen, LineParser parser, List<CharArrayBuffer> headerLines) throws HttpException, IOException {
    Args.notNull(inbuffer, "Session input buffer");
    Args.notNull(parser, "Line parser");
    Args.notNull(headerLines, "Header line list");
    CharArrayBuffer current = null;
    CharArrayBuffer previous = null;

    do {
        if(current == null) {
            current = new CharArrayBuffer(64);
        } else {
            current.clear();
        }

        int headers = inbuffer.readLine(current);
        int i;
        if(headers == -1 || current.length() < 1) {
            Header[] var12 = new Header[headerLines.size()];

            for(i = 0; i < headerLines.size(); ++i) {
                CharArrayBuffer var13 = (CharArrayBuffer)headerLines.get(i);

                try {
                    var12[i] = parser.parseHeader(var13);
                } catch (ParseException var11) {
                    throw new ProtocolException(var11.getMessage());
                }
            }

            return var12;
        }

        if((current.charAt(0) == 32 || current.charAt(0) == 9) && previous != null) {
            for(i = 0; i < current.length(); ++i) {
                char buffer = current.charAt(i);
                if(buffer != 32 && buffer != 9) {
                    break;
                }
            }

            if(maxLineLen > 0 && previous.length() + 1 + current.length() - i > maxLineLen) {
                throw new MessageConstraintException("Maximum line length limit exceeded");
            }

            previous.append(' ');
            previous.append(current, i, current.length() - i);
        } else {
            headerLines.add(current);
            previous = current;
            current = null;
        }
    } while(maxHeaderCount <= 0 || headerLines.size() < maxHeaderCount);

    throw new MessageConstraintException("Maximum header count exceeded");
}

这个方法显得比较长,但是我们需要关注的只有两个变量,分别是 inbuffercurrent。前者是 SessionInputBuffer 对象,对象中 instream 变量存储的数据实际上就是我们的响应流;后者实际上就是一个字符数组。

看到这里,我们基本可以确定,乱码出现在响应流转换为字符数组的过程中

我们进入到 SessionInputBuffer 实现类 SessionInputBufferImpl 中,发现该类有一个 CharsetDecoder 变量,跟踪发现默认情况下该变量为空。这时候,我们只需按照文章开头的方法,为该实现类赋予一个封装了 UTF-8 编码格式的 CharsetDecoder 实例,就可以解决中文乱码的问题。

猜你喜欢

转载自blog.csdn.net/magicpenta/article/details/81290496
今日推荐