HttpClient 4.5 重定向到中文URL出现乱码的解决方案

一、问题描述：

遇到某个 URL A，请求时发现会重定向到某个包含了中文字符的 URL B。原以为只要 HttpClient 开启了自动重定向的功能，下载 A 指向的页面轻而易举，结果却出乎意料。HttpClient 在获取重定向后的 URL B 时出现了中文乱码，导致下载失败，具体报错信息见下图：

二、解决方案

问题的核心在于 ConnectionConfig 对象的 Charset 变量。如果你有使用到连接池，请参照如下方法：

PoolingHttpClientConnectionManager cm = new PoolingHttpClientConnectionManager();
cm.setDefaultConnectionConfig(ConnectionConfig.custom().setCharset(Charset.forName("UTF-8")).build());

如果你只是使用到 HttpClient 对象，那么可以参考以下方法：

CloseableHttpClient httpClient = HttpClients.custom()
            .setDefaultConnectionConfig(ConnectionConfig.custom().setCharset(Charset.forName("UTF-8")).build())
            .build();

三、过程分析

上面我直接给出了解决方案，有兴趣的话可以一起分析一下这个过程。

首先，我们要了解 HttpClient 在重定向的这个过程中做了什么。

默认情况下，HttpClient 的重定向策略依赖于 DefaultRedirectStrategy 这个类。该类的 getLocationURI(...) 方法用于获取重定向后的 URL，具体代码如下所示：

public URI getLocationURI(HttpRequest request, HttpResponse response, HttpContext context) throws ProtocolException {
    Args.notNull(request, "HTTP request");
    Args.notNull(response, "HTTP response");
    Args.notNull(context, "HTTP context");
    HttpClientContext clientContext = HttpClientContext.adapt(context);
    Header locationHeader = response.getFirstHeader("location");
    if(locationHeader == null) {
        throw new ProtocolException("Received redirect response " + response.getStatusLine() + " but no location header");
    } else {
        String location = locationHeader.getValue();
        if(this.log.isDebugEnabled()) {
            this.log.debug("Redirect requested to location \'" + location + "\'");
        }

        RequestConfig config = clientContext.getRequestConfig();
        URI uri = this.createLocationURI(location);

        try {
            if(!uri.isAbsolute()) {
                if(!config.isRelativeRedirectsAllowed()) {
                    throw new ProtocolException("Relative redirect location \'" + uri + "\' not allowed");
                }

                HttpHost redirectLocations = clientContext.getTargetHost();
                Asserts.notNull(redirectLocations, "Target host");
                URI requestURI = new URI(request.getRequestLine().getUri());
                URI absoluteRequestURI = URIUtils.rewriteURI(requestURI, redirectLocations, false);
                uri = URIUtils.resolve(absoluteRequestURI, uri);
            }
        } catch (URISyntaxException var12) {
            throw new ProtocolException(var12.getMessage(), var12);
        }

        RedirectLocations redirectLocations1 = (RedirectLocations)clientContext.getAttribute("http.protocol.redirect-locations");
        if(redirectLocations1 == null) {
            redirectLocations1 = new RedirectLocations();
            context.setAttribute("http.protocol.redirect-locations", redirectLocations1);
        }

        if(!config.isCircularRedirectsAllowed() && redirectLocations1.contains(uri)) {
            throw new CircularRedirectException("Circular redirect to \'" + uri + "\'");
        } else {
            redirectLocations1.add(uri);
            return uri;
        }
    }
}

注意其中的核心点：

Header locationHeader = response.getFirstHeader("location");

可以看到，在遇到需要重定向的 URL 时，HttpClient 会先获取响应头的 location 属性，然后将其封装成 URI 对象后重新请求。

了解这一点后，我们先 debug 到这个位置，看看实际获取到的 location 属性是怎样的。结果发现，在这个地方获取到的 location 的值就已经是乱码了。

这时候我们可以确定，问题不是出现在 response 的 getFirstHeader(String name) 方法，而是出现在 response 本身。就是说，在我们发出请求后，获取到的 HttpResponse 实例本身就已经是出现问题的了。

那么，我们继续往底层跟踪，看看返回 HttpResponse 对象的 HttpRequestExecutor 在做什么。

protected HttpResponse doSendRequest(HttpRequest request, HttpClientConnection conn, HttpContext context) throws IOException, HttpException {
    Args.notNull(request, "HTTP request");
    Args.notNull(conn, "Client connection");
    Args.notNull(context, "HTTP context");
    HttpResponse response = null;
    context.setAttribute("http.connection", conn);
    context.setAttribute("http.request_sent", Boolean.FALSE);
    conn.sendRequestHeader(request);
    if(request instanceof HttpEntityEnclosingRequest) {
        boolean sendentity = true;
        ProtocolVersion ver = request.getRequestLine().getProtocolVersion();
        if(((HttpEntityEnclosingRequest)request).expectContinue() && !ver.lessEquals(HttpVersion.HTTP_1_0)) {
            conn.flush();
            if(conn.isResponseAvailable(this.waitForContinue)) {
                response = conn.receiveResponseHeader();
                if(this.canResponseHaveBody(request, response)) {
                    conn.receiveResponseEntity(response);
                }

                int status = response.getStatusLine().getStatusCode();
                if(status < 200) {
                    if(status != 100) {
                        throw new ProtocolException("Unexpected response: " + response.getStatusLine());
                    }

                    response = null;
                } else {
                    sendentity = false;
                }
            }
        }

        if(sendentity) {
            conn.sendRequestEntity((HttpEntityEnclosingRequest)request);
        }
    }

    conn.flush();
    context.setAttribute("http.request_sent", Boolean.TRUE);
    return response;
}

我们发现，真正发出请求和获取响应的是以下两段代码：

conn.sendRequestHeader(request);

response = conn.receiveResponseHeader();
if(this.canResponseHaveBody(request, response)) {  
    conn.receiveResponseEntity(response);
}

其中，在默认情况下，conn 的实现类是 DefaultBHttpClientConnection。

由于负责重定向的 location 属性位于响应头中，所以我们进入到 DefaultBHttpClientConnection 的 receiveResponseHeader() 方法，看看里面有什么门道：

public HttpResponse receiveResponseHeader() throws HttpException, IOException {
    this.ensureOpen();
    HttpResponse response = (HttpResponse)this.responseParser.parse();
    this.onResponseReceived(response);
    if(response.getStatusLine().getStatusCode() >= 200) {
        this.incrementResponseCount();
    }

    return response;
}

结果发现在这里还是没法看到响应头的具体获取过程，但是发现了 responseParser 的存在。经过跟踪，我们发现 responseParser 的 parse() 方法是由抽象类 AbstractMessageParser 实现的：

public T parse() throws IOException, HttpException {
    int st = this.state;
    switch(st) {
    case 0:
        try {
            this.message = this.parseHead(this.sessionBuffer);
        } catch (ParseException var4) {
            throw new ProtocolException(var4.getMessage(), var4);
        }

        this.state = 1;
    case 1:
        Header[] headers = parseHeaders(this.sessionBuffer, this.messageConstraints.getMaxHeaderCount(), this.messageConstraints.getMaxLineLength(), this.lineParser, this.headerLines);
        this.message.setHeaders(headers);
        HttpMessage result = this.message;
        this.message = null;
        this.headerLines.clear();
        this.state = 0;
        return result;
    default:
        throw new IllegalStateException("Inconsistent parser state");
    }
}

注意到代码中的 Header[] 数组，可以明显地感觉到离目标已经非常接近了，所以我们继续深入到 parseHeaders(...) 方法中：

public static Header[] parseHeaders(SessionInputBuffer inbuffer, int maxHeaderCount, int maxLineLen, LineParser parser, List<CharArrayBuffer> headerLines) throws HttpException, IOException {
    Args.notNull(inbuffer, "Session input buffer");
    Args.notNull(parser, "Line parser");
    Args.notNull(headerLines, "Header line list");
    CharArrayBuffer current = null;
    CharArrayBuffer previous = null;

    do {
        if(current == null) {
            current = new CharArrayBuffer(64);
        } else {
            current.clear();
        }

        int headers = inbuffer.readLine(current);
        int i;
        if(headers == -1 || current.length() < 1) {
            Header[] var12 = new Header[headerLines.size()];

            for(i = 0; i < headerLines.size(); ++i) {
                CharArrayBuffer var13 = (CharArrayBuffer)headerLines.get(i);

                try {
                    var12[i] = parser.parseHeader(var13);
                } catch (ParseException var11) {
                    throw new ProtocolException(var11.getMessage());
                }
            }

            return var12;
        }

        if((current.charAt(0) == 32 || current.charAt(0) == 9) && previous != null) {
            for(i = 0; i < current.length(); ++i) {
                char buffer = current.charAt(i);
                if(buffer != 32 && buffer != 9) {
                    break;
                }
            }

            if(maxLineLen > 0 && previous.length() + 1 + current.length() - i > maxLineLen) {
                throw new MessageConstraintException("Maximum line length limit exceeded");
            }

            previous.append(' ');
            previous.append(current, i, current.length() - i);
        } else {
            headerLines.add(current);
            previous = current;
            current = null;
        }
    } while(maxHeaderCount <= 0 || headerLines.size() < maxHeaderCount);

    throw new MessageConstraintException("Maximum header count exceeded");
}

这个方法显得比较长，但是我们需要关注的只有两个变量，分别是 inbuffer 和 current。前者是 SessionInputBuffer 对象，对象中 instream 变量存储的数据实际上就是我们的响应流；后者实际上就是一个字符数组。

看到这里，我们基本可以确定，乱码出现在响应流转换为字符数组的过程中。

我们进入到 SessionInputBuffer 实现类 SessionInputBufferImpl 中，发现该类有一个 CharsetDecoder 变量，跟踪发现默认情况下该变量为空。这时候，我们只需按照文章开头的方法，为该实现类赋予一个封装了 UTF-8 编码格式的 CharsetDecoder 实例，就可以解决中文乱码的问题。