Capital one TPS整理

Credit Card Fraud Detection 7 times from 2015 to 2017

What machine learning model would you use to classify fraudulent transactions on credit cards?

feature selection

how to use classification method, which one is good to use?Later there will also be a problem which method is the least useful.

bias variance trade off - What does regularization do?

target missing

false positive/false negative - Are false positives or false negatives more important? What is the effect of FP and FN?

What is VIF (in regression output)?

potential issues

exploratory analysis and data cleaning

How would you handle missing or garbage data?

How would you use existing features to add new features?

Logistic regression, random forests

Difference between random forest and gradient boosted tree.

Anomaly detection/novelty detection techniques might be also helpful because of the huge data imbalance that normally exists in such scenarios.

Asked a lot of possible problems with the model and how should you deal with that when time is limited.

Couple things to keep in mind regarding fraud:
1) you're dealing with an imbalanced data set (your fraud cases may be 3-5% of all your data). So, consider either oversampling, or giving higher weight to your fraud cases.
2) you data may not have all the true fraud cases - in other words, there maybe actual fraud cases not captured in your data. So, some form of anomaly detection may be needed.

预测用户是否会注销信用卡 -3 times in 2018

如果给你一堆dataset，比如信用卡一年的交易记录、客户个人信息，银行想预测客户会不会在一个月之内关户，如果会的话，银行打算发一点cashback rewards给这些人挽留一下。让你建模预关户。以下是面试官的问题：

1. 你会选哪些feature？（感觉是随便说，只要有关系。追问如果是一堆transaction的日期之类的，应该怎样rebuild feature）
2. 怎么做data cleaning：
a.       怎样detect outlier？. From 1point 3acres bbs
b.       怎样fill in missing data？(我说可以填constant比如mean，然后他追问填mean在什么情况下不合适、怎样更好)
c.       如果target value也missing了怎么办
3. 你选什么model？(我说decision tree，然后他让我说有没有其他model，优缺点分别是什么，target是什么。target应该是一个binary的值whether the customer will close the account in one month，如果regression得到了0~1之间的值就代表how likely)
4. 怎么看model 的performance，用什么package. From 1point 3acres bbs
5. 如果data size很大有1TB，怎样sample，用什么package. From 1point 3acres bbs
6. 如果model不准确，会给银行造成什么损失？
7. 如果用model predict得到了一堆target的值，应该怎样根据target发rewards (我说画个distribution，给最可能关户的百分之几客户发rewards。追问除了这种方式还有什么方式，我也不确定是考modeling还是business sense)
8. 最后一个是地里看到的一模一样的open question，两人都有5000limit，但是一个用100%一个只用2%，这两人有没有可能都在一月之内关户。面试官应该看你第一反应是考虑model的问题还是考虑其他方面。

从feature engineering 到最后 model tuning and validation 的所有步骤。

如何建model,用了哪些parameter,结果如何还有为什么要选这个model

credit card churn model
   1. Feature engineering，比如从start date算出tenure 等等
   2. Missing value
   3. 用什么模型，为什么
   4. 现在数据量加大，怎么办？spark。如果你要选，用RSpark还是PySpark？为什么
   5. 现在模型output出来，一个credit limit 使用率0%的用户和使用率95%的用户都很危险，都很可能马上就关掉信用卡，你会怎么处理？我回答churn model是起点，一般marketing department会根据churn model的结果设计retention program。对于这两类危险用户，需要设计不同的incentive plan。
         1）使用率0%的用户，基本上很难挽回。
         2）使用率95%的用户大概率可以挽回，降低利率，增加cashback等等。。。
         3）可以根据测试结果再搞个uplift model，看哪些high churn users可以挽回的，着重施加treatment。

tell me some useful packages you use in R/python? 1 Answer
how do you detect multicollinearity? 1 Answer
how do you join two data sets?

猜你喜欢