They define their data scientist as 50% software engineer, 30% statistician, and 20% business analyst. The main tools they are using is Python, Amazon Web Services, and Spark(Scala).

2016

一个关于Credit Card Fraud的case，从Feature Engineering到Model Building都问了，一个case45分钟，所以包括很多细节

第一轮就挂了------而且是不到一个小时就收到了拒信。说实话我觉得自己答得不错，或者说是不至于一轮就给挂了。case的内容一样，就是做fraud detection，我当时申请的是朋友帮忙refer的，office在TX。也是差不多45分钟吧。但是不知道具体的问题都问的是不是一样。我记得我当时是从头开始跟他分析，这是个supervised problem（因为你得知道这个人是否最后有诈骗行为Yes/No），然后就是feature selection啊，他还都问我了如何做feature selection吧，我记得我说的用了Lasso，而且还说了这个步骤很重要，而且还要注意bias variance trade off----他还专门指出我说的这一步其他面试的人都没提到，他觉得这是个good point. 然后又问了些Lasso啊最后要检验模型好坏啊验证啊 CV啦之类的。最后的最后开始给我找难题问我了，他问我：如果target missing了怎么办！？我他么当时就斯巴达了啊。。。你一个好端端的supervised 给我编程unsupervised问题，那我只能说用Unsupervised办法解决了，我说那就clustering 吧，认为找到那些归到一个cluster的record然后给标记成Yes/No好了。他还不满意。。。我没办法，真是想不出来了，然后反问他：那就找找看missing的原因。因为我当时想：这个target理论上是不可能有missing的啊！一个人要么诈骗过要么没有。哪怕他有一万个动机要诈骗，只要还没做出诈骗，那就是No，只要是诈骗过那就是Yes。凭什么还能给我missing?!除非你家数据库有bug或者有人手贱给你专门删除了才能missing吧。就算是你技术原因的丢失了，这种数据公司也都该有备份吧，总不能你说没了就永远都找不到了吧。。。

2018

一上来先聊简历，之后对方假设了一个数据条件和场景，然后一步步往下问，从cleaning，feature engining，到 model selection， validation，同时也涉及大数据量的情况，问当数据量很大的时候怎么处理，用什么工具，问的比较细致，有的地方要大概描述代码怎么实现，电话持续一小时。

2018

credit card churn model，地里有人分享过，我就补充下细节：

   1. Feature engineering，比如从start date算出tenure 等等
   2. Missing value
   3. 用什么模型，为什么
   4. 现在数据量加大，怎么办？spark。如果你要选，用RSpark还是PySpark？为什么
   5. 现在模型output出来，一个credit limit 使用率0%的用户和使用率95%的用户都很危险，都很可能马上就关掉信用卡，你会怎么处理？我回答churn model是起点，一般marketing department会根据churn model的结果设计retention program。对于这两类危险用户，需要设计不同的incentive plan。
         1）使用率0%的用户，基本上很难挽回。
         2）使用率95%的用户大概率可以挽回，降低利率，增加cashback等等。。。
         3）可以根据测试结果再搞个uplift model，看哪些high churn users可以挽回的，着重施加treatment。

2018

一个信用卡customer retention 的 predictive model 过程，预测用户是否会注销信用卡。我回答的并不好但还是勉强过了，因为没有预料到这个是个更business的交流过程而不是一个纯technical的面试，而本人在美国6年来从来没用过credit card，所以表现得就有点缺乏常识了。面试基本涉及了从feature engineering 到最后 model tuning and validation 的所有步骤。

2018

电话面试内容主要问现在的工作经验和做的一些project 还有问是否用到predictive modeling， regression一类的，电话面试的结尾，同时安排了第一轮Onsite.

会问很多modeling都是细节尤其predictive modeling, random forest, logistic regression这类的如果简历里面写了这方面工作经历，一定要准备充分因为他们会问的很细节包括如何建model,用了哪些parameter,结果如何还有为什么要选这个model

2018

面试上来先自我介绍，面试官给屏幕共享可以看到一个word文档，题目大概是这样的：
our sever run cost is xxx, 其他固定成本是xxx，能容纳xxx TB流量。我们大概有xxx个客户，每个客户交付给我们server使用费为xxx／month。我们给每个用户分配xxxGB，但是平均每个用户只会用掉期中的xx%，所以我们可以把剩下的空间再去接纳更多的客户。问：每年盈利是多少？现有另外一种server b， cost is xxx，capacity is xxx。。。请权衡比较我们要不要把已有server换成server b-baidu 1point3acres
job fit就问的很简单，问python会多少，一般用哪些package，最近做过的ds项目。

2018

先互相介绍了一下，给我讲了讲他在capital one做什么，之后技术面试聊了大概50分钟，只有case。

题目是有一个运动产品的零售商，来找你优化他们的在线广告竞拍系统，提高response rate。假设你有的数据是3, 000, 000用户的访问数据，每行数据有150多个column，已知overall的response rate是1/1000。
被问的问题有：
1. 选什么作为target？
Response or not
2. 选什么metrics?
AUC-ROC
3. 怎么处理NA?
It depends. If NA is meaningful, leave it there. If NA is missing due to data extracation, do some simple if-else condition/mean(median)/regression to fill
4. 怎么做feature engineering?
Encode categorical varaible, use 'groupby' and 'mean/medium/std' to generate some features
4. 数据量特别大怎么办？
mapreduce，但是我没用过，就拿本地并行优化举了个例子，怎么分配数据给各个线程，然后怎么把数据收回来合并。
5. 模型用什么？
GBDT，lightGBM/XGB
6. 怎么评估模型表现？
k-fold CV
7. Overfitting/underfitting怎么办？
分别讨论了一下。想办法获取更多的数据，调整hyper-parameter。
8. 如果模型预测出了问题，会有什么影响？
分情况讨论了一下整体上会有什么变化，对单个用户有什么影响。

2018

自我介绍以后就开始问那个经典的预测信用卡用户会不会关户的问题。如果给你一堆dataset，比如信用卡一年的交易记录、客户个人信息，银行想预测客户会不会在一个月之内关户，如果会的话，银行打算发一点cashback rewards给这些人挽留一下。让你建模预关户。以下是面试官的问题：

1. 你会选哪些feature？（感觉是随便说，只要有关系。追问如果是一堆transaction的日期之类的，应该怎样rebuild feature）
2. 怎么做data cleaning：
a.       怎样detect outlier？. From 1point 3acres bbs
b.       怎样fill in missing data？(我说可以填constant比如mean，然后他追问填mean在什么情况下不合适、怎样更好)
c.       如果target value也missing了怎么办
3. 你选什么model？(我说decision tree，然后他让我说有没有其他model，优缺点分别是什么，target是什么。target应该是一个binary的值whether the customer will close the account in one month，如果regression得到了0~1之间的值就代表how likely)
4. 怎么看model 的performance，用什么package. From 1point 3acres bbs
5. 如果data size很大有1TB，怎样sample，用什么package. From 1point 3acres bbs
6. 如果model不准确，会给银行造成什么损失？
7. 如果用model predict得到了一堆target的值，应该怎样根据target发rewards (我说画个distribution，给最可能关户的百分之几客户发rewards。追问除了这种方式还有什么方式，我也不确定是考modeling还是business sense)
8. 最后一个是地里看到的一模一样的open question，两人都有5000limit，但是一个用100%一个只用2%，这两人有没有可能都在一月之内关户。面试官应该看你第一反应是考虑model的问题还是考虑其他方面。

2017

What machine learning model would you use to classify fraudulent transactions on credit cards?

He asked me many technical questions surrounding applications of machine learning to credit/financial industry. It covered questions about exploratory analysis and data cleaning, and was very focused on using random forests.

Logistic regression, random forests

Anomaly detection/novelty detection techniques might be also helpful because of the huge data imbalance that normally exists in such scenarios.

2017

For the phone interview, I spoke with a data scientist at an office that would be different from the one I was being interviewed for. I was asked an ill-defined question about how to develop a model for predicting credit card fraud, given a vague set of feature data. Ultimately, I don't think the interviewer liked my approach, as I received notification from the recruiter about one week later that the hiring manager was not interested in proceeding

2017

Phone Screening - Fraud detection, modeling related questions.

How would you develop a model to predict credit card fraud

? How would you handle missing or garbage data? How would you use existing features to add new features?

2017

Given a dataset, how would you model it to extract a

particular information. How would you architect the pipeline.

2017

Technical phone interview that went over credit card fraud case. Asked about what kind of model, features to engineer, false positive/false negative, regularization, and potential issues

Explain the bias-variance tradeoff. Answer Question
Write pseudocode for map reduce Answer Question
What does regularization do? Answer Question
Difference between random forest and gradient boosted tree. Answer Question
Describe a time you worked on a team. Answer Question
How do you learn something new? Answer Question
Are false positives or false negatives more important? 1 Answer
What is VIF (in regression output)? Answer Question
Interpret this ANOVA table.

2018

it was a phone interview, and they asked about 20 questions. I was bombarded with questions, and I didnt like the way conducted the interview honestly. she was not patient at all when i was asnwering the questions

tell me some useful packages you use in R? 1 Answer
how do you detect multicollinearity? 1 Answer
how do you join two data sets?

2018

phone interview discussing a predictive model on credit card business,

2015

Phone call with current data scientist: about 45 minutes. Asked some questions to see if you knew what you were talking about with statistics (what sort of distribution would this be, explain SD/mean, etc.). Then a question about a data set and what some problems with it could be. Lastly walk through how you would solve a simple problem with map reduce.

2015

How to build up a model to predict credit card fraud?

He goes through the job functionality and what they do first. Then comes a case interview about how to detect the credit card fraud. Asked about how to do validation, about specific method and specifically how to implement that. Asked a lot of possible problems with the model and how should you deal with that when time is limited.

Using a binary classification method such as logistic regression, CART, decision tree, random forest, or neural network it is possible to fit a model that predicts the outcome of the dependent variable (whether fraudulent or legitimate transaction). Multiple features can be used to fit the model based on the given dataset.

Couple things to keep in mind regarding fraud:
1) you're dealing with an imbalanced data set (your fraud cases may be 3-5% of all your data). So, consider either oversampling, or giving higher weight to your fraud cases.
2) you data may not have all the true fraud cases - in other words, there maybe actual fraud cases not captured in your data. So, some form of anomaly detection may be needed.

2016

how to detect the Fraud with current data problem, you are expecting to answer how to deal with missing value, how to use classification method, which one is good to use?Later there will also be a problem which method is the least useful. What is the effect of FP and FN.How to change the model if the amount (one is large like $1,000,the other is $5) turned out to classified as the same.

2016

How to set threshold for credit card fraud detection

Capital one TPS

How would you develop a model to predict credit card fraud

Given a dataset, how would you model it to extract a

猜你喜欢