I am working out of a huge csv file (873,323 x 271) that looks similar to what is below:
| Part_Number | Type_Code | Building_Code | Handling_Code | Price to Buy | Price to Sell | Name |
|:-----------:|:-------------:|:--------------:|:-------------:|:------------:|:-------------:|:-------------:|
| A | 1, 2 | XX, XX, XX | Y, Y, Y, Y, Y | 304.32 | 510 | Mower |
| B | 1, 1, 1 | XX, XX, XX | Y, Y, Y, Y | 1282.04 | 5000 | Saw |
| C | 1, 2, 3 | XX, XX | Y, Y | 68.91 | 65 | Barrel (Hard) |
| D | 1, 1, 1, 1, 1 | XX, XX, XX, XX | Y, Y, Y | 0 | 300 | Barrel (Make) |
| E | 1 | XX | Y, Y, Y, Y | 321.11 | 415 | Cement Mixer |
| F | 2 | XX, XX, XX | Y | 194.44 | 1095 | Cement Mix |
There are a mix of column types: Some are numerical, some are string, and some are strings that look like lists (i.e., Type_Code
, Building_Code
, Handling_Code
, etc.)
What I am trying to accomplish is:
If each value in the column is the same value, then remove the list-like structure and replace it with just that value. i.e., 1, 1, 1 should become just 1. Numerical and non list-like strings should not be changed
Morphing the above table:
| Part_Number | Type_Code | Building_Code | Handling_Code | Price to Buy | Price to Sell | Name |
|:-----------:|:---------:|:-------------:|:-------------:|:------------:|:-------------:|:-------------:|
| A | 1, 2 | XX | Y | 304.32 | 510 | Mower |
| B | 1 | XX | Y | 1282.04 | 5000 | Saw |
| C | 1, 2, 3 | XX | Y | 68.91 | 65 | Barrel (Hard) |
| D | 1 | XX | Y | 0 | 300 | Barrel (Make) |
| E | 1 | XX | Y | 321.11 | 415 | Cement Mixer |
| F | 2 | XX | Y | 194.44 | 1095 | Cement Mix |
(i.e., since Building_Code
was just aggregations of XX
, it should just say XX
)
Below is my current my attempt:
import pandas as pd
# Read in CSV
df = pd.read_csv('C:\\Users\\wundermahn\\Desktop\\test_stack_csv.csv')
# Turn all columns into a list
for col in df.columns:
col_name = str(col)
temp = pd.DataFrame(df[col_name].tolist())
df.drop(col, axis=1, inplace=True)
df = pd.concat([df, temp], axis=1, join='inner')
# Now loop through the columns and remove items from the list
for col in df.columns:
# If all items are the same
if (len(set(col)) <= 1):
# Set it to be that item
col = col[0]
else:
# If they aren't the same, then just take the items out of the list
col = str(col)
print(df)
But I get an error:
Traceback (most recent call last):
File "c:\Users\wundermahn\Desktop\stack_0318.py", line 15, in <module>
if (len(set(col)) <= 1):
TypeError: 'int' object is not iterable
How can I achieve my desired result?
This looks like a custom function which splits ,
and joins it back after removing duplicates for which I have used dict.fromkeys
f = lambda x:','.join(dict.fromkeys([i.strip() for i in x.split(',')]).keys())
df.loc[:,df.dtypes.eq('object')]=df.select_dtypes('O').applymap(f)
print(df)
Part_Number Type_Code Building_Code Handling_Code Price to Buy \
0 A 1,2 XX Y 304.32
1 B 1 XX Y 1282.04
2 C 1,2,3 XX Y 68.91
3 D 1 XX Y 0.00
4 E 1 XX Y 321.11
5 F 2 XX Y 194.44
Price to Sell Name
0 510 Mower
1 5000 Saw
2 65 Barrel (Hard)
3 300 Barrel (Make)
4 415 Cement Mixer
5 1095 Cement Mix