[성동1기 전Z전능 데이터 분석가] 53일차 파이널 프로젝트

데이터분석가_안졍 2024. 1. 4. 12:41

728x90

어제 파이썬으로 데이터를 가져오는 것 까지 했다.

오늘은 결측치랑 이상치를 확인하고 결측치는 etc로 대체하기로 했다.

나는 시간 단위로 분석할 계획이라서 파생변수를 만들 것이다.

# 결측치 빈도 확인
pd.isna(RT).sum() # 결측치 없음
pd.isna(RT_ver1).sum() # 결측치 없음
pd.isna(RT_ver2).sum() # adress에 493977

# RT_ver2의 address 결측치 etc_adress로 대체하기
RT_ver2['address'] = RT_ver2['address'].fillna("etc_address")
RT_ver2['address'].value_counts()

# 이상치 확인하기
RT['ship_time'].value_counts().sort_index() # 2020-12-31 08~ 2023-11-01 11 시간단위까지로 줄여도 되겠음
RT['center_code'].value_counts().sort_index()
RT['region'].value_counts().sort_index()
RT['qt'].value_counts().sort_index() # 내가 활용할 것
RT['box_qt'].value_counts().sort_index()
RT['unit_qt'].value_counts().sort_index()
RT['product_code'].value_counts().sort_index() # 내가 활용할 것
RT['destination_code'].value_counts().sort_index()
RT['ship_unit'].value_counts().sort_index()
RT['ship_unit_qt'].value_counts().sort_index()
RT['std_weight'].value_counts().sort_index()
RT['condition'].value_counts().sort_index()
RT['remark1'].value_counts().sort_index()
RT['remark2'].value_counts().sort_index()
RT['cbm'].value_counts().sort_index()
RT['product_code'].value_counts().sort_index()
RT['customer_code'].value_counts().sort_index()
RT_ver1['customer_code'].value_counts().sort_index()
RT_ver1['func'].value_counts().sort_index()
RT_ver2['destination_code'].value_counts().sort_index()
RT_ver2['address'].value_counts().sort_index()

# 파생변수 추가 년/ 월/ 일/ 시간
RT['year'] = RT['ship_time'].dt.year
RT['month'] = RT['ship_time'].dt.month
RT['day'] = RT['ship_time'].dt.day
RT['hour'] = RT['ship_time'].dt.hour
RT['minute'] = RT['ship_time'].dt.minute
RT.head()