{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "ojlhGzdxhkwR"
},
"source": [
"# Pandas. Загрузка библиотек"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"id": "xChor81V6mtD"
},
"outputs": [],
"source": [
"## Описание и загрузка библиотеки"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Z22T_R766hsO"
},
"source": [
" - Pandas - библиотека для обработки и анализа данных. Предназначена для данных разной природы - матричных, панельных данных, временных рядов. Претендует на звание самого мощного и гибкого средства для анализа данных с открытым исходным кодом."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Requirement already satisfied: pandas in ./venv/lib/python3.13/site-packages (2.2.3)\n",
"Requirement already satisfied: numpy>=1.26.0 in ./venv/lib/python3.13/site-packages (from pandas) (2.2.3)\n",
"Requirement already satisfied: python-dateutil>=2.8.2 in ./venv/lib/python3.13/site-packages (from pandas) (2.9.0.post0)\n",
"Requirement already satisfied: pytz>=2020.1 in ./venv/lib/python3.13/site-packages (from pandas) (2025.1)\n",
"Requirement already satisfied: tzdata>=2022.7 in ./venv/lib/python3.13/site-packages (from pandas) (2025.1)\n",
"Requirement already satisfied: six>=1.5 in ./venv/lib/python3.13/site-packages (from python-dateutil>=2.8.2->pandas) (1.17.0)\n",
"\n",
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m24.3.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m25.0.1\u001b[0m\n",
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n"
]
}
],
"source": [
"!pip install pandas"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"id": "DqYWosnHhkwU"
},
"outputs": [],
"source": [
"import pandas as pd # Загружаем модуль pandas"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "ZNKpaic1hkwb"
},
"source": [
"В пандас есть две структуры данных:\n",
"- Series: одномерный массив с именованными индексами (чаще всего, данные одного типа)\n",
"- DataFrame: двухмерный массив, имеет табличную структуру, легко изменяется по размерам, может содержать в себе данные разных типов\n",
"\n",
"Оба типа можно создавать вручную с помощью функций из самой библиотеки:\n",
"- pandas.Series(data=None, index=None, dtype=None)\n",
"- pandas.DataFrame(data=None, index=None, columns=None, dtype=None)\n",
"\n",
"- **data** - данные, которые надо записать в структуру\n",
"- **index** - индексы строк\n",
"- **columns** - названия столбцов\n",
"- **dtype** - тип данных\n",
"\n",
"Кроме data, остальные параметры опциональны\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "tMHOWBBWhkwf"
},
"source": [
"Мы, конечно, можем сами создавать датафреймы!\n",
"\n",
"Например, кто-то нашел нам кусок данных и просит воспроизвести этот датасет:\n",
"\n",
" \n",
"\n",
"Давайте разберемся, что здесь, что и запишем в известную нам конструкцию - листы. "
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"id": "9yW-A-fRhkwi"
},
"outputs": [],
"source": [
"columns = ['country', 'province', 'region_1', 'region_2'] # Создаем список, в котором будут храниться названия столбцов\n",
"index = [0, 1, 10, 100] # Создаем список, в котором будут индексы строк\n",
"\n",
"# Создаем список с данными, каждая строка таблицы - отдельный список\n",
"data = [['Italy', 'Sicily & Sardinia', 'Etna', 'NaN'], \n",
" ['Portugal', 'Douro', 'NaN', 'NaN'],\n",
" ['US', 'California', 'Napa Valley', 'Napa'],\n",
" ['US', 'New York', 'Finger Lakes', 'Finger Lakes']]"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "6jUo7y0uhkwo"
},
"source": [
"А теперь соберем в датафрейм"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 173
},
"id": "jMEdfOOdhkwp",
"outputId": "b5fae3e6-3e8d-4297-d468-0be74894b070",
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" country \n",
" province \n",
" region_1 \n",
" region_2 \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" Italy \n",
" Sicily & Sardinia \n",
" Etna \n",
" NaN \n",
" \n",
" \n",
" 1 \n",
" Portugal \n",
" Douro \n",
" NaN \n",
" NaN \n",
" \n",
" \n",
" 10 \n",
" US \n",
" California \n",
" Napa Valley \n",
" Napa \n",
" \n",
" \n",
" 100 \n",
" US \n",
" New York \n",
" Finger Lakes \n",
" Finger Lakes \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" country province region_1 region_2\n",
"0 Italy Sicily & Sardinia Etna NaN\n",
"1 Portugal Douro NaN NaN\n",
"10 US California Napa Valley Napa\n",
"100 US New York Finger Lakes Finger Lakes"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.DataFrame(data, columns = columns, index = index) # Создаем ДатаФрейм (в качестве параметров передаем называние столбцов, индексы и сами данные)\n",
"df # Отображаем наш ДатаФрейм (лучше без использования функции print())"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"id": "TIJhU5vEhkwv"
},
"outputs": [],
"source": [
"## Загрузка и запись данных"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "CjIlX-Ar6vd7"
},
"source": [
"\n",
"- Функции типа **pd.read_формат** и **pd.to_формат**\n",
"считывают и записывают данные соответственно. Полный список можно найти в документации:\n",
"https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html\n",
"\n",
"Научимся считывать данные в формате csv (comma separated value) функцией:\n",
"\n",
"- pd.read_csv() : \n",
"\n",
"Аргументов у нее очень много, критически важные:\n",
" - **filepath_or_buffer** - текстовая строка с названием (адресом) файла\n",
" - **sep** - разделитель между данными\n",
" - **header** - номер строки, в которой в файле указаны названия столбцов, None, если нет\n",
" - **names** - список с названиями колонок\n",
" - **index_col** - или номер столбца, или список, или ничего - колонка, из которой надо взять названия строк"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"id": "mWdKBTMNhkwx"
},
"outputs": [],
"source": [
"data = pd.read_csv('water_potability.csv') # С помощью метода read_csv загружаем файл wine_base.csv и записываем данные в data"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "g8zunGkmhkw2"
},
"source": [
"**Смотрим, что загрузилось**\n"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 204
},
"id": "slhGLHJNhkw4",
"outputId": "58af12df-d33f-4a2a-e3f6-5a763ba68831"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" ph \n",
" Hardness \n",
" Solids \n",
" Chloramines \n",
" Sulfate \n",
" Conductivity \n",
" Organic carbon \n",
" Trihalomethanes \n",
" Turbidity \n",
" Potability \n",
" \n",
" \n",
" \n",
" \n",
" 3271 \n",
" 4.668102 \n",
" 193.681735 \n",
" 47580.991603 \n",
" 7.166639 \n",
" 359.948574 \n",
" 526.424171 \n",
" 13.894419 \n",
" 66.687695 \n",
" 4.435821 \n",
" 1 \n",
" \n",
" \n",
" 3272 \n",
" 7.808856 \n",
" 193.553212 \n",
" 17329.802160 \n",
" 8.061362 \n",
" NaN \n",
" 392.449580 \n",
" 19.903225 \n",
" NaN \n",
" 2.798243 \n",
" 1 \n",
" \n",
" \n",
" 3273 \n",
" 9.419510 \n",
" 175.762646 \n",
" 33155.578218 \n",
" 7.350233 \n",
" NaN \n",
" 432.044783 \n",
" 11.039070 \n",
" 69.845400 \n",
" 3.298875 \n",
" 1 \n",
" \n",
" \n",
" 3274 \n",
" 5.126763 \n",
" 230.603758 \n",
" 11983.869376 \n",
" 6.303357 \n",
" NaN \n",
" 402.883113 \n",
" 11.168946 \n",
" 77.488213 \n",
" 4.708658 \n",
" 1 \n",
" \n",
" \n",
" 3275 \n",
" 7.874671 \n",
" 195.102299 \n",
" 17404.177061 \n",
" 7.509306 \n",
" NaN \n",
" 327.459760 \n",
" 16.140368 \n",
" 78.698446 \n",
" 2.309149 \n",
" 1 \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" ph Hardness Solids Chloramines Sulfate \\\n",
"3271 4.668102 193.681735 47580.991603 7.166639 359.948574 \n",
"3272 7.808856 193.553212 17329.802160 8.061362 NaN \n",
"3273 9.419510 175.762646 33155.578218 7.350233 NaN \n",
"3274 5.126763 230.603758 11983.869376 6.303357 NaN \n",
"3275 7.874671 195.102299 17404.177061 7.509306 NaN \n",
"\n",
" Conductivity Organic carbon Trihalomethanes Turbidity Potability \n",
"3271 526.424171 13.894419 66.687695 4.435821 1 \n",
"3272 392.449580 19.903225 NaN 2.798243 1 \n",
"3273 432.044783 11.039070 69.845400 3.298875 1 \n",
"3274 402.883113 11.168946 77.488213 4.708658 1 \n",
"3275 327.459760 16.140368 78.698446 2.309149 1 "
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.tail() # С помощью метода head выводим первые 5 строк нашего ДатаФрейма"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "TAEcUwXohkw9"
},
"source": [
"Что-то не то с первым столбцом, немного поправим"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"id": "UQ_ne0wIhkw-"
},
"outputs": [],
"source": [
"data = pd.read_csv('water_potability.csv', index_col = 0) # В параметре index_col указываем столбец, который будет использоваться как индекс нашего датафрейма"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 924
},
"id": "u5iBpJ0jhkxC",
"outputId": "b8c9ab01-2747-467a-e833-870c9d83d11b",
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" Hardness \n",
" Solids \n",
" Chloramines \n",
" Sulfate \n",
" Conductivity \n",
" Organic carbon \n",
" Trihalomethanes \n",
" Turbidity \n",
" Potability \n",
" \n",
" \n",
" ph \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" NaN \n",
" 204.890455 \n",
" 20791.318981 \n",
" 7.300212 \n",
" 368.516441 \n",
" 564.308654 \n",
" 10.379783 \n",
" 86.990970 \n",
" 2.963135 \n",
" 0 \n",
" \n",
" \n",
" 3.716080 \n",
" 129.422921 \n",
" 18630.057858 \n",
" 6.635246 \n",
" NaN \n",
" 592.885359 \n",
" 15.180013 \n",
" 56.329076 \n",
" 4.500656 \n",
" 0 \n",
" \n",
" \n",
" 8.099124 \n",
" 224.236259 \n",
" 19909.541732 \n",
" 9.275884 \n",
" NaN \n",
" 418.606213 \n",
" 16.868637 \n",
" 66.420093 \n",
" 3.055934 \n",
" 0 \n",
" \n",
" \n",
" 8.316766 \n",
" 214.373394 \n",
" 22018.417441 \n",
" 8.059332 \n",
" 356.886136 \n",
" 363.266516 \n",
" 18.436524 \n",
" 100.341674 \n",
" 4.628771 \n",
" 0 \n",
" \n",
" \n",
" 9.092223 \n",
" 181.101509 \n",
" 17978.986339 \n",
" 6.546600 \n",
" 310.135738 \n",
" 398.410813 \n",
" 11.558279 \n",
" 31.997993 \n",
" 4.075075 \n",
" 0 \n",
" \n",
" \n",
" 5.584087 \n",
" 188.313324 \n",
" 28748.687739 \n",
" 7.544869 \n",
" 326.678363 \n",
" 280.467916 \n",
" 8.399735 \n",
" 54.917862 \n",
" 2.559708 \n",
" 0 \n",
" \n",
" \n",
" 10.223862 \n",
" 248.071735 \n",
" 28749.716544 \n",
" 7.513408 \n",
" 393.663396 \n",
" 283.651634 \n",
" 13.789695 \n",
" 84.603556 \n",
" 2.672989 \n",
" 0 \n",
" \n",
" \n",
" 8.635849 \n",
" 203.361523 \n",
" 13672.091764 \n",
" 4.563009 \n",
" 303.309771 \n",
" 474.607645 \n",
" 12.363817 \n",
" 62.798309 \n",
" 4.401425 \n",
" 0 \n",
" \n",
" \n",
" NaN \n",
" 118.988579 \n",
" 14285.583854 \n",
" 7.804174 \n",
" 268.646941 \n",
" 389.375566 \n",
" 12.706049 \n",
" 53.928846 \n",
" 3.595017 \n",
" 0 \n",
" \n",
" \n",
" 11.180284 \n",
" 227.231469 \n",
" 25484.508491 \n",
" 9.077200 \n",
" 404.041635 \n",
" 563.885481 \n",
" 17.927806 \n",
" 71.976601 \n",
" 4.370562 \n",
" 0 \n",
" \n",
" \n",
" 7.360640 \n",
" 165.520797 \n",
" 32452.614409 \n",
" 7.550701 \n",
" 326.624353 \n",
" 425.383419 \n",
" 15.586810 \n",
" 78.740016 \n",
" 3.662292 \n",
" 0 \n",
" \n",
" \n",
" 7.974522 \n",
" 218.693300 \n",
" 18767.656682 \n",
" 8.110385 \n",
" NaN \n",
" 364.098230 \n",
" 14.525746 \n",
" 76.485911 \n",
" 4.011718 \n",
" 0 \n",
" \n",
" \n",
" 7.119824 \n",
" 156.704993 \n",
" 18730.813653 \n",
" 3.606036 \n",
" 282.344050 \n",
" 347.715027 \n",
" 15.929536 \n",
" 79.500778 \n",
" 3.445756 \n",
" 0 \n",
" \n",
" \n",
" NaN \n",
" 150.174923 \n",
" 27331.361962 \n",
" 6.838223 \n",
" 299.415781 \n",
" 379.761835 \n",
" 19.370807 \n",
" 76.509996 \n",
" 4.413974 \n",
" 0 \n",
" \n",
" \n",
" 7.496232 \n",
" 205.344982 \n",
" 28388.004887 \n",
" 5.072558 \n",
" NaN \n",
" 444.645352 \n",
" 13.228311 \n",
" 70.300213 \n",
" 4.777382 \n",
" 0 \n",
" \n",
" \n",
" 6.347272 \n",
" 186.732881 \n",
" 41065.234765 \n",
" 9.629596 \n",
" 364.487687 \n",
" 516.743282 \n",
" 11.539781 \n",
" 75.071617 \n",
" 4.376348 \n",
" 0 \n",
" \n",
" \n",
" 7.051786 \n",
" 211.049406 \n",
" 30980.600787 \n",
" 10.094796 \n",
" NaN \n",
" 315.141267 \n",
" 20.397022 \n",
" 56.651604 \n",
" 4.268429 \n",
" 0 \n",
" \n",
" \n",
" 9.181560 \n",
" 273.813807 \n",
" 24041.326280 \n",
" 6.904990 \n",
" 398.350517 \n",
" 477.974642 \n",
" 13.387341 \n",
" 71.457362 \n",
" 4.503661 \n",
" 0 \n",
" \n",
" \n",
" 8.975464 \n",
" 279.357167 \n",
" 19460.398131 \n",
" 6.204321 \n",
" NaN \n",
" 431.443990 \n",
" 12.888759 \n",
" 63.821237 \n",
" 2.436086 \n",
" 0 \n",
" \n",
" \n",
" 7.371050 \n",
" 214.496610 \n",
" 25630.320037 \n",
" 4.432669 \n",
" 335.754439 \n",
" 469.914551 \n",
" 12.509164 \n",
" 62.797277 \n",
" 2.560299 \n",
" 0 \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Hardness Solids Chloramines Sulfate Conductivity \\\n",
"ph \n",
"NaN 204.890455 20791.318981 7.300212 368.516441 564.308654 \n",
"3.716080 129.422921 18630.057858 6.635246 NaN 592.885359 \n",
"8.099124 224.236259 19909.541732 9.275884 NaN 418.606213 \n",
"8.316766 214.373394 22018.417441 8.059332 356.886136 363.266516 \n",
"9.092223 181.101509 17978.986339 6.546600 310.135738 398.410813 \n",
"5.584087 188.313324 28748.687739 7.544869 326.678363 280.467916 \n",
"10.223862 248.071735 28749.716544 7.513408 393.663396 283.651634 \n",
"8.635849 203.361523 13672.091764 4.563009 303.309771 474.607645 \n",
"NaN 118.988579 14285.583854 7.804174 268.646941 389.375566 \n",
"11.180284 227.231469 25484.508491 9.077200 404.041635 563.885481 \n",
"7.360640 165.520797 32452.614409 7.550701 326.624353 425.383419 \n",
"7.974522 218.693300 18767.656682 8.110385 NaN 364.098230 \n",
"7.119824 156.704993 18730.813653 3.606036 282.344050 347.715027 \n",
"NaN 150.174923 27331.361962 6.838223 299.415781 379.761835 \n",
"7.496232 205.344982 28388.004887 5.072558 NaN 444.645352 \n",
"6.347272 186.732881 41065.234765 9.629596 364.487687 516.743282 \n",
"7.051786 211.049406 30980.600787 10.094796 NaN 315.141267 \n",
"9.181560 273.813807 24041.326280 6.904990 398.350517 477.974642 \n",
"8.975464 279.357167 19460.398131 6.204321 NaN 431.443990 \n",
"7.371050 214.496610 25630.320037 4.432669 335.754439 469.914551 \n",
"\n",
" Organic carbon Trihalomethanes Turbidity Potability \n",
"ph \n",
"NaN 10.379783 86.990970 2.963135 0 \n",
"3.716080 15.180013 56.329076 4.500656 0 \n",
"8.099124 16.868637 66.420093 3.055934 0 \n",
"8.316766 18.436524 100.341674 4.628771 0 \n",
"9.092223 11.558279 31.997993 4.075075 0 \n",
"5.584087 8.399735 54.917862 2.559708 0 \n",
"10.223862 13.789695 84.603556 2.672989 0 \n",
"8.635849 12.363817 62.798309 4.401425 0 \n",
"NaN 12.706049 53.928846 3.595017 0 \n",
"11.180284 17.927806 71.976601 4.370562 0 \n",
"7.360640 15.586810 78.740016 3.662292 0 \n",
"7.974522 14.525746 76.485911 4.011718 0 \n",
"7.119824 15.929536 79.500778 3.445756 0 \n",
"NaN 19.370807 76.509996 4.413974 0 \n",
"7.496232 13.228311 70.300213 4.777382 0 \n",
"6.347272 11.539781 75.071617 4.376348 0 \n",
"7.051786 20.397022 56.651604 4.268429 0 \n",
"9.181560 13.387341 71.457362 4.503661 0 \n",
"8.975464 12.888759 63.821237 2.436086 0 \n",
"7.371050 12.509164 62.797277 2.560299 0 "
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.head(20) # С помощью метода head выводим первые 20 строк нашего ДатаФрейма"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "UnU4xLLzhkxG"
},
"source": [
"**Информация о загруженных данных**:\n",
"\n",
"- Посчитаем, сколько записей\n",
"- Посмотрим, какого типа данные\n",
"- Проверим, есть ли пропуски"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"id": "Z-MKWiELhkxP",
"outputId": "68ca424e-83eb-4d17-d779-1ea7f1a6b4ae"
},
"outputs": [
{
"data": {
"text/plain": [
"(3276, 9)"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.shape # Параметр .shape (так же как и в numpy-массивах) показывает размерность нашего датафрейма"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"id": "SEr52zb4hkxT",
"outputId": "d72fe356-c89d-4d61-c62b-a64359f748d2"
},
"outputs": [
{
"data": {
"text/plain": [
"29484"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.size # Параметр .size (так же как и в numpy-массивах) показывает количество элементов в нашем датафрейме"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 204
},
"id": "sw8ATDX1hkxJ",
"outputId": "2cfce98c-00bf-4093-8bf5-6573e0cd909f",
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"Hardness 3276\n",
"Solids 3276\n",
"Chloramines 3276\n",
"Sulfate 2495\n",
"Conductivity 3276\n",
"Organic carbon 3276\n",
"Trihalomethanes 3114\n",
"Turbidity 3276\n",
"Potability 3276\n",
"dtype: int64"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.count() # Метод count считает сколько всего непустых записей в каждом столбце"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 204
},
"id": "OwVkE1MKX6mW",
"outputId": "9a9f1142-de9c-4ffa-e051-bb58af728151"
},
"outputs": [
{
"data": {
"text/plain": [
"Hardness 100\n",
"Solids 100\n",
"Chloramines 100\n",
"Sulfate 74\n",
"Conductivity 100\n",
"Organic carbon 100\n",
"Trihalomethanes 98\n",
"Turbidity 100\n",
"Potability 100\n",
"dtype: int64"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.head(100).count() # Применим метод .count() к первым ста записям нашего датафрейма"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "5AjpVmYFhkxX"
},
"source": [
"- Метод info() заодно показывает, какого типа данные в столбцах"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 306
},
"id": "G8RHx3kvhkxZ",
"outputId": "cf46dd23-3acf-4d2e-c8c4-046fa7b1f8d6"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Index: 3276 entries, nan to 7.87467135779128\n",
"Data columns (total 9 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 Hardness 3276 non-null float64\n",
" 1 Solids 3276 non-null float64\n",
" 2 Chloramines 3276 non-null float64\n",
" 3 Sulfate 2495 non-null float64\n",
" 4 Conductivity 3276 non-null float64\n",
" 5 Organic carbon 3276 non-null float64\n",
" 6 Trihalomethanes 3114 non-null float64\n",
" 7 Turbidity 3276 non-null float64\n",
" 8 Potability 3276 non-null int64 \n",
"dtypes: float64(8), int64(1)\n",
"memory usage: 255.9 KB\n"
]
}
],
"source": [
"data.info() # Метод .info() показывает тип каждого столбца и занимаемую память"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 204
},
"id": "cMRzYwQdhkxd",
"outputId": "4afa01d4-ec65-452a-fc89-c503253c1efa"
},
"outputs": [
{
"data": {
"text/plain": [
"Hardness float64\n",
"Solids float64\n",
"Chloramines float64\n",
"Sulfate float64\n",
"Conductivity float64\n",
"Organic carbon float64\n",
"Trihalomethanes float64\n",
"Turbidity float64\n",
"Potability int64\n",
"dtype: object"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.dtypes # Параметр .dtypes показывает просто тип каждого столбца"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "S3TniwKUhkxh"
},
"source": [
"Начнем проверять на пропуски! \n",
"\n",
"- .isnull() - выдает табличку, где False - ячейка заполнена, True - ячейка пуста :( Ближайшая родня - isna()"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 204
},
"id": "uq1iywLbYsxS",
"outputId": "fcc31e0d-6e49-4967-ac34-d03865227b1f"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" Hardness \n",
" Solids \n",
" Chloramines \n",
" Sulfate \n",
" Conductivity \n",
" Organic carbon \n",
" Trihalomethanes \n",
" Turbidity \n",
" Potability \n",
" \n",
" \n",
" ph \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" NaN \n",
" 204.890455 \n",
" 20791.318981 \n",
" 7.300212 \n",
" 368.516441 \n",
" 564.308654 \n",
" 10.379783 \n",
" 86.990970 \n",
" 2.963135 \n",
" 0 \n",
" \n",
" \n",
" 3.716080 \n",
" 129.422921 \n",
" 18630.057858 \n",
" 6.635246 \n",
" NaN \n",
" 592.885359 \n",
" 15.180013 \n",
" 56.329076 \n",
" 4.500656 \n",
" 0 \n",
" \n",
" \n",
" 8.099124 \n",
" 224.236259 \n",
" 19909.541732 \n",
" 9.275884 \n",
" NaN \n",
" 418.606213 \n",
" 16.868637 \n",
" 66.420093 \n",
" 3.055934 \n",
" 0 \n",
" \n",
" \n",
" 8.316766 \n",
" 214.373394 \n",
" 22018.417441 \n",
" 8.059332 \n",
" 356.886136 \n",
" 363.266516 \n",
" 18.436524 \n",
" 100.341674 \n",
" 4.628771 \n",
" 0 \n",
" \n",
" \n",
" 9.092223 \n",
" 181.101509 \n",
" 17978.986339 \n",
" 6.546600 \n",
" 310.135738 \n",
" 398.410813 \n",
" 11.558279 \n",
" 31.997993 \n",
" 4.075075 \n",
" 0 \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Hardness Solids Chloramines Sulfate Conductivity \\\n",
"ph \n",
"NaN 204.890455 20791.318981 7.300212 368.516441 564.308654 \n",
"3.716080 129.422921 18630.057858 6.635246 NaN 592.885359 \n",
"8.099124 224.236259 19909.541732 9.275884 NaN 418.606213 \n",
"8.316766 214.373394 22018.417441 8.059332 356.886136 363.266516 \n",
"9.092223 181.101509 17978.986339 6.546600 310.135738 398.410813 \n",
"\n",
" Organic carbon Trihalomethanes Turbidity Potability \n",
"ph \n",
"NaN 10.379783 86.990970 2.963135 0 \n",
"3.716080 15.180013 56.329076 4.500656 0 \n",
"8.099124 16.868637 66.420093 3.055934 0 \n",
"8.316766 18.436524 100.341674 4.628771 0 \n",
"9.092223 11.558279 31.997993 4.075075 0 "
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.head() # Отобразим первые 5 строк нашего датафрейма"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 204
},
"id": "_oxBR6lAzfgu",
"outputId": "1b2f600d-ea50-4289-cfd6-976ea1526877"
},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 204
},
"id": "ZjTn7cM5zyta",
"outputId": "b3889e78-08c3-4bdf-c4ec-260d35eed9ea"
},
"outputs": [
{
"data": {
"text/plain": [
"Hardness 0\n",
"Solids 0\n",
"Chloramines 0\n",
"Sulfate 781\n",
"Conductivity 0\n",
"Organic carbon 0\n",
"Trihalomethanes 162\n",
"Turbidity 0\n",
"Potability 0\n",
"dtype: int64"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.isna().sum() # Подсчитаем количество пропусков в каждом столбце с помощью метода .sum()"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 204
},
"id": "D7aHOhOGY5pe",
"outputId": "ce904bd7-3087-40a2-824e-7f7047f358db"
},
"outputs": [
{
"data": {
"text/plain": [
"Hardness 0\n",
"Solids 0\n",
"Chloramines 0\n",
"Sulfate 26\n",
"Conductivity 0\n",
"Organic carbon 0\n",
"Trihalomethanes 2\n",
"Turbidity 0\n",
"Potability 0\n",
"dtype: int64"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.head(100).isna().sum() # Подсчитаем количество пропусков в каждом столбце для первых ста записей"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 204
},
"id": "ZaiGPw-KY9eQ",
"outputId": "3ffbc798-c5bc-4970-c6c3-1608da30afe4"
},
"outputs": [
{
"data": {
"text/plain": [
"Hardness 0\n",
"Solids 0\n",
"Chloramines 0\n",
"Sulfate 26\n",
"Conductivity 0\n",
"Organic carbon 0\n",
"Trihalomethanes 2\n",
"Turbidity 0\n",
"Potability 0\n",
"dtype: int64"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.isna().head(100).sum() # Подсчитаем количество пропусков в каждом столбце для первых ста записей (равнозначно предыдущей записи)"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 204
},
"id": "NHeX2czDhkxi",
"outputId": "9995758b-2f88-47ca-ab63-37cb364875dd"
},
"outputs": [
{
"data": {
"text/plain": [
"Hardness 0.000000\n",
"Solids 0.000000\n",
"Chloramines 0.000000\n",
"Sulfate 0.238400\n",
"Conductivity 0.000000\n",
"Organic carbon 0.000000\n",
"Trihalomethanes 0.049451\n",
"Turbidity 0.000000\n",
"Potability 0.000000\n",
"dtype: float64"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"round(data.isna().sum() / data.shape[0], 6) # Посчитаем какую часть составляют пропуски от общего количества элементов"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"id": "tvAQTignhkxo",
"outputId": "7d223855-7d97-4529-bb74-572e21ed89a2"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"943\n"
]
}
],
"source": [
"proc = data.isna().sum().sum() # Подсчитаем сколько всего пропусков (во всех столбцах) в нашем датафрейме\n",
"print(proc) # Отобразим количество посчитанных пропусков"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
},
"id": "EOZz-GAPhkxr",
"outputId": "b7997cfd-a292-48ff-d431-ff425d210a7c"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"3.2%\n"
]
}
],
"source": [
"# Переведем полученное значение в процентное отображение\n",
"proc = data.isna().sum().sum() / data.size\n",
"print(round(100*proc,1), '%', sep='')"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"id": "OuW1gRtlhkxz"
},
"outputs": [],
"source": [
"### Как оценить пропуски визуально"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "E8w2yGJ1hkx6"
},
"source": [
"Что с ним делать?\n",
"\n",
"Выбора не очень много: \n",
"\n",
"1) Удалять: \n",
"- dropna(axis=0, how='any'): axis = 0 - удаляем построчно, axis = 1 выкидываем столбец; how ='any' - выкидываем, если есть хотя бы одна ячейка пустая. how = 'all' - выкидываем, если есть полностью пустая строка или столбец\n",
"\n",
"2) Вставлять информацию самим:\n",
"- fillna() - это отдельное искусство, как заполнять. "
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Requirement already satisfied: matplotlib in ./venv/lib/python3.13/site-packages (3.10.0)\n",
"Requirement already satisfied: contourpy>=1.0.1 in ./venv/lib/python3.13/site-packages (from matplotlib) (1.3.1)\n",
"Requirement already satisfied: cycler>=0.10 in ./venv/lib/python3.13/site-packages (from matplotlib) (0.12.1)\n",
"Requirement already satisfied: fonttools>=4.22.0 in ./venv/lib/python3.13/site-packages (from matplotlib) (4.56.0)\n",
"Requirement already satisfied: kiwisolver>=1.3.1 in ./venv/lib/python3.13/site-packages (from matplotlib) (1.4.8)\n",
"Requirement already satisfied: numpy>=1.23 in ./venv/lib/python3.13/site-packages (from matplotlib) (2.2.3)\n",
"Requirement already satisfied: packaging>=20.0 in ./venv/lib/python3.13/site-packages (from matplotlib) (24.2)\n",
"Requirement already satisfied: pillow>=8 in ./venv/lib/python3.13/site-packages (from matplotlib) (11.1.0)\n",
"Requirement already satisfied: pyparsing>=2.3.1 in ./venv/lib/python3.13/site-packages (from matplotlib) (3.2.1)\n",
"Requirement already satisfied: python-dateutil>=2.7 in ./venv/lib/python3.13/site-packages (from matplotlib) (2.9.0.post0)\n",
"Requirement already satisfied: six>=1.5 in ./venv/lib/python3.13/site-packages (from python-dateutil>=2.7->matplotlib) (1.17.0)\n",
"\n",
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m24.3.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m25.0.1\u001b[0m\n",
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n",
"Requirement already satisfied: seaborn in ./venv/lib/python3.13/site-packages (0.13.2)\n",
"Requirement already satisfied: numpy!=1.24.0,>=1.20 in ./venv/lib/python3.13/site-packages (from seaborn) (2.2.3)\n",
"Requirement already satisfied: pandas>=1.2 in ./venv/lib/python3.13/site-packages (from seaborn) (2.2.3)\n",
"Requirement already satisfied: matplotlib!=3.6.1,>=3.4 in ./venv/lib/python3.13/site-packages (from seaborn) (3.10.0)\n",
"Requirement already satisfied: contourpy>=1.0.1 in ./venv/lib/python3.13/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.3.1)\n",
"Requirement already satisfied: cycler>=0.10 in ./venv/lib/python3.13/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (0.12.1)\n",
"Requirement already satisfied: fonttools>=4.22.0 in ./venv/lib/python3.13/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (4.56.0)\n",
"Requirement already satisfied: kiwisolver>=1.3.1 in ./venv/lib/python3.13/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.4.8)\n",
"Requirement already satisfied: packaging>=20.0 in ./venv/lib/python3.13/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (24.2)\n",
"Requirement already satisfied: pillow>=8 in ./venv/lib/python3.13/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (11.1.0)\n",
"Requirement already satisfied: pyparsing>=2.3.1 in ./venv/lib/python3.13/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (3.2.1)\n",
"Requirement already satisfied: python-dateutil>=2.7 in ./venv/lib/python3.13/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (2.9.0.post0)\n",
"Requirement already satisfied: pytz>=2020.1 in ./venv/lib/python3.13/site-packages (from pandas>=1.2->seaborn) (2025.1)\n",
"Requirement already satisfied: tzdata>=2022.7 in ./venv/lib/python3.13/site-packages (from pandas>=1.2->seaborn) (2025.1)\n",
"Requirement already satisfied: six>=1.5 in ./venv/lib/python3.13/site-packages (from python-dateutil>=2.7->matplotlib!=3.6.1,>=3.4->seaborn) (1.17.0)\n",
"\n",
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m24.3.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m25.0.1\u001b[0m\n",
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n"
]
}
],
"source": [
"!pip install matplotlib\n",
"!pip install seaborn"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 735
},
"id": "ToPE3VkWhkx1",
"outputId": "6a5c7213-1a94-4823-e5cf-3d19468a890d"
},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import matplotlib.pyplot as plt # Загружаем модуль matplotlib.pyplot\n",
"import seaborn as sns # Загружаем модуль seaborn\n",
"%matplotlib inline\n",
"\n",
"fig, ax = plt.subplots(figsize=(20,12)) # Создаем область под график\n",
"sns_heatmap = sns.heatmap(data.isnull(), yticklabels=False, cbar=False, cmap='viridis') # Визуализируем прпуски\n",
"plt.show() # Отображаем график"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 495
},
"id": "uZgh1E3nhkx6",
"outputId": "3a6709c7-3fbc-4260-bcd9-69c47228b013"
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" Hardness \n",
" Solids \n",
" Chloramines \n",
" Sulfate \n",
" Conductivity \n",
" Organic carbon \n",
" Trihalomethanes \n",
" Turbidity \n",
" Potability \n",
" \n",
" \n",
" ph \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" \n",
" NaN \n",
" 204.890455 \n",
" 20791.318981 \n",
" 7.300212 \n",
" 368.516441 \n",
" 564.308654 \n",
" 10.379783 \n",
" 86.99097 \n",
" 2.963135 \n",
" 0 \n",
" \n",
" \n",
" 3.716080 \n",
" 129.422921 \n",
" 18630.057858 \n",
" 6.635246 \n",
" Python \n",
" 592.885359 \n",
" 15.180013 \n",
" 56.329076 \n",
" 4.500656 \n",
" 0 \n",
" \n",
" \n",
" 8.099124 \n",
" 224.236259 \n",
" 19909.541732 \n",
" 9.275884 \n",
" Python \n",
" 418.606213 \n",
" 16.868637 \n",
" 66.420093 \n",
" 3.055934 \n",
" 0 \n",
" \n",
" \n",
" 8.316766 \n",
" 214.373394 \n",
" 22018.417441 \n",
" 8.059332 \n",
" 356.886136 \n",
" 363.266516 \n",
" 18.436524 \n",
" 100.341674 \n",
" 4.628771 \n",
" 0 \n",
" \n",
" \n",
" 9.092223 \n",
" 181.101509 \n",
" 17978.986339 \n",
" 6.546600 \n",
" 310.135738 \n",
" 398.410813 \n",
" 11.558279 \n",
" 31.997993 \n",
" 4.075075 \n",
" 0 \n",
" \n",
" \n",
" 5.584087 \n",
" 188.313324 \n",
" 28748.687739 \n",
" 7.544869 \n",
" 326.678363 \n",
" 280.467916 \n",
" 8.399735 \n",
" 54.917862 \n",
" 2.559708 \n",
" 0 \n",
" \n",
" \n",
" 10.223862 \n",
" 248.071735 \n",
" 28749.716544 \n",
" 7.513408 \n",
" 393.663396 \n",
" 283.651634 \n",
" 13.789695 \n",
" 84.603556 \n",
" 2.672989 \n",
" 0 \n",
" \n",
" \n",
" 8.635849 \n",
" 203.361523 \n",
" 13672.091764 \n",
" 4.563009 \n",
" 303.309771 \n",
" 474.607645 \n",
" 12.363817 \n",
" 62.798309 \n",
" 4.401425 \n",
" 0 \n",
" \n",
" \n",
" NaN \n",
" 118.988579 \n",
" 14285.583854 \n",
" 7.804174 \n",
" 268.646941 \n",
" 389.375566 \n",
" 12.706049 \n",
" 53.928846 \n",
" 3.595017 \n",
" 0 \n",
" \n",
" \n",
" 11.180284 \n",
" 227.231469 \n",
" 25484.508491 \n",
" 9.077200 \n",
" 404.041635 \n",
" 563.885481 \n",
" 17.927806 \n",
" 71.976601 \n",
" 4.370562 \n",
" 0 \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Hardness Solids Chloramines Sulfate Conductivity \\\n",
"ph \n",
"NaN 204.890455 20791.318981 7.300212 368.516441 564.308654 \n",
"3.716080 129.422921 18630.057858 6.635246 Python 592.885359 \n",
"8.099124 224.236259 19909.541732 9.275884 Python 418.606213 \n",
"8.316766 214.373394 22018.417441 8.059332 356.886136 363.266516 \n",
"9.092223 181.101509 17978.986339 6.546600 310.135738 398.410813 \n",
"5.584087 188.313324 28748.687739 7.544869 326.678363 280.467916 \n",
"10.223862 248.071735 28749.716544 7.513408 393.663396 283.651634 \n",
"8.635849 203.361523 13672.091764 4.563009 303.309771 474.607645 \n",
"NaN 118.988579 14285.583854 7.804174 268.646941 389.375566 \n",
"11.180284 227.231469 25484.508491 9.077200 404.041635 563.885481 \n",
"\n",
" Organic carbon Trihalomethanes Turbidity Potability \n",
"ph \n",
"NaN 10.379783 86.99097 2.963135 0 \n",
"3.716080 15.180013 56.329076 4.500656 0 \n",
"8.099124 16.868637 66.420093 3.055934 0 \n",
"8.316766 18.436524 100.341674 4.628771 0 \n",
"9.092223 11.558279 31.997993 4.075075 0 \n",
"5.584087 8.399735 54.917862 2.559708 0 \n",
"10.223862 13.789695 84.603556 2.672989 0 \n",
"8.635849 12.363817 62.798309 4.401425 0 \n",
"NaN 12.706049 53.928846 3.595017 0 \n",
"11.180284 17.927806 71.976601 4.370562 0 "
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.fillna(\"Python\").head(10) # С помощью метода .fillna() заменяем все пропуски словом Python"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"id": "mBsHwML6hkyF"
},
"outputs": [
{
"ename": "SyntaxError",
"evalue": "invalid syntax (192570114.py, line 3)",
"output_type": "error",
"traceback": [
"\u001b[0;36m Cell \u001b[0;32mIn[28], line 3\u001b[0;36m\u001b[0m\n\u001b[0;31m Теперь посмотрим, а что содержательно у нас есть на руках.\u001b[0m\n\u001b[0m ^\u001b[0m\n\u001b[0;31mSyntaxError\u001b[0m\u001b[0;31m:\u001b[0m invalid syntax\n"
]
}
],
"source": [
"### Описательные статистики\n",
"\n",
"Теперь посмотрим, а что содержательно у нас есть на руках. \n",
"\n",
"Глазами просматривать не будем, а попросим посчитать основные описательные статистики. Причем сразу все.\n",
"\n",
"- describe() - метод, который возвращает табличку с описательными статистиками. В таком виде считает все для числовых столбцов"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 297
},
"id": "ZWz60or1hkyG",
"outputId": "134781c0-28b6-4137-85d1-8376216860c6",
"scrolled": true
},
"outputs": [],
"source": [
"data.describe() # Отобразим описательные статистики нашего датафрейма (только числовые данные)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "0aeamrWMhkyK"
},
"source": [
"Немножко магии, и для нечисловых данные тоже будут свои описательные статистики. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 173
},
"id": "jKTF-2BHhkyK",
"outputId": "244a91a6-e8b4-42a9-d464-3728bc5c3dc5"
},
"outputs": [],
"source": [
"data.describe(include=['O']) # # Отобразим описательные статистики нашего датафрейма ('O' - в том числе и строковые)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "-IbwBRL_hkyO"
},
"outputs": [],
"source": [
"### Срезы данных\n",
"\n",
"Допустим, нам не нужен датасет, а только определенные столбцы или строки или столбцы и строки. \n",
"\n",
"\n",
"Как делать?\n",
"Помним, что:\n",
"- у столбцов есть названия\n",
"- у строк есть названия\n",
"- если нет названий, то они пронумерованы с нуля\n",
"\n",
"Основываясь на этой идее, мы начнем отбирать данные."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 80
},
"id": "4uT9dn4vhkyO",
"outputId": "f57acc81-2f88-41fd-ada0-028d80ed3ca7"
},
"outputs": [],
"source": [
"data.head(1) # Отобразим первую строчку датафрейма"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "5b2HPHJwhkyT"
},
"source": [
"#### Отбираем по столбцам. Версия 1. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 221
},
"id": "ocn9YgmnhkyZ",
"outputId": "438dc10b-ff81-42d3-deea-eefa140305d5"
},
"outputs": [],
"source": [
"array = data['price'] # Отобразим столбец price\n",
"array"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 221
},
"id": "tBeyZQLPIMIJ",
"outputId": "95d6ac17-ddde-46d9-e9a8-6e265eb12085"
},
"outputs": [],
"source": [
"data.price"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 119
},
"id": "YVzV30CQhkyV",
"outputId": "26839d1c-a250-4ec0-a388-50f50e45af89"
},
"outputs": [],
"source": [
"data.price.head() # Отобразим столбец price (альтернативные вариант)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 204
},
"id": "IDhUYDK5hkye",
"outputId": "536a6bdf-0016-4faf-e984-f1bfa8d356ad"
},
"outputs": [],
"source": [
"new_df = data[['price','country']].head() # Отобразим столбцы 'price' и 'country'\n",
"new_df"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Pw7bsVKPhkyg"
},
"source": [
"#### Отбираем по строкам. Версия 1. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 359
},
"id": "-3ntG2CzhDyV",
"outputId": "799530a2-5339-4ddf-8cbc-ef187d8a148f"
},
"outputs": [],
"source": [
"data[10:20] # Отобразим с 10й по 20ю строки датафрейма"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 173
},
"id": "DaW5dRU7hkyh",
"outputId": "9665e0f7-f195-4217-b8a1-674700cdc917"
},
"outputs": [],
"source": [
"data[10:20:3] # Отобразим с 10й по 20ю строки датафрейма с шагом 2"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 359
},
"id": "zXqL-lBEhGkG",
"outputId": "c741c699-f40d-4417-9bb4-bc849da4f19b"
},
"outputs": [],
"source": [
"data[::5].head(10) # Отобразим каждую 5ю строку датафрейма"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "pV0kczWfhkyk"
},
"source": [
"#### Отбор по столбцам. Версия 2. Все еще по названиям "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 495
},
"id": "blyn4oRnJOlm",
"outputId": "cc0258b0-2735-4d7f-cb92-ac9b23b2a83e"
},
"outputs": [],
"source": [
"data.head(10)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 173
},
"id": "LfRYRSsohkyk",
"outputId": "c7f8b402-bba9-4da4-eebe-6402c24030c2"
},
"outputs": [],
"source": [
"data.loc[4:7, ['price', 'points']] # Отобразим два столбца 'price' и 'points', и в них строки с индексами с 4 по 7"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "CK4-ntzDhkyo"
},
"source": [
"#### Отбор по строкам. Версия 2. Все еще по названиям "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 235
},
"id": "eqAQs0YIhkyq",
"outputId": "d0bab15c-91be-41a2-ef6f-82f2faf5e702"
},
"outputs": [],
"source": [
"data.loc[:5,:] # Отобразим строки с индексом от 0 до 5 (то же, что и data.loc[:5])"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "NGugpgJfhkyv"
},
"source": [
"#### Отбор по строчкам и столбцам. Версия 3. По номеру строк и столбцов"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 204
},
"id": "CG0aSTW8hkyv",
"outputId": "af2503a1-524e-431c-f61e-5d0a5590a9b1"
},
"outputs": [],
"source": [
"data.iloc[::5, [1,3]].head() # Отобразим каждую 5 строку и 1 и 3 столбец"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "sIao6149hkyy"
},
"source": [
"#### Отбор с условиями\n",
"\n",
"Так, а если мне нужны вина дороже $15 долларов? Как быть?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "YsIrLdRnhkyy"
},
"outputs": [],
"source": [
"#задаем маску\n",
"mask = data['price'] > 15"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 119
},
"id": "nfVB6YBbhky0",
"outputId": "ebfb750f-f4f2-43a3-c62b-4e39273046de"
},
"outputs": [],
"source": [
"mask.head() # Отобразим маску"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 359
},
"id": "FADnit0Ghky2",
"outputId": "16cf6881-4c3c-408e-f661-4141182143d5"
},
"outputs": [],
"source": [
"#и отбираем данные\n",
"temp = data[mask] # Выбираем данные из датафрейма в соответствии с маской и записываем их в новый даатафрейм temp\n",
"temp # Отображаем temp"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 204
},
"id": "EHT3VYtNhky4",
"outputId": "84091e0c-6995-4d73-b63a-c69aeec6ecd4"
},
"outputs": [],
"source": [
"data[data.price>300].head()# Альтернативный вариант"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 616
},
"id": "Moi8GwyVhky8",
"outputId": "f4020760-204a-42f0-9df3-28242583e16e"
},
"outputs": [],
"source": [
"data[(data.price > 200) & ((data.country == 'US') | (data.country == 'France'))].head(15) # Составное условие"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "Da0WfR_5hky_"
},
"outputs": [],
"source": [
"### Мультииндексация"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 204
},
"id": "gIqtjR45hky_",
"outputId": "48a1fdf1-4c7f-4c3c-8e76-b9e773cb15c7"
},
"outputs": [],
"source": [
"data.head() # Отобразим наш датафрем"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 1000
},
"id": "JizHrXguhkzC",
"outputId": "4b793963-14d9-4f87-bb3f-b26bb14d2d8e"
},
"outputs": [],
"source": [
"data_ = data.groupby(['country', 'price']).count() # Сграппируем данные сначала по странам, а затем по price\n",
"data_.head(100) # Отобразим первые 50 строк нового датафрейма"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 450
},
"id": "hE9aG1imhkzG",
"outputId": "b2abe0e9-93f7-4044-e88c-844918452e52"
},
"outputs": [],
"source": [
"data_.loc['US'] # Отобразим все данные для 'US'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 170
},
"id": "ZteYvkfehkzL",
"outputId": "938087ed-492f-4ddf-c3f1-16ca3b38f331"
},
"outputs": [],
"source": [
"data_.loc['US', 100] # Отобразим данные для 'US', у кого 100 points"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "6o6JX1OnhkzP"
},
"outputs": [],
"source": [
"#### Как изменять значения в табличке"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 204
},
"id": "sVulAc0HsaLu",
"outputId": "6fca9bb4-357b-4398-e509-6191a1ee9e74"
},
"outputs": [],
"source": [
"data_backup = data.copy() # Создаем копию нашего датафрейма и записываем в переменную data_backup\n",
"data.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 297
},
"id": "eMhSX4jqhkzP",
"outputId": "5ee71b23-0935-46c4-925f-912e28eeda25"
},
"outputs": [],
"source": [
"data.iloc[0,1] = 'kotiki' # Вставляем новое значение в 0 строку и 1 стоблец\n",
"data.iloc[2,2] = '129' # Вставляем новое значение в 2 строку и 2 стоблец\n",
"data.iloc[3:5,2:5] = 'new' # Вставляем новое значение с 3 по 5 строку и со 2го по 5ый стоблец\n",
"data.head(8)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "RZvBCkMCsiOT"
},
"outputs": [],
"source": [
"data = data_backup.copy() # Восстанавливаем данные из копии"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 204
},
"id": "SNliNx3TPCTX",
"outputId": "2f0ccacf-df6b-4499-c844-12be866957dc"
},
"outputs": [],
"source": [
"data.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 204
},
"id": "qXlU-wqyhkzT",
"outputId": "b65e6f8a-0562-43d7-b09a-ff2c8f9eab35"
},
"outputs": [],
"source": [
"data.loc[data.country == 'US', 'region_2'] = 'Syberia'\n",
"data.loc[data.price > 100, 'points'] = 200\n",
"data.loc[data.price > 100, 'price'] = 1000\n",
"data.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "6PumqSIU7Q7U"
},
"outputs": [],
"source": [
"## Перевод в Numpy\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 51
},
"id": "Obs9TzQ9E8ss",
"outputId": "7ea94ae7-0ee5-4773-be88-14ab92522349"
},
"outputs": [],
"source": [
"np_data = data.values # Получаем данные из датафрейма и записываем их в переменную np_data\n",
"print(np_data.shape) # Выводим размерность np_data\n",
"np_data.dtype"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 105
},
"id": "y1n1cNpdrFqQ",
"outputId": "f6f3c2df-9732-40ed-8569-5d53223b9a24"
},
"outputs": [],
"source": [
"print(np_data[0]) # Выводим 0ой элемент из массива"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 717
},
"id": "UsAcU8mBnWwn",
"outputId": "4eec37c8-c418-4140-de98-c4ec5b6bbe8b"
},
"outputs": [],
"source": [
"# Выведем первые 10 элементов из np_data\n",
"for i in range(10):\n",
" print(np_data[i])"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "8F6G2AUnKFCA"
},
"source": [
"# **Глоссарий**\n",
"\n",
"\n",
"pd.DataFrame(данные, columns = [колонки, если есть], index = [индексы ,если есть]) - создать датафрейм\n",
"\n",
"pd.read_csv(полный адрес расположения файла) - открыть .csv файл\n",
"\n",
"------------\n",
"\n",
".head() - посмотреть верхушку датафрейма (первые n строк)\n",
"\n",
".tail() - посмотреть конец датафрейма (последние n строк)\n",
"\n",
".columns - список колонок датафрейма\n",
"\n",
".values - вывести массив всех значений датафрейма\n",
"\n",
".index - список индексов датафрейма\n",
"\n",
".tolist() - перевести в список\n",
"\n",
".count() - посчитать количество определенных величин во фрейме\n",
"\n",
".describe() - посмотреть основные статистические характеристики фрейма\n",
"\n",
".shape - форма фрейма (строки, колонки)\n",
"\n",
".size - размер фрейма строки*колонки\n",
"\n",
".info() - информация о данных каждой колонки\n",
"\n",
".dtypes - тип данных каждой колонки\n",
"\n",
".isnull() - где недостает значений\n",
"\n",
".isna()- есть ли значения None\n",
"\n",
".dropna() - выкинуть строки/колонки с None\n",
"\n",
".fillna() - заполнить заданным значеним ячейки, где есть None\n",
"\n",
".loc[] - вывести значения по названиям колонок\n",
"\n",
".iloc[] - вывести значения по индексам колонок\n",
"\n",
".drop() - выкинуть определенные значения\n",
"\n",
"--------------\n",
"\n",
"pd.to_datetime(колонка, которую переводим в формат временного ряда)\n",
"\n",
".groupby() - сгруппировать по конкретному признаку\n",
"\n",
".copy() - создать копию\n",
"\n",
".sort_values() - сортировка значений\n",
"\n",
"pd.concat([df1,df2]) - конкатенация фреймов\n",
"\n",
".merge(второй_датафрейм, on = 'общая колонка, по которой склеиваем', how = 'с какой стороны') - конкатенация фреймов через общий признак\n",
"\n",
"-------------\n",
"\n",
"\n",
".corr() - вычислить корреляцию\n",
"\n",
".median() - вычислить медиану\n",
"\n",
".cumsum() - вычислить куммулятивную сумму\n",
"\n",
".cumprod() - вычислить коммулятивное произведение\n",
"\n",
".cummax() - вычислить коммулятивный максимум\n",
"\n",
"-------------\n",
"\n",
".quantile([]) - вычислить квантили\n",
"\n",
".nunique() - уникальные значения для n-колонок/строк\n",
"\n",
".unique() - уникальные значения определенной колонки/строк\n",
"\n",
"------------\n",
"\n",
".apply(функция) - применить функцию для колонки/строки\n",
"\n",
".agg(набор_функций) - применить ряд функций для колонки/строки\n"
]
}
],
"metadata": {
"colab": {
"provenance": [],
"toc_visible": true
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.13.2"
}
},
"nbformat": 4,
"nbformat_minor": 4
}