2488 lines
108 KiB
Plaintext
2488 lines
108 KiB
Plaintext
{
|
||
"cells": [
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "ojlhGzdxhkwR"
|
||
},
|
||
"source": [
|
||
"# Pandas. Загрузка библиотек"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 1,
|
||
"metadata": {
|
||
"id": "xChor81V6mtD"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"## Описание и загрузка библиотеки"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "Z22T_R766hsO"
|
||
},
|
||
"source": [
|
||
" - <a href=\"http://pandas.pydata.org/\">Pandas</a> - библиотека для обработки и анализа данных. Предназначена для данных разной природы - матричных, панельных данных, временных рядов. Претендует на звание самого мощного и гибкого средства для анализа данных с открытым исходным кодом."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 2,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Requirement already satisfied: pandas in ./venv/lib/python3.13/site-packages (2.2.3)\n",
|
||
"Requirement already satisfied: numpy>=1.26.0 in ./venv/lib/python3.13/site-packages (from pandas) (2.2.3)\n",
|
||
"Requirement already satisfied: python-dateutil>=2.8.2 in ./venv/lib/python3.13/site-packages (from pandas) (2.9.0.post0)\n",
|
||
"Requirement already satisfied: pytz>=2020.1 in ./venv/lib/python3.13/site-packages (from pandas) (2025.1)\n",
|
||
"Requirement already satisfied: tzdata>=2022.7 in ./venv/lib/python3.13/site-packages (from pandas) (2025.1)\n",
|
||
"Requirement already satisfied: six>=1.5 in ./venv/lib/python3.13/site-packages (from python-dateutil>=2.8.2->pandas) (1.17.0)\n",
|
||
"\n",
|
||
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m24.3.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m25.0.1\u001b[0m\n",
|
||
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"!pip install pandas"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 3,
|
||
"metadata": {
|
||
"id": "DqYWosnHhkwU"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"import pandas as pd # Загружаем модуль pandas"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "ZNKpaic1hkwb"
|
||
},
|
||
"source": [
|
||
"В пандас есть две структуры данных:\n",
|
||
"- Series: одномерный массив с именованными индексами (чаще всего, данные одного типа)\n",
|
||
"- DataFrame: двухмерный массив, имеет табличную структуру, легко изменяется по размерам, может содержать в себе данные разных типов\n",
|
||
"\n",
|
||
"Оба типа можно создавать вручную с помощью функций из самой библиотеки:\n",
|
||
"- pandas.Series(data=None, index=None, dtype=None)\n",
|
||
"- pandas.DataFrame(data=None, index=None, columns=None, dtype=None)\n",
|
||
"\n",
|
||
"- **data** - данные, которые надо записать в структуру\n",
|
||
"- **index** - индексы строк\n",
|
||
"- **columns** - названия столбцов\n",
|
||
"- **dtype** - тип данных\n",
|
||
"\n",
|
||
"Кроме data, остальные параметры опциональны\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "tMHOWBBWhkwf"
|
||
},
|
||
"source": [
|
||
"Мы, конечно, можем сами создавать датафреймы!\n",
|
||
"\n",
|
||
"Например, кто-то нашел нам кусок данных и просит воспроизвести этот датасет:\n",
|
||
"\n",
|
||
"<img src=\"https://i.imgur.com/FUCGiKP.png\">\n",
|
||
"\n",
|
||
"Давайте разберемся, что здесь, что и запишем в известную нам конструкцию - листы. "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 4,
|
||
"metadata": {
|
||
"id": "9yW-A-fRhkwi"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"columns = ['country', 'province', 'region_1', 'region_2'] # Создаем список, в котором будут храниться названия столбцов\n",
|
||
"index = [0, 1, 10, 100] # Создаем список, в котором будут индексы строк\n",
|
||
"\n",
|
||
"# Создаем список с данными, каждая строка таблицы - отдельный список\n",
|
||
"data = [['Italy', 'Sicily & Sardinia', 'Etna', 'NaN'], \n",
|
||
" ['Portugal', 'Douro', 'NaN', 'NaN'],\n",
|
||
" ['US', 'California', 'Napa Valley', 'Napa'],\n",
|
||
" ['US', 'New York', 'Finger Lakes', 'Finger Lakes']]"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "6jUo7y0uhkwo"
|
||
},
|
||
"source": [
|
||
"А теперь соберем в датафрейм"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 5,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/",
|
||
"height": 173
|
||
},
|
||
"id": "jMEdfOOdhkwp",
|
||
"outputId": "b5fae3e6-3e8d-4297-d468-0be74894b070",
|
||
"scrolled": true
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>country</th>\n",
|
||
" <th>province</th>\n",
|
||
" <th>region_1</th>\n",
|
||
" <th>region_2</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>0</th>\n",
|
||
" <td>Italy</td>\n",
|
||
" <td>Sicily & Sardinia</td>\n",
|
||
" <td>Etna</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>1</th>\n",
|
||
" <td>Portugal</td>\n",
|
||
" <td>Douro</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>10</th>\n",
|
||
" <td>US</td>\n",
|
||
" <td>California</td>\n",
|
||
" <td>Napa Valley</td>\n",
|
||
" <td>Napa</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>100</th>\n",
|
||
" <td>US</td>\n",
|
||
" <td>New York</td>\n",
|
||
" <td>Finger Lakes</td>\n",
|
||
" <td>Finger Lakes</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" country province region_1 region_2\n",
|
||
"0 Italy Sicily & Sardinia Etna NaN\n",
|
||
"1 Portugal Douro NaN NaN\n",
|
||
"10 US California Napa Valley Napa\n",
|
||
"100 US New York Finger Lakes Finger Lakes"
|
||
]
|
||
},
|
||
"execution_count": 5,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"df = pd.DataFrame(data, columns = columns, index = index) # Создаем ДатаФрейм (в качестве параметров передаем называние столбцов, индексы и сами данные)\n",
|
||
"df # Отображаем наш ДатаФрейм (лучше без использования функции print())"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 6,
|
||
"metadata": {
|
||
"id": "TIJhU5vEhkwv"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"## Загрузка и запись данных"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "CjIlX-Ar6vd7"
|
||
},
|
||
"source": [
|
||
"\n",
|
||
"- Функции типа **pd.read_формат** и **pd.to_формат**\n",
|
||
"считывают и записывают данные соответственно. <br /> Полный список можно найти в документации:\n",
|
||
"https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html\n",
|
||
"\n",
|
||
"Научимся считывать данные в формате csv (comma separated value) функцией:\n",
|
||
"\n",
|
||
"- <a href=\"http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html#pandas.read_csv\"> pd.read_csv()</a>: \n",
|
||
"\n",
|
||
"Аргументов у нее очень много, критически важные:\n",
|
||
" - **filepath_or_buffer** - текстовая строка с названием (адресом) файла\n",
|
||
" - **sep** - разделитель между данными\n",
|
||
" - **header** - номер строки, в которой в файле указаны названия столбцов, None, если нет\n",
|
||
" - **names** - список с названиями колонок\n",
|
||
" - **index_col** - или номер столбца, или список, или ничего - колонка, из которой надо взять названия строк"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 7,
|
||
"metadata": {
|
||
"id": "mWdKBTMNhkwx"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"data = pd.read_csv('water_potability.csv') # С помощью метода read_csv загружаем файл wine_base.csv и записываем данные в data"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "g8zunGkmhkw2"
|
||
},
|
||
"source": [
|
||
"**Смотрим, что загрузилось**\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 8,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/",
|
||
"height": 204
|
||
},
|
||
"id": "slhGLHJNhkw4",
|
||
"outputId": "58af12df-d33f-4a2a-e3f6-5a763ba68831"
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>ph</th>\n",
|
||
" <th>Hardness</th>\n",
|
||
" <th>Solids</th>\n",
|
||
" <th>Chloramines</th>\n",
|
||
" <th>Sulfate</th>\n",
|
||
" <th>Conductivity</th>\n",
|
||
" <th>Organic carbon</th>\n",
|
||
" <th>Trihalomethanes</th>\n",
|
||
" <th>Turbidity</th>\n",
|
||
" <th>Potability</th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>3271</th>\n",
|
||
" <td>4.668102</td>\n",
|
||
" <td>193.681735</td>\n",
|
||
" <td>47580.991603</td>\n",
|
||
" <td>7.166639</td>\n",
|
||
" <td>359.948574</td>\n",
|
||
" <td>526.424171</td>\n",
|
||
" <td>13.894419</td>\n",
|
||
" <td>66.687695</td>\n",
|
||
" <td>4.435821</td>\n",
|
||
" <td>1</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3272</th>\n",
|
||
" <td>7.808856</td>\n",
|
||
" <td>193.553212</td>\n",
|
||
" <td>17329.802160</td>\n",
|
||
" <td>8.061362</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>392.449580</td>\n",
|
||
" <td>19.903225</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>2.798243</td>\n",
|
||
" <td>1</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3273</th>\n",
|
||
" <td>9.419510</td>\n",
|
||
" <td>175.762646</td>\n",
|
||
" <td>33155.578218</td>\n",
|
||
" <td>7.350233</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>432.044783</td>\n",
|
||
" <td>11.039070</td>\n",
|
||
" <td>69.845400</td>\n",
|
||
" <td>3.298875</td>\n",
|
||
" <td>1</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3274</th>\n",
|
||
" <td>5.126763</td>\n",
|
||
" <td>230.603758</td>\n",
|
||
" <td>11983.869376</td>\n",
|
||
" <td>6.303357</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>402.883113</td>\n",
|
||
" <td>11.168946</td>\n",
|
||
" <td>77.488213</td>\n",
|
||
" <td>4.708658</td>\n",
|
||
" <td>1</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3275</th>\n",
|
||
" <td>7.874671</td>\n",
|
||
" <td>195.102299</td>\n",
|
||
" <td>17404.177061</td>\n",
|
||
" <td>7.509306</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>327.459760</td>\n",
|
||
" <td>16.140368</td>\n",
|
||
" <td>78.698446</td>\n",
|
||
" <td>2.309149</td>\n",
|
||
" <td>1</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" ph Hardness Solids Chloramines Sulfate \\\n",
|
||
"3271 4.668102 193.681735 47580.991603 7.166639 359.948574 \n",
|
||
"3272 7.808856 193.553212 17329.802160 8.061362 NaN \n",
|
||
"3273 9.419510 175.762646 33155.578218 7.350233 NaN \n",
|
||
"3274 5.126763 230.603758 11983.869376 6.303357 NaN \n",
|
||
"3275 7.874671 195.102299 17404.177061 7.509306 NaN \n",
|
||
"\n",
|
||
" Conductivity Organic carbon Trihalomethanes Turbidity Potability \n",
|
||
"3271 526.424171 13.894419 66.687695 4.435821 1 \n",
|
||
"3272 392.449580 19.903225 NaN 2.798243 1 \n",
|
||
"3273 432.044783 11.039070 69.845400 3.298875 1 \n",
|
||
"3274 402.883113 11.168946 77.488213 4.708658 1 \n",
|
||
"3275 327.459760 16.140368 78.698446 2.309149 1 "
|
||
]
|
||
},
|
||
"execution_count": 8,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"data.tail() # С помощью метода head выводим первые 5 строк нашего ДатаФрейма"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "TAEcUwXohkw9"
|
||
},
|
||
"source": [
|
||
"Что-то не то с первым столбцом, немного поправим"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 9,
|
||
"metadata": {
|
||
"id": "UQ_ne0wIhkw-"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"data = pd.read_csv('water_potability.csv', index_col = 0) # В параметре index_col указываем столбец, который будет использоваться как индекс нашего датафрейма"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 10,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/",
|
||
"height": 924
|
||
},
|
||
"id": "u5iBpJ0jhkxC",
|
||
"outputId": "b8c9ab01-2747-467a-e833-870c9d83d11b",
|
||
"scrolled": true
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>Hardness</th>\n",
|
||
" <th>Solids</th>\n",
|
||
" <th>Chloramines</th>\n",
|
||
" <th>Sulfate</th>\n",
|
||
" <th>Conductivity</th>\n",
|
||
" <th>Organic carbon</th>\n",
|
||
" <th>Trihalomethanes</th>\n",
|
||
" <th>Turbidity</th>\n",
|
||
" <th>Potability</th>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>ph</th>\n",
|
||
" <th></th>\n",
|
||
" <th></th>\n",
|
||
" <th></th>\n",
|
||
" <th></th>\n",
|
||
" <th></th>\n",
|
||
" <th></th>\n",
|
||
" <th></th>\n",
|
||
" <th></th>\n",
|
||
" <th></th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>NaN</th>\n",
|
||
" <td>204.890455</td>\n",
|
||
" <td>20791.318981</td>\n",
|
||
" <td>7.300212</td>\n",
|
||
" <td>368.516441</td>\n",
|
||
" <td>564.308654</td>\n",
|
||
" <td>10.379783</td>\n",
|
||
" <td>86.990970</td>\n",
|
||
" <td>2.963135</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3.716080</th>\n",
|
||
" <td>129.422921</td>\n",
|
||
" <td>18630.057858</td>\n",
|
||
" <td>6.635246</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>592.885359</td>\n",
|
||
" <td>15.180013</td>\n",
|
||
" <td>56.329076</td>\n",
|
||
" <td>4.500656</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>8.099124</th>\n",
|
||
" <td>224.236259</td>\n",
|
||
" <td>19909.541732</td>\n",
|
||
" <td>9.275884</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>418.606213</td>\n",
|
||
" <td>16.868637</td>\n",
|
||
" <td>66.420093</td>\n",
|
||
" <td>3.055934</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>8.316766</th>\n",
|
||
" <td>214.373394</td>\n",
|
||
" <td>22018.417441</td>\n",
|
||
" <td>8.059332</td>\n",
|
||
" <td>356.886136</td>\n",
|
||
" <td>363.266516</td>\n",
|
||
" <td>18.436524</td>\n",
|
||
" <td>100.341674</td>\n",
|
||
" <td>4.628771</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>9.092223</th>\n",
|
||
" <td>181.101509</td>\n",
|
||
" <td>17978.986339</td>\n",
|
||
" <td>6.546600</td>\n",
|
||
" <td>310.135738</td>\n",
|
||
" <td>398.410813</td>\n",
|
||
" <td>11.558279</td>\n",
|
||
" <td>31.997993</td>\n",
|
||
" <td>4.075075</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>5.584087</th>\n",
|
||
" <td>188.313324</td>\n",
|
||
" <td>28748.687739</td>\n",
|
||
" <td>7.544869</td>\n",
|
||
" <td>326.678363</td>\n",
|
||
" <td>280.467916</td>\n",
|
||
" <td>8.399735</td>\n",
|
||
" <td>54.917862</td>\n",
|
||
" <td>2.559708</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>10.223862</th>\n",
|
||
" <td>248.071735</td>\n",
|
||
" <td>28749.716544</td>\n",
|
||
" <td>7.513408</td>\n",
|
||
" <td>393.663396</td>\n",
|
||
" <td>283.651634</td>\n",
|
||
" <td>13.789695</td>\n",
|
||
" <td>84.603556</td>\n",
|
||
" <td>2.672989</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>8.635849</th>\n",
|
||
" <td>203.361523</td>\n",
|
||
" <td>13672.091764</td>\n",
|
||
" <td>4.563009</td>\n",
|
||
" <td>303.309771</td>\n",
|
||
" <td>474.607645</td>\n",
|
||
" <td>12.363817</td>\n",
|
||
" <td>62.798309</td>\n",
|
||
" <td>4.401425</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>NaN</th>\n",
|
||
" <td>118.988579</td>\n",
|
||
" <td>14285.583854</td>\n",
|
||
" <td>7.804174</td>\n",
|
||
" <td>268.646941</td>\n",
|
||
" <td>389.375566</td>\n",
|
||
" <td>12.706049</td>\n",
|
||
" <td>53.928846</td>\n",
|
||
" <td>3.595017</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>11.180284</th>\n",
|
||
" <td>227.231469</td>\n",
|
||
" <td>25484.508491</td>\n",
|
||
" <td>9.077200</td>\n",
|
||
" <td>404.041635</td>\n",
|
||
" <td>563.885481</td>\n",
|
||
" <td>17.927806</td>\n",
|
||
" <td>71.976601</td>\n",
|
||
" <td>4.370562</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>7.360640</th>\n",
|
||
" <td>165.520797</td>\n",
|
||
" <td>32452.614409</td>\n",
|
||
" <td>7.550701</td>\n",
|
||
" <td>326.624353</td>\n",
|
||
" <td>425.383419</td>\n",
|
||
" <td>15.586810</td>\n",
|
||
" <td>78.740016</td>\n",
|
||
" <td>3.662292</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>7.974522</th>\n",
|
||
" <td>218.693300</td>\n",
|
||
" <td>18767.656682</td>\n",
|
||
" <td>8.110385</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>364.098230</td>\n",
|
||
" <td>14.525746</td>\n",
|
||
" <td>76.485911</td>\n",
|
||
" <td>4.011718</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>7.119824</th>\n",
|
||
" <td>156.704993</td>\n",
|
||
" <td>18730.813653</td>\n",
|
||
" <td>3.606036</td>\n",
|
||
" <td>282.344050</td>\n",
|
||
" <td>347.715027</td>\n",
|
||
" <td>15.929536</td>\n",
|
||
" <td>79.500778</td>\n",
|
||
" <td>3.445756</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>NaN</th>\n",
|
||
" <td>150.174923</td>\n",
|
||
" <td>27331.361962</td>\n",
|
||
" <td>6.838223</td>\n",
|
||
" <td>299.415781</td>\n",
|
||
" <td>379.761835</td>\n",
|
||
" <td>19.370807</td>\n",
|
||
" <td>76.509996</td>\n",
|
||
" <td>4.413974</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>7.496232</th>\n",
|
||
" <td>205.344982</td>\n",
|
||
" <td>28388.004887</td>\n",
|
||
" <td>5.072558</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>444.645352</td>\n",
|
||
" <td>13.228311</td>\n",
|
||
" <td>70.300213</td>\n",
|
||
" <td>4.777382</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>6.347272</th>\n",
|
||
" <td>186.732881</td>\n",
|
||
" <td>41065.234765</td>\n",
|
||
" <td>9.629596</td>\n",
|
||
" <td>364.487687</td>\n",
|
||
" <td>516.743282</td>\n",
|
||
" <td>11.539781</td>\n",
|
||
" <td>75.071617</td>\n",
|
||
" <td>4.376348</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>7.051786</th>\n",
|
||
" <td>211.049406</td>\n",
|
||
" <td>30980.600787</td>\n",
|
||
" <td>10.094796</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>315.141267</td>\n",
|
||
" <td>20.397022</td>\n",
|
||
" <td>56.651604</td>\n",
|
||
" <td>4.268429</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>9.181560</th>\n",
|
||
" <td>273.813807</td>\n",
|
||
" <td>24041.326280</td>\n",
|
||
" <td>6.904990</td>\n",
|
||
" <td>398.350517</td>\n",
|
||
" <td>477.974642</td>\n",
|
||
" <td>13.387341</td>\n",
|
||
" <td>71.457362</td>\n",
|
||
" <td>4.503661</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>8.975464</th>\n",
|
||
" <td>279.357167</td>\n",
|
||
" <td>19460.398131</td>\n",
|
||
" <td>6.204321</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>431.443990</td>\n",
|
||
" <td>12.888759</td>\n",
|
||
" <td>63.821237</td>\n",
|
||
" <td>2.436086</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>7.371050</th>\n",
|
||
" <td>214.496610</td>\n",
|
||
" <td>25630.320037</td>\n",
|
||
" <td>4.432669</td>\n",
|
||
" <td>335.754439</td>\n",
|
||
" <td>469.914551</td>\n",
|
||
" <td>12.509164</td>\n",
|
||
" <td>62.797277</td>\n",
|
||
" <td>2.560299</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" Hardness Solids Chloramines Sulfate Conductivity \\\n",
|
||
"ph \n",
|
||
"NaN 204.890455 20791.318981 7.300212 368.516441 564.308654 \n",
|
||
"3.716080 129.422921 18630.057858 6.635246 NaN 592.885359 \n",
|
||
"8.099124 224.236259 19909.541732 9.275884 NaN 418.606213 \n",
|
||
"8.316766 214.373394 22018.417441 8.059332 356.886136 363.266516 \n",
|
||
"9.092223 181.101509 17978.986339 6.546600 310.135738 398.410813 \n",
|
||
"5.584087 188.313324 28748.687739 7.544869 326.678363 280.467916 \n",
|
||
"10.223862 248.071735 28749.716544 7.513408 393.663396 283.651634 \n",
|
||
"8.635849 203.361523 13672.091764 4.563009 303.309771 474.607645 \n",
|
||
"NaN 118.988579 14285.583854 7.804174 268.646941 389.375566 \n",
|
||
"11.180284 227.231469 25484.508491 9.077200 404.041635 563.885481 \n",
|
||
"7.360640 165.520797 32452.614409 7.550701 326.624353 425.383419 \n",
|
||
"7.974522 218.693300 18767.656682 8.110385 NaN 364.098230 \n",
|
||
"7.119824 156.704993 18730.813653 3.606036 282.344050 347.715027 \n",
|
||
"NaN 150.174923 27331.361962 6.838223 299.415781 379.761835 \n",
|
||
"7.496232 205.344982 28388.004887 5.072558 NaN 444.645352 \n",
|
||
"6.347272 186.732881 41065.234765 9.629596 364.487687 516.743282 \n",
|
||
"7.051786 211.049406 30980.600787 10.094796 NaN 315.141267 \n",
|
||
"9.181560 273.813807 24041.326280 6.904990 398.350517 477.974642 \n",
|
||
"8.975464 279.357167 19460.398131 6.204321 NaN 431.443990 \n",
|
||
"7.371050 214.496610 25630.320037 4.432669 335.754439 469.914551 \n",
|
||
"\n",
|
||
" Organic carbon Trihalomethanes Turbidity Potability \n",
|
||
"ph \n",
|
||
"NaN 10.379783 86.990970 2.963135 0 \n",
|
||
"3.716080 15.180013 56.329076 4.500656 0 \n",
|
||
"8.099124 16.868637 66.420093 3.055934 0 \n",
|
||
"8.316766 18.436524 100.341674 4.628771 0 \n",
|
||
"9.092223 11.558279 31.997993 4.075075 0 \n",
|
||
"5.584087 8.399735 54.917862 2.559708 0 \n",
|
||
"10.223862 13.789695 84.603556 2.672989 0 \n",
|
||
"8.635849 12.363817 62.798309 4.401425 0 \n",
|
||
"NaN 12.706049 53.928846 3.595017 0 \n",
|
||
"11.180284 17.927806 71.976601 4.370562 0 \n",
|
||
"7.360640 15.586810 78.740016 3.662292 0 \n",
|
||
"7.974522 14.525746 76.485911 4.011718 0 \n",
|
||
"7.119824 15.929536 79.500778 3.445756 0 \n",
|
||
"NaN 19.370807 76.509996 4.413974 0 \n",
|
||
"7.496232 13.228311 70.300213 4.777382 0 \n",
|
||
"6.347272 11.539781 75.071617 4.376348 0 \n",
|
||
"7.051786 20.397022 56.651604 4.268429 0 \n",
|
||
"9.181560 13.387341 71.457362 4.503661 0 \n",
|
||
"8.975464 12.888759 63.821237 2.436086 0 \n",
|
||
"7.371050 12.509164 62.797277 2.560299 0 "
|
||
]
|
||
},
|
||
"execution_count": 10,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"data.head(20) # С помощью метода head выводим первые 20 строк нашего ДатаФрейма"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "UnU4xLLzhkxG"
|
||
},
|
||
"source": [
|
||
"**Информация о загруженных данных**:\n",
|
||
"\n",
|
||
"- Посчитаем, сколько записей\n",
|
||
"- Посмотрим, какого типа данные\n",
|
||
"- Проверим, есть ли пропуски"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 11,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/",
|
||
"height": 34
|
||
},
|
||
"id": "Z-MKWiELhkxP",
|
||
"outputId": "68ca424e-83eb-4d17-d779-1ea7f1a6b4ae"
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"(3276, 9)"
|
||
]
|
||
},
|
||
"execution_count": 11,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"data.shape # Параметр .shape (так же как и в numpy-массивах) показывает размерность нашего датафрейма"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 12,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/",
|
||
"height": 34
|
||
},
|
||
"id": "SEr52zb4hkxT",
|
||
"outputId": "d72fe356-c89d-4d61-c62b-a64359f748d2"
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"29484"
|
||
]
|
||
},
|
||
"execution_count": 12,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"data.size # Параметр .size (так же как и в numpy-массивах) показывает количество элементов в нашем датафрейме"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 13,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/",
|
||
"height": 204
|
||
},
|
||
"id": "sw8ATDX1hkxJ",
|
||
"outputId": "2cfce98c-00bf-4093-8bf5-6573e0cd909f",
|
||
"scrolled": true
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"Hardness 3276\n",
|
||
"Solids 3276\n",
|
||
"Chloramines 3276\n",
|
||
"Sulfate 2495\n",
|
||
"Conductivity 3276\n",
|
||
"Organic carbon 3276\n",
|
||
"Trihalomethanes 3114\n",
|
||
"Turbidity 3276\n",
|
||
"Potability 3276\n",
|
||
"dtype: int64"
|
||
]
|
||
},
|
||
"execution_count": 13,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"data.count() # Метод count считает сколько всего непустых записей в каждом столбце"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 14,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/",
|
||
"height": 204
|
||
},
|
||
"id": "OwVkE1MKX6mW",
|
||
"outputId": "9a9f1142-de9c-4ffa-e051-bb58af728151"
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"Hardness 100\n",
|
||
"Solids 100\n",
|
||
"Chloramines 100\n",
|
||
"Sulfate 74\n",
|
||
"Conductivity 100\n",
|
||
"Organic carbon 100\n",
|
||
"Trihalomethanes 98\n",
|
||
"Turbidity 100\n",
|
||
"Potability 100\n",
|
||
"dtype: int64"
|
||
]
|
||
},
|
||
"execution_count": 14,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"data.head(100).count() # Применим метод .count() к первым ста записям нашего датафрейма"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "5AjpVmYFhkxX"
|
||
},
|
||
"source": [
|
||
"- Метод info() заодно показывает, какого типа данные в столбцах"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 15,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/",
|
||
"height": 306
|
||
},
|
||
"id": "G8RHx3kvhkxZ",
|
||
"outputId": "cf46dd23-3acf-4d2e-c8c4-046fa7b1f8d6"
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"<class 'pandas.core.frame.DataFrame'>\n",
|
||
"Index: 3276 entries, nan to 7.87467135779128\n",
|
||
"Data columns (total 9 columns):\n",
|
||
" # Column Non-Null Count Dtype \n",
|
||
"--- ------ -------------- ----- \n",
|
||
" 0 Hardness 3276 non-null float64\n",
|
||
" 1 Solids 3276 non-null float64\n",
|
||
" 2 Chloramines 3276 non-null float64\n",
|
||
" 3 Sulfate 2495 non-null float64\n",
|
||
" 4 Conductivity 3276 non-null float64\n",
|
||
" 5 Organic carbon 3276 non-null float64\n",
|
||
" 6 Trihalomethanes 3114 non-null float64\n",
|
||
" 7 Turbidity 3276 non-null float64\n",
|
||
" 8 Potability 3276 non-null int64 \n",
|
||
"dtypes: float64(8), int64(1)\n",
|
||
"memory usage: 255.9 KB\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"data.info() # Метод .info() показывает тип каждого столбца и занимаемую память"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 16,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/",
|
||
"height": 204
|
||
},
|
||
"id": "cMRzYwQdhkxd",
|
||
"outputId": "4afa01d4-ec65-452a-fc89-c503253c1efa"
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"Hardness float64\n",
|
||
"Solids float64\n",
|
||
"Chloramines float64\n",
|
||
"Sulfate float64\n",
|
||
"Conductivity float64\n",
|
||
"Organic carbon float64\n",
|
||
"Trihalomethanes float64\n",
|
||
"Turbidity float64\n",
|
||
"Potability int64\n",
|
||
"dtype: object"
|
||
]
|
||
},
|
||
"execution_count": 16,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"data.dtypes # Параметр .dtypes показывает просто тип каждого столбца"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "S3TniwKUhkxh"
|
||
},
|
||
"source": [
|
||
"Начнем проверять на пропуски! \n",
|
||
"\n",
|
||
"- .isnull() - выдает табличку, где False - ячейка заполнена, True - ячейка пуста :( Ближайшая родня - isna()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 17,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/",
|
||
"height": 204
|
||
},
|
||
"id": "uq1iywLbYsxS",
|
||
"outputId": "fcc31e0d-6e49-4967-ac34-d03865227b1f"
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>Hardness</th>\n",
|
||
" <th>Solids</th>\n",
|
||
" <th>Chloramines</th>\n",
|
||
" <th>Sulfate</th>\n",
|
||
" <th>Conductivity</th>\n",
|
||
" <th>Organic carbon</th>\n",
|
||
" <th>Trihalomethanes</th>\n",
|
||
" <th>Turbidity</th>\n",
|
||
" <th>Potability</th>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>ph</th>\n",
|
||
" <th></th>\n",
|
||
" <th></th>\n",
|
||
" <th></th>\n",
|
||
" <th></th>\n",
|
||
" <th></th>\n",
|
||
" <th></th>\n",
|
||
" <th></th>\n",
|
||
" <th></th>\n",
|
||
" <th></th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>NaN</th>\n",
|
||
" <td>204.890455</td>\n",
|
||
" <td>20791.318981</td>\n",
|
||
" <td>7.300212</td>\n",
|
||
" <td>368.516441</td>\n",
|
||
" <td>564.308654</td>\n",
|
||
" <td>10.379783</td>\n",
|
||
" <td>86.990970</td>\n",
|
||
" <td>2.963135</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3.716080</th>\n",
|
||
" <td>129.422921</td>\n",
|
||
" <td>18630.057858</td>\n",
|
||
" <td>6.635246</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>592.885359</td>\n",
|
||
" <td>15.180013</td>\n",
|
||
" <td>56.329076</td>\n",
|
||
" <td>4.500656</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>8.099124</th>\n",
|
||
" <td>224.236259</td>\n",
|
||
" <td>19909.541732</td>\n",
|
||
" <td>9.275884</td>\n",
|
||
" <td>NaN</td>\n",
|
||
" <td>418.606213</td>\n",
|
||
" <td>16.868637</td>\n",
|
||
" <td>66.420093</td>\n",
|
||
" <td>3.055934</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>8.316766</th>\n",
|
||
" <td>214.373394</td>\n",
|
||
" <td>22018.417441</td>\n",
|
||
" <td>8.059332</td>\n",
|
||
" <td>356.886136</td>\n",
|
||
" <td>363.266516</td>\n",
|
||
" <td>18.436524</td>\n",
|
||
" <td>100.341674</td>\n",
|
||
" <td>4.628771</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>9.092223</th>\n",
|
||
" <td>181.101509</td>\n",
|
||
" <td>17978.986339</td>\n",
|
||
" <td>6.546600</td>\n",
|
||
" <td>310.135738</td>\n",
|
||
" <td>398.410813</td>\n",
|
||
" <td>11.558279</td>\n",
|
||
" <td>31.997993</td>\n",
|
||
" <td>4.075075</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" Hardness Solids Chloramines Sulfate Conductivity \\\n",
|
||
"ph \n",
|
||
"NaN 204.890455 20791.318981 7.300212 368.516441 564.308654 \n",
|
||
"3.716080 129.422921 18630.057858 6.635246 NaN 592.885359 \n",
|
||
"8.099124 224.236259 19909.541732 9.275884 NaN 418.606213 \n",
|
||
"8.316766 214.373394 22018.417441 8.059332 356.886136 363.266516 \n",
|
||
"9.092223 181.101509 17978.986339 6.546600 310.135738 398.410813 \n",
|
||
"\n",
|
||
" Organic carbon Trihalomethanes Turbidity Potability \n",
|
||
"ph \n",
|
||
"NaN 10.379783 86.990970 2.963135 0 \n",
|
||
"3.716080 15.180013 56.329076 4.500656 0 \n",
|
||
"8.099124 16.868637 66.420093 3.055934 0 \n",
|
||
"8.316766 18.436524 100.341674 4.628771 0 \n",
|
||
"9.092223 11.558279 31.997993 4.075075 0 "
|
||
]
|
||
},
|
||
"execution_count": 17,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"data.head() # Отобразим первые 5 строк нашего датафрейма"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/",
|
||
"height": 204
|
||
},
|
||
"id": "_oxBR6lAzfgu",
|
||
"outputId": "1b2f600d-ea50-4289-cfd6-976ea1526877"
|
||
},
|
||
"outputs": [],
|
||
"source": []
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 18,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/",
|
||
"height": 204
|
||
},
|
||
"id": "ZjTn7cM5zyta",
|
||
"outputId": "b3889e78-08c3-4bdf-c4ec-260d35eed9ea"
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"Hardness 0\n",
|
||
"Solids 0\n",
|
||
"Chloramines 0\n",
|
||
"Sulfate 781\n",
|
||
"Conductivity 0\n",
|
||
"Organic carbon 0\n",
|
||
"Trihalomethanes 162\n",
|
||
"Turbidity 0\n",
|
||
"Potability 0\n",
|
||
"dtype: int64"
|
||
]
|
||
},
|
||
"execution_count": 18,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"data.isna().sum() # Подсчитаем количество пропусков в каждом столбце с помощью метода .sum()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 19,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/",
|
||
"height": 204
|
||
},
|
||
"id": "D7aHOhOGY5pe",
|
||
"outputId": "ce904bd7-3087-40a2-824e-7f7047f358db"
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"Hardness 0\n",
|
||
"Solids 0\n",
|
||
"Chloramines 0\n",
|
||
"Sulfate 26\n",
|
||
"Conductivity 0\n",
|
||
"Organic carbon 0\n",
|
||
"Trihalomethanes 2\n",
|
||
"Turbidity 0\n",
|
||
"Potability 0\n",
|
||
"dtype: int64"
|
||
]
|
||
},
|
||
"execution_count": 19,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"data.head(100).isna().sum() # Подсчитаем количество пропусков в каждом столбце для первых ста записей"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 20,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/",
|
||
"height": 204
|
||
},
|
||
"id": "ZaiGPw-KY9eQ",
|
||
"outputId": "3ffbc798-c5bc-4970-c6c3-1608da30afe4"
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"Hardness 0\n",
|
||
"Solids 0\n",
|
||
"Chloramines 0\n",
|
||
"Sulfate 26\n",
|
||
"Conductivity 0\n",
|
||
"Organic carbon 0\n",
|
||
"Trihalomethanes 2\n",
|
||
"Turbidity 0\n",
|
||
"Potability 0\n",
|
||
"dtype: int64"
|
||
]
|
||
},
|
||
"execution_count": 20,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"data.isna().head(100).sum() # Подсчитаем количество пропусков в каждом столбце для первых ста записей (равнозначно предыдущей записи)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 21,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/",
|
||
"height": 204
|
||
},
|
||
"id": "NHeX2czDhkxi",
|
||
"outputId": "9995758b-2f88-47ca-ab63-37cb364875dd"
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/plain": [
|
||
"Hardness 0.000000\n",
|
||
"Solids 0.000000\n",
|
||
"Chloramines 0.000000\n",
|
||
"Sulfate 0.238400\n",
|
||
"Conductivity 0.000000\n",
|
||
"Organic carbon 0.000000\n",
|
||
"Trihalomethanes 0.049451\n",
|
||
"Turbidity 0.000000\n",
|
||
"Potability 0.000000\n",
|
||
"dtype: float64"
|
||
]
|
||
},
|
||
"execution_count": 21,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"round(data.isna().sum() / data.shape[0], 6) # Посчитаем какую часть составляют пропуски от общего количества элементов"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 22,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/",
|
||
"height": 34
|
||
},
|
||
"id": "tvAQTignhkxo",
|
||
"outputId": "7d223855-7d97-4529-bb74-572e21ed89a2"
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"943\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"proc = data.isna().sum().sum() # Подсчитаем сколько всего пропусков (во всех столбцах) в нашем датафрейме\n",
|
||
"print(proc) # Отобразим количество посчитанных пропусков"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 23,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/",
|
||
"height": 34
|
||
},
|
||
"id": "EOZz-GAPhkxr",
|
||
"outputId": "b7997cfd-a292-48ff-d431-ff425d210a7c"
|
||
},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"3.2%\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"# Переведем полученное значение в процентное отображение\n",
|
||
"proc = data.isna().sum().sum() / data.size\n",
|
||
"print(round(100*proc,1), '%', sep='')"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 24,
|
||
"metadata": {
|
||
"id": "OuW1gRtlhkxz"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"### Как оценить пропуски визуально"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "E8w2yGJ1hkx6"
|
||
},
|
||
"source": [
|
||
"Что с ним делать?\n",
|
||
"\n",
|
||
"Выбора не очень много: <br>\n",
|
||
"\n",
|
||
"1) Удалять: \n",
|
||
"- dropna(axis=0, how='any'): axis = 0 - удаляем построчно, axis = 1 выкидываем столбец; how ='any' - выкидываем, если есть хотя бы одна ячейка пустая. how = 'all' - выкидываем, если есть полностью пустая строка или столбец\n",
|
||
"\n",
|
||
"2) Вставлять информацию самим:\n",
|
||
"- fillna() - это отдельное искусство, как заполнять. "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 25,
|
||
"metadata": {},
|
||
"outputs": [
|
||
{
|
||
"name": "stdout",
|
||
"output_type": "stream",
|
||
"text": [
|
||
"Requirement already satisfied: matplotlib in ./venv/lib/python3.13/site-packages (3.10.0)\n",
|
||
"Requirement already satisfied: contourpy>=1.0.1 in ./venv/lib/python3.13/site-packages (from matplotlib) (1.3.1)\n",
|
||
"Requirement already satisfied: cycler>=0.10 in ./venv/lib/python3.13/site-packages (from matplotlib) (0.12.1)\n",
|
||
"Requirement already satisfied: fonttools>=4.22.0 in ./venv/lib/python3.13/site-packages (from matplotlib) (4.56.0)\n",
|
||
"Requirement already satisfied: kiwisolver>=1.3.1 in ./venv/lib/python3.13/site-packages (from matplotlib) (1.4.8)\n",
|
||
"Requirement already satisfied: numpy>=1.23 in ./venv/lib/python3.13/site-packages (from matplotlib) (2.2.3)\n",
|
||
"Requirement already satisfied: packaging>=20.0 in ./venv/lib/python3.13/site-packages (from matplotlib) (24.2)\n",
|
||
"Requirement already satisfied: pillow>=8 in ./venv/lib/python3.13/site-packages (from matplotlib) (11.1.0)\n",
|
||
"Requirement already satisfied: pyparsing>=2.3.1 in ./venv/lib/python3.13/site-packages (from matplotlib) (3.2.1)\n",
|
||
"Requirement already satisfied: python-dateutil>=2.7 in ./venv/lib/python3.13/site-packages (from matplotlib) (2.9.0.post0)\n",
|
||
"Requirement already satisfied: six>=1.5 in ./venv/lib/python3.13/site-packages (from python-dateutil>=2.7->matplotlib) (1.17.0)\n",
|
||
"\n",
|
||
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m24.3.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m25.0.1\u001b[0m\n",
|
||
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n",
|
||
"Requirement already satisfied: seaborn in ./venv/lib/python3.13/site-packages (0.13.2)\n",
|
||
"Requirement already satisfied: numpy!=1.24.0,>=1.20 in ./venv/lib/python3.13/site-packages (from seaborn) (2.2.3)\n",
|
||
"Requirement already satisfied: pandas>=1.2 in ./venv/lib/python3.13/site-packages (from seaborn) (2.2.3)\n",
|
||
"Requirement already satisfied: matplotlib!=3.6.1,>=3.4 in ./venv/lib/python3.13/site-packages (from seaborn) (3.10.0)\n",
|
||
"Requirement already satisfied: contourpy>=1.0.1 in ./venv/lib/python3.13/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.3.1)\n",
|
||
"Requirement already satisfied: cycler>=0.10 in ./venv/lib/python3.13/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (0.12.1)\n",
|
||
"Requirement already satisfied: fonttools>=4.22.0 in ./venv/lib/python3.13/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (4.56.0)\n",
|
||
"Requirement already satisfied: kiwisolver>=1.3.1 in ./venv/lib/python3.13/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (1.4.8)\n",
|
||
"Requirement already satisfied: packaging>=20.0 in ./venv/lib/python3.13/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (24.2)\n",
|
||
"Requirement already satisfied: pillow>=8 in ./venv/lib/python3.13/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (11.1.0)\n",
|
||
"Requirement already satisfied: pyparsing>=2.3.1 in ./venv/lib/python3.13/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (3.2.1)\n",
|
||
"Requirement already satisfied: python-dateutil>=2.7 in ./venv/lib/python3.13/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (2.9.0.post0)\n",
|
||
"Requirement already satisfied: pytz>=2020.1 in ./venv/lib/python3.13/site-packages (from pandas>=1.2->seaborn) (2025.1)\n",
|
||
"Requirement already satisfied: tzdata>=2022.7 in ./venv/lib/python3.13/site-packages (from pandas>=1.2->seaborn) (2025.1)\n",
|
||
"Requirement already satisfied: six>=1.5 in ./venv/lib/python3.13/site-packages (from python-dateutil>=2.7->matplotlib!=3.6.1,>=3.4->seaborn) (1.17.0)\n",
|
||
"\n",
|
||
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m24.3.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m25.0.1\u001b[0m\n",
|
||
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"!pip install matplotlib\n",
|
||
"!pip install seaborn"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 26,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/",
|
||
"height": 735
|
||
},
|
||
"id": "ToPE3VkWhkx1",
|
||
"outputId": "6a5c7213-1a94-4823-e5cf-3d19468a890d"
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"image/png": "",
|
||
"text/plain": [
|
||
"<Figure size 2000x1200 with 1 Axes>"
|
||
]
|
||
},
|
||
"metadata": {},
|
||
"output_type": "display_data"
|
||
}
|
||
],
|
||
"source": [
|
||
"import matplotlib.pyplot as plt # Загружаем модуль matplotlib.pyplot\n",
|
||
"import seaborn as sns # Загружаем модуль seaborn\n",
|
||
"%matplotlib inline\n",
|
||
"\n",
|
||
"fig, ax = plt.subplots(figsize=(20,12)) # Создаем область под график\n",
|
||
"sns_heatmap = sns.heatmap(data.isnull(), yticklabels=False, cbar=False, cmap='viridis') # Визуализируем прпуски\n",
|
||
"plt.show() # Отображаем график"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 27,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/",
|
||
"height": 495
|
||
},
|
||
"id": "uZgh1E3nhkx6",
|
||
"outputId": "3a6709c7-3fbc-4260-bcd9-69c47228b013"
|
||
},
|
||
"outputs": [
|
||
{
|
||
"data": {
|
||
"text/html": [
|
||
"<div>\n",
|
||
"<style scoped>\n",
|
||
" .dataframe tbody tr th:only-of-type {\n",
|
||
" vertical-align: middle;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe tbody tr th {\n",
|
||
" vertical-align: top;\n",
|
||
" }\n",
|
||
"\n",
|
||
" .dataframe thead th {\n",
|
||
" text-align: right;\n",
|
||
" }\n",
|
||
"</style>\n",
|
||
"<table border=\"1\" class=\"dataframe\">\n",
|
||
" <thead>\n",
|
||
" <tr style=\"text-align: right;\">\n",
|
||
" <th></th>\n",
|
||
" <th>Hardness</th>\n",
|
||
" <th>Solids</th>\n",
|
||
" <th>Chloramines</th>\n",
|
||
" <th>Sulfate</th>\n",
|
||
" <th>Conductivity</th>\n",
|
||
" <th>Organic carbon</th>\n",
|
||
" <th>Trihalomethanes</th>\n",
|
||
" <th>Turbidity</th>\n",
|
||
" <th>Potability</th>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>ph</th>\n",
|
||
" <th></th>\n",
|
||
" <th></th>\n",
|
||
" <th></th>\n",
|
||
" <th></th>\n",
|
||
" <th></th>\n",
|
||
" <th></th>\n",
|
||
" <th></th>\n",
|
||
" <th></th>\n",
|
||
" <th></th>\n",
|
||
" </tr>\n",
|
||
" </thead>\n",
|
||
" <tbody>\n",
|
||
" <tr>\n",
|
||
" <th>NaN</th>\n",
|
||
" <td>204.890455</td>\n",
|
||
" <td>20791.318981</td>\n",
|
||
" <td>7.300212</td>\n",
|
||
" <td>368.516441</td>\n",
|
||
" <td>564.308654</td>\n",
|
||
" <td>10.379783</td>\n",
|
||
" <td>86.99097</td>\n",
|
||
" <td>2.963135</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>3.716080</th>\n",
|
||
" <td>129.422921</td>\n",
|
||
" <td>18630.057858</td>\n",
|
||
" <td>6.635246</td>\n",
|
||
" <td>Python</td>\n",
|
||
" <td>592.885359</td>\n",
|
||
" <td>15.180013</td>\n",
|
||
" <td>56.329076</td>\n",
|
||
" <td>4.500656</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>8.099124</th>\n",
|
||
" <td>224.236259</td>\n",
|
||
" <td>19909.541732</td>\n",
|
||
" <td>9.275884</td>\n",
|
||
" <td>Python</td>\n",
|
||
" <td>418.606213</td>\n",
|
||
" <td>16.868637</td>\n",
|
||
" <td>66.420093</td>\n",
|
||
" <td>3.055934</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>8.316766</th>\n",
|
||
" <td>214.373394</td>\n",
|
||
" <td>22018.417441</td>\n",
|
||
" <td>8.059332</td>\n",
|
||
" <td>356.886136</td>\n",
|
||
" <td>363.266516</td>\n",
|
||
" <td>18.436524</td>\n",
|
||
" <td>100.341674</td>\n",
|
||
" <td>4.628771</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>9.092223</th>\n",
|
||
" <td>181.101509</td>\n",
|
||
" <td>17978.986339</td>\n",
|
||
" <td>6.546600</td>\n",
|
||
" <td>310.135738</td>\n",
|
||
" <td>398.410813</td>\n",
|
||
" <td>11.558279</td>\n",
|
||
" <td>31.997993</td>\n",
|
||
" <td>4.075075</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>5.584087</th>\n",
|
||
" <td>188.313324</td>\n",
|
||
" <td>28748.687739</td>\n",
|
||
" <td>7.544869</td>\n",
|
||
" <td>326.678363</td>\n",
|
||
" <td>280.467916</td>\n",
|
||
" <td>8.399735</td>\n",
|
||
" <td>54.917862</td>\n",
|
||
" <td>2.559708</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>10.223862</th>\n",
|
||
" <td>248.071735</td>\n",
|
||
" <td>28749.716544</td>\n",
|
||
" <td>7.513408</td>\n",
|
||
" <td>393.663396</td>\n",
|
||
" <td>283.651634</td>\n",
|
||
" <td>13.789695</td>\n",
|
||
" <td>84.603556</td>\n",
|
||
" <td>2.672989</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>8.635849</th>\n",
|
||
" <td>203.361523</td>\n",
|
||
" <td>13672.091764</td>\n",
|
||
" <td>4.563009</td>\n",
|
||
" <td>303.309771</td>\n",
|
||
" <td>474.607645</td>\n",
|
||
" <td>12.363817</td>\n",
|
||
" <td>62.798309</td>\n",
|
||
" <td>4.401425</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>NaN</th>\n",
|
||
" <td>118.988579</td>\n",
|
||
" <td>14285.583854</td>\n",
|
||
" <td>7.804174</td>\n",
|
||
" <td>268.646941</td>\n",
|
||
" <td>389.375566</td>\n",
|
||
" <td>12.706049</td>\n",
|
||
" <td>53.928846</td>\n",
|
||
" <td>3.595017</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" <tr>\n",
|
||
" <th>11.180284</th>\n",
|
||
" <td>227.231469</td>\n",
|
||
" <td>25484.508491</td>\n",
|
||
" <td>9.077200</td>\n",
|
||
" <td>404.041635</td>\n",
|
||
" <td>563.885481</td>\n",
|
||
" <td>17.927806</td>\n",
|
||
" <td>71.976601</td>\n",
|
||
" <td>4.370562</td>\n",
|
||
" <td>0</td>\n",
|
||
" </tr>\n",
|
||
" </tbody>\n",
|
||
"</table>\n",
|
||
"</div>"
|
||
],
|
||
"text/plain": [
|
||
" Hardness Solids Chloramines Sulfate Conductivity \\\n",
|
||
"ph \n",
|
||
"NaN 204.890455 20791.318981 7.300212 368.516441 564.308654 \n",
|
||
"3.716080 129.422921 18630.057858 6.635246 Python 592.885359 \n",
|
||
"8.099124 224.236259 19909.541732 9.275884 Python 418.606213 \n",
|
||
"8.316766 214.373394 22018.417441 8.059332 356.886136 363.266516 \n",
|
||
"9.092223 181.101509 17978.986339 6.546600 310.135738 398.410813 \n",
|
||
"5.584087 188.313324 28748.687739 7.544869 326.678363 280.467916 \n",
|
||
"10.223862 248.071735 28749.716544 7.513408 393.663396 283.651634 \n",
|
||
"8.635849 203.361523 13672.091764 4.563009 303.309771 474.607645 \n",
|
||
"NaN 118.988579 14285.583854 7.804174 268.646941 389.375566 \n",
|
||
"11.180284 227.231469 25484.508491 9.077200 404.041635 563.885481 \n",
|
||
"\n",
|
||
" Organic carbon Trihalomethanes Turbidity Potability \n",
|
||
"ph \n",
|
||
"NaN 10.379783 86.99097 2.963135 0 \n",
|
||
"3.716080 15.180013 56.329076 4.500656 0 \n",
|
||
"8.099124 16.868637 66.420093 3.055934 0 \n",
|
||
"8.316766 18.436524 100.341674 4.628771 0 \n",
|
||
"9.092223 11.558279 31.997993 4.075075 0 \n",
|
||
"5.584087 8.399735 54.917862 2.559708 0 \n",
|
||
"10.223862 13.789695 84.603556 2.672989 0 \n",
|
||
"8.635849 12.363817 62.798309 4.401425 0 \n",
|
||
"NaN 12.706049 53.928846 3.595017 0 \n",
|
||
"11.180284 17.927806 71.976601 4.370562 0 "
|
||
]
|
||
},
|
||
"execution_count": 27,
|
||
"metadata": {},
|
||
"output_type": "execute_result"
|
||
}
|
||
],
|
||
"source": [
|
||
"data.fillna(\"Python\").head(10) # С помощью метода .fillna() заменяем все пропуски словом Python"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": 28,
|
||
"metadata": {
|
||
"id": "mBsHwML6hkyF"
|
||
},
|
||
"outputs": [
|
||
{
|
||
"ename": "SyntaxError",
|
||
"evalue": "invalid syntax (192570114.py, line 3)",
|
||
"output_type": "error",
|
||
"traceback": [
|
||
"\u001b[0;36m Cell \u001b[0;32mIn[28], line 3\u001b[0;36m\u001b[0m\n\u001b[0;31m Теперь посмотрим, а что содержательно у нас есть на руках.\u001b[0m\n\u001b[0m ^\u001b[0m\n\u001b[0;31mSyntaxError\u001b[0m\u001b[0;31m:\u001b[0m invalid syntax\n"
|
||
]
|
||
}
|
||
],
|
||
"source": [
|
||
"### Описательные статистики\n",
|
||
"\n",
|
||
"Теперь посмотрим, а что содержательно у нас есть на руках. \n",
|
||
"\n",
|
||
"Глазами просматривать не будем, а попросим посчитать основные описательные статистики. Причем сразу все.\n",
|
||
"\n",
|
||
"- describe() - метод, который возвращает табличку с описательными статистиками. В таком виде считает все для числовых столбцов"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/",
|
||
"height": 297
|
||
},
|
||
"id": "ZWz60or1hkyG",
|
||
"outputId": "134781c0-28b6-4137-85d1-8376216860c6",
|
||
"scrolled": true
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"data.describe() # Отобразим описательные статистики нашего датафрейма (только числовые данные)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "0aeamrWMhkyK"
|
||
},
|
||
"source": [
|
||
"Немножко магии, и для нечисловых данные тоже будут свои описательные статистики. "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/",
|
||
"height": 173
|
||
},
|
||
"id": "jKTF-2BHhkyK",
|
||
"outputId": "244a91a6-e8b4-42a9-d464-3728bc5c3dc5"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"data.describe(include=['O']) # # Отобразим описательные статистики нашего датафрейма ('O' - в том числе и строковые)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"id": "-IbwBRL_hkyO"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"### Срезы данных\n",
|
||
"\n",
|
||
"Допустим, нам не нужен датасет, а только определенные столбцы или строки или столбцы и строки. \n",
|
||
"\n",
|
||
"\n",
|
||
"Как делать?\n",
|
||
"Помним, что:\n",
|
||
"- у столбцов есть названия\n",
|
||
"- у строк есть названия\n",
|
||
"- если нет названий, то они пронумерованы с нуля\n",
|
||
"\n",
|
||
"Основываясь на этой идее, мы начнем отбирать данные."
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/",
|
||
"height": 80
|
||
},
|
||
"id": "4uT9dn4vhkyO",
|
||
"outputId": "f57acc81-2f88-41fd-ada0-028d80ed3ca7"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"data.head(1) # Отобразим первую строчку датафрейма"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "5b2HPHJwhkyT"
|
||
},
|
||
"source": [
|
||
"#### Отбираем по столбцам. Версия 1. "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/",
|
||
"height": 221
|
||
},
|
||
"id": "ocn9YgmnhkyZ",
|
||
"outputId": "438dc10b-ff81-42d3-deea-eefa140305d5"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"array = data['price'] # Отобразим столбец price\n",
|
||
"array"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/",
|
||
"height": 221
|
||
},
|
||
"id": "tBeyZQLPIMIJ",
|
||
"outputId": "95d6ac17-ddde-46d9-e9a8-6e265eb12085"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"data.price"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/",
|
||
"height": 119
|
||
},
|
||
"id": "YVzV30CQhkyV",
|
||
"outputId": "26839d1c-a250-4ec0-a388-50f50e45af89"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"data.price.head() # Отобразим столбец price (альтернативные вариант)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/",
|
||
"height": 204
|
||
},
|
||
"id": "IDhUYDK5hkye",
|
||
"outputId": "536a6bdf-0016-4faf-e984-f1bfa8d356ad"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"new_df = data[['price','country']].head() # Отобразим столбцы 'price' и 'country'\n",
|
||
"new_df"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "Pw7bsVKPhkyg"
|
||
},
|
||
"source": [
|
||
"#### Отбираем по строкам. Версия 1. "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/",
|
||
"height": 359
|
||
},
|
||
"id": "-3ntG2CzhDyV",
|
||
"outputId": "799530a2-5339-4ddf-8cbc-ef187d8a148f"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"data[10:20] # Отобразим с 10й по 20ю строки датафрейма"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/",
|
||
"height": 173
|
||
},
|
||
"id": "DaW5dRU7hkyh",
|
||
"outputId": "9665e0f7-f195-4217-b8a1-674700cdc917"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"data[10:20:3] # Отобразим с 10й по 20ю строки датафрейма с шагом 2"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/",
|
||
"height": 359
|
||
},
|
||
"id": "zXqL-lBEhGkG",
|
||
"outputId": "c741c699-f40d-4417-9bb4-bc849da4f19b"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"data[::5].head(10) # Отобразим каждую 5ю строку датафрейма"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "pV0kczWfhkyk"
|
||
},
|
||
"source": [
|
||
"#### Отбор по столбцам. Версия 2. Все еще по названиям "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/",
|
||
"height": 495
|
||
},
|
||
"id": "blyn4oRnJOlm",
|
||
"outputId": "cc0258b0-2735-4d7f-cb92-ac9b23b2a83e"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"data.head(10)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/",
|
||
"height": 173
|
||
},
|
||
"id": "LfRYRSsohkyk",
|
||
"outputId": "c7f8b402-bba9-4da4-eebe-6402c24030c2"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"data.loc[4:7, ['price', 'points']] # Отобразим два столбца 'price' и 'points', и в них строки с индексами с 4 по 7"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "CK4-ntzDhkyo"
|
||
},
|
||
"source": [
|
||
"#### Отбор по строкам. Версия 2. Все еще по названиям "
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/",
|
||
"height": 235
|
||
},
|
||
"id": "eqAQs0YIhkyq",
|
||
"outputId": "d0bab15c-91be-41a2-ef6f-82f2faf5e702"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"data.loc[:5,:] # Отобразим строки с индексом от 0 до 5 (то же, что и data.loc[:5])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "NGugpgJfhkyv"
|
||
},
|
||
"source": [
|
||
"#### Отбор по строчкам и столбцам. Версия 3. По номеру строк и столбцов"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/",
|
||
"height": 204
|
||
},
|
||
"id": "CG0aSTW8hkyv",
|
||
"outputId": "af2503a1-524e-431c-f61e-5d0a5590a9b1"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"data.iloc[::5, [1,3]].head() # Отобразим каждую 5 строку и 1 и 3 столбец"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "sIao6149hkyy"
|
||
},
|
||
"source": [
|
||
"#### Отбор с условиями\n",
|
||
"\n",
|
||
"Так, а если мне нужны вина дороже $15 долларов? Как быть?"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"id": "YsIrLdRnhkyy"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"#задаем маску\n",
|
||
"mask = data['price'] > 15"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/",
|
||
"height": 119
|
||
},
|
||
"id": "nfVB6YBbhky0",
|
||
"outputId": "ebfb750f-f4f2-43a3-c62b-4e39273046de"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"mask.head() # Отобразим маску"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/",
|
||
"height": 359
|
||
},
|
||
"id": "FADnit0Ghky2",
|
||
"outputId": "16cf6881-4c3c-408e-f661-4141182143d5"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"#и отбираем данные\n",
|
||
"temp = data[mask] # Выбираем данные из датафрейма в соответствии с маской и записываем их в новый даатафрейм temp\n",
|
||
"temp # Отображаем temp"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/",
|
||
"height": 204
|
||
},
|
||
"id": "EHT3VYtNhky4",
|
||
"outputId": "84091e0c-6995-4d73-b63a-c69aeec6ecd4"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"data[data.price>300].head()# Альтернативный вариант"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/",
|
||
"height": 616
|
||
},
|
||
"id": "Moi8GwyVhky8",
|
||
"outputId": "f4020760-204a-42f0-9df3-28242583e16e"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"data[(data.price > 200) & ((data.country == 'US') | (data.country == 'France'))].head(15) # Составное условие"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"id": "Da0WfR_5hky_"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"### Мультииндексация"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/",
|
||
"height": 204
|
||
},
|
||
"id": "gIqtjR45hky_",
|
||
"outputId": "48a1fdf1-4c7f-4c3c-8e76-b9e773cb15c7"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"data.head() # Отобразим наш датафрем"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/",
|
||
"height": 1000
|
||
},
|
||
"id": "JizHrXguhkzC",
|
||
"outputId": "4b793963-14d9-4f87-bb3f-b26bb14d2d8e"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"data_ = data.groupby(['country', 'price']).count() # Сграппируем данные сначала по странам, а затем по price\n",
|
||
"data_.head(100) # Отобразим первые 50 строк нового датафрейма"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/",
|
||
"height": 450
|
||
},
|
||
"id": "hE9aG1imhkzG",
|
||
"outputId": "b2abe0e9-93f7-4044-e88c-844918452e52"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"data_.loc['US'] # Отобразим все данные для 'US'"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/",
|
||
"height": 170
|
||
},
|
||
"id": "ZteYvkfehkzL",
|
||
"outputId": "938087ed-492f-4ddf-c3f1-16ca3b38f331"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"data_.loc['US', 100] # Отобразим данные для 'US', у кого 100 points"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"id": "6o6JX1OnhkzP"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"#### Как изменять значения в табличке"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/",
|
||
"height": 204
|
||
},
|
||
"id": "sVulAc0HsaLu",
|
||
"outputId": "6fca9bb4-357b-4398-e509-6191a1ee9e74"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"data_backup = data.copy() # Создаем копию нашего датафрейма и записываем в переменную data_backup\n",
|
||
"data.head()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/",
|
||
"height": 297
|
||
},
|
||
"id": "eMhSX4jqhkzP",
|
||
"outputId": "5ee71b23-0935-46c4-925f-912e28eeda25"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"data.iloc[0,1] = 'kotiki' # Вставляем новое значение в 0 строку и 1 стоблец\n",
|
||
"data.iloc[2,2] = '129' # Вставляем новое значение в 2 строку и 2 стоблец\n",
|
||
"data.iloc[3:5,2:5] = 'new' # Вставляем новое значение с 3 по 5 строку и со 2го по 5ый стоблец\n",
|
||
"data.head(8)"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"id": "RZvBCkMCsiOT"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"data = data_backup.copy() # Восстанавливаем данные из копии"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/",
|
||
"height": 204
|
||
},
|
||
"id": "SNliNx3TPCTX",
|
||
"outputId": "2f0ccacf-df6b-4499-c844-12be866957dc"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"data.head()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/",
|
||
"height": 204
|
||
},
|
||
"id": "qXlU-wqyhkzT",
|
||
"outputId": "b65e6f8a-0562-43d7-b09a-ff2c8f9eab35"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"data.loc[data.country == 'US', 'region_2'] = 'Syberia'\n",
|
||
"data.loc[data.price > 100, 'points'] = 200\n",
|
||
"data.loc[data.price > 100, 'price'] = 1000\n",
|
||
"data.head()"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"id": "6PumqSIU7Q7U"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"## Перевод в Numpy\n"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/",
|
||
"height": 51
|
||
},
|
||
"id": "Obs9TzQ9E8ss",
|
||
"outputId": "7ea94ae7-0ee5-4773-be88-14ab92522349"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"np_data = data.values # Получаем данные из датафрейма и записываем их в переменную np_data\n",
|
||
"print(np_data.shape) # Выводим размерность np_data\n",
|
||
"np_data.dtype"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/",
|
||
"height": 105
|
||
},
|
||
"id": "y1n1cNpdrFqQ",
|
||
"outputId": "f6f3c2df-9732-40ed-8569-5d53223b9a24"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"print(np_data[0]) # Выводим 0ой элемент из массива"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "code",
|
||
"execution_count": null,
|
||
"metadata": {
|
||
"colab": {
|
||
"base_uri": "https://localhost:8080/",
|
||
"height": 717
|
||
},
|
||
"id": "UsAcU8mBnWwn",
|
||
"outputId": "4eec37c8-c418-4140-de98-c4ec5b6bbe8b"
|
||
},
|
||
"outputs": [],
|
||
"source": [
|
||
"# Выведем первые 10 элементов из np_data\n",
|
||
"for i in range(10):\n",
|
||
" print(np_data[i])"
|
||
]
|
||
},
|
||
{
|
||
"cell_type": "markdown",
|
||
"metadata": {
|
||
"id": "8F6G2AUnKFCA"
|
||
},
|
||
"source": [
|
||
"# **Глоссарий**\n",
|
||
"\n",
|
||
"\n",
|
||
"pd.DataFrame(данные, columns = [колонки, если есть], index = [индексы ,если есть]) - создать датафрейм\n",
|
||
"\n",
|
||
"pd.read_csv(полный адрес расположения файла) - открыть .csv файл\n",
|
||
"\n",
|
||
"------------\n",
|
||
"\n",
|
||
".head() - посмотреть верхушку датафрейма (первые n строк)\n",
|
||
"\n",
|
||
".tail() - посмотреть конец датафрейма (последние n строк)\n",
|
||
"\n",
|
||
".columns - список колонок датафрейма\n",
|
||
"\n",
|
||
".values - вывести массив всех значений датафрейма\n",
|
||
"\n",
|
||
".index - список индексов датафрейма\n",
|
||
"\n",
|
||
".tolist() - перевести в список\n",
|
||
"\n",
|
||
".count() - посчитать количество определенных величин во фрейме\n",
|
||
"\n",
|
||
".describe() - посмотреть основные статистические характеристики фрейма\n",
|
||
"\n",
|
||
".shape - форма фрейма (строки, колонки)\n",
|
||
"\n",
|
||
".size - размер фрейма строки*колонки\n",
|
||
"\n",
|
||
".info() - информация о данных каждой колонки\n",
|
||
"\n",
|
||
".dtypes - тип данных каждой колонки\n",
|
||
"\n",
|
||
".isnull() - где недостает значений\n",
|
||
"\n",
|
||
".isna()- есть ли значения None\n",
|
||
"\n",
|
||
".dropna() - выкинуть строки/колонки с None\n",
|
||
"\n",
|
||
".fillna() - заполнить заданным значеним ячейки, где есть None\n",
|
||
"\n",
|
||
".loc[] - вывести значения по названиям колонок\n",
|
||
"\n",
|
||
".iloc[] - вывести значения по индексам колонок\n",
|
||
"\n",
|
||
".drop() - выкинуть определенные значения\n",
|
||
"\n",
|
||
"--------------\n",
|
||
"\n",
|
||
"pd.to_datetime(колонка, которую переводим в формат временного ряда)\n",
|
||
"\n",
|
||
".groupby() - сгруппировать по конкретному признаку\n",
|
||
"\n",
|
||
".copy() - создать копию\n",
|
||
"\n",
|
||
".sort_values() - сортировка значений\n",
|
||
"\n",
|
||
"pd.concat([df1,df2]) - конкатенация фреймов\n",
|
||
"\n",
|
||
".merge(второй_датафрейм, on = 'общая колонка, по которой склеиваем', how = 'с какой стороны') - конкатенация фреймов через общий признак\n",
|
||
"\n",
|
||
"-------------\n",
|
||
"\n",
|
||
"\n",
|
||
".corr() - вычислить корреляцию\n",
|
||
"\n",
|
||
".median() - вычислить медиану\n",
|
||
"\n",
|
||
".cumsum() - вычислить куммулятивную сумму\n",
|
||
"\n",
|
||
".cumprod() - вычислить коммулятивное произведение\n",
|
||
"\n",
|
||
".cummax() - вычислить коммулятивный максимум\n",
|
||
"\n",
|
||
"-------------\n",
|
||
"\n",
|
||
".quantile([]) - вычислить квантили\n",
|
||
"\n",
|
||
".nunique() - уникальные значения для n-колонок/строк\n",
|
||
"\n",
|
||
".unique() - уникальные значения определенной колонки/строк\n",
|
||
"\n",
|
||
"------------\n",
|
||
"\n",
|
||
".apply(функция) - применить функцию для колонки/строки\n",
|
||
"\n",
|
||
".agg(набор_функций) - применить ряд функций для колонки/строки\n"
|
||
]
|
||
}
|
||
],
|
||
"metadata": {
|
||
"colab": {
|
||
"provenance": [],
|
||
"toc_visible": true
|
||
},
|
||
"kernelspec": {
|
||
"display_name": "Python 3 (ipykernel)",
|
||
"language": "python",
|
||
"name": "python3"
|
||
},
|
||
"language_info": {
|
||
"codemirror_mode": {
|
||
"name": "ipython",
|
||
"version": 3
|
||
},
|
||
"file_extension": ".py",
|
||
"mimetype": "text/x-python",
|
||
"name": "python",
|
||
"nbconvert_exporter": "python",
|
||
"pygments_lexer": "ipython3",
|
||
"version": "3.13.2"
|
||
}
|
||
},
|
||
"nbformat": 4,
|
||
"nbformat_minor": 4
|
||
}
|