- [[scikit-learn]], [[sklearn pipeline]]
- [course page](https://courses.dataschool.io/courses/building-an-effective-machine-learning-workflow-with-scikit-learn/723182-introduction)
- [[sklearn - save models]]
# Tips
- `NaN` values are different from data/categories in the test data that aren't seen in the training data
- `NaN` values are not accepted in general; the only type of models that accept `NaN` values are histogram gradient boosting trees, which can handle missing values without imputing them
- Drop missing rows only if you don't lose too much data and if data are missing at random.
- If `pipe.fit(X)` pipeline imputes missing values, `pipe.predict(X_new)` on the testing data `X_new` imputes the value trained from `X`, not `X_new`!!!
- Why is that!?! The model can only **learn** from training data, not testing data! Whatever that is learned from the training data will get applied to testing data.
- Training data is where transformers learn how to encode the data. Those encodings are applied to the test data.
- Conceptually, predictions are made one row at a time (though under the hood, it's done different for efficiency reasons), so during testing, it's actually impossible to learn anything, so it has to use the knowledge that's learned during training!
- All pipeline steps other than the final step must be a transformer. The final step can be a transformer or model.
- Steps in pipelines run sequentially; steps in column transformers run in parallel, independently.
- 5-fold cross-validation is often recommended based on empirical findings.
- We can cross validate entire pipelines!
- `cross_val_score` applies column transformers **after** splitting the data (not transform then split, but split then transform) to avoid [[data leakage]]
- **hyper-parameters** are parameters we define/set; **parameters** are learned by the models during model fitting
# Code
http://bit.ly/first-ml-lesson
Simple pipeline
```python
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
cols = ['Parch', 'Fare', 'Embarked', 'Sex', 'Name']
df = pd.read_csv('http://bit.ly/kaggletrain', nrows=10)
X = df[cols]
y = df['Survived']
df_new = pd.read_csv('http://bit.ly/kaggletest', nrows=10)
X_new = df_new[cols]
ohe = OneHotEncoder()
vect = CountVectorizer()
ct = make_column_transformer(
(ohe, ['Embarked', 'Sex']),
(vect, 'Name'),
remainder='passthrough')
logreg = LogisticRegression(solver='liblinear', random_state=1)
pipe = make_pipeline(ct, logreg)
pipe.fit(X, y)
pipe.predict(X_new)
```
Simple pipeline with `GridSearchCV`
```python
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
iris = datasets.load_iris()
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svc = svm.SVC()
gs = GridSearchCV(svc, parameters)
gs.fit(iris.data, iris.target)
```
Complex or full pipeline
http://bit.ly/complex-pipeline
```python
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
cols = ['Parch', 'Fare', 'Embarked', 'Sex', 'Name', 'Age']
df = pd.read_csv('http://bit.ly/kaggletrain')
X = df[cols]
y = df['Survived']
df_new = pd.read_csv('http://bit.ly/kaggletest')
X_new = df_new[cols]
imp_constant = SimpleImputer(strategy='constant', fill_value='missing')
ohe = OneHotEncoder()
imp_ohe = make_pipeline(imp_constant, ohe)
vect = CountVectorizer()
imp = SimpleImputer()
# steps in column transformers are independent!
ct = make_column_transformer(
(imp_ohe, ['Embarked', 'Sex']),
(vect, 'Name'),
(imp, ['Age', 'Fare']),
remainder='passthrough')
logreg = LogisticRegression(solver='liblinear', random_state=1)
pipe = make_pipeline(ct, logreg)
pipe.fit(X, y)
pipe.predict(X_new)
cross_val_score(pipe, X, y, cv=5, scoring="accuracy").mean()
pipe.named_steps.keys()
params = {}
params["logisticregression__penalty"] = ["l1", "l2"]
params["logisticregression__C"] = [0.1, 1, 10]
params["columntransformer__pipeline__onehotencoder__drop"] = [None, "first"]
params["columntransformer__countvectorizer__ngram_range"] = [(1, 1), (1, 2)]
params["columntransformer__simpleimputer__add_indicator"] = [False, True]
grid = GridSearchCV(pipe, params, cv=5, scoring="accuracy")
grid.fit(X, y)
results = pd.DataFrame(grid.cv_results_)
results.sort_values("rank_test_score")
grid.best_score_
grid.best_params_
grid.predict(X_new) # predict with best model
# get best model
m = grid.best_estimator_
m.predict(X_new)
# save model
from joblib import dump, load
import sklearn
dump(grid, f"_mod_{sklearn.__version__}.joblib")
g = load(f"_mod_{sklearn.__version__}.joblib")
g.predict(X_new)
```
![[Pasted image 20210514190846.png]]
```python
# imputers create a category for missing values
from sklearn.preprocessing import OneHotEncoder
X = np.array([["a", np.nan, "b"]]).reshape(-1, 1)
ohe = OneHotEncoder(sparse=False)
ohe.fit_transform(X)
```
```python
# get attributes of transformers
ct.named_transformers_.simpleimputer.statistics_
```
# Setting hyper-parameters
```python
pipe.named_steps.keys()
```