course - building effective ML workflow with scikit-learn - Kevin Markham

- [[scikit-learn]], [[sklearn pipeline]] - [course page](https://courses.dataschool.io/courses/building-an-effective-machine-learning-workflow-with-scikit-learn/723182-introduction) - [[sklearn - save models]] # Tips - `NaN` values are different from data/categories in the test data that aren't seen in the training data - `NaN` values are not accepted in general; the only type of models that accept `NaN` values are histogram gradient boosting trees, which can handle missing values without imputing them - Drop missing rows only if you don't lose too much data and if data are missing at random. - If `pipe.fit(X)` pipeline imputes missing values, `pipe.predict(X_new)` on the testing data `X_new` imputes the value trained from `X`, not `X_new`!!! - Why is that!?! The model can only **learn** from training data, not testing data! Whatever that is learned from the training data will get applied to testing data. - Training data is where transformers learn how to encode the data. Those encodings are applied to the test data. - Conceptually, predictions are made one row at a time (though under the hood, it's done different for efficiency reasons), so during testing, it's actually impossible to learn anything, so it has to use the knowledge that's learned during training! - All pipeline steps other than the final step must be a transformer. The final step can be a transformer or model. - Steps in pipelines run sequentially; steps in column transformers run in parallel, independently. - 5-fold cross-validation is often recommended based on empirical findings. - We can cross validate entire pipelines! - `cross_val_score` applies column transformers **after** splitting the data (not transform then split, but split then transform) to avoid [[data leakage]] - **hyper-parameters** are parameters we define/set; **parameters** are learned by the models during model fitting # Code http://bit.ly/first-ml-lesson Simple pipeline ```python import pandas as pd from sklearn.preprocessing import OneHotEncoder from sklearn.feature_extraction.text import CountVectorizer from sklearn.linear_model import LogisticRegression from sklearn.compose import make_column_transformer from sklearn.pipeline import make_pipeline cols = ['Parch', 'Fare', 'Embarked', 'Sex', 'Name'] df = pd.read_csv('http://bit.ly/kaggletrain', nrows=10) X = df[cols] y = df['Survived'] df_new = pd.read_csv('http://bit.ly/kaggletest', nrows=10) X_new = df_new[cols] ohe = OneHotEncoder() vect = CountVectorizer() ct = make_column_transformer( (ohe, ['Embarked', 'Sex']), (vect, 'Name'), remainder='passthrough') logreg = LogisticRegression(solver='liblinear', random_state=1) pipe = make_pipeline(ct, logreg) pipe.fit(X, y) pipe.predict(X_new) ``` Simple pipeline with `GridSearchCV` ```python from sklearn import svm, datasets from sklearn.model_selection import GridSearchCV iris = datasets.load_iris() parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]} svc = svm.SVC() gs = GridSearchCV(svc, parameters) gs.fit(iris.data, iris.target) ``` Complex or full pipeline http://bit.ly/complex-pipeline ```python import pandas as pd from sklearn.impute import SimpleImputer from sklearn.preprocessing import OneHotEncoder from sklearn.feature_extraction.text import CountVectorizer from sklearn.linear_model import LogisticRegression from sklearn.compose import make_column_transformer from sklearn.pipeline import make_pipeline from sklearn.model_selection import cross_val_score from sklearn.model_selection import GridSearchCV cols = ['Parch', 'Fare', 'Embarked', 'Sex', 'Name', 'Age'] df = pd.read_csv('http://bit.ly/kaggletrain') X = df[cols] y = df['Survived'] df_new = pd.read_csv('http://bit.ly/kaggletest') X_new = df_new[cols] imp_constant = SimpleImputer(strategy='constant', fill_value='missing') ohe = OneHotEncoder() imp_ohe = make_pipeline(imp_constant, ohe) vect = CountVectorizer() imp = SimpleImputer() # steps in column transformers are independent! ct = make_column_transformer( (imp_ohe, ['Embarked', 'Sex']), (vect, 'Name'), (imp, ['Age', 'Fare']), remainder='passthrough') logreg = LogisticRegression(solver='liblinear', random_state=1) pipe = make_pipeline(ct, logreg) pipe.fit(X, y) pipe.predict(X_new) cross_val_score(pipe, X, y, cv=5, scoring="accuracy").mean() pipe.named_steps.keys() params = {} params["logisticregression__penalty"] = ["l1", "l2"] params["logisticregression__C"] = [0.1, 1, 10] params["columntransformer__pipeline__onehotencoder__drop"] = [None, "first"] params["columntransformer__countvectorizer__ngram_range"] = [(1, 1), (1, 2)] params["columntransformer__simpleimputer__add_indicator"] = [False, True] grid = GridSearchCV(pipe, params, cv=5, scoring="accuracy") grid.fit(X, y) results = pd.DataFrame(grid.cv_results_) results.sort_values("rank_test_score") grid.best_score_ grid.best_params_ grid.predict(X_new) # predict with best model # get best model m = grid.best_estimator_ m.predict(X_new) # save model from joblib import dump, load import sklearn dump(grid, f"_mod_{sklearn.__version__}.joblib") g = load(f"_mod_{sklearn.__version__}.joblib") g.predict(X_new) ``` ![[Pasted image 20210514190846.png]] ```python # imputers create a category for missing values from sklearn.preprocessing import OneHotEncoder X = np.array([["a", np.nan, "b"]]).reshape(-1, 1) ohe = OneHotEncoder(sparse=False) ohe.fit_transform(X) ``` ```python # get attributes of transformers ct.named_transformers_.simpleimputer.statistics_ ``` # Setting hyper-parameters ```python pipe.named_steps.keys() ```