


9 使用机器学习算法进行预测分析



In this chapter, we come to the top of the data analysis pyramid, including the prediction of the data and the application after the final model is created.


9.1 数据探索式分析

在第八章进行了数据探索,但主要工作是在于数据的补充和完善,并没有进行数据详细的分析工作,接下来就是要通过Jupyter Notebook环境对数据进行探索式分析,首先在创建项目的地方打开Jupyter Notebook。


9.1.1 环境配置与数据读入



可以直接使用show()方法进行查看数据,这种方式输出结果在pycahrm中比较和谐,但是在jupyter botebook中却不怎么友好,比如查看前3条数据,输出结果如下。






9.1.2 探究延误航班有多少数据量






9.1.3 探究出发延期到达提前的数据量





Combined with the previous analysis, we can divide and statistics the basic situation of delay, and get the total number of flights, the number of departure delays, the number of arrival delays, the percentage of delays and so on. The final output results are as follows.


9.1.4 探究延误的平均时间



With the percentage of delay, it shows that 1/3 of the flights are at risk of delay. The total or average delay time can be calculated. The code and output results are as follows. (the average departure delay is 9.4 and the arrival delay is 4.4)


9.1.5 探究延误的原因

根据输出的字段,表中已经将延误的原因分为了五种,具体为:WeatherDelay, CarrierDelay, ASDelay, SecurityDelay, LateAircraftDelay。可以统计五类原因的延误时间并汇总,代码及输出结果如下。(第一列是延误的总时间,剩下五列就是延误的明细)




9.2 特征工程



9.2.1 去除空值


filled_on_time_features = simple_on_time_features.filter(  simple_on_time_features.ArrDelay.isNotNull()  &  simple_on_time_features.DepDelay.isNotNull())

9.2.2 时间数据处理



import iso8601import datetimedef convert_hours(hours_minutes):    hours = hours_minutes[:-2]    minutes = hours_minutes[-2:]    if hours == '24':        hours = '23'        minutes = '59'    time_string = "{}:{}:00Z".format(hours, minutes)    return



The test example and the output of the calling function are as follows. (the returned result is still string data, but it has become a time-formatted string for subsequent operations.)



def compose_datetime(iso_date, time_string):    return "{} {}".format(iso_date, time_string)



The test example and the output of the calling function are as follows. (the function is to construct a complete time-style string data)



def create_iso_string(iso_date, hours_minutes):    time_string = convert_hours(hours_minutes)    full_datetime = compose_datetime(iso_date, time_string)    return



The test example and the output of the calling function are as follows. (this function is the premise of the previous two functions)



def create_datetime(iso_string):    return iso8601.parse_date(iso_string)



The test example and the output of the calling function are as follows.



def convert_datetime(iso_date, hours_minutes):    iso_string = create_iso_string(iso_date, hours_minutes)    dt = create_datetime(iso_string)    return



The test example and the output of the calling function are as follows.



def day_of_year(iso_date_string):    dt = iso8601.parse_date(iso_date_string)    doy = dt.timetuple().tm_yday    return



The test example and the output of the calling function are as follows.



def alter_feature_datetimes(row):    flight_date = iso8601.parse_date(row['FlightDate'])    scheduled_dep_time = convert_datetime(row['FlightDate'], row['CRSDepTime'])    scheduled_arr_time = convert_datetime(row['FlightDate'], row['CRSArrTime'])    # Handle overnight flights    if scheduled_arr_time < scheduled_dep_time:        scheduled_arr_time += datetime.timedelta(days=1)    doy = day_of_year(row['FlightDate'])    return {        'FlightNum': row['FlightNum'],        'FlightDate': flight_date,        'DayOfWeek': int(row['DayOfWeek']),        'DayOfMonth': int(row['DayOfMonth']),        'DayOfYear': doy,        'Carrier': row['Carrier'],        'Origin': row['Origin'],        'Dest': row['Dest'],        'Distance': row['Distance'],        'DepDelay': row['DepDelay'],        'ArrDelay': row['ArrDelay'],        'CRSDepTime': scheduled_dep_time,        'CRSArrTime': scheduled_arr_time,    }



Finally, the seventh function is used for data processing, and the output result is as follows.




Then sort the data, the sorting rules are as follows, confirm that the data is correct, save the data locally, the code and operation results are as follows.

import datetimesorted_features = timestamp_df.sort(  timestamp_df.DayOfYear,  timestamp_df.Carrier,  timestamp_df.Origin,  timestamp_df.Dest,  timestamp_df.FlightNum,  timestamp_df.CRSDepTime,  timestamp_df.CRSArrTime,)sorted_features.show()sorted_features.repartition(1).write.mode("overwrite").json("../data/simple_flight_delay_features.json")

9.3 模型创建,评估与保存

9.3.1 模块导入与数据加载





Count the read-in data and check the amount of running memory consumed.


9.3.2 数据随机采样与标签数据处理





You can get any piece of data at random and view the basic structure of the data. For example, select the data with index 2 here to view the corresponding data information.




9.3.3 数据集中字段处理





9.3.4 数据集划分



9.3.5 模型创建与评估



First of all, the linear review model is created, there are four steps, in turn: import the model to be used, model initialization, model fitting, model prediction.






The evaluation score of a single model can only roughly judge the fitting of the data, and the evaluation score needs to be compared with different model scores before the evaluation score will be meaningful. Then, the echelon promotion model is used for modeling (when the amount of data will affect the running speed of the program), and the same evaluation criteria are used for model evaluation. The code and output results are as follows.


9.3.6 模型保存



9.4 模型部署



Many predictive model systems die in the lab where they were created, largely because people don’t know if a model is deployed online. Deploying a prediction system will be an important task for us next, and it is also a key skill for data scientists to grow into veterans.

9.4.1 单条数据预测



Taking the one-on-one test data as an example, the deployment steps are broken down as follows:


import joblibvectorizer = joblib.load("../models/sklearn_vectorizer.pkl")regressor = joblib.load("../models/sklearn_regressor.pkl")


prediction_features = {}prediction_features['DepDelay'] = 5prediction_features['Origin'] = 'SFO'prediction_features['Dest'] = 'ATL'prediction_features['FlightNum'] = 1519prediction_features['Carrier'] = 'AA'print(prediction_features)


feature_vectors = vectorizer.transform(prediction_features)


result = regressor.predict(feature_vectors)[0]print ("延迟时间是:" +  str(round(result,0)))



9.4.2 网页中加载模型进行预测


import joblibvectorizer = joblib.load("../../models/sklearn_vectorizer.pkl")regressor = joblib.load("../../models/sklearn_regressor.pkl")@app.route("/flights/delays/predict/regress", methods=['POST'])def regress_flight_delays():    api_field_type_map = \        {            "DepDelay": int,            "Carrier": str,            "FlightDate": str,            "Dest": str,            "FlightNum": str,            "Origin": str        }    api_form_values = {}    for api_field_name, api_field_type in api_field_type_map.items():        api_form_values[api_field_name] = request.form.get(api_field_name, type=api_field_type)    prediction_features = {}    prediction_features['DepDelay'] = api_form_values['DepDelay']    prediction_features['Origin'] = api_form_values['Origin']    prediction_features['Dest'] = api_form_values['Dest']    prediction_features['FlightNum'] = api_form_values['FlightNum']    prediction_features['Carrier'] = api_form_values['Carrier']    date_features_dict = predict_utils.get_regression_date_args(api_form_values['FlightDate'])    for api_field_name, api_field_value in date_features_dict.items():        prediction_features[api_field_name] = api_field_value    feature_vectors = vectorizer.transform([prediction_features])    result = regressor.predict(feature_vectors)[0]    result_obj = {"Delay": round(result,0)}    return json.dumps(result_obj)


import sys, os, reimport pymongoimport datetime, iso8601def process_search(results):  """Process elasticsearch hits and return flights records"""  records = []  total = 0  if results['hits'] and results['hits']['hits']:    total = results['hits']['total']    hits = results['hits']['hits']    for hit in hits:      record = hit['_source']      records.append(record)  return records, totaldef get_navigation_offsets(offset1, offset2, increment):  """Calculate offsets for fetching lists of flights from MongoDB"""  offsets = {}  offsets['Next'] = {'top_offset': offset2 + increment, 'bottom_offset':  offset1 + increment}  offsets['Previous'] = {'top_offset': max(offset2 - increment, 0), 'bottom_offset': max(offset1 - increment, 0)} # Don't go < 0  return offsetsdef strip_place(url):  """Strip the existing start and end parameters from the query string"""  try:    p = re.match('(.+)\?start=.+&end=.+', url).group(1)  except AttributeError as e:    return url  return pdef get_flight_distance(client, origin, dest):  """Get the distance between a pair of airport codes"""  query = {    "Origin": origin,    "Dest": dest,  }  record = client.example.origin_dest_distances.find_one(query)  return record["Distance"]def get_regression_date_args(iso_date):  """Given an ISO Date, return the day of year, day of month, day of week as the API expects them."""  print(iso_date)  dt = iso8601.parse_date(iso_date)  print(dt)  day_of_year = dt.timetuple().tm_yday  print(day_of_year)  day_of_month = dt.day  day_of_week = dt.weekday()  print(day_of_week)  return {    "DayOfYear": day_of_year,    "DayOfMonth": day_of_month,    "DayOfWeek": day_of_week,  }def get_current_timestamp():  iso_now = datetime.datetime.now().isoformat()  return


@app.route("/flights/delays/predict")def flight_delays_page():    form_config = [        {'field': 'DepDelay', 'label': 'Departure Delay','colname':'出发延迟时间'},        {'field': 'Carrier','colname':'航空公司'},        {'field': 'FlightDate', 'label': 'Date','colname':'航班时间'},        {'field': 'Origin','colname':'出发地'},        {'field': 'Dest', 'label': 'Destination','colname':'目的地'},        {'field': 'FlightNum', 'label': 'Flight Number','colname':'航班编号'},    ]    return render_template('flight_delays_predict.html', form_config=form_config)


{% extends "index.html" %}
{% block body2 %}

/ <a href="/flights/delays/predict">预测航班延迟a>
<p class="lead" style="margin: 10px; margin-left: 0px;">


<form id="flight_delay_regression" action="/flights/delays/predict/regress" method="post">
{% for item in form_config %}
{% if 'label' in item %}
<label for="{{item['field']}}">{{item['colname']}}label>
{% else %}
<label for="{{item['field']}}">{{item['colname']}}label>
{% endif %}
<input name="{{item['field']}}" style="width: 36px; margin-right: 10px;" value="">input>
{% endfor %}
<button type="submit" class="btn btn-xs btn-default" style="height:">提交查询button>
<div style="margin-top: 10px;">
<p>预计航班延迟时间: <span id="result" style="display: inline-block;">span> 分钟p>
<script>// Attach a submit handler to the form
$( "#flight_delay_regression" ).submit(function( event ) {
// Stop form from submitting normally
// Get some values from elements on the page:
var $form = $( this ),
term = $form.find( "input[name='s']" ).val(),
url = $form.attr( "action" );
// Send the data using post
var posting = $.post( url, $( "#flight_delay_regression" ).serialize() );
// Put the results in a div
posting.done(function( data ) {
result = JSON.parse(data);
$( "#result" ).empty().append( result.Delay );





You can then enter the content, such as entering a single piece of test data, and get the final prediction result through the submit query button. (the final output is consistent with the results of a single test)




You can change the input data and test it again. The input and output results are as follows.


Original: https://blog.51cto.com/u_15713987/5462732
Author: 百木从森
Title: 【大数据实战项目八】使用机器学习算法进行预测分析并进行网上部署





