The flowering forecast provides recommendations for orchard cleaning, pest control, field management and fertilization, which can help increase tree vigor and resistance. Flowering forecast is not only an important pa...The flowering forecast provides recommendations for orchard cleaning, pest control, field management and fertilization, which can help increase tree vigor and resistance. Flowering forecast is not only an important part of the construction of agro-meteorological index system, but also an important part of the meteorological service system. In this paper, by analyzing local meteorological data and phenological data of “Red Fuji” apples in Fen County, Linfen City, Shanxi Province, with the help of machine learning and neural networks, we proposed a method based on the combination of time series forecasting and classification forecasting is proposed to complete the dynamic forecasting model of local flowering in Ji County. Then, we evaluated the effectiveness of the model based on the number of error days and the number of days in advance. The implementation shows that the proposed multivariable LSTM network has a good effect on the prediction of meteorological factors. The model loss is less than 0.2. In the two-category task of flowering judgment, the idea of combining strategies in ensemble learning improves the effect of flowering judgment, and its AUC value increases from 0.81 and 0.80 of single model RF and AdaBoost to 0.82. The proposed model has high applicability and accuracy for flowering forecast. At the same time, the model solves the problem of rounding decimals in the prediction of flowering dates by the regression method.展开更多
In this study, we analyze brain activity data describing functional magnetic resonance imaging (MRI) imaging of 820 subjects with each subject being scanned at 4 different times. This multiple scanning gives us an opp...In this study, we analyze brain activity data describing functional magnetic resonance imaging (MRI) imaging of 820 subjects with each subject being scanned at 4 different times. This multiple scanning gives us an opportunity to observe the consistency of imaging characteristics within the subjects as compared to the variability across the subjects. The most consistent characteristics are then used for the purpose of predicting subjects’ traits. We concentrate on four predictive methods (Regression, Logistic Regression, Linear Discriminant Analysis and Random Forest) in order to predict subjects’ traits such as gender and age based on the brain activities observed between brain regions. Those predictions are done based on the adjusted communication activity among the brain regions, as assessed from 4 scans of each subject. Due to a large number of such communications among the 116 brain regions, we performed a preliminary selection of the most promising pairs of brain regions. Logistic Regression performed best in classifying the subject gender based on communication activity among the brain regions. The accuracy rate was 85.6 percent for an AIC step-wise selected Logistic Regression model. On the other hand, the Logistic Regression model maintaining the entire set of ranked predictor was capable of getting an 87.7 percent accuracy rate. It is interesting to point out that the model with the AIC selected features was better classifying males, whereas the complete ranked model was better classifying females. The Random Forest technique performed best for prediction of age (grouped within five categories as provided by the original data) with 48.8 percent accuracy rate. Any set of predictors between 200 and 1600 was presenting similar rates of accuracy.展开更多
The network-based intrusion detection has become common to evaluate machine learning algorithms. Although the KDD Cup’99 Dataset has class imbalance over different intrusion classes, still it plays a significant role...The network-based intrusion detection has become common to evaluate machine learning algorithms. Although the KDD Cup’99 Dataset has class imbalance over different intrusion classes, still it plays a significant role to evaluate machine learning algorithms. In this work, we utilize the singular valued decomposition technique for feature dimension reduction. We further reconstruct the features form reduced features and the selected eigenvectors. The reconstruction loss is used to decide the intrusion class for a given network feature. The intrusion class having the smallest reconstruction loss is accepted as the intrusion class in the network for that sample. The proposed system yield 97.90% accuracy on KDD Cup’99 dataset for the stated task. We have also analyzed the system with individual intrusion categories separately. This analysis suggests having a system with the ensemble of multiple classifiers;therefore we also created a random forest classifier. The random forest classifier performs significantly better than the SVD based system. The random forest classifier achieves 99.99% accuracy for intrusion detection on the same training and testing data set.展开更多
In this paper, sixty-eight research articles published between 2000 and 2017 as well as textbooks which employed four classification algorithms: K-Nearest-Neighbor (KNN), Support Vector Machines (SVM), Random Forest (...In this paper, sixty-eight research articles published between 2000 and 2017 as well as textbooks which employed four classification algorithms: K-Nearest-Neighbor (KNN), Support Vector Machines (SVM), Random Forest (RF) and Neural Network (NN) as the main statistical tools were reviewed. The aim was to examine and compare these nonparametric classification methods on the following attributes: robustness to training data, sensitivity to changes, data fitting, stability, ability to handle large data sizes, sensitivity to noise, time invested in parameter tuning, and accuracy. The performances, strengths and shortcomings of each of the algorithms were examined, and finally, a conclusion was arrived at on which one has higher performance. It was evident from the literature reviewed that RF is too sensitive to small changes in the training dataset and is occasionally unstable and tends to overfit in the model. KNN is easy to implement and understand but has a major drawback of becoming significantly slow as the size of the data in use grows, while the ideal value of K for the KNN classifier is difficult to set. SVM and RF are insensitive to noise or overtraining, which shows their ability in dealing with unbalanced data. Larger input datasets will lengthen classification times for NN and KNN more than for SVM and RF. Among these nonparametric classification methods, NN has the potential to become a more widely used classification algorithm, but because of their time-consuming parameter tuning procedure, high level of complexity in computational processing, the numerous types of NN architectures to choose from and the high number of algorithms used for training, most researchers recommend SVM and RF as easier and wieldy used methods which repeatedly achieve results with high accuracies and are often faster to implement.展开更多
Multi-level multi-scale resource selection models using machine learning were compared and contrasted for generating predictive maps of jaguar habitat (Panthera onca) in the Brazilian Pantanal. Multiple spatial scales...Multi-level multi-scale resource selection models using machine learning were compared and contrasted for generating predictive maps of jaguar habitat (Panthera onca) in the Brazilian Pantanal. Multiple spatial scales and temporal movement levels were run within several analytical modeling frameworks for comparison. Included in the analysis were multi-scale raster grains (30 m, 90 m, 180 m, 360 m, 720 m, 1440 m) and GPS collaring temporal movement levels (point, path, and step). Various analytical methods were used for comparison of models that could accommodate data structural levels (group, individual, case-control). Models compared included conditional logistic regression, generalized additive modeling (GAM), and classification regression trees, such as random forests (RF) and gradient boosted regression tree (GBM). The goals of the study were to discuss the potential and limitations for machine learning methods using GPS collaring data to produce predictive habitat suitability mapping using the various scales and levels available. Results indicated that choosing the appropriate temporal level and raster scale improved model outputs. Overall, larger level analytical modeling frameworks and those that used multi-scale raster grains showed the best model evaluation with the inherent condition that they predict a broader scale and subset of data. The identification of the appropriate spatial scale, temporal scale and statistical model need careful consideration in predictive mapping efforts.展开更多
It is a commonplace that the injury plays a vital influence in an NBA match and it may reverse the result of two teams with wide strength disparity. In this article, in order to decrease the uncertainty of the risk in...It is a commonplace that the injury plays a vital influence in an NBA match and it may reverse the result of two teams with wide strength disparity. In this article, in order to decrease the uncertainty of the risk in the coming match, we propose a pipeline from gathering data at the player’s level including the fundamental statistics and the performance in the match before and data at the team’s level including the basic information and the opponent team’s status in the match we predict on. Confined to the limited and extremely unbalanced data, our result showed a limited power on injury prediction but it made a not bad result on the injury of the star player in a team. We also analyze the contribution of the factors to our prediction. It demonstrated that player’s own performance matters most in their injury. The Principal Component Analysis is also applied to help reduce the dimension of our data and to show the correlation of different features.展开更多
Software programs are always prone to change for several reasons. In a software product line, the change is more often as many software units are carried from one release to another. Also, other new files are added to...Software programs are always prone to change for several reasons. In a software product line, the change is more often as many software units are carried from one release to another. Also, other new files are added to the reused files. In this work, we explore the possibility of building a model that can predict files with a high chance of experiencing the change from one release to another. Knowing the files that are likely to face a change is vital because it will help to improve the planning, managing resources, and reducing the cost. This also helps to improve the software process, which should lead to better software quality. Also, we explore how different learners perform in this context, and if the learning improves as the software evolved. Predicting change from a release to the next release was successful using logistic regression, J48, and random forest with accuracy and precision scored between 72% to 100%, recall scored between 74% to 100%, and F-score scored between 80% to 100%. We also found that there was no clear evidence regarding if the prediction performance will ever improve as the project evolved.展开更多
文摘The flowering forecast provides recommendations for orchard cleaning, pest control, field management and fertilization, which can help increase tree vigor and resistance. Flowering forecast is not only an important part of the construction of agro-meteorological index system, but also an important part of the meteorological service system. In this paper, by analyzing local meteorological data and phenological data of “Red Fuji” apples in Fen County, Linfen City, Shanxi Province, with the help of machine learning and neural networks, we proposed a method based on the combination of time series forecasting and classification forecasting is proposed to complete the dynamic forecasting model of local flowering in Ji County. Then, we evaluated the effectiveness of the model based on the number of error days and the number of days in advance. The implementation shows that the proposed multivariable LSTM network has a good effect on the prediction of meteorological factors. The model loss is less than 0.2. In the two-category task of flowering judgment, the idea of combining strategies in ensemble learning improves the effect of flowering judgment, and its AUC value increases from 0.81 and 0.80 of single model RF and AdaBoost to 0.82. The proposed model has high applicability and accuracy for flowering forecast. At the same time, the model solves the problem of rounding decimals in the prediction of flowering dates by the regression method.
文摘In this study, we analyze brain activity data describing functional magnetic resonance imaging (MRI) imaging of 820 subjects with each subject being scanned at 4 different times. This multiple scanning gives us an opportunity to observe the consistency of imaging characteristics within the subjects as compared to the variability across the subjects. The most consistent characteristics are then used for the purpose of predicting subjects’ traits. We concentrate on four predictive methods (Regression, Logistic Regression, Linear Discriminant Analysis and Random Forest) in order to predict subjects’ traits such as gender and age based on the brain activities observed between brain regions. Those predictions are done based on the adjusted communication activity among the brain regions, as assessed from 4 scans of each subject. Due to a large number of such communications among the 116 brain regions, we performed a preliminary selection of the most promising pairs of brain regions. Logistic Regression performed best in classifying the subject gender based on communication activity among the brain regions. The accuracy rate was 85.6 percent for an AIC step-wise selected Logistic Regression model. On the other hand, the Logistic Regression model maintaining the entire set of ranked predictor was capable of getting an 87.7 percent accuracy rate. It is interesting to point out that the model with the AIC selected features was better classifying males, whereas the complete ranked model was better classifying females. The Random Forest technique performed best for prediction of age (grouped within five categories as provided by the original data) with 48.8 percent accuracy rate. Any set of predictors between 200 and 1600 was presenting similar rates of accuracy.
文摘The network-based intrusion detection has become common to evaluate machine learning algorithms. Although the KDD Cup’99 Dataset has class imbalance over different intrusion classes, still it plays a significant role to evaluate machine learning algorithms. In this work, we utilize the singular valued decomposition technique for feature dimension reduction. We further reconstruct the features form reduced features and the selected eigenvectors. The reconstruction loss is used to decide the intrusion class for a given network feature. The intrusion class having the smallest reconstruction loss is accepted as the intrusion class in the network for that sample. The proposed system yield 97.90% accuracy on KDD Cup’99 dataset for the stated task. We have also analyzed the system with individual intrusion categories separately. This analysis suggests having a system with the ensemble of multiple classifiers;therefore we also created a random forest classifier. The random forest classifier performs significantly better than the SVD based system. The random forest classifier achieves 99.99% accuracy for intrusion detection on the same training and testing data set.
文摘In this paper, sixty-eight research articles published between 2000 and 2017 as well as textbooks which employed four classification algorithms: K-Nearest-Neighbor (KNN), Support Vector Machines (SVM), Random Forest (RF) and Neural Network (NN) as the main statistical tools were reviewed. The aim was to examine and compare these nonparametric classification methods on the following attributes: robustness to training data, sensitivity to changes, data fitting, stability, ability to handle large data sizes, sensitivity to noise, time invested in parameter tuning, and accuracy. The performances, strengths and shortcomings of each of the algorithms were examined, and finally, a conclusion was arrived at on which one has higher performance. It was evident from the literature reviewed that RF is too sensitive to small changes in the training dataset and is occasionally unstable and tends to overfit in the model. KNN is easy to implement and understand but has a major drawback of becoming significantly slow as the size of the data in use grows, while the ideal value of K for the KNN classifier is difficult to set. SVM and RF are insensitive to noise or overtraining, which shows their ability in dealing with unbalanced data. Larger input datasets will lengthen classification times for NN and KNN more than for SVM and RF. Among these nonparametric classification methods, NN has the potential to become a more widely used classification algorithm, but because of their time-consuming parameter tuning procedure, high level of complexity in computational processing, the numerous types of NN architectures to choose from and the high number of algorithms used for training, most researchers recommend SVM and RF as easier and wieldy used methods which repeatedly achieve results with high accuracies and are often faster to implement.
文摘Multi-level multi-scale resource selection models using machine learning were compared and contrasted for generating predictive maps of jaguar habitat (Panthera onca) in the Brazilian Pantanal. Multiple spatial scales and temporal movement levels were run within several analytical modeling frameworks for comparison. Included in the analysis were multi-scale raster grains (30 m, 90 m, 180 m, 360 m, 720 m, 1440 m) and GPS collaring temporal movement levels (point, path, and step). Various analytical methods were used for comparison of models that could accommodate data structural levels (group, individual, case-control). Models compared included conditional logistic regression, generalized additive modeling (GAM), and classification regression trees, such as random forests (RF) and gradient boosted regression tree (GBM). The goals of the study were to discuss the potential and limitations for machine learning methods using GPS collaring data to produce predictive habitat suitability mapping using the various scales and levels available. Results indicated that choosing the appropriate temporal level and raster scale improved model outputs. Overall, larger level analytical modeling frameworks and those that used multi-scale raster grains showed the best model evaluation with the inherent condition that they predict a broader scale and subset of data. The identification of the appropriate spatial scale, temporal scale and statistical model need careful consideration in predictive mapping efforts.
文摘It is a commonplace that the injury plays a vital influence in an NBA match and it may reverse the result of two teams with wide strength disparity. In this article, in order to decrease the uncertainty of the risk in the coming match, we propose a pipeline from gathering data at the player’s level including the fundamental statistics and the performance in the match before and data at the team’s level including the basic information and the opponent team’s status in the match we predict on. Confined to the limited and extremely unbalanced data, our result showed a limited power on injury prediction but it made a not bad result on the injury of the star player in a team. We also analyze the contribution of the factors to our prediction. It demonstrated that player’s own performance matters most in their injury. The Principal Component Analysis is also applied to help reduce the dimension of our data and to show the correlation of different features.
文摘Software programs are always prone to change for several reasons. In a software product line, the change is more often as many software units are carried from one release to another. Also, other new files are added to the reused files. In this work, we explore the possibility of building a model that can predict files with a high chance of experiencing the change from one release to another. Knowing the files that are likely to face a change is vital because it will help to improve the planning, managing resources, and reducing the cost. This also helps to improve the software process, which should lead to better software quality. Also, we explore how different learners perform in this context, and if the learning improves as the software evolved. Predicting change from a release to the next release was successful using logistic regression, J48, and random forest with accuracy and precision scored between 72% to 100%, recall scored between 74% to 100%, and F-score scored between 80% to 100%. We also found that there was no clear evidence regarding if the prediction performance will ever improve as the project evolved.