Below is the problem, after you accept the task, I will send you the lecture notes which will help you with this project. And other things you need. Thank you
Our competition data on Kaggle are an MNIST replacement, which consists of Japanese characters, and contains the following 2 dataset:
Kuzushiji-MNIST is a drop-in replacement for the MNIST dataset (28×28 grayscale, 70,000 images), provided in the original MNIST format as well as a NumPy format. Since MNIST restricts us to 10 classes, we chose one character to represent each of the 10 rows of Hiragana when creating Kuzushiji-MNIST.
Kuzushiji-49, as the name suggests, has 49 classes (28×28 grayscale, 270,912 images), is a much larger, but imbalanced dataset containing 48 Hiragana characters and one Hiragana iteration mark.
The data are courtesy of Tarin Clanuwat.
Background: The MNIST database of handwritten digits (Links to an external site.)Links to an external site., available from this page, has a training set of 60,000 examples, and a test set of 10,000 examples. It is currently the de facto benchmark dataset for any machine learning models (classification, regression, neural net, clustering). The K-MNIST we will use are datasets formatted similar, but with different values, which means you cannot use existing trained models for MNIST on this directly, and you have to re-train your models.
Download
k49-train-imgs.npz, k49-train-labels.npz, k49-test-imgs.npz, kmnist-train-imgs.npz, kmnist-train-labels.npz, kmnist-test-imgs.npz,
from our Kaggle competition site. The npz file format will be directly readable by numpy. You will learn classifiers using the training data of K-MNIST, make predictions based on the test features, and upload your predictions in csv file to Kaggle for evaluation. The prediction based on K49 training data will be graded as bonus points.
Kaggle will then score your predictions, and report your performance on a random subset of the test data to place your team on the public leaderboard. After the competition, the score on the remainder of the test data will be used to determine your final standing; this ensures that your scores are not affected by overfitting to the leaderboard data.
Kaggle will limit you to at most 2 uploads per day, so you cannot simply upload every possible classifier and check their leaderboard quality. You will need to do your own validation, for example by splitting the training data into multiple folds, to tune the parameters of learning algorithms before uploading predictions for your top models. The competition closes (uploads will no longer be accepted or scored) on March 22nd, 2019 at 11:59pm Pacific daylight-saving time.
Project RequirementsEach project team will learn several different classifiers for the Kaggle data, to try to predict class labels as accurately as possible. We expect you to experiment with at different types of classification models, or combine two of them. Suggestions include:
K-Nearest Neighbors. KNN models for this data will need to overcome two issues: the large number of training & test examples, and the data dimension. As noted in class, distance-based methods often do not work well in high dimensions, so you may need to perform some kind of feature selection process to decide which features are most important. Also, computing distances between all pairs of training and test instances may be too slow; you may need to reduce the number of training examples somehow (for example by clustering), or use more efficient algorithms to find nearest neighbors. Finally, the right distance for prediction may not be Euclidean in the original feature scaling (these are raw numbers); you may want to experiment with scaling features differently.
Linear models. Since we have a large amount of training data, we can use the softmax regression for multi-classes with a pre-processing data using a clustering algorithm.
Support Vector Machines: SVM is like KNN classifiers, and some data pre-processing or subsampling may be required.
Neural networks. The key to learning a good NN model on these data will be to ensure that your training algorithm does not become trapped in poor local optima. You should monitor its performance across backpropagation iterations on training/validation data, and verify that predictive performance improves to reasonable values. Start with few layers (2-3) and moderate numbers of hidden nodes (100-1000) per layer, and verify improvements over baseline linear models.
Other. You tell us! Apply another class of learners, or a variant or combination of methods like the above. You can use existing libraries or modify course codes. The only requirement is that you understand the model you are applying, and can clearly explain its properties in the project report.
For each machine learning algorithm, you should do enough work to make sure that it achieves reasonable performance, with accuracy similar to (or better than) baselines like softmax regression for the 10 classes. Then, take your best learned models, and combine them using a blending or stacking technique. This could be done via a simple average/vote, or a weighted vote based on another learning algorithm. Feel free to experiment and see what performance gains are possible.