German Credit Data Set Arff
Format: The datasets are in annotated transaction format with labels: every line is one transaction. A transaction is a space-separated list of item identifiers (offset 0), the last item is either 1 or 0 and represents the class label.The meaning of every label is given in the header of the file: @.
Lines describe item number, @class. Describes the two classes.
To parse the files correctly, all lines starting with @, with% and empty lines should be ignored. (the format is a combination of the FIMI format with annotations like the ARFF format).Sources: The original datasets were collected from the. More datasets can be found in the, but they are not annotated.Preprocessing: Preprocessing steps were added to the @relation tag of every file. Attributes having more then 10% missing values were removed, as well as the remaining examples that had missing values. Also, zoo-1 and splice-1 have their unique ID attribute removed,. Numerical attributes were binarized using unsupervised discretisation with 7 binary split points (8 bins) and equal-frequency binning,.
Nominal attributes are transformed using one item for every value,. Multi-class problems were made binary by selecting the largest class.Properties: Different datasets have different properties and will behave differently. A key property to watch is density (the relative number of 1's in the binary format): traditional itemset mining focussed on very large and sparse datasets (see the ).
In constraint-based mining dense datasets are considered harder to mine because of the large number of candidates. For discriminative itemset mining class labels are given, the number of positive transactions are indicated below for each dataset.The number of itemsets (standard and closed/maximal condensed) is also given, for verification of correctness and as a guideline for usage. Was used to find them.
Free download page for Project VIKAMINE's credit-g-demo-dataset.arff.VIKAMINE is a flexible environment for visual analytics, data mining and business intelligence - implemented in pure Java. It features several powerful visualization and mining methods, and can. Dec 14, 2016. The original dataset contains 1000 entries with 20 categorial/symbolic attributes prepared by Prof. In this dataset, each entry represents a person who takes a credit by a bank.
Each person is classified as good or bad credit risks according to the set of attributes. The link to the original.1001 Datasets and Data repositories ( List of lists of lists ) This is a LIST of. 'lists of lists'.Messy presentation (mainly for my own use) to pull together Raw Datasets for when I'm in the mood to get creative - search text on a single page as a starting point for exploration. Later will look at better format. If you have a suggestion for a list of lists, to add to this list:) please message me or post comment.
Source:. July 31, 2017 By Data.gov. Community Categories. SOURCE -. Data.gov The US Government pledged last year to make all government data available freely online. This site is the first stage and acts as a portal to all sorts of amazing information on everything from climate to crime.
US Census Bureau A wealth of information on the lives of US citizens covering population data, geographic data and education. is another interesting place to explore government-related data, with some visualisation tools built-in. European Union Open Data Portal As the above, but based on data from European Union institutions. Data.gov.uk Data from the UK Government, including the British National Bibliography – metadata on all UK books and publications since 1950. is a pilot project with many government and geospatial datasets. offers open government data from US, EU, Canada, CKAN, and more.
The CIA World Factbook Information on history, population, economy, government, infrastructure and military of 267 countries.Healthdata.gov 125 years of US healthcare data including claim-level Medicare data, epidemiology and population statistics. NHS and Social Care Information Centre Health data sets from the UK National Health Service. offers statistics on the situation of women and children worldwide. offers world hunger, health, and disease statistics. Amazon Web Services public datasets Huge resource of public data, including the 1000 Genome Project, an attempt to build the most comprehensive database of human genetic information and ’s database of satellite imagery of Earth. Graph Although much of the information on users’ Facebook profile is private, a lot isn’t – Facebook provide the Graph API as a way of querying the huge amount of information that its users are happy to share with the world (or can’t hide because they haven’t worked out how the privacy settings work). Autocad 2007 Free Download For Windows Xp With Crack.: A fascinating tool for facial recognition data.
makes some of the data from its courses public. is a place to check out data related to economics, healthcare, food and agriculture, and the automotive industry. includes data from world development indicators, OECD, and human development indicators, mostly related to economics data and the world. is a data scraping service that also includes data feeds. is a social data sharing service that allows you to upload your own data and connect with others who are uploading their data. Gapminder Compilation of data from sources including the World Health Organization and World Bank covering economic, medical and social statistics from around the world.
Trends Statistics on search volume (as a proportion of total search) for any given term, since 2004. Google Finance 40 years’ worth of stock market data, updated in real time.
Google Books Ngrams Search and analyze the full text of any of the millions of books digitised as part of the Google Books project. National Climatic Data Center Huge collection of environmental, meteorological and climate data sets from the US National Climatic Data Center. The world’s largest archive of weather data.
DBPedia Wikipedia is comprised of millions of pieces of data, structured and unstructured on every subject under the sun. DBPedia is an ambitious project to catalogue and create a public, freely distributable database allowing anyone to analyze this data. Searchable, indexed archive of news articles going back to 1851. Freebase A community-compiled database of structured data about people, places and things, with over 45 million entries.
Million Song Data Set Metadata on over a million songs and pieces of music. Part of Amazon Web Services.
is a dataset specifically pre-processed for machine learning. offers a large catalog of financial data sets. offers its raw data from its fascinating research into American life. offers a number of cancer-related datasets.Makes learning applied machine learning easy, efficient, and fun. It is a GUI tool that allows you to load datasets, run algorithms and design and run experiments with results statistically robust enough to publish.I recommend Weka to beginners in machine learning because it lets them focus on learning the rather than getting bogged down by the and the — those can come later. In this post, I want to show you how easy it is to load a dataset, run an advanced classification algorithm and review the results. If you follow along, you will have machine learning results in under 5 minutes, and the knowledge and confidence to go ahead and try more datasets and more algorithms.Download Weka and Install Visit the and locate a version of Weka suitable for your computer (Windows, Mac, or Linux).
Weka requires Java. You may already have installed and if not, there are versions of Weka listed on the download page (for Windows) that include Java and will install it for you. I’m on a Mac myself, and like everything else on Mac, Weka just works out of the box. If you are interested in machine learning, then I know you can figure out how to download and install software into your own computer.
If you need help installing Weka, see the following post that provides step-by-step instructions:. Weka GUI Chooser Click the “ Explorer” button to launch the Weka Explorer.This GUI lets you load datasets and run classification algorithms. It also provides other features, like data filtering, clustering, association rule extraction, and visualization, but we won’t be using these features right now.
Open the data/iris.arff Dataset Click the “ Open file” button to open a data set and double click on the “ data” directory. Weka provides a number of small common machine learning datasets that you can use to practice on. Select the “ iris.arff” file to load the Iris dataset. Weka Explorer Interface with the Iris dataset loaded The Iris Flower dataset is a famous dataset from statistics and is heavily borrowed by researchers in machine learning. It contains 150 instances (rows) and 4 attributes (columns) and a class attribute for the species of iris flower (one of setosa, versicolor, and virginica).You can read more about. Select and Run an Algorithm Now that you have loaded a dataset, it’s time to choose a machine learning algorithm to model the problem and make predictions.
Click the “ Classify” tab. This is the area for running algorithms against a loaded dataset in Weka.You will note that the “ ZeroR” algorithm is selected by default. Click the “ Start” button to run this algorithm. Weka Results for the ZeroR algorithm on the Iris flower dataset The ZeroR algorithm selects the majority class in the dataset (all three species of iris are equally present in the data, so it picks the first one: setosa) and uses that to make all predictions. This is the baseline for the dataset and the measure by which all algorithms can be compared.The result is 33%, as expected (3 classes, each equally represented, assigning one of the three to each prediction results in 33% classification accuracy). You will also note that the test options selects Cross Validation by default with 10 folds.This means that the dataset is split into 10 parts: the first 9 are used to train the algorithm, and the 10th is used to assess the algorithm. This process is repeated, allowing each of the 10 parts of the split dataset a chance to be the held-out test set.
The ZeroR algorithm is important, but boring. Click the “Choose” button in the “Classifier” section and click on “trees” and click on the “J48” algorithm.This is an implementation of the C4.8 algorithm in Java (“J” for Java, 48 for C4.8, hence the J48 name) and is a minor extension to the famous C4.5 algorithm. Click the “ Start” button to run the algorithm. Just the results of the J48 algorithm on the Iris flower dataset in Weka Firstly, note the. You can see that the model achieved a result of 144/150 correct or 96%, which seems a lot better than the baseline of 33%.Secondly, look at the. You can see a table of actual classes compared to predicted classes and you can see that there was 1 error where an Iris-setosa was classified as an Iris-versicolor, 2 cases where Iris-virginica was classified as an Iris-versicolor, and 3 cases where an Iris-versicolor was classified as an Iris-setosa (a total of 6 errors).
This table can help to explain the accuracy achieved by the algorithm.Summary In this post, you loaded your first dataset and ran your first machine learning algorithm (an implementation of the C4.8 algorithm) in Weka. The ZeroR algorithm doesn’t really count: it’s just a useful baseline. You now know how to load the datasets that are provided with Weka and how to run algorithms: go forth and try different algorithms and see what you come up with.Leave a note in the comments if you can achieve better than 96% accuracy on the Iris dataset. Well, just learning the tool etc, but using the above setup, I changed the test option to ‘Use Training Set’ and got 98% accuracy.
Detailed Accuracy By Class TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class 1.000 0.000 1.000 1.000 1.000 1.000 1.000 1.000 Iris-setosa 0.980 0.020 0.961 0.980 0.970 0.955 0.990 0.969 Iris-versicolor 0.960 0.010 0.980 0.960 0.970 0.955 0.990 0.970 Iris-virginica Weighted Avg. 0.980 0.010 0.980 0.980 0.980 0.970 0.993 0.980 Confusion Matrix a b c. Really nice work Sandra!Changing the test option to “use training set” changes the nature of the experiment and the results are not really comparable. This change tells you how well the model performed on the data to which was trained (already knows the answers). This is good if you are making a descriptive model, but not helpful if you want to use that model to make predictions.
To get an idea at how good it is at making predictions, we need to test it on data that it has not “seen” before where it must make predictions that we can compare to the actual results.Cross validation does this for us (10 times in fact). Great work on Multilayer Perceptron! That’s a complicated algorithm that has a lot of parameters you can play with. Maybe you could try some other datasets from the “data” directory in Weka.
Hello Everyone, hello Jason, I must say this is exciting, i absolutely have no foundation in computer science or programming and neither was i very good at mathematics but somehow i am in love with the idea of machine learning, probably because i have a real life scenario i want to experiment with. I have up to 20 weekends and more of historical data of matches played and i would like to see how weka can predict the outcome of matches played within that 20 week period. My data is in tabular form and it is stored in microsoft word.It is a forecast of football matches played in the past. Pattern detection is the key, By poring over historical data of matches played in the past, patterns begin to emerge and i use this to forecast what the outcome of matches will be for the next game. I use the following attributes for detecting patterns and making predictions which on paper is always 80-100% accurate but when i make a bet, it fails. (results, team names, codes, week’s color, row number) Results= Matches that result in DRAWS Team names = Believe it or not, teams names are used as parameters to make predictions, HOW?
They begin with Alphabets. Codes= These are 3-4 strings either digits or a combo of letters and digits, depending on where they are strategically placed in the table, they offer insight into detecting patterns.Weeks Color= In the football forecasting world, there are 4 colours used to represent each week in a month. RED, BLUE, BROWN and PURPLE. These also allows the forecaster to see emerging patterns. Row Number= Each week, the data is presented in a table form with two competing teams occupying a row and a number is associated with that row.
These numbers are used to make preditions. So i would like to TEACH WEKA how i detect these patterns so that my task can be automated and tweaked anyhow i like it. In plain english, how do i write out my “pattern detecting style” for weka to understand and how do i get this information loaded into weka for processing into my desired results.Going by my scenario, What will be my attributes? What will be my instances? What will be the claasifiers?What algorithms do i use to achieve my aim or will i need to write new algorithms? I sincerely hope someone will come to my rescue.
Can you pls help me. I actually new to this datamining concepts. I want to know how to extract a features and accuracy of a given url name.For eg: if the url name is it will extract the feature is and @ in it and i also tells the age of the url and also some feature extraction like ip address, long or short url,httos and ssl,hsf,redirect page,anchor tag like that it should extract and it will tell the accuracy too.and then implement using c4.5 classifier algorithm to find whether the given url name is malicious or benign url.
Pls some one help me to do this process. I’m trying to use libsvm for classification (2 class) in 10-fold cross-validation mode. The output predictions that I get have an instance#, but I dont know which instances of my dataset do these correspond to. For example, my output predictions looks like this: inst#, actual, predicted, error, probability distribution 1 2:R 2:R 0.1 2 2:R 2:R 0.1 3 2:R 2:R 0.1 4 2:R 2:R 0.1 5 1:S 1:S.1 0 6 1:S 1:S.1 0 1 2:R 2:R 0.1 2 2:R 1:S +.1 0 3 2:R 2:R 0.1 4 2:R 2:R 0.1 5 1:S 1:S.1 0 6 1:S 2:R + 0.1. How does my dataset get divided into 10 parts, which files do these instances correspond to?I’m interested in knowing which files get incorrectly classified. Is there some other/better way to do this?
I’ve been playing around for a while with WEKA, and now I get good prediction results. But I still wonder how to apply the model built further? I mean, I train and tune algorithms and get better results, but then?When I try to input, say, a set of four attributes corresponding to those of the IRIS set, it doesn’t recognize it as something that it can use in the model. If I put these four attributes and an empty column, it accepts this, but I don’t know how to predict the class then? How should I set the parameters in WEKA to do that, please?
Thanks by advance. Just a quick note that I love the whole site and have never had such an easy time establishing a new direction of endeavor with a high degree of confidence and understanding! This post had me up, running valid data, and evaluating the output from the classifier in under 10 minutes!
Simply amazing!One small issue of note, though. The last paragraph of section 5, just before the Summary, covers the Confusion Matrix. In that paragraph, the third case cited refers to the three instances in which “Iris-versicolor was classified as a Iris-setosa.” My understanding of the table, however, is that there were three instances in which the Iris-versicolor was classified as an Iris-verginica, not Iris-setosa as stated.
Naturally Mr.Murphy would select the most ironic location for this confusion of interpretation. Thanks for your helpful information. I have a specific question.
Using the steps that you have mentioned we can train a machine learning model in WEKA and test its accuracy. I am wondering how we can classify new instances, with no class labels, using a model that we have trained in WEKA. For example, lets say that we have 1000 instances of positive and negative sentences.We train a machine learning model using an algorithm. Afterwards, we want to label 100 new sentences that have not already been classified with either positive or negative labels. How can we do such a work using WEKA?A quick question: I ran the SMO classifier, as I needed a Support Vector Machine, and got a set of results that included a list of the features used, under a line that reads, “Machine Linear: showing attribute weights, not support vectors”. Each feature has a value to the left and the label “(normalised)” next to it.
What does this mean, please? Values for each feature was used in the classification so I assume the numbers refer to some sort of weighting i.e.How heavily each feature impacted on the results. Is this the case? Any chance someone can please explain this in simple terms as I am a beginner, or at least point me to a website with a detailed explanation of the SMO classifier and ALL its results section contents. I will be taking a course in Predictive Analytics through UC San Diego in April and I thought I’d get a start by looking at WEKA. That will be the tool that we will be using in the course. I liked the example of the iris file that you gave and how it analyzed the data and found a certain number of records misclassified within the data set.I’m trying to understand the real life application of this tool and I’m thinking about my previous role as a manager over all incidents that were reported by customers who are trying to access our websites.
Some of those issues were data related problems, eg missing data, incorrect data in the wrong fields, etc. So would I be able to run analysis of the data and identify records that had these data issues in my dataset and then drill down to the actual record to identify the real root cause and address the issue upstream in the data ETL process? I thought I posted a comment last night, but don’t see it 🙁 At any rate, here is my question/thought. I will be taking Predictive Analytics through UC San Diego next month and will be using WEKA in the course. Your example on the Iris made it easier for me to digest how to use this tool and I thank you for that. I’m thinking about the practical use of this tool in my every day work.I worked for a large insurance company and was responsible for 7 websites and all of the reported issues/incidents within those portals from customers (internal and external).
German Credit Data Set Arff Download
I’m assuming this would be a good tool to look at data related issues, e.g. We get data feeds from mainframe to a repository.In that data, there can be defects, e.g. Missing data, incorrect data in wrong field, etc.I’m assuming this will help with this type of issue. My questions are – is my assumption above correct and if so, does this tool then allow you to easily identify the data at a line level? HI Jason, Am trying to classify tweets into 3 categories: +ve, -ve and neutral. Currently I have around 800 tweets.
Steps followed by me: 1.Converting tweet text column to string using NomialtoString. Applying StringtoWordVector (stemmer: Snowball, tokenizer: NGramTokenizer or CharNGramTokenizer.
Applying AttributeSelection (with default settings) under “preprocess” only so as to automatically select the attributes.Am getting an accuracy to around 65% with Naive Bayes. How can I improve the result?? I need to know exact process or settinngs to follow. Is there anything else I should be doing??Thanks a lot Jason for this great initiative. Thanks for this great machine learning tool.
German Credit Data Set Arff Program
I tried using Weka to classify spambase dataset into either spa or non-spam but it is not giving me the right result. Can you explain how I can use Weka for spam email classification using any dataset? What I first did was to convert the text file to.cvs file, then do select normalize on Weka, and then went on select the classification algorithms that I want. Thanks Jason. I am using Multilayerperceptron for my dataset and I am getting following results.Time taken to build model: 1064.19 seconds Evaluation on training set Summary Correctly Classified Instances 393% Incorrectly Classified Instances 106% Kappa statistic 0.7819 Mean absolute error 0.0152 Root mean squared error 0.0853 Relative absolute error 31.9084% Root relative squared error 55.3309% Total Number of Instances 49947 I need more accuracy i.e.
Arff Dataset
Accuracy =90% with same model.