Is result of 10 years of machine study fluky? Berkeley &MIT considered to oppugn 30 classical mo

Is result of 10 years of machine study fluky? Berkeley &MIT considered to oppugn 30 classical models

Xin Zhiyuan reports

[Xin Zhiyuan introduction] we are known to the development of machine study, depend on greatly a few a few levels are standard, for instance CIFAR-10, imageNet or MuJoCo. This raised a crucial question: Are we right is at present machine study evolutional measured is there many reliable?

Is result of 10 years of machine study fluky? Berkeley &MIT considered to oppugn 30 classical models

In recent years artificial intelligence develops, big, for instance one another " surpass human level " progress, and small, be in everyday almost even happen (this should thank Arxiv) , it is for instance in all sorts of papers ceaseless by refresh " State-of-the-art " , all without exception lets a person plaint of the domain flourishingly.

But, actual condition is done not have probably so good.

A Berkeley and the new research that MIT cooperates, a few classical classification that in going nearly 10 years to passing, put forward implement (for instance VGG is mixed ResNet) the discovery after undertaking checking again, because check collect to plan to close too, a lot of classification implement what precision did not allege actually is so tall; On new data set the test makes clear as a result, these classification implement precision has generally drop, extent 4%~10% differs.

Investigator expresses, this one result can be regarded as evidence, the precision that proves a model this number is fluky, and be distributinged easily by data in the influence of little natural change.

This new issue that considered to also put forward to be worth to think over -- we are used at present measure method of machine study evolutional and method, have many after all reliable?

Repeat use same test part, cannot popularize new data

The author is written in the paper, in the past in 5 years, machine study has made an experiment field. Drive in what deepness learns below, the paper that great majority publishs was used same kind cover a region, that sees a kind of new method be in namely a few keys are fiducial on function has how many promotion. In other words, it is numerical value of simple gruffly contrast, very few someone goes explaining why.

And when contrast is numeric, the assessment of most research depends on of a few a few standards fiducial, for example CIFAR-10, ImageNet or MuJoCo. Not only such, because the data of Ground Truth distributings to get very hard commonly, so researcher can evaluate the performance of the model on independent test market only.

"Now, in whole algorithm and model design process, repeat the method that uses same test part to had been accepted generally for many times. Although will new model and the result previously undertake comparative,be very natural think of a way, but apparent and current research technique destroyed classification implement independent at testing market this one key assumes. But apparent and current research technique destroyed classification implement independent at testing market this one key assumes..

This kind does not match brought apparent harm, easily design goes out to be able to be only on specific test market move good, but the model that cannot popularize new data actually.

CIFAR-10 but repeatability experiment: The classical model precision such as VGG, ResNet drops generally

What cause to examine this kind of phenomenon is sequential, researcher is right CIFAR-10 and relevant classification implement did investigate again. The main goal of research is, measure ingoing to classify implement arrival is changed in extensive from identical distributing, it is good to can be done more when sealed new data.

Choose standard CIFAR-10 data set, because it founds a process transparently to make its special,be agree with this task. In addition, CIFAR-10 has become the heat that will nearly 10 years consider, investigating adaptability (Adaptivity) whether had brought about plan to close this issue, it is a very good test is used exemple.

Is result of 10 years of machine study fluky? Berkeley &MIT considered to oppugn 30 classical models

In the experiment, researcher is used above all new, it is about 2000 picture that the model had not seen certainly, made a new test part, distributing the subclass of new test market to be done carefully with primitive CIFAR-10 data set match, keep consistent as far as possible.

Next, 30 image were evaluated to classify on new test market implement function, include classical VGG, ResNet, the ResNeXt that puts forward newly recently, PyramidNet, DenseNet, and the Shake-Drop that releases in ICLR 2018, this Shake-Drop is changing a method to combine the classification previously implement, obtained current State-of-art.

As a result following watches are shown. Primitive CIFAR-10 checks the model precision of collect and new test market, gap is the difference of both precision. Δ Rank represents the change of the rank, for instance " - 2 " meant the rank that centers in new test to drop two positions.

Is result of 10 years of machine study fluky? Berkeley &MIT considered to oppugn 30 classical models

By the result knowable, the precision photograph of the model on new test market has than primitive test market drop apparently. For example, these two models are in VGG and ResNet the accuracy rate on primitive data set is 93% , and fall on new test market for 85% the left and right sides. In addition, the author still shows, the function photograph that they discover the model on existing test market is had more than new test market forecast a gender.

To appearing this is planted the reason of the result, author set many hypothesis undertook discussion one by one, wait besides ginseng of statistical error, attune besides, still basically plan to close too.

The author shows, their result showed current machine to learn the one side that progress makes the person is accident. Although CIFAR-10 checks collect to had been gotten used to ceaselessly (Adapting) a lot of years, but this kind of trend does not have backwater. The model with best expression remains the Shake-Shake network that offers recently (Cutout is being changed) . And, on new test market place, shake-Shake increases 8 % from 4 % than the advantage of standard ResNet. This makes clear, the research technique opposite that aims at onslaught of market of a test plans to close and character is very effective.

In the meantime, this result is right also current classification implement rash club sex raised doubt. Although new data set made little change only (distributinging move) , but the model that is used extensively existingly, classified accuracy drops generally significantly. For example, the precision loss of the VGG that mentions in front and ResNet is corresponding the old progress at CIFAR-10.

The writer is special point out, the distributinging move that their experiment causes (Distributional Shift) both neither is antagonism (Adversarial) , also not be the result that different data source brings about. Accordingly, although be in benign environment, distributinging move also can bring grim challenge, researcher needs to think, current model can float truly change what rate.

Machine study research also needs to notice but repeatability

Comment of Sebastian Raschka of one book author studies Python Machine Learning this think, it reminds machine study researcher notices to check collect to repeat again use (and violate independent character) problem.

Cereal song cerebra studies Hardmaru of account of scientist, Twitter expresses, study to machine study the method that has be assessmented reliably is very important. He expects to see the similar research of concerned text and interpreter, examine PTB, wikitext, enwik8, WMT'14 EN-FR, how does the structure such as EN-DE arrive to test market newly from same distributinging move.

Nevertheless, hardmaru expresses, if similar result gets on PTB, so studying to deepness study is meddlesome actually for the bound, because be in,PTB undertakes on this small data set super optimized typical procedure, can let a person discover extensive changes the new method with better performance really.

The author shows, prospective test should be explored in other data set (for example ImageNet) with other job (if the language builds a model) go up to whether plan to close to passing likewise have recover from an illness quality. In addition, we should understand what abiogenesis distributinging change to classify appliance to have challenge sex to image.

Turn an issue to understand extensive truly, more research should collect perspicacious new data to evaluate the function show that has algorithm to go up in these data. Be similar to recruit new participator to have medicine or psychology but repeatability experiment, can repeat what machine study considers to also need pair of model property do research more.

Relevant paper: Address of Do CIFAR-10 Classifiers Generalize To CIFAR-10: Https://arxiv.org/pdf/1806.00451.pdf
未经允许不得转载:News » Is result of 10 years of machine study fluky? Berkeley &MIT considered to oppugn 30 classical mo