« Day 17 Day 15 »

Day 16: AutoAugment

6/4/2018 Updated 6/13/2018 6:16 AM UTC

Improve this article Revision history 1 Contributor

Table of Contents

AutoAugment
Overfitting?
Statistical Pitfalls in News
Visualization
Easily Reproducible Research
Tutorials and Resources
- #rstats
Notable Research
Miscellaneous

The most interesting news today is Google introducing an algorithm which automatically finds optimal image augmentation policies, and a paper investigating the possible overfitting in published CIFAR-10 models.

AutoAugment

Introducing AutoAugment, a reinforcement learning algorithm which increases both the amount and diversity of existing data by finding optimal image augmentation policies, leading to state-of-the-art performance in computer vision models. Learn more at https://t.co/Yq5Hw9B1QV
— Google AI (@GoogleAI) June 4, 2018

Computer vision model training uses different augmentation of input images (random crops, distortions, etc.). Instead of hand-engineering the operation sequence, one can instead use meta learning to learn effective sequences. Gets new state-of-the-art of 83.54% top1 on ImageNet! https://t.co/NRWfrdARDx
— Jeff Dean (@JeffDean) June 5, 2018

ImageNet Top-1 performance of 2018 is like CIFAR10 performance of 2010.
— hardmaru (@hardmaru) May 25, 2018

Overfitting?

Are we overfitting on CIFAR10? "We measure the accuracy of CIFAR-10 classifiers by creating a new test set of truly unseen images. [...] We find a large drop in accuracy (4% to 10%) for a broad range of deep learning models" https://t.co/wPhSUmIr9x
— Giorgio Patrini (@GiorgioPatrini) June 4, 2018

"Do CIFAR-10 Classifiers Generalize to CIFAR-10?" -- very interesting article highlighting the well-known test set re-use (and violation of independence) issues (i.e., optimisitic bias in absolute performance) in machine learning research empirically: https://t.co/joAN2ASftJ
— Sebastian Raschka (@rasbt) June 4, 2018

“Our sense of progress largely rests on a small number of standard benchmarks such as CIFAR-10, ImageNet, or MuJoCo. This raises a crucial question: How reliable are our current measures of progress in machine learning?” 🔥 https://t.co/WbQ2eInAjT
— hardmaru (@hardmaru) June 4, 2018

I've seen similar results on MNIST -- 99% accuracy classifier dropped to 90% when applied to a new handwriting dataset with same preprocessing
— Yaroslav Bulatov (@yaroslavvb) June 4, 2018

it's good to see though that the original ranking is largely preserved when the classifiers are compared via the new test set. So, basically only the absolute error is affected, which is more of a minor issue
— Sebastian Raschka (@rasbt) June 4, 2018

Stating the obvious: a lot of current deep learning tricks are overfit to the validation sets of well-known benchmarks, including CIFAR10. It's nice to see this quantified. This has been a problem with ImageNet since at least 2015. https://t.co/OGvgn2fwJH
— François Chollet (@fchollet) June 4, 2018

(Long thread. Click on the tweet to view the full thread)

If publishing your paper requires you to select specific ideas, architectures, and hyperparameters according to a fixed validation set, then it's not longer a validation set, it's a training set. And there are no guarantees that the selected ideas will generalize to real data.
— François Chollet (@fchollet) June 4, 2018

Statistical Pitfalls in News

ICYMI, 📽 great slides for #dataliteracy:
"Statistical pitfalls in the news" 🗣 @maartenzam https://t.co/T6q8DhWoJN #ddj #EIJC18 #SoDS18 pic.twitter.com/Zc2TRVYLAt
— Mara Averick (@dataandme) June 5, 2018

Visualization

I think the latest xkcd is written directly to you @dataandme pic.twitter.com/tmYdOeXzli
— Thomas Lin Pedersen (@thomasp85) June 4, 2018

Listed Authors **Hey, join our book project: https://t.co/po9FQ1j6ZQ pic.twitter.com/dKPf11QqGc
— PHD Comics (@PHDcomics) June 4, 2018

Despite the different numbers, there is a clear takeaway from all these studies: a lot of people died after Hurricane Maria. https://t.co/GnnKFgh8F7 pic.twitter.com/YPQ0WncGZ9
— FiveThirtyEight (@FiveThirtyEight) June 5, 2018

Some things that only interactive graphics can do:
- Guide a reader through complex charts
- Incorporate live data
- Challenge the reader with a simple game
- Perform live experiments
- Immerse with 3D, VR and AR
- Use soundhttps://t.co/7nyChCoPfJ
— Elliot Bentley (@elliot_bentley) June 4, 2018

Easily Reproducible Research

From a paper I'm reading: "we have also created a docker container that reproduces all the experiments and figures in this paper". Very nice! A high bar, but I think eventually this should be the norm in machine learning conferences too.
— Dumitru Erhan (@doomie) June 4, 2018

If you care about the truth of your findings, about generating reliable knowledge, then you should *want* as many people as possible try to reproduce your results. And you should make it frictionless. https://t.co/19C9qOsDYa
— François Chollet (@fchollet) June 5, 2018

Making it deliberately difficult to reproduce or evaluate your results is a sure sign that you are writing papers for the wrong reasons.

Science may be a job, but before that, it is humanity's engine for knowledge creation. Make it work
— François Chollet (@fchollet) June 5, 2018

Tutorials and Resources

Hey #Alteryx18 - interested in learning how you can step up your #BI efforts? Check out our white paper "Moving from #BusinessIntelligence to #MachineLearning with Automation" #AugmentedAnalytics #CitizenDataScientist https://t.co/5nD8OifZx9
— DataRobot (@DataRobot) June 4, 2018

For those folks looking for a book to help them as they work through the @fastdotai course - quite a few students are saying that @fchollet's book is a great matchhttps://t.co/WyMRJq5TW5
— Jeremy Howard (@jeremyphoward) June 4, 2018

#DataScience Resources : Cheat Sheets
by @DataScienceFree |

Read full article here: https://t.co/r5FgYLwFex #ML #MachineLearning #DataScientists #Programming #Analytics #DataAnalytics #BigData #Data #Cheatsheet #RT

cc: @bigdata @tableau @guardiandata @bigdatagal pic.twitter.com/XhHluOtAbL
— Ronald van Loon (@Ronald_vanLoon) June 4, 2018

A great into to pipelines! Learning this is one of my mini-projects.

Gives you superpowers when working with structured data and is a very nice way to reason about calculations in general.

Did you know sklearn had FunctionTransformer and Imputer? 👌🙂https://t.co/JU1xaCvc3C
— Radek (@radekosmulski) June 5, 2018

Fighting fire with machine learning: how two California high school students are using @TensorFlow to predict wildfires → https://t.co/z3mfUIdaeh #SearchOn pic.twitter.com/ZmEpooLvDt
— Google (@Google) June 4, 2018

Data Fallacies to Avoid — An Illustrated Collection of Mistakes that People Often Make When Analyzing Data: https://t.co/ZygJljEEUF #abdsc #BigData #DataScience #Statistics #DataLiteracy #DataEthics #DataScientists #StatisticalLiteracy pic.twitter.com/QSC1NW7bdH
— Kirk Borne (@KirkDBorne) June 5, 2018

#rstats

💫 code-through w/ MKL neural ntwks:
"Data Science for Fraud Detection" by @ShirinGlander https://t.co/Y3ypUr7MkI via @codecentric #rstats pic.twitter.com/YbMlpYtK1J
— Mara Averick (@dataandme) June 4, 2018

ggplot2 has a lot (and I say A LOT) of theme customisation that I can't remember each time what the argument name was. I forgot that @rstudio can have addins and one of them makes the #ggplot theme customisation easier 🙂 #rstats pic.twitter.com/57khHf3zS6
— Emi Tanaka 🌾 (@statsgen) June 2, 2018

New position adjustment for #rstats #ggplot2: `position_waterfall`. Part of in-development pkg ggbg https://t.co/NF1CbhMebH. pic.twitter.com/1G4bDoK673
— BrodieG (@BrodieGaslam) June 2, 2018

👍 overview:
"Enterprise Dashboards with R Markdown" by nathanstephens
https://t.co/YwE2q5K8dE #rstats #rmarkdown #dataviz pic.twitter.com/WSygtre5LI
— Mara Averick (@dataandme) June 4, 2018

+1 to @erictleung's recommendation for #rstats skimr package (especially the mini histogram!). #cascadiarconf pic.twitter.com/YdXPDEzt0M
— Caitlin Hudon👩🏼‍💻 (@beeonaposy) June 3, 2018

🦄 fun dataset to get your nerd 🤓on!
"Cleaning up and combining data, a dataset for practice" https://t.co/dCfNEUsUJr #STEMed pic.twitter.com/7COm2xyh7Q
— Mara Averick (@dataandme) June 4, 2018

If you want a ton of great R content, follow these accounts. I'm going to attempt to be more like them in terms of sharing great coding-related content, but focus more on python 😊 https://t.co/HI6TYu4RFq
— Data Science Renee (@BecomingDataSci) June 4, 2018

💕 With @JennyBryan, @revodavid, @hspter, the R-loving family Robinson (@robinson_es, @drob), @juliasilge, @thomasp85, @hrbrmstr, @kierisi, some rando dude @hadleywickham, and many many more, I'm pretty sure I'm riding coattails of #rstats-Twitter glory! 🐎✨
— Mara Averick (@dataandme) June 4, 2018

Notable Research

New paper: "Accounting for the Neglected Dimensions of AI Progress": https://t.co/uvD4sU0uPr

Joint work with Fernando Martínez-Plumed, Shahar Avin, Allan Dafoe, Seán Ó hÉigeartaigh, and José Hernández-Orallo.

(1/n) pic.twitter.com/d0MPukms3Y
— Miles Brundage (@Miles_Brundage) June 5, 2018

To coincide with the talk and the paper, we are releasing a new set of ELMo models of various sizes/amounts of data. Download at https://t.co/dOpuNmUeHy - seamless integration with #allennlp and the original tensorflow code. #naacl2018
— Mark Neumann (@MarkNeumannnn) June 4, 2018

A nice review article about relational reasoning, relational inductive biases in typical deep learning building blocks, and graph networks. By @PeterWBattaglia et al. from DeepMind, Google Brain, MIT, University of Edinburgh. https://t.co/kmypHzcizw
— hardmaru (@hardmaru) June 5, 2018

I wrote a post explaining "Yes, but Did It Work?: Evaluating Variational Inference" by Yao et al. https://t.co/fQYYYo4ibc (ICML 2018), mostly for my own understanding https://t.co/z2OfSsAxb6
— Stephanie Hyland (@_hylandSL) June 3, 2018

Miscellaneous

The root of gender bias: mistaking self-confidence and hubris disguised... https://t.co/BGpu4kHYm9
— Yann LeCun (@ylecun) June 4, 2018

Really proud of our partnership with @theRSAorg on this report. As technology becomes more sophisticated and ingrained in all of our lives, it makes robust public engagement by the tech sector an ethical responsibility. https://t.co/ytBDqKIyqb
— Mustafa Suleyman (@mustafasuleymn) June 4, 2018

We’ve got something new that we think you’ll like... check out our community-owned Data Management Skillbuilding Hub – a one-stop-shop for teaching & learning best practices for data mgmt. Use, share & contribute to continually improve these resources. https://t.co/GVmYu0T7Uq pic.twitter.com/cBq1gKuMhg
— DataONE (@DataONEorg) June 4, 2018

Interesting interview with Andrew Ng https://t.co/ZJuaGztn4U
— Nando de Freitas (@NandoDF) June 4, 2018

"To start, focus on what things DO, not what they ARE" - such an unbelievably powerful idea on learning shared by @math_rachel in one of her lectures!!!

Planning on making this the basis of what I'm going to work on next.
— Radek (@radekosmulski) June 4, 2018

As a kid my dad told me "The age you consider 'old' is the square-root of your age times 10". At 9 you think 30 is old, at 16 you think 40, etc.

Turns out he was wrong.

It's the square-root of your age times 8. pic.twitter.com/wTIzcjyvYg
— Tomer Ullman (@TomerUllman) June 4, 2018

Just load the entire dataset into RAM in a single np.array! Long live feed_dict's for TensorFlow. https://t.co/2tIvjbJ3P8
— hardmaru (@hardmaru) June 4, 2018

(Actual conversation)

“What do you do?”
“I’m a data scientist”
<silence>
“I work with computers”
<stares at my sweet jogging pants>
“from home”
“Like the jobs on the telephone poles?” https://t.co/Z6G2uyOs7s
— Chris Albon (@chrisalbon) June 5, 2018

« Day 17 Day 15 »

Curated by

@ceshine_en

Inpired by @WTFJHT

Supported by