diff --git a/01b-AI_Possibilities-what_is_ai.Rmd b/01b-AI_Possibilities-what_is_ai.Rmd index 9e53509d..adcd5479 100644 --- a/01b-AI_Possibilities-what_is_ai.Rmd +++ b/01b-AI_Possibilities-what_is_ai.Rmd @@ -65,7 +65,7 @@ In this case study, we will look at how artificial intelligence has been utilize There are many uses of AI for improving financial institutions, each with potential benefits and risks. Most financial institutions weigh the benefits and risks carefully before implementation. -For instance, if a financial institution takes a high-risk prediction seriously, such as predicting a financial crisis or a large recession, then it would have huge impact on a bank’s policy and allows the bank to act early. However, many financial institutions are hesitant to take action based on artificial intelligence predictions because the prediction is for a high-risk situation. If the prediction is not accurate then there can be severe consequences. Additionally, data on rare events such as financial crises are not abundant, so researchers worry that there is not enough data to train accurate models @nelson2023. +For instance, if a financial institution takes a high-risk prediction seriously, such as predicting a financial crisis or a large recession, then it would have huge impact on a bank’s policy and allows the bank to act early. However, many financial institutions are hesitant to take action based on artificial intelligence predictions because the prediction is for a high-risk situation. If the prediction is not accurate then there can be severe consequences. Additionally, data on rare events such as financial crises are not abundant, so researchers worry that there is not enough data to train accurate models [@nelson2023]. Many banks prefer to pilot AI for low-risk, repeated predictions, in which the events are common and there is a lot of data to train the model on. @@ -77,7 +77,7 @@ Let’s look at a few examples that illustrate the potential benefits and risks ottrpal::include_slide("https://docs.google.com/presentation/d/1b8ivojtu3UA0HcACLqcghS300Ia4Wu7iXmgp6KacEJw/edit#slide=id.g2639341f200_0_58") ``` -An important task in analysis of economic data is to classify business by institutional sector. For instance, given 10 million legal entities in the European Union, they need to be classified by financial sector to conduct downstream analysis. In the past, classifying legal entities was curated by expert knowledge @moufakkir2023. +An important task in analysis of economic data is to classify business by institutional sector. For instance, given 10 million legal entities in the European Union, they need to be classified by financial sector to conduct downstream analysis. In the past, classifying legal entities was curated by expert knowledge [@moufakkir2023]. Text-based analysis and machine learning classifiers, which are all considered AI models, help reduce this manual curation time. An AI model would extract important keywords and classify into an appropriate financial sector, such as “non-profits”, “small business”, or “government”. This would be a low-risk use of AI, as one could easily validate the result to the true financial sector. @@ -87,8 +87,7 @@ Text-based analysis and machine learning classifiers, which are all considered A ottrpal::include_slide("https://docs.google.com/presentation/d/1b8ivojtu3UA0HcACLqcghS300Ia4Wu7iXmgp6KacEJw/edit#slide=id.g2639341f200_0_70") ``` - -Banks are considering expanding upon existing traditional economic models to bring in a wider data sources, such as pulling in social media feeds as an indicator of public sentiment. The National bank of France has started to use social media information to estimate the public perception of inflation. The Malaysian national bank has started to incorporate new articles into its financial model of gross domestic product estimation. However, the use of these new data sources may may raise questions about government oversight of social media and public domain information @omfif2023. +Banks are considering expanding upon existing traditional economic models to bring in a wider data sources, such as pulling in social media feeds as an indicator of public sentiment. The National bank of France has started to use social media information to estimate the public perception of inflation. The Malaysian national bank has started to incorporate new articles into its financial model of gross domestic product estimation. However, the use of these new data sources may may raise questions about government oversight of social media and public domain information [@omfif2023]. #### Using Large Language Models to predict inflation @@ -96,17 +95,9 @@ Banks are considering expanding upon existing traditional economic models to bri ottrpal::include_slide("https://docs.google.com/presentation/d/1b8ivojtu3UA0HcACLqcghS300Ia4Wu7iXmgp6KacEJw/edit#slide=id.g2639341f200_0_14") ``` -The US Federal Reserve has researched the idea of using pre-trained large language models from Google to make inflation predictions. Usually, inflation is predicted from the Survey of Professional Forecasters, which pools forecasts from a range of financial forecasts and experts. When compared to the true inflation rate, the researchers found that the large language models performed slightly better than the Survey of Professional Forecasters @stlouisfed2023. - -A concern of using pre-trained large language models is that the data sources used for model training are not known, so the financial institution may be using data that is not in line with its policy. Also, a potential risk of using large language models that perform similarly is the convergence of predictions. If large language models make very similar predictions, banks would act similarly and make similar policies, which may lead to financial instability @omfif2023. - - - - - - - +The US Federal Reserve has researched the idea of using pre-trained large language models from Google to make inflation predictions. Usually, inflation is predicted from the Survey of Professional Forecasters, which pools forecasts from a range of financial forecasts and experts. When compared to the true inflation rate, the researchers found that the large language models performed slightly better than the Survey of Professional Forecasters [@stlouisfed2023]. +A concern of using pre-trained large language models is that the data sources used for model training are not known, so the financial institution may be using data that is not in line with its policy. Also, a potential risk of using large language models that perform similarly is the convergence of predictions. If large language models make very similar predictions, banks would act similarly and make similar policies, which may lead to financial instability [@omfif2023]. ## What Is and Is Not AI @@ -174,3 +165,7 @@ While the core functionality of speed cameras relies on sensor technology and pr This is considered AI. Social media algorithms, like Instagram's, make recommendations based on user behavior. For example, if you spend a lot of time viewing a page that was recommended, the system interprets that as positive feedback and will make similar recommendations. Typically, these recommendations get better over time as the user generates more user-specific data. You supply data through your behaviors, the algorithm gets trained, and you interact with the suggestions via the app. + +## Summary + +The definition of artificial intelligence (AI) has shifted over time. We use the three part framework of data, algorithms, and interfaces to describe AI applications. You will need to consider specific technologies and whether they meet the criteria for being classified as AI using this framework. Adaptability and training with new data are key factors to keep in mind as we move further in the course. diff --git a/01c-AI_Possibilities-how_ai_works.Rmd b/01c-AI_Possibilities-how_ai_works.Rmd index 83a74671..8c90a0f0 100644 --- a/01c-AI_Possibilities-how_ai_works.Rmd +++ b/01c-AI_Possibilities-how_ai_works.Rmd @@ -5,65 +5,108 @@ ottrpal::set_knitr_image_path() # VIDEO How AI Works +TODO: Slides here: https://docs.google.com/presentation/d/1OydUyEv1uEzn8ONPznxH1mGd4VHC9n88_aUGqkHJX4I/edit#slide=id.g263e06ef889_36_397 + # How AI Works -Let's briefly revisit our definition of AI: it must have data, training via an algorithm, and an interface. How do each of these work? We'll explore below. +Let's briefly revisit our definition of AI: it must have data, algorithm(s), and an interface. Let's dive into each of these in more detail below. + +## Early Warning for Skin Cancer + +Each year in the United States, 6.1 million adults are treated for skin cancer (basal cell and squamous cell carcinomas), totaling nearly $10 billion in costs [@CDC2023]. It is one of the most common forms of cancer in the United States, and mortality from skin cancer is a real concern. Fortunately, early detection through regular screening can increase survival rates to over 95% [@Melarkode2023]. Cost and accessibility of screening providers, however, means that many people aren't getting the preventative care they need. + +Increasingly, AI is being used to flag potential skin cancer. AI focused on skin cancer detection could be used by would-be patients to motivate them to seek a professional opinion, or by clinicians to validate their findings or help with continuous learning. -## The Data Explosion +1. **Data**: Images of skin -Let's say we're driving a car or taking public transportation in a city. We might notice a pattern between the amount of traffic on roads, and the time of day. If you commute once at a specific time of day and observe the traffic around you, you have one data point. You can do this a bunch of times and collect more data. +1. **Algorithm**: Detection of possible skin cancer -Historically, this is the way data has been collected, and you could manage that data in an Excel Spreadsheet. However, as computer storage has become cheaper and data collection methods have become more sophisticated, our ability to access data has exploded in scale. It's not hard to imagine that using traffic cameras, dashcams, and car sensors could collect a lot more information than any one person. +1. **Interface**: Web portal or app where you can submit a new picture -Think about how much text information is freely available on the internet! Treating that as input data, AI systems can look for patterns of words that typically go together. For example, you're much more likely to see the phrase "cancer is a disease" than "cancer is a computer program". +## Collecting Datapoints + +Let's say a clinician, *Dr. Derma*, is learning how to screen for skin cancer. When Dr. D sees their first instance of skin cancer, they now have one data point. Dr. D could make future diagnoses based on this one data point, but it might not be very accurate. Over time, as Dr. D does more screenings of skin with and without cancer, they will get a better and better idea of what skin cancer looks like. This is part of what we do best. Human beings are powerhouses when it comes to pattern recognition and processing [@Mattson2014]. + +Like Dr. D, AI will get better at finding the right patterns with more data. In order to train an AI algorithm to detect possible skin cancer, we'll first want to gather as many pictures of normal and cancerous skin as we can. This is the **raw data** [@Leek2017]. ```{r, echo=FALSE, fig.alt='CAPTION HERE', out.width = '100%', fig.align = 'center'} -ottrpal::include_slide("https://docs.google.com/presentation/d/1OydUyEv1uEzn8ONPznxH1mGd4VHC9n88_aUGqkHJX4I/edit#slide=id.g2a3877ab699_0_79") +ottrpal::include_slide("https://docs.google.com/presentation/d/1OydUyEv1uEzn8ONPznxH1mGd4VHC9n88_aUGqkHJX4I/edit#slide=id.g263e06ef889_36_153") ``` ### What Is Data -Data comes in many shapes and forms. Data can be **structured**, such as a spreadsheet of times and traffic volume or counts of viral particles in different patients. Data can also be **unstructured**, such as might be found in social media text or genome sequence data. +In our skin cancer screening example, our data is all of the information stored in an image. However, data comes in many shapes and forms. Data can be **structured**, such as a spreadsheet of the time of day plus traffic volume or counts of viral particles in different patients. Data can also be **unstructured**, such as might be found in social media text or genome sequence data. Other kinds of data can be collected and used to train algorithms. These might include survey data collected directly from consumers, medical data collected in a healthcare setting, purchase or transaction tracking, and online tracking of your time on certain web pages [@Cote2022]. +Quantity *and* quality of data are very important. More data makes it easier to detect and account for minor differences among observations. However, that shouldn't come at the cost of quality. It is sometimes better to have fewer, high resolution or high quality images in our dataset than many images that are blurry, discolored, or in other ways questionable. +
-It is **essential** that you and your team think critically about data sources. Many companies releasing generative AI systems have come under fire for training these systems on data that doesn't belong to them [@Walsh2023]. Individual people also have a right to data privacy. No personal data should be used without permission, even if that data could be interesting or useful. -
+Representative diversity of datasets is crucial for the effectiveness of AI. For instance, if an AI used for skin cancer screening only encounters instances of skin cancer on lighter skin tones, it might fail to alert individuals with darker skin tones. -## Machines Can Learn Like Us +The tech industry's lack of diversity contributes to these issues, often leading to the discovery of failures only after harm has occurred. + -Human beings are powerhouses when it comes to pattern recognition and processing [@Mattson2014]. We are constantly observing the world around us, collecting data to learn and make decisions. For example, we might notice a pattern between the amount of traffic on roads in a city, and the time of day. +Large Language Models (LLMs), which we will cover later, are great examples of high quantity and quality of data. Think about how much text information is freely available on the internet! Throughout the internet, we're much more likely to see the phrase "cancer is a disease" than "cancer is a computer program". Many LLMs are trained on sources like [Wikipedia](https://www.wikipedia.org/), which are typically grammatically sound and informative, leading to higher quality output. ```{r, echo=FALSE, fig.alt='CAPTION HERE', out.width = '100%', fig.align = 'center'} -ottrpal::include_slide("https://docs.google.com/presentation/d/1OydUyEv1uEzn8ONPznxH1mGd4VHC9n88_aUGqkHJX4I/edit#slide=id.gcf1264c749_0_140") +ottrpal::include_slide("https://docs.google.com/presentation/d/1OydUyEv1uEzn8ONPznxH1mGd4VHC9n88_aUGqkHJX4I/edit#slide=id.g2a3877ab699_0_79") ``` -Much like the human brain, machine learning detects patterns within data. **Machine learning** is at the heart of artificial intelligence, allowing computers to learn and make predictions. In more complex machine learning, computers make millions of calculations, mastering the mapping of inputs (observations) to outputs (predictions). This process mirrors how humans learn through experience. - -
-**Machine Learning**: Machine learning is a way for computers to learn from examples and improve their performance over time, resembling how humans learn from experience. +
+It is **essential** that you and your team think critically about data sources. Many companies releasing generative AI systems have come under fire for training these systems on data that doesn't belong to them [@Walsh2023]. Individual people also have a right to data privacy. No personal data should be used without permission, even if that data could be interesting or useful.
-A machine learning system refines its understanding by continuously updating its parameters based on the feedback received from the provided data. For example, our system might be guessing traffic by time of day, but also judging its accuracy while accounting for other factors, such as whether or not it was a work day, if some workers are on holiday, or how many people live in the city. +### Preparing the Data + +It's important to remember that AI systems need specific instructions to start detecting patterns. We'll need to take our raw data and indicate which pictures are positive for skin cancer and which aren't. This process is called **labeling** and has to be done by humans. + +Once data is labeled, either "cancer" or "not cancer", we can use it to train the algorithm in the next step. This data is aptly called **training data**. ```{r, echo=FALSE, fig.alt='CAPTION HERE', out.width = '100%', fig.align = 'center'} -ottrpal::include_slide("https://docs.google.com/presentation/d/1OydUyEv1uEzn8ONPznxH1mGd4VHC9n88_aUGqkHJX4I/edit#slide=id.g1965a5f7f0a_0_44") +ottrpal::include_slide("https://docs.google.com/presentation/d/1OydUyEv1uEzn8ONPznxH1mGd4VHC9n88_aUGqkHJX4I/edit#slide=id.g263e06ef889_36_318") ``` -The rise of machine learning has been propelled by our ability to collect vast amounts of data and sophisticated types of AI and computing power. +## Understanding the Algorithm + +Our goal is "detection of possible skin cancer", but how does a computer do that? + +First, we'll need to break down the image into attributes called **features**. This could be the presence of certain color pixels, percentage of certain shades, spot perimeter regularity, or other features. Features can be determined by computers or by data scientists who know what kind of features are important. It's not uncommon for an AI looking at image data to have thousands of features. +Because we've supplied a bunch of images with labels, AI can look for patterns that are present in cancerous images that are not present in others. +As an example, here is a very simple algorithm with one feature (spot perimeter): +1. Calculate the perimeter of a darker spot in the image. - +1. If the perimeter of the spot is exactly circular, label the image "not cancer". +1. If the perimeter of the spot is not circular, label the image "cancer". - +### Testing the Algorithm - +After setting up and quantifying the features, we want to make sure the AI is actually doing a good job. We'll take some images the AI hasn't seen before, called **test data**. We know the correct answers, but the AI does not. The AI will measure the features within each of the images to provide an educated guess of the proper label. Every time AI gets a label wrong, it will reassess parts of the algorithm. For example, it might make the tweak below: - +1. Calculate the perimeter of a darker spot in the image. - +1. If the perimeter of the spot is close to circular, label the image "not cancer". +1. If the perimeter of the spot is not close to circular, label the image "cancer". + +Humans play a big part in what kind of scores are acceptable when producing outputs. With cancer screening, we might be very worried about missing a real instance of cancer. Therefore, we might tell the AI to score false negatives more harshly than false positives. + +```{r, echo=FALSE, fig.alt='CAPTION HERE', out.width = '100%', fig.align = 'center'} +ottrpal::include_slide("https://docs.google.com/presentation/d/1OydUyEv1uEzn8ONPznxH1mGd4VHC9n88_aUGqkHJX4I/edit#slide=id.g263e06ef889_36_360") +``` + +## Interfacing with AI + +Finally, AI would not work without an interface. This is where we can get creative. In our skin cancer screening, we might create a website where providers or patients could upload a picture of an area that needs screening. + +- Because skin images could be considered medical data, we would need to think critically about what happens to images after they are uploaded. Are images deleted after a screening prognosis is made? Will images be used to update the training data? + +- Telling people they might have cancer could be very upsetting for them. Our interface should provide supporting resources and clear disclaimers about its abilities. + +```{r, echo=FALSE, fig.alt='CAPTION HERE', out.width = '100%', fig.align = 'center'} +ottrpal::include_slide("https://docs.google.com/presentation/d/1OydUyEv1uEzn8ONPznxH1mGd4VHC9n88_aUGqkHJX4I/edit#slide=id.g263e06ef889_36_397") +``` diff --git a/01d-AI_Possibilities-ai_types.Rmd b/01d-AI_Possibilities-ai_types.Rmd index c7da38c5..5b122558 100644 --- a/01d-AI_Possibilities-ai_types.Rmd +++ b/01d-AI_Possibilities-ai_types.Rmd @@ -5,17 +5,22 @@ ottrpal::set_knitr_image_path() # VIDEO Different Types of AI -# Types of AI +# Demystifying Types of AI -How they work.. +We've learned a bit about how AI works. However there are many different types of AI with different combinations of data, algorithms, and interfaces. There are also general terms that are important to know. Let's explore some of these below. - +## Machine Learning +**Machine learning** is broad concept describing how computers to learn from data. It includes traditional methods like decision trees and linear regression, as well as more modern approaches such as deep learning. It involves training models on labeled data to make predictions or uncover patterns or grouping of data. Machine learning is often the "algorithm" part of our data - algorithm - interface framework. - +## Neural Networks - +**Neural networks** are a specific class of algorithms within the broader field of machine learning. They organize data into layers, including an input layer for data input and an output layer for results, with intermediate "hidden" layers in between. - +You can think of layers like different teams in an organization. The input layer is in charge of scoping and strategy, the output layer is in charge of finalizing deliverables, while the intermediate layers are responsible for piecing together existing and creating new project materials. These layers help neural networks understand hierarchical patterns in data. - +The connections between nodes have weights that the network learns during training. The network can then adjust these weights to minimize errors in predictions. Neural networks often require large amounts of labeled data for training, and their performance may continue to improve with more data. + +```{r, echo=FALSE, fig.alt='CAPTION HERE', out.width = '100%', fig.align = 'center'} +ottrpal::include_slide("https://docs.google.com/presentation/d/1UiYOR_4a68524XsCv-f950n_CfbyNJVez2KdAjq2ltU/edit#slide=id.g2a694e3cce9_0_0") +``` diff --git a/book.bib b/book.bib index f229b91b..2609afc5 100644 --- a/book.bib +++ b/book.bib @@ -349,7 +349,7 @@ @misc{pearce_beware_2021 @misc{CDC2023, title = {Melanoma of the Skin Statistics}, url = {https://www.cdc.gov/cancer/skin/statistics/index.htm}, - author = {US Centeres for Disease Control and Prevention}, + author = {CDC}, language = {en}, urldate = {2023-12-14}, year= {2023} @@ -365,3 +365,11 @@ @article{Melarkode2023 year={2023}, publisher={MDPI} } + +@misc{Leek2017, + title = {Demystifying Artificial Intelligence}, + url = {https://leanpub.com/demystifyai}, + author = {Leek, Jeffrey T and Narayanan, Divya}, + language = {en}, + year= {2017} +} diff --git a/resources/dictionary.txt b/resources/dictionary.txt index d3ab8858..a6a65507 100644 --- a/resources/dictionary.txt +++ b/resources/dictionary.txt @@ -49,6 +49,7 @@ ChatGPT's CIO Coursera css +curation cyberattacks cybersecurity DALL @@ -57,7 +58,9 @@ DaSL DaSL's Datatrail DataTrail +deliverables deepfakes +Derma Dockerfile Dockerhub dropdown @@ -74,6 +77,7 @@ GPT HIPAA IDARE impactful +IRB IRBs ITCR itcrtraining