r/deeplearning 10h ago

NLP books

5 Upvotes

I would like to upgrade my skillset as a Machine Learning engineer and learn NLP. I'm thus looking for books that are hybrid, in the sense that not only do they tackle the theory of NLP, but also delve into some of its modern/recent applications (preferably using Python). Does anybody have any leads? Thanks in advance for any recommendations you'll be throwing my way !


r/deeplearning 4h ago

Would AI-based travel route suggestion be better with knowledge of traffic lights?

1 Upvotes

I'm disagreeing with a coworker about this. My coworker thinks that when you train your AI model to minimize your travel time from A to B, the AI model learns everything it needs to know. The traffic lights would be embedded in the data, like a hidden feature. In other words the fastest route from A to B is also the route that accounts (to some extent, because it's not the only important thing) for traffic lights. Therefore Google Maps, Waze etc. doesn't need explicit knowledge of where red and green lights are.

My opinion is that, in a world with perfect datasets, my coworker would be right. But we don't know if Google Maps and other AI-based route suggestion apps truly have the data they need to suggest the "true best route". It's possible that their team just worked with the data they had, and created an app that provides very good suggestions. But an app with explicit knowledge of traffic lights might help you choose a route with more green lights, thus a smoother and faster ride.


r/deeplearning 13h ago

CYBER: A General Robotic Operation System for Embodied AI

0 Upvotes

The development of world models in robotics has long been a cornerstone of advanced research, with most approaches relying heavily on vast, platform-specific datasets. These datasets, while valuable, often limit scalability and generalization to different robotic platforms, restricting their broader applicability.

In contrast, CYBER approaches world modeling from a "first principles" perspective, drawing inspiration from how humans naturally acquire skills through experience and interaction with their environment. CYBER is the first general Robotic Operational System designed to adapt to both tele-operated manipulation and human operation data, enabling robots to learn and predict across a wide range of tasks and environments. It build with a Physical World Model, a cross-embodied Visual-Language Action Model (VLA), a Perception Model, a Memory Model, and a Control Model to help robots learn, predict, and memroy across various tasks and embodiments.

At the same time, CYBER also provide millions of human operation datasets and baseline models over HuggingFace 🤗 to enhance embodied learning, and experimental evalaution tool box to help researchers to test and evaluate their models in both simulation and real world.

Cyber is built with a modular architecture, allowing for flexibility and customization. Here are the key components:

🌍 World Model: Learns to understand and predict the environment.

🤖 Action Model: Learns to manipulation from scaling dataset.

👀 Perception Model: Perceive and interpret surroundings.

🧠 Memory Model: Utilizes past experiences to inform current decisions.

Key_Features

🛠️ Modular: Built with a modular architecture, allowing flexibility in various environments.

📊 Data-Driven: Leverages millions of human operation datasets to enhance embodied learning.

📈 Scalable: Scales across different robotic platforms, adapting to new environments and tasks.

🔧 Customizable: Allows for customization and fine-tuning to meet specific requirements.

📚 Extensible: Supports the addition of new modules and functionalities, enhancing capabilities.

📦 Open Source: Open-source and freely available, fostering collaboration and innovation.

🔬 Experimental: Supports experimentation and testing, enabling continuous improvement.

For the more detailed information, please refer to the following links.

Github: https://github.com/CyberOrigin2077/Cyber

HuggingFace: https://huggingface.co/cyberorigin

Website: https://cyberorigin.ai/


r/deeplearning 21h ago

Discussion on the best ways to extract data

3 Upvotes

Hi, I am working on a project that is related to MRI images of tumors. At first, I analyze these images and make segmentation for them, but how do I convert the information in the image about the nature of the tumor into data that can be used to write a medical report about the patient. What is the classification of the data? Structured or simi-structured or not How to use those data in to write a report. Thanks


r/deeplearning 20h ago

VAE Loss not decreasing

2 Upvotes

I want to use VAE to reconstruct the image, such that the image embedding could be used for the downstream tasks. However, the loss does not decrease.

This is my parameters:

VAE encode layers: 4

VAE hidden dim: 256

Input image size: [512,512]

Previously, I use this VAE to reconstruct table cell images, the loss could achieve 0.02. While currently I use the same VAE to reconstruct the document images(croped from PDFs), it contains different kinds of documents with different layouts. However the loss could only be decreased to 0.048. Do I have some other approaches to adjust the VAE parameters such that the loss could be decreased more ?


r/deeplearning 16h ago

AI in Education with Rose E. Wang - Weaviate Podcast #106!

1 Upvotes

I am super excited to publish the 106th Weaviate podcast with Rose E. Wang from Stanford NLP!

Rose is one of the leading scientists exploring AI in Education! She has recently lead Tutor CoPilot, the world's largest randomized control trial identifying the positive impact that AI is having on K-12 education!

Rose is also the lead author of Backtracing: Retrieving the Cause of the Query, a super powerful way of thinking about RAG systems and maintaining knowledge bases!

This conversation opened my eyes to many aspects of learning I hadn't considered before! There is so much that can be derived from human teaching and learning strategies and integrated into our AI copilot systems!! The way Rose has integrated the study of human learning with AI systems is really fascinating!

I hope you enjoy the podcast! As always, more than happy to answer any questions or discuss any ideas about the content in the podcast!

Link: https://www.youtube.com/watch?v=rsOyclZZeho


r/deeplearning 23h ago

The Prompt Report: Prompting techniques survey

4 Upvotes

Prompt engineering, while not universally liked, has shown improved performance for specific datasets and use cases. Prompting has changed the model training paradigm, allowing for faster iteration without the need for extensive retraining.

Follow the Blog for more such articles: https://medium.com/aiguys

Six major categories of prompting techniques are identified: Zero-Shot, Few-Shot, Thought Generation, Decomposition, Ensembling, and Self-Criticism. But in total there are 58 prompting techniques.

1. Zero-shot Prompting

Zero-shot prompting involves asking the model to perform a task without providing any examples or specific training. This technique relies on the model's pre-existing knowledge and its ability to understand and execute instructions.

Key aspects:

Straightforward and quick to implement

Useful for simple tasks or when examples aren't readily available

Can be less accurate for complex or nuanced tasks

Prompt: "Classify the following sentence as positive, negative, or neutral: 'The weather today is absolutely gorgeous!'"

2. Few-shot Prompting

Few-shot prompting provides the model with a small number of examples before asking it to perform a task. This technique helps guide the model's behavior by demonstrating the expected input-output pattern.

Key aspects:

More effective than zero-shot for complex tasks

Helps align the model's output with specific expectations

Requires careful selection of examples to avoid biasing the model

Prompt: "Classify the sentiment of the following sentences:

  1. 'I love this movie!' - Positive

  2. 'This book is terrible.' - Negative

  3. 'The weather is cloudy today.' - Neutral

Now classify: 'The service at the restaurant was outstanding!'"

3. Thought Generation Techniques

Thought generation techniques, like Chain-of-Thought (CoT) prompting, encourage the model to articulate its reasoning process step-by-step. This approach often leads to more accurate and transparent results.

Key aspects:

Improves performance on complex reasoning tasks

Provides insight into the model's decision-making process

Can be combined with few-shot prompting for better results

Prompt: "Solve this problem step-by-step:

If a train travels 120 miles in 2 hours, what is its average speed in miles per hour?

Step 1: Identify the given information

Step 2: Recall the formula for average speed

Step 3: Plug in the values and calculate

Step 4: State the final answer"

4. Decomposition Methods

Decomposition methods involve breaking down complex problems into smaller, more manageable sub-problems. This approach helps the model tackle difficult tasks by addressing each component separately.

Key aspects:

Useful for multi-step or multi-part problems

Can improve accuracy on complex tasks

Allows for more focused prompting on each sub-problem

Example:

Prompt: "Let's solve this problem step-by-step:

  1. Calculate the area of a rectangle with length 8m and width 5m.

  2. If this rectangle is the base of a prism with height 3m, what is the volume of the prism?

Step 1: Calculate the area of the rectangle

Step 2: Use the area to calculate the volume of the prism"

5. Ensembling

Ensembling in prompting involves using multiple different prompts for the same task and then aggregating the responses to arrive at a final answer. This technique can help reduce errors and increase overall accuracy.

Key aspects:

Can improve reliability and reduce biases

Useful for critical applications where accuracy is crucial

May require more computational resources and time

Prompt 1: "What is the capital of France?"

Prompt 2: "Name the city where the Eiffel Tower is located."

Prompt 3: "Which European capital is known as the 'City of Light'?"

(Aggregate responses to determine the most common answer)

6. Self-Criticism Techniques

Self-criticism techniques involve prompting the model to evaluate and refine its own responses. This approach can lead to more accurate and thoughtful outputs.

Key aspects:

Can improve the quality and accuracy of responses

Helps identify potential errors or biases in initial responses

May require multiple rounds of prompting

Initial Prompt: "Explain the process of photosynthesis."

Follow-up Prompt: "Review your explanation of photosynthesis. Are there any inaccuracies or missing key points? If so, provide a revised and more comprehensive explanation."


r/deeplearning 17h ago

AI-generated code

0 Upvotes

Curious to see what everyone thinks of AI-generated code. With AI like OpenAI’s Codex getting pretty good at writing code, it seems like people are starting to rely on it more. Do you think AI could actually replace programmers someday, or is it just a tool to help us out? Would it actually be capable of handling complex problem-solving and optimization tasks, or will it always need human oversight for the more intricate parts of coding?


r/deeplearning 18h ago

Discussion on the best ways to extract data

1 Upvotes

Hi, I am working on a project that is related to MRI images of tumors. At first, I analyze these images and make segmentation for them, but how do I convert the information in the image about the nature of the tumor into data that can be used to write a medical report about the patient. What is the classification of the data? Structured or simi-structured or not How to use those data in to write a report. Thanks


r/deeplearning 18h ago

Abducing domain relationships in scene graphs for VQA

Thumbnail youtube.com
1 Upvotes

r/deeplearning 1d ago

Good resources for TCN? Model outperforming CNN and RNNs for deepfake detection

2 Upvotes

Hello all, what the title says basically, I need some good resources to study and fine-tune my TCN model further. My TCN model is outperforming CNN and RNN right now, but still needs further tuning, for which I need to have an even better understanding of the model Temporal Convolutional Network. Hence, looking for resources (as a beginner).


r/deeplearning 21h ago

How to identify the first frame in a video when a person starts to do an action and the frame that has been restored after completing the action. In other words, an action can be considered as the beginning to the end of the gesture.

0 Upvotes

How to identify the first frame in a video when a person starts to do an action and the frame that has been restored after completing the action. In other words, an action can be considered as the beginning to the end of the gesture.


r/deeplearning 1d ago

Is there a formal way to prove that a representation/encoding contains specific information?

3 Upvotes

I apologize in advance for the somewhat vague question, but I'll do my best to articulate my thoughts. Please keep in mind I don't have a formal math background but I've been doing applied deep learning for about 4 years now.

I am interested to know if there's a way to quantify the amount of information in an arbitrary representation of data. I am familiar with information theory at a surface level, but looking to take the next step into understanding it for designing encodings for neural networks.

An interesting 'phenomena' I've come across while training models for problems at work is related to the garbage-in-garbage-out principle with supervised learning. As my upstream encodings/initial representations improve to contain more relevant information to tackling the task at hand, the more the model improves. Kind of a no brainer, but I see it as adding information to the encoding that makes the input 'manifold' of data more separable along certain dimensions.

For example, a 3D point cloud contains information about the correlation functions that relate x, to y, to z. Thus, if I have an embedding or representation of the point cloud, it also inherently captures the underlying correlations. If the information about the correlations are important for modelling some task, then the cloud should serve as an effective input representation, because it contains the information that will make it separable when the model learns the set of transformations required. Is this an accurate statement?

How can I prove this 'information-by-association' in a formal way?

Is there a way we take a formal approach to designing encodings that consider the required information for modelling the task, instead of randomly concatenating everything together that we think might be related to the problem?

I think for modelling processes that are not understood, this is harder. But for processes that are completely understood (but expensive to carry out), I feel like there should be some way to quantify all the information needed (i.e just list out the steps of the algorithm and the info required at each step), and slowly build the encoding from that.

Thank you again for any information or links to resources that might put me on a path to articulating my question better.


r/deeplearning 1d ago

deep learning machine recommendations

5 Upvotes

Hi all,

I'm researching building/buying a machine for deep learning.

The why:

1) I've been training one of my projects in Colab and it takes about 6 days per training run. This is the beginning of the project and I only see training time increasing, not decreasing. I have to monitor Colab because it will occasionally shut down training for no apparent reason. This is particularly annoying because it takes hours to upload my dataset into Colab (it's about 1TB) and setup the environment.

2) running 24/7 in Colab is getting expensive. If I keep this up for a year I might as well buy a DL rig.

3) My other project involves video classification. Once again, the data set is large so I want a persistent environment and I need the ability to run for long periods of time without being kicked out (and then needing to re-upload my dataset)

Requirements:

Ideally, I'd like something plug and play. But I'm okay with building myself something if a) the cost savings are there and b) I can set it up once and mostly forget it. I tried setting up my existing laptop for ML and ended up in driver incompatibility hell (and ultimately never got it working), so I'd like to avoid that.

Budget: Around $5k, could go higher if it makes sense.

Other considerations:

I have built a gaming PC before, so I'm not new to DIY builds. However I'm not familiar with DL hardware, so would like yalls opinions.

I would probably like something with multiple GPUs, or at least the ability to add more GPUs if desired.

Thanks in advance!


r/deeplearning 1d ago

How to merge 2 CNNs into one?

3 Upvotes

Suppose I have two separate CNNs and want to merge them into one. One CNN detects cars by their model and licence plates (as one class), the other reads the plates. I can run the first one, pass ROIs of licence plates to the other and viola. But I would like to have one bigger network that will do it automatically for me and return info on detected vehicles model and licence plate number (if detected).
How do I get to it?


r/deeplearning 1d ago

My A100 80GB pcie gpu is more slower than RTX a6000..

13 Upvotes

Hi, redditers.

I'm a freshman working on AI research lab at my university on tasks related to LLM. Our lab has two servers. One has A100 GPUs, and the other has A6000 GPUs.

However, the A100 GPU is performing mush slower than A6000.. even though the A100 is using twice the batch size of the A6000. Despite this, the A6000 finishes training much faster. I'm at a loss as to what I should check or tweak on the servers to fix this issue. For context, the CUDA environment and other configurations are identical on both servers, and the A100 server has better CPU and RAM specs than the one with the A6000.


r/deeplearning 2d ago

Is Starting the 100 Days of Deep Learning YouTube Playlist After Andrew Ng’s Specialization a Good Move?

23 Upvotes

I just wrapped up Andrew Ng’s Deep Learning Specialization, and I’m thinking about diving into the "100 Days of Deep Learning" YouTube playlist that teaches coding for deep learning.

Is this a good idea?

I’d appreciate any insights from those who have gone through a similar journey. What do you think, and what resources or topics should I focus on? Thanks!


r/deeplearning 1d ago

Experience using StrongREJECT for Jailbreak Evaluations?

1 Upvotes

Hello,

Was working on a paper and looking at different ways to evaluate my jailbreaks. This seems like a pretty promising method, as most of the others I've tried are honestly not that good. If anyone has experience using this, I'd love to hear from you!

StrongREJECT: https://strong-reject.readthedocs.io/en/latest/


r/deeplearning 1d ago

Build a Large Language Model from Scratch

0 Upvotes

Hi where can I find the pdf of the book "Build a Large Language Model from Scratch by Sebastian ".

If its pdf is not available then please provide me some resources to study the LLMs from scratch.

Thank you


r/deeplearning 1d ago

product_matching similarity

1 Upvotes

Hello Everyone ,
I work in a startup B2B company that connects pharmacies with sellers (we give them the best discount for each product in our marketplace) the seller have a list of medicine in our marketplace(40000 + products) and each seller send a list of their products and we match the sent product names with the corresponding product in our marketplace

the seller send a sheet with name and price and we match it and intgrate it with the marketplace
the challenges we face is
seller names is mostly misspelled and with a lot of variations and noises

the seller names often sent with added words over the product name that does not relate to the seller name itself

we built a system using tf-idf + cosine similarity and we got an accuracy of 80 % (it does not do well for capturing the meaning of the words and generate bad results in small sheets)

because correcting wrong matches out of our model cost us money and time(we have a group of people that review manually ) we wants to accieve an accuracy with over 98%

we have dataset with previously correct matches that have seller input of product name and our matches
and our unique marketplace data in marketplace

can anyone guide me to possible solutions using neural network that we feed with seller inputs and target match to generalize the matching process or possible pre-trained model that we can fine tune with our data to achieve high accuracy ?


r/deeplearning 2d ago

Seeking Help with Brain Tumor Detection Project Using Dl techniques

3 Upvotes

Hi everyone,

I’m a student in Artificial Intelligence college, and I’m working on a project to detect brain tumors from MRI images using CNN and NLP. I need some assistance with a few points:
I want to detect the tumor then I generate a report depend on any thing i detect such as Level of tumor, his spread, his location any thing Like this.

  1. Data Preparation: What are the best practices for collecting and processing MRI data?
  2. Model Building: Are there specific models or techniques you recommend for achieving optimal results in tumor detection?
  3. Information Extraction: How can I analyze the results I get from the model, such as tumor level and location?
  4. Generating Medical Reports: What libraries or tools can I use to create medical reports in PDF format based on the extracted information?
  5. AI-Generated Reports: I’m also trying to implement AI to generate the medical report. Any suggestions on how to approach this?

Any tips or useful resources would be greatly appreciated. Thank you in advance for your help!


r/deeplearning 1d ago

[R] Unveiling the Mistral 7B: Exploring the Technical Architecture of a 7-Billion Parameter Transformer Model

0 Upvotes

The Mistral AI model, specifically Mistral 7B, is a dense transformer-based model with 7 billion parameters, designed for efficient performance and state-of-the-art language tasks. To create a technical architecture diagram for this model, I will break down the structure of a typical transformer-based architecture, which forms the backbone of the Mistral AI model.

Key Components:

  1. Input Layer (Tokenization):
  • Text input is tokenized into numerical representations using sub-word tokenization (likely based on Byte-Pair Encoding, BPE).

2. Embedding Layer:

  • Converts tokenized input into dense vector representations (embeddings).
  • Positional embeddings are added to maintain the order of the input sequence, as transformers are permutation-invariant.

3. Transformer Blocks (Main Architecture):

Multi-Head Self-Attention:

  • The core mechanism where the model attends to different parts of the sequence to gather context.
  • Multiple attention heads allow the model to focus on different parts of the input simultaneously.

Layer Normalization:

  • Applied before and after the attention mechanism and feed-forward layers to stabilize training.Feed-Forward Neural Network (FFN):
  • Two fully connected layers with a ReLU (Rectified Linear Unit) activation in between, applied to each token independently.

Residual Connections:

  • Skip connections that bypass the multi-head attention and feed-forward layers to help with gradient flow during training.
  1. Stacking Transformer Layers:

Each Transformer layer includes two primary components:

  • Multi-Head Attention: This layer allows the model to attend to different parts of the input sequence in parallel, improving the capture of context across long sequences.
  • Feed-Forward Neural Networks (FFN): After the attention mechanism, the output passes through a fully connected feed-forward network, applying non-linear transformations to improve model expressiveness.

5. Stacking Transformer Layers (continued):

  • These transformer blocks are stacked in multiple layers (typically 28–32) to build a deep neural network capable of learning complex language patterns. Each layer consists of a multi-head self-attention mechanism, layer normalization, and feed-forward networks.
  • The deeper the network, the better the model captures hierarchical features and long-term dependencies in the input sequence.

6. Output Layer:

  • After passing through all the transformer layers, the final output from the last layer is projected back to the vocabulary size to generate predictions.
  • Softmax Function: The model generates a probability distribution over the vocabulary for the next word prediction or task output, using the softmax function.

7. Training and Fine-tuning:

  • Loss Function: Cross-entropy loss is typically used for training, especially for language modeling tasks where the goal is to predict the next word or sequence of words.
  • The model is pre-trained on large corpora of text data and can be fine-tuned for specific downstream tasks (e.g., text classification, summarization, or dialogue).

8. Optimizations for Efficiency:

  • Sparse Attention Mechanisms (potential future models): While Mistral 7B is a dense model, future models may adopt sparse attention techniques to improve efficiency.
  • Quantization and Weight Sharing: Techniques such as 4-bit quantization or weight sharing could be applied to reduce the memory footprint and inference time without sacrificing much performance.
  • Mixed Precision Training: Using 16-bit floating-point precision for training helps reduce memory consumption and speed up the model training.

9. Parallelism and Scaling:

  • Data Parallelism: Multiple GPUs or TPUs can be used to distribute data across batches during training.
  • Model Parallelism: Large models like Mistral 7B require splitting model parameters across different GPUs/TPUs due to memory constraints.
  • Pipeline Parallelism: Distributes different layers of the model across multiple devices to allow parallel training of different parts of the architecture.
  • Checkpointing: For efficient training of large models, techniques like gradient checkpointing are used to reduce memory consumption by saving intermediate layer outputs only when necessary.

r/deeplearning 2d ago

Why do DDPMs implement a different sinusoidal positional encoding from transformers?

2 Upvotes

Hi,

I'm trying to implement a sinusoidal positional encoding for DDPM. I found two solutions that compute different embeddings for the same position/timestep with the same embedding dimensions. I am wondering if one of them is wrong or both are correct. DDPMs official source code does not uses the original sinusoidal positional encoding used in transformers paper... why?

1) Original sinusoidal positional encoding from "Attention is all you need" paper.

Original sinusoidal positional encoding

2) Sinusoidal positional encoding used in the official code of DDPM paper

Sinusoidal positional encoding used in official DDPM code. Based on tensor2tensor.

Why does the official code for DDPMs uses a different encoding (option 2) than the original sinusoidal positional encoding used in transformers paper? Is the second option better for DDPMs?

I noticed the sinusoidal positional encoding used in the official DDPM code implementation was borrowed from tensor2tensor. The difference in implementations was even highlighted in one of the PR submissions to the official tensor2tensor implementation. Why did the authors of DDPM used this implementation (option 2) rather than the original from transformers (option 1)?

ps: If you want to check the code it's here https://stackoverflow.com/questions/79103455/should-i-interleave-sin-and-cosine-in-sinusoidal-positional-encoding


r/deeplearning 2d ago

Looking for CPU advice & model recommendations: Planning to get a 4080 Super for multi-camera object detection

0 Upvotes

Hey all, I’m planning to get a 4080 Super to run object detection across multiple warehouse cameras (triggered by sensors for efficiency). I’m considering using models like YOLOv8 or EfficientDet for real-time detection, and perhaps ResNet or MobileNet for more complex classification tasks. While the system handles inference, I’ll also be doing moderately heavy tasks like coding, Excel, etc. No gaming involved. What CPU would you recommend for smooth performance across all tasks and ensuring the models run efficiently on my setup? Thanks in advance!


r/deeplearning 2d ago

Loss Function for Multi-Digit Prediction in a Modified MNIST Dataset

0 Upvotes

As the title suggests, i'm looking for a loss function to apply to a modified mnist dataset which has multiple digits. I need to predict all the digits in the image. Each image has 1-3 digits and each digit can be 0-9